Skip to content

Commit 9b45bd8

Browse files
authored
Merge branch 'master' into pprados/02-pymupdf
2 parents 0e6c904 + e156b37 commit 9b45bd8

File tree

27 files changed

+870
-380
lines changed

27 files changed

+870
-380
lines changed

cookbook/mongodb-langchain-cache-memory.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@
156156
"metadata": {},
157157
"outputs": [],
158158
"source": [
159-
"# Ensure you have an HF_TOKEN in your development enviornment:\n",
159+
"# Ensure you have an HF_TOKEN in your development environment:\n",
160160
"# access tokens can be created or copied from the Hugging Face platform (https://huggingface.co/docs/hub/en/security-tokens)\n",
161161
"\n",
162162
"# Load MongoDB's embedded_movies dataset from Hugging Face\n",

docs/docs/how_to/document_loader_markdown.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
"- Basic usage;\n",
1717
"- Parsing of Markdown into elements such as titles, list items, and text.\n",
1818
"\n",
19-
"LangChain implements an [UnstructuredMarkdownLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html) object which requires the [Unstructured](https://unstructured-io.github.io/unstructured/) package. First we install it:"
19+
"LangChain implements an [UnstructuredMarkdownLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader.html) object which requires the [Unstructured](https://docs.unstructured.io/welcome/) package. First we install it:"
2020
]
2121
},
2222
{
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# HyperbrowserLoader"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n",
15+
"\n",
16+
"Key Features:\n",
17+
"- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n",
18+
"- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n",
19+
"- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n",
20+
"- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n",
21+
"\n",
22+
"This notebook provides a quick overview for getting started with Hyperbrowser [document loader](https://python.langchain.com/docs/concepts/#document-loaders).\n",
23+
"\n",
24+
"For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).\n",
25+
"\n",
26+
"## Overview\n",
27+
"### Integration details\n",
28+
"\n",
29+
"| Class | Package | Local | Serializable | JS support|\n",
30+
"| :--- | :--- | :---: | :---: | :---: |\n",
31+
"| HyperbrowserLoader | langchain-hyperbrowser | ❌ | ❌ | ❌ | \n",
32+
"### Loader features\n",
33+
"| Source | Document Lazy Loading | Native Async Support |\n",
34+
"| :---: | :---: | :---: | \n",
35+
"| HyperbrowserLoader | ✅ | ✅ | \n",
36+
"\n",
37+
"## Setup\n",
38+
"\n",
39+
"To access Hyperbrowser document loader you'll need to install the `langchain-hyperbrowser` integration package, and create a Hyperbrowser account and get an API key.\n",
40+
"\n",
41+
"### Credentials\n",
42+
"\n",
43+
"Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:\n"
44+
]
45+
},
46+
{
47+
"cell_type": "markdown",
48+
"metadata": {},
49+
"source": [
50+
"### Installation\n",
51+
"\n",
52+
"Install **langchain-hyperbrowser**."
53+
]
54+
},
55+
{
56+
"cell_type": "code",
57+
"execution_count": null,
58+
"metadata": {},
59+
"outputs": [],
60+
"source": [
61+
"%pip install -qU langchain-hyperbrowser"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"metadata": {},
67+
"source": [
68+
"## Initialization\n",
69+
"\n",
70+
"Now we can instantiate our model object and load documents:\n"
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": null,
76+
"metadata": {},
77+
"outputs": [],
78+
"source": [
79+
"from langchain_hyperbrowser import HyperbrowserLoader\n",
80+
"\n",
81+
"loader = HyperbrowserLoader(\n",
82+
" urls=\"https://example.com\",\n",
83+
" api_key=\"YOUR_API_KEY\",\n",
84+
")"
85+
]
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"metadata": {},
90+
"source": [
91+
"## Load"
92+
]
93+
},
94+
{
95+
"cell_type": "code",
96+
"execution_count": null,
97+
"metadata": {},
98+
"outputs": [
99+
{
100+
"data": {
101+
"text/plain": [
102+
"Document(metadata={'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, page_content='Example Domain\\n\\n# Example Domain\\n\\nThis domain is for use in illustrative examples in documents. You may use this\\ndomain in literature without prior coordination or asking for permission.\\n\\n[More information...](https://www.iana.org/domains/example)')"
103+
]
104+
},
105+
"execution_count": null,
106+
"metadata": {},
107+
"output_type": "execute_result"
108+
}
109+
],
110+
"source": [
111+
"docs = loader.load()\n",
112+
"docs[0]"
113+
]
114+
},
115+
{
116+
"cell_type": "code",
117+
"execution_count": null,
118+
"metadata": {},
119+
"outputs": [],
120+
"source": [
121+
"print(docs[0].metadata)"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"metadata": {},
127+
"source": [
128+
"## Lazy Load"
129+
]
130+
},
131+
{
132+
"cell_type": "code",
133+
"execution_count": null,
134+
"metadata": {},
135+
"outputs": [],
136+
"source": [
137+
"page = []\n",
138+
"for doc in loader.lazy_load():\n",
139+
" page.append(doc)\n",
140+
" if len(page) >= 10:\n",
141+
" # do some paged operation, e.g.\n",
142+
" # index.upsert(page)\n",
143+
"\n",
144+
" page = []"
145+
]
146+
},
147+
{
148+
"cell_type": "markdown",
149+
"metadata": {},
150+
"source": [
151+
"## Advanced Usage\n",
152+
"\n",
153+
"You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page."
154+
]
155+
},
156+
{
157+
"cell_type": "code",
158+
"execution_count": null,
159+
"metadata": {},
160+
"outputs": [],
161+
"source": [
162+
"loader = HyperbrowserLoader(\n",
163+
" urls=\"https://hyperbrowser.ai\", api_key=\"YOUR_API_KEY\", operation=\"crawl\"\n",
164+
")"
165+
]
166+
},
167+
{
168+
"cell_type": "markdown",
169+
"metadata": {},
170+
"source": [
171+
"Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait."
172+
]
173+
},
174+
{
175+
"cell_type": "code",
176+
"execution_count": null,
177+
"metadata": {},
178+
"outputs": [],
179+
"source": [
180+
"loader = HyperbrowserLoader(\n",
181+
" urls=\"https://example.com\",\n",
182+
" api_key=\"YOUR_API_KEY\",\n",
183+
" operation=\"scrape\",\n",
184+
" params={\"scrape_options\": {\"include_tags\": [\"h1\", \"h2\", \"p\"]}},\n",
185+
")"
186+
]
187+
},
188+
{
189+
"cell_type": "markdown",
190+
"metadata": {},
191+
"source": [
192+
"## API reference\n",
193+
"\n",
194+
"- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)\n",
195+
"- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)\n",
196+
"- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)"
197+
]
198+
}
199+
],
200+
"metadata": {
201+
"kernelspec": {
202+
"display_name": "Python 3 (ipykernel)",
203+
"language": "python",
204+
"name": "python3"
205+
},
206+
"language_info": {
207+
"codemirror_mode": {
208+
"name": "ipython",
209+
"version": 3
210+
},
211+
"file_extension": ".py",
212+
"mimetype": "text/x-python",
213+
"name": "python",
214+
"nbconvert_exporter": "python",
215+
"pygments_lexer": "ipython3",
216+
"version": "3.9.16"
217+
}
218+
},
219+
"nbformat": 4,
220+
"nbformat_minor": 4
221+
}

0 commit comments

Comments
 (0)