|
13 | 13 | "\n",
|
14 | 14 | "`Textract` supports `JPEG`, `PNG`, `PDF`, and `TIFF` file formats; more information is available in [the documentation](https://docs.aws.amazon.com/textract/latest/dg/limits-document.html).\n",
|
15 | 15 | "\n",
|
16 |
| - "The following samples demonstrate the use of `Amazon Textract` in combination with LangChain as a DocumentLoader." |
| 16 | + "The following examples demonstrate the use of `Amazon Textract` in combination with LangChain as a DocumentLoader." |
17 | 17 | ]
|
18 | 18 | },
|
19 | 19 | {
|
|
41 | 41 | "id": "400b25c6-befa-4730-a201-39ff112c8858",
|
42 | 42 | "metadata": {},
|
43 | 43 | "source": [
|
44 |
| - "## Sample 1\n", |
| 44 | + "## Example 1: Loading from a local file\n", |
45 | 45 | "\n",
|
46 | 46 | "The first example uses a local file, which internally will be sent to Amazon Textract sync API [DetectDocumentText](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html). \n",
|
47 | 47 | "\n",
|
|
100 | 100 | "id": "4cf7f19c-3635-453a-9c76-4baf98b8d7f4",
|
101 | 101 | "metadata": {},
|
102 | 102 | "source": [
|
103 |
| - "## Sample 2\n", |
104 |
| - "The next sample loads a file from an HTTPS endpoint. \n", |
| 103 | + "## Example 2: Loading from a URL\n", |
| 104 | + "The next example loads a file from an HTTPS endpoint. \n", |
105 | 105 | "It has to be single page, as Amazon Textract requires all multi-page documents to be stored on S3."
|
106 | 106 | ]
|
107 | 107 | },
|
|
150 | 150 | "id": "3a9cd8ec-e663-4dc7-9db1-d2f575253141",
|
151 | 151 | "metadata": {},
|
152 | 152 | "source": [
|
153 |
| - "## Sample 3\n", |
| 153 | + "## Example 3: Loading multi-page PDF documents\n", |
154 | 154 | "\n",
|
155 | 155 | "Processing a multi-page document requires the document to be on S3. The sample document resides in a bucket in us-east-2 and Textract needs to be called in that same region to be successful, so we set the region_name on the client and pass that in to the loader to ensure Textract is called from us-east-2. You could also to have your notebook running in us-east-2, setting the AWS_DEFAULT_REGION set to us-east-2 or when running in a different environment, pass in a boto3 Textract client with that region name like in the cell below."
|
156 | 156 | ]
|
|
214 | 214 | }
|
215 | 215 | },
|
216 | 216 | "source": [
|
217 |
| - "## Sample 4\n", |
| 217 | + "## Example 4: Customizing the output format\n", |
218 | 218 | "\n",
|
219 | 219 | "You have the option to pass an additional parameter called `linearization_config` to the AmazonTextractPDFLoader which will determine how the text output will be linearized by the parser after Textract runs."
|
220 | 220 | ]
|
|
248 | 248 | "## Using the AmazonTextractPDFLoader in a LangChain chain (e.g. OpenAI)\n",
|
249 | 249 | "\n",
|
250 | 250 | "The AmazonTextractPDFLoader can be used in a chain the same way the other loaders are used.\n",
|
251 |
| - "Textract itself does have a [Query feature](https://docs.aws.amazon.com/textract/latest/dg/API_Query.html), which offers similar functionality to the QA chain in this sample, which is worth checking out as well." |
| 251 | + "Textract itself does have a [Query feature](https://docs.aws.amazon.com/textract/latest/dg/API_Query.html), which offers similar functionality to the QA chain in this example, which is worth checking out as well." |
252 | 252 | ]
|
253 | 253 | },
|
254 | 254 | {
|
|
0 commit comments