|
60 | 60 | "\n",
|
61 | 61 | "Before we dive into the technical weeds, we need to set up the notebook's runtime and filesystem environments. The code cells below do the following:\n",
|
62 | 62 | "- Install required libraries\n",
|
63 |
| - "- Fetch the PDF dataset that we will be working with `fda-approved-drug.pdf`\n", |
64 |
| - " - External link: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf\n", |
65 |
| - "- Fetch pre-existing versions of the parsed documents for each solution. For the sake of the notebook, this allows readers to see the final results without having to set up the required infrastructure\n", |
| 63 | + "- Confirm that data dependencies from the GitHub repo have been downloaded. These will be under `data/document-parsing` and contain the following:\n", |
| 64 | + " - the PDF document that we will be working with, `fda-approved-drug.pdf` (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)\n", |
| 65 | + " - precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)\n", |
66 | 66 | "- Add utility functions needed for later sections"
|
67 | 67 | ]
|
68 | 68 | },
|
|
76 | 76 | "source": [
|
77 | 77 | "%%capture\n",
|
78 | 78 | "! sudo apt install tesseract-ocr poppler-utils\n",
|
79 |
| - "! pip install cohere hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas\n" |
| 79 | + "! pip install cohere fsspec hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas\n" |
80 | 80 | ]
|
81 | 81 | },
|
82 | 82 | {
|
|
87 | 87 | },
|
88 | 88 | "outputs": [],
|
89 | 89 | "source": [
|
90 |
| - "\"\"\"\n", |
91 |
| - "Create filesystem\n", |
92 |
| - "\"\"\"\n", |
93 |
| - "\n", |
94 |
| - "from pathlib import Path\n", |
95 |
| - "\n", |
96 |
| - "data_dir = \"data\"\n", |
| 90 | + "data_dir = \"data/document-parsing\"\n", |
97 | 91 | "source_filename = \"example-drug-label\"\n",
|
98 |
| - "extension = \"pdf\"\n", |
99 |
| - "Path(data_dir).mkdir(parents=True, exist_ok=True)" |
| 92 | + "extension = \"pdf\"" |
100 | 93 | ]
|
101 | 94 | },
|
102 | 95 | {
|
103 | 96 | "cell_type": "code",
|
104 | 97 | "execution_count": null,
|
105 |
| - "metadata": { |
106 |
| - "id": "3Em5H09-Yp19" |
107 |
| - }, |
| 98 | + "metadata": {}, |
108 | 99 | "outputs": [],
|
109 | 100 | "source": [
|
110 |
| - "\"\"\"\n", |
111 |
| - "Fetch notebook's data\n", |
112 |
| - "\"\"\"\n", |
113 |
| - "\n", |
114 |
| - "import fsspec\n", |
115 | 101 | "from pathlib import Path\n",
|
116 | 102 | "\n",
|
117 |
| - "destination = Path(data_dir)\n", |
118 |
| - "fs = fsspec.filesystem(\"github\", org=\"gchatz22\", repo=\"temp-cohere-resources\")\n", |
119 |
| - "fs.get(fs.ls(data_dir), destination.as_posix(), recursive=True)" |
| 103 | + "sources = [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]\n", |
| 104 | + "\n", |
| 105 | + "filenames = [\"{}-parsed-fda-approved-drug.txt\".format(source) for source in sources]\n", |
| 106 | + "filenames.append(\"fda-approved-drug.pdf\")\n", |
| 107 | + "\n", |
| 108 | + "for filename in filenames: \n", |
| 109 | + " file_path = Path(f\"{data_dir}/{filename}\")\n", |
| 110 | + " if file_path.is_file() == False:\n", |
| 111 | + " print(f\"File {filename} not found at {data_dir}!\")" |
120 | 112 | ]
|
121 | 113 | },
|
122 | 114 | {
|
|
138 | 130 | "outputs": [],
|
139 | 131 | "source": [
|
140 | 132 | "def store_document(path: str, doc_content: str):\n",
|
141 |
| - " file = open(path, 'w')\n", |
142 |
| - " file.write(doc_content)\n", |
143 |
| - " file.close()" |
| 133 | + " with open(path, 'w') as f:\n", |
| 134 | + " f.write(doc_content)" |
144 | 135 | ]
|
145 | 136 | },
|
146 | 137 | {
|
|
229 | 220 | "id": "cxwJ_jZpgNDo"
|
230 | 221 | },
|
231 | 222 | "source": [
|
232 |
| - "e#### Parsing the document\n", |
| 223 | + "#### Parsing the document\n", |
233 | 224 | "\n",
|
234 | 225 | "The following block can be executed in one of two ways\n",
|
235 | 226 | "1. Inside a Google Vertex AI environment\n",
|
|
432 | 423 | "outputs": [],
|
433 | 424 | "source": [
|
434 | 425 | "filename = \"gcp-parsed-{}.txt\".format(source_filename)\n",
|
435 |
| - "doc = open(\"{}/{}\".format(data_dir, filename), \"r\")\n", |
436 |
| - "parsed_document = doc.read()\n", |
437 |
| - "doc.close()\n", |
| 426 | + "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n", |
| 427 | + " parsed_document = doc.read()\n", |
438 | 428 | "\n",
|
439 | 429 | "print(parsed_document[:1000])"
|
440 | 430 | ]
|
|
476 | 466 | "At minimum, you will need access to the following AWS resources to get started:\n",
|
477 | 467 | "\n",
|
478 | 468 | "- Textract\n",
|
479 |
| - "- an S3 bucket containing the document(s) to process - in this case, our `Example Label.pdf` file\n", |
| 469 | + "- an S3 bucket containing the document(s) to process - in this case, our `example-drug-label.pdf` file\n", |
480 | 470 | "- an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.\n",
|
481 | 471 | "- an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic\n",
|
482 | 472 | "\n",
|
|
825 | 815 | "sns_topic_arn = \"your-sns-arn\" # this can be found under the topic you created in the Amazon SNS dashboard\n",
|
826 | 816 | "sns_role_arn = \"sns-role-arn\" # this is an IAM role that allows Textract to interact with SNS\n",
|
827 | 817 | "\n",
|
828 |
| - "file_name = \"Example Label.pdf\"" |
| 818 | + "file_name = \"example-drug-label.pdf\"" |
829 | 819 | ]
|
830 | 820 | },
|
831 | 821 | {
|
|
979 | 969 | "outputs": [],
|
980 | 970 | "source": [
|
981 | 971 | "filename = \"aws-parsed-{}.txt\".format(source_filename)\n",
|
982 |
| - "doc = open(\"{}/{}\".format(data_dir, filename), \"r\")\n", |
983 |
| - "parsed_document = doc.read()\n", |
984 |
| - "doc.close()\n", |
| 972 | + "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n", |
| 973 | + " parsed_document = doc.read()\n", |
985 | 974 | "\n",
|
986 | 975 | "print(parsed_document[:1000])"
|
987 | 976 | ]
|
|
1031 | 1020 | "parsed_documents = []\n",
|
1032 | 1021 | "\n",
|
1033 | 1022 | "input_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)\n",
|
1034 |
| - "file_data = open(input_path, 'rb')\n", |
1035 |
| - "response = requests.post(\n", |
1036 |
| - " url=UNSTRUCTURED_URL,\n", |
1037 |
| - " files={\"files\": (\"{}.{}\".format(source_filename, extension), file_data)},\n", |
1038 |
| - " data={\n", |
1039 |
| - " \"output_format\": (None, \"application/json\"),\n", |
1040 |
| - " \"stratergy\": \"hi_res\",\n", |
1041 |
| - " \"pdf_infer_table_structure\": \"true\",\n", |
1042 |
| - " \"include_page_breaks\": \"true\"\n", |
1043 |
| - " },\n", |
1044 |
| - " headers={\"Accept\": \"application/json\"}\n", |
1045 |
| - ")\n", |
| 1023 | + "with open(input_path, 'rb') as file_data:\n", |
| 1024 | + " response = requests.post(\n", |
| 1025 | + " url=UNSTRUCTURED_URL,\n", |
| 1026 | + " files={\"files\": (\"{}.{}\".format(source_filename, extension), file_data)},\n", |
| 1027 | + " data={\n", |
| 1028 | + " \"output_format\": (None, \"application/json\"),\n", |
| 1029 | + " \"stratergy\": \"hi_res\",\n", |
| 1030 | + " \"pdf_infer_table_structure\": \"true\",\n", |
| 1031 | + " \"include_page_breaks\": \"true\"\n", |
| 1032 | + " },\n", |
| 1033 | + " headers={\"Accept\": \"application/json\"}\n", |
| 1034 | + " )\n", |
| 1035 | + "\n", |
1046 | 1036 | "parsed_response = response.json()\n",
|
1047 | 1037 | "\n",
|
1048 | 1038 | "parsed_document = \" \".join([parsed_entry[\"text\"] for parsed_entry in parsed_response])\n",
|
|
1083 | 1073 | "outputs": [],
|
1084 | 1074 | "source": [
|
1085 | 1075 | "filename = \"unstructured-io-parsed-{}.txt\".format(source_filename)\n",
|
1086 |
| - "doc = open(\"{}/{}\".format(data_dir, filename), \"r\")\n", |
1087 |
| - "parsed_document = doc.read()\n", |
1088 |
| - "doc.close()\n", |
| 1076 | + "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n", |
| 1077 | + " parsed_document = doc.read()\n", |
1089 | 1078 | "\n",
|
1090 | 1079 | "print(parsed_document[:1000])"
|
1091 | 1080 | ]
|
|
1233 | 1222 | "# Text parsing\n",
|
1234 | 1223 | "\n",
|
1235 | 1224 | "filename = \"llamaparse-text-parsed-{}.txt\".format(source_filename)\n",
|
1236 |
| - "doc = open(\"{}/{}\".format(data_dir, filename), \"r\")\n", |
1237 |
| - "parsed_document = doc.read()\n", |
1238 |
| - "doc.close()\n", |
1239 | 1225 | "\n",
|
| 1226 | + "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n", |
| 1227 | + " parsed_document = doc.read()\n", |
| 1228 | + " \n", |
1240 | 1229 | "print(parsed_document[:1000])"
|
1241 | 1230 | ]
|
1242 | 1231 | },
|
|
1251 | 1240 | "# Markdown parsing\n",
|
1252 | 1241 | "\n",
|
1253 | 1242 | "filename = \"llamaparse-markdown-parsed-fda-approved-drug.txt\"\n",
|
1254 |
| - "doc = open(\"{}/{}\".format(data_dir, filename), \"r\")\n", |
1255 |
| - "parsed_document = doc.read()\n", |
1256 |
| - "doc.close()\n", |
1257 |
| - "\n", |
| 1243 | + "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n", |
| 1244 | + " parsed_document = doc.read()\n", |
| 1245 | + " \n", |
1258 | 1246 | "print(parsed_document[:1000])"
|
1259 | 1247 | ]
|
1260 | 1248 | },
|
|
1401 | 1389 | "outputs": [],
|
1402 | 1390 | "source": [
|
1403 | 1391 | "filename = \"pytesseract-parsed-{}.txt\".format(source_filename)\n",
|
1404 |
| - "doc = open(\"{}/{}\".format(data_dir, filename), \"r\")\n", |
1405 |
| - "parsed_document = doc.read()\n", |
1406 |
| - "doc.close()\n", |
| 1392 | + "with open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n", |
| 1393 | + " parsed_document = doc.read()\n", |
1407 | 1394 | "\n",
|
1408 | 1395 | "print(parsed_document[:1000])"
|
1409 | 1396 | ]
|
|
1414 | 1401 | "id": "SCbkT4oZSfs9"
|
1415 | 1402 | },
|
1416 | 1403 | "source": [
|
1417 |
| - "## Document Questions\n", |
1418 | 1404 | "<a name=\"document-questions\"></a>\n",
|
| 1405 | + "## Document Questions\n", |
1419 | 1406 | "\n",
|
1420 | 1407 | "We can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are\n",
|
1421 | 1408 | "- **What are the most common adverse reactions of Iwilfin?**\n",
|
|
1493 | 1480 | "\n",
|
1494 | 1481 | "documents = []\n",
|
1495 | 1482 | "\n",
|
1496 |
| - "doc = open(\"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, source), \"r\")\n", |
| 1483 | + "with open(\"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, source), \"r\") as doc:\n", |
1497 | 1484 | "doc_content = doc.read()\n",
|
1498 |
| - "doc.close()\n", |
1499 | 1485 | "\n",
|
1500 | 1486 | "\"\"\"\n",
|
1501 | 1487 | "Personal notes on chunking\n",
|
|
0 commit comments