docling-project
diff --git a/‎docs/examples/backend_xml_rag.ipynb‎
Lines changed: 1 addition & 125 deletions b/‎docs/examples/backend_xml_rag.ipynb‎
Lines changed: 1 addition & 125 deletions
@@ -431,130 +431,6 @@
     "print(f\"Fetched and exported {doc_num} documents.\")"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Using the backend converter (optional)\n",
-    "\n",
-    "- The custom backend converters `PubMedDocumentBackend` and `PatentUsptoDocumentBackend` aim at handling the parsing of PMC articles and USPTO patents, respectively.\n",
-    "- As any other backends, you can leverage the function `is_valid()` to check if the input document is supported by the this backend.\n",
-    "- Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\n",
-      "Document ipg241217-1.xml is a valid patent? True\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "316241ca89a843bda3170f2a5c76c639",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/4014 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Found 3928 patents out of 4014 XML files.\n"
-     ]
-    }
-   ],
-   "source": [
-    "from tqdm.notebook import tqdm\n",
-    "\n",
-    "from docling.backend.xml.jats_backend import JatsDocumentBackend\n",
-    "from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\n",
-    "from docling.datamodel.base_models import InputFormat\n",
-    "from docling.datamodel.document import InputDocument\n",
-    "\n",
-    "# check PMC\n",
-    "in_doc = InputDocument(\n",
-    "    path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n",
-    "    format=InputFormat.XML_JATS,\n",
-    "    backend=JatsDocumentBackend,\n",
-    ")\n",
-    "backend = JatsDocumentBackend(\n",
-    "    in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n",
-    ")\n",
-    "print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n",
-    "\n",
-    "# check USPTO\n",
-    "in_doc = InputDocument(\n",
-    "    path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n",
-    "    format=InputFormat.XML_USPTO,\n",
-    "    backend=PatentUsptoDocumentBackend,\n",
-    ")\n",
-    "backend = PatentUsptoDocumentBackend(\n",
-    "    in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n",
-    ")\n",
-    "print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n",
-    "\n",
-    "patent_valid = 0\n",
-    "pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\n",
-    "for in_path in pbar:\n",
-    "    in_doc = InputDocument(\n",
-    "        path_or_stream=in_path,\n",
-    "        format=InputFormat.XML_USPTO,\n",
-    "        backend=PatentUsptoDocumentBackend,\n",
-    "    )\n",
-    "    backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n",
-    "    patent_valid += int(backend.is_valid())\n",
-    "\n",
-    "print(f\"Found {patent_valid} patents out of {doc_num} XML files.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Calling the function `convert()` will convert the input document into a `DoclingDocument`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Patent \"Semiconductor package\" has 19 claims\n"
-     ]
-    }
-   ],
-   "source": [
-    "doc = backend.convert()\n",
-    "\n",
-    "claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\n",
-    "print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "✏️ **Tip**: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic `DocumentConverter` object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in [Simple Conversion](#simple-conversion) is the recommended usage for the supported XML files."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -923,7 +799,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.8"
+   "version": "3.12.10"
   }
  },
  "nbformat": 4,