diff --git a/README.md b/README.md index a3cddbb..376aa5b 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,12 @@ This repo contains material from workshops centered around Docling. Python Summit Accepted / TBA + + 17 Oct 2025 + Seville, Spain + PyCon Spain 2025 + View Details + 6-9 Oct 2025 Orlando diff --git a/workshops/2025_10_17_PyConES/Docling PyconES Presentation.pdf b/workshops/2025_10_17_PyConES/Docling PyconES Presentation.pdf new file mode 100644 index 0000000..efcebcf Binary files /dev/null and b/workshops/2025_10_17_PyConES/Docling PyconES Presentation.pdf differ diff --git a/workshops/2025_10_17_PyConES/README.md b/workshops/2025_10_17_PyConES/README.md new file mode 100644 index 0000000..461231b --- /dev/null +++ b/workshops/2025_10_17_PyConES/README.md @@ -0,0 +1,14 @@ +# PyConES 🇪🇸 2025 | Docling Workshop + + +**Event:** [PyCon Spain 2025](https://2025.es.pycon.org/) + +**Date:** October 17th, 2025 + +**Location:** Pablo de Olavide University, Seville, Spain + +**Workshop Documentation:** [ibm.biz/pycones25](https://ibm.biz/pycones25) + +**Speakers:** +- [Simon Sanchez Viloria](https://github.com/simonsanvil) (IBM Expert Labs) +- [Andres Ruiz Calvo](https://github.com/andresruizc) (IBM Consulting) diff --git a/workshops/2025_10_17_PyConES/notebooks/Lab1_Docling_convert.ipynb b/workshops/2025_10_17_PyConES/notebooks/Lab1_Docling_convert.ipynb new file mode 100644 index 0000000..c89c0a5 --- /dev/null +++ b/workshops/2025_10_17_PyConES/notebooks/Lab1_Docling_convert.ipynb @@ -0,0 +1,764 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Hk7eDm0UvuV4" + }, + "source": [ + "\n", + "# Transforma tus documentos en datos listos para IA con Docling\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Lab 1: Conversión de Documentos\n", + "\n", + "Bienvenido a el primer lab de esta workshop de Docling. Este viaje de tres partes te llevará desde los conceptos básicos del procesamiento de documentos hasta la construcción de sistemas de IA avanzados y transparentes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Objetivos\n", + "\n", + "En este primer lab, aprenderás a transformar documentos complejos en datos estructurados listos para IA. Pero esto no se trata solo de extracción, sino de una conversión inteligente que preserva todo lo importante para las aplicaciones de IA posteriores.\n", + "\n", + "### Al final de este lab, habrás dominado:\n", + "\n", + "- **Carga de Documentos**: Ingesta de PDFs, documentos de Word, PowerPoints y más\n", + "- **Extracción de Estructura**: Preservando de la jerarquía y las relaciones entre los elementos\n", + "- **Excelencia en Tablas**: Conversión de tablas complejas en formatos utilizables.\n", + "- **Manejo de Imágenes**: Extracción y preparación de imágenes para el procesamiento de IA.\n", + "- **Preservación de Metadatos**: Mantenimiento de la información necesaria para referenciar y contextualizar los datos." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introducción\n", + "\n", + "**[Docling](https://docling-project.github.io/docling/)** es una herramienta de código abierto para el procesamiento, análisis y conversión de documentos diseñado para aplicaciones de IA generativa.\n", + "\n", + "### Características Clave\n", + "- **Soporte multi-formato**: PDF, DOCX, XLSX, HTML, imágenes y más\n", + "- **Comprensión avanzada de PDF**: Diseño de página, orden de lectura, estructura de tabla, bloques de código, fórmulas\n", + "- **Representación unificada de DoclingDocument**: Estructura de datos consistente en todos los formatos\n", + "- **Opciones de exportación flexibles**: Markdown, HTML, JSON, DocTags\n", + "- **Ejecución local**: Permite procesar datos sensibles sin servicios externos\n", + "- **Integración con frameworks**: LangChain, LlamaIndex y otros frameworks de IA" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ¿Por qué usamos Docling para la conversión de documentos?\n", + "\n", + "Los datos son la base de todos los sistemas de IA. Para aprovechar la mayor cantidad de datos posible, necesitamos poder ingerir datos de varios formatos con precisión. Sin embargo, los LLM generalmente requieren datos en un formato específico, de ahí la necesidad de conversión.\n", + "\n", + "**Sin una conversión adecuada**:\n", + "- La información se pierde o se desordena\n", + "- Las tablas se convierten en texto ilegible\n", + "- Las imágenes desaparecen por completo\n", + "- La estructura del documento se destruye\n", + "\n", + "**Con la conversión avanzada de Docling**:\n", + "- Cada pieza de información se preserva\n", + "- Las tablas mantienen su estructura\n", + "- Las imágenes se extraen y son procesables\n", + "- El diseño y las relaciones se comprenden" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LhPuTXE5laO7" + }, + "source": [ + "### Instalación Básica" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "U4UTqpiL8Dfe", + "outputId": "e111d023-bae6-4548-8fb3-be1958c60107" + }, + "outputs": [], + "source": [ + "!uv pip install docling matplotlib pillow pandas python-dotenv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TV-71yke857S" + }, + "source": [ + "### Importamos los componentes esenciales de la librería\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2eg9Lln_89Cv" + }, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "# Core Docling imports\n", + "from docling.document_converter import DocumentConverter\n", + "from docling.datamodel.base_models import InputFormat\n", + "from docling.datamodel.pipeline_options import PdfPipelineOptions\n", + "from docling.document_converter import PdfFormatOption\n", + "from IPython.display import display\n", + "\n", + "# For advanced features\n", + "from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, TextItem, DoclingDocument\n", + "\n", + "# For data processing and visualization\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Create output directory\n", + "output_dir = Path(\"output\")\n", + "output_dir.mkdir(exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5N0cmabF_Ria" + }, + "source": [ + "## 1. Conversión Básica de Documentos\n", + "\n", + "### Ejemplo Mínimo\n", + "\n", + "La forma más sencilla de convertir un documento es inicializar un `DocumentConverter` y llamar a su método `convert()` con la ruta del fichero o ficheros a convertir.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PjeZWdaMQq3j" + }, + "outputs": [], + "source": [ + "# Docling Technical Report\n", + "docling_paper = \"https://arxiv.org/pdf/2501.17887\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qRBIgujl_01F", + "outputId": "cc3cb054-fd05-47d2-a5af-d00b48c20b67" + }, + "outputs": [], + "source": [ + "# Conversión simple\n", + "\n", + "# Crear una instancia del convertidor con parametros por defecto\n", + "converter = DocumentConverter()\n", + "\n", + "# Convertir un documento\n", + "result = converter.convert(docling_paper)\n", + "doc = result.document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "C-OlC5yvlaO7", + "outputId": "381d699b-bd99-4587-9d6a-616cb3915cb5" + }, + "outputs": [], + "source": [ + "# Exportar a Markdown\n", + "md_out = doc.export_to_markdown()\n", + "\n", + "# Imprimir un extracto del resultado\n", + "print(f\"{md_out}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nxDEG2s-lSLq" + }, + "source": [ + "### Exploración de la Estructura del Documento\n", + "\n", + "Uno de las superpoderes de Docling es comprender la estructura del documento, algo crítico a la hora de hacer chunking inteligente o extraer partes específicas del documento." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VJv614U1lS1x", + "outputId": "1b7d46bf-0fd3-42c5-c72a-ad316d7e11dc" + }, + "outputs": [], + "source": [ + "# Document metadata\n", + "print(f\"Titulo: {doc.name}\")\n", + "print(f\"Numero de paginas: {len(doc.pages)}\")\n", + "print(f\"Numero de tablas: {len(doc.tables)}\")\n", + "print(f\"Numero de imagenes: {len(doc.pictures)}\")\n", + "\n", + "# Exploramos la estructura del documento de forma jerarquica\n", + "print(\"\\nEstructura del documento:\")\n", + "for i, (item, level) in enumerate(doc.iterate_items()):\n", + " if i < 10: # Show first 10 items\n", + " item_type = type(item).__name__\n", + " text_preview = item.text[:200] if hasattr(item, 'text') else 'N/A'\n", + " print(f\"{' ' * level}{item_type}: {text_preview}\")\n", + " else: \n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CrrwFDxmHsbz" + }, + "source": [ + "### Opciones de Exportación\n", + "\n", + "Docling ofrece varias opciones de exportación para adaptarse a diferentes necesidades." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AiCbsd1ifnns" + }, + "outputs": [], + "source": [ + "markdown_text = doc.export_to_markdown() # exportar a markdown (string)\n", + "html_text = doc.export_to_html() # exportar a html (string)\n", + "json_dict = doc.export_to_dict() # exportar a dict (json serializable)\n", + "doc_tags = doc.export_to_doctags() # exportar a doctags (string)\n", + "\n", + "# Guardar el documento en varios formatos\n", + "doc.save_as_markdown(\n", + " output_dir / \"document.md\",\n", + " image_mode=ImageRefMode.PLACEHOLDER, # Otras opciones: EMBEDDED o REFERENCED (con generate_picture_images=True)\n", + " image_placeholder=\"\",\n", + " # ...\n", + ")\n", + "\n", + "# ...\n", + "\n", + "# Exporta a JSON\n", + "doc.save_as_json(\n", + " output_dir / \"document.json\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Siéntete libre de explorar los distintos formatos de exportación y visualizar los resultados en comparación con el PDF original." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Inspecciona la salida de DoclingDocument\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6HoHUwTRlwMF" + }, + "source": [ + "### Tratar con Tablas\n", + "\n", + "Docling ofrece excelentes capacidades de extracción de tablas gracias a los modelos [TableFormer](https://github.com/docling-project/docling-ibm-models) que impulsan la extracción de tablas. Para este ejemplo, utilicemos un documento con más tablas." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kBJvmt1tlaO8" + }, + "outputs": [], + "source": [ + "pycon_example = \"https://2025.es.pycon.org/theme/files/[ES]PyConES25.pdf\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5iSBDPf_f4gb", + "outputId": "6028ed2c-5641-4ea2-894f-8d42bd607b05" + }, + "outputs": [], + "source": [ + "# Convertimos el documento\n", + "table_result = converter.convert(pycon_example)\n", + "table_doc = table_result.document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AD7k7Sg_laO8", + "outputId": "eea71290-ff3f-49af-d427-613e96a45692" + }, + "outputs": [], + "source": [ + "print(f\"El documento contiene {len(table_doc.tables)} tablas\")\n", + "\n", + "# Exportar todas las tablas\n", + "for table_idx, table in enumerate(table_doc.tables):\n", + " # Convertir a pandas DataFrame\n", + " df = table.export_to_dataframe(doc=table_doc)\n", + "\n", + " print(f\"\\n## Tabla {table_idx}\")\n", + " print(f\"Dimensiones: {df.shape}\")\n", + " display(df)\n", + "\n", + " # Save as CSV\n", + " df.to_csv(output_dir / f\"table_{table_idx}.csv\", index=False)\n", + "\n", + " # Save as HTML\n", + " with open(output_dir / f\"table_{table_idx}.html\", \"w\") as fp:\n", + " fp.write(table.export_to_html(doc=table_doc))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8tCfJxhHaIRR" + }, + "source": [ + "### Extracción de Imágenes\n", + "\n", + "Tal como las tablas, Docling también extrae imágenes y las hace accesibles en la estructura del documento." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EANW0BKX0Q25" + }, + "source": [ + "#### Extracción y Visualización de Imágenes\n", + "\n", + "Podemos configurar la pipeline de Docling para generar imágenes de las páginas del PDF y extraer las imágenes embebidas en el documento. Esto es especialmente útil para documentos con gráficos, diagramas, fotos u otros elementos visuales." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DNDC06dGaH5M", + "outputId": "fc27a03f-d8b0-4282-9eb8-b4b4bc014cbb" + }, + "outputs": [], + "source": [ + "IMAGE_RESOLUTION_SCALE = 2.0 # 2x resolution (144 DPI)\n", + "\n", + "pipeline_options = PdfPipelineOptions(\n", + " images_scale=IMAGE_RESOLUTION_SCALE,\n", + " generate_page_images=True,\n", + " generate_picture_images=True,\n", + ")\n", + "\n", + "converter_with_images = DocumentConverter(\n", + " format_options={\n", + " InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n", + " }\n", + ")\n", + "\n", + "# Conversión con extracción de imágenes\n", + "img_result = converter_with_images.convert(pycon_example)\n", + "img_doc = img_result.document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tVhQ9MSWlaO8", + "outputId": "24308bf5-7020-48e0-992c-ce01c0102a5a" + }, + "outputs": [], + "source": [ + "# Crear directorio para imágenes\n", + "images_dir = output_dir / \"images\"\n", + "images_dir.mkdir(exist_ok=True)\n", + "\n", + "# Guardar imágenes de las páginas\n", + "for page_no, page in img_doc.pages.items():\n", + " page_image_filename = images_dir / f\"page_{page_no}.png\"\n", + " with page_image_filename.open(\"wb\") as fp:\n", + " page.image.pil_image.save(fp, format=\"PNG\")\n", + "\n", + "print(f\"Saved {len(img_doc.pages)} page images\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Revisa la carpeta `output/` para ver las imágenes extraídas." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bmfwcH_uUKkp" + }, + "source": [ + "### Extraer y Guardar Tablas e Imágenes\n", + "\n", + "Para un procesamiento personalizado, también podemos extraer figuras y tablas como imágenes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kmxxCUVHUKTl", + "outputId": "e73d0748-eebf-4614-8fcc-e7a84ca25c71" + }, + "outputs": [], + "source": [ + "# Extraer y guardar tablas e imágenes\n", + "table_counter = 0\n", + "picture_counter = 0\n", + "\n", + "for element, level in img_doc.iterate_items():\n", + " if isinstance(element, TableItem):\n", + " table_counter += 1\n", + " image_filename = images_dir / f\"table_{table_counter}.png\"\n", + " with image_filename.open(\"wb\") as fp:\n", + " element.get_image(img_doc).save(fp, \"PNG\")\n", + "\n", + " elif isinstance(element, PictureItem):\n", + " picture_counter += 1\n", + " image_filename = images_dir / f\"figure_{picture_counter}.png\"\n", + " with image_filename.open(\"wb\") as fp:\n", + " element.get_image(img_doc).save(fp, \"PNG\")\n", + "\n", + "print(f\"Extraído {table_counter} tabla(s) y {picture_counter} figuras como imágenes\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8pxfi8azdLTQ" + }, + "source": [ + "### Inspeccionar el Contenido de las Imágenes\n", + "\n", + "Docling preservará automáticamente los subtítulos y extraerá el contenido de texto de las imágenes extraídas. Veamos qué se ha extraído:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ia--zHY9dNQc", + "outputId": "4c950171-48e4-43e8-f283-0d84c9a206b7" + }, + "outputs": [], + "source": [ + "def inspect_pictures_with_images(doc: DoclingDocument, image_size=(6, 4)):\n", + " \"\"\"Display pictures inline with their text content\"\"\"\n", + " for idx, picture in enumerate(doc.pictures[:6]): # Limit to first 6 pictures\n", + " print(f\"\\n{'='*60}\")\n", + " print(f\"Imagen {idx}\")\n", + " print(f\"{'='*60}\")\n", + "\n", + " # Display the image\n", + " try:\n", + " img = picture.get_image(doc)\n", + " if img:\n", + " plt.figure(figsize=image_size)\n", + " plt.imshow(img)\n", + " plt.axis('off')\n", + " plt.title(f\"Picture {idx}\")\n", + " plt.show()\n", + " except Exception as e:\n", + " print(f\"Could not display image: {e}\")\n", + "\n", + " # Display metadata\n", + " caption = picture.caption_text(doc)\n", + " if caption:\n", + " print(f\"Subtitulo: {caption}\")\n", + "\n", + " if hasattr(picture, 'prov') and picture.prov:\n", + " print(f\"Location: Page {picture.prov[0].page_no}\")\n", + "\n", + " # Display embedded text\n", + " print(\"\\nTexto embebido:\")\n", + " text_found = False\n", + " for item, level in doc.iterate_items(root=picture, traverse_pictures=True):\n", + " if isinstance(item, TextItem):\n", + " print(f\"{' ' * (level + 1)}- {item.text}\")\n", + " text_found = True\n", + "\n", + " if not text_found:\n", + " print(\" (No text elements found)\")\n", + "\n", + "# Use the simple inline display\n", + "inspect_pictures_with_images(img_doc)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8hjtRz-mq1Id" + }, + "source": [ + "### Visualización de la Estructura del Documento con Bounding Boxes\n", + "\n", + "Para entender cómo se extrae cada parte del documento, visualicemos los elementos extraídos. Podemos hacer esto utilizando uno de los visualizadores integrados en Docling:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dt_xLM4blaO8", + "outputId": "ff8d949d-5264-4002-cff3-9b3d6ab5821a" + }, + "outputs": [], + "source": [ + "from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer\n", + "\n", + "layout_visualizer = LayoutVisualizer()\n", + "page_images = layout_visualizer.get_visualization(doc=img_doc)\n", + "\n", + "num_pages_to_viz = 2 # first N pages to visualize\n", + "pages_to_viz = list(page_images.keys())[:num_pages_to_viz]\n", + "for page in pages_to_viz:\n", + " display(page_images[page])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LkBT1FFUwzYN" + }, + "source": [ + "## 3. Enriquecimiento de Imagenes con VLM" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yDM-OeUgy0ym" + }, + "source": [ + "### Clasificación y Descripción de Imágenes\n", + "\n", + "Además de las capacidades básicas de subtitulado, Docling también puede procesar imágenes utilizando LLMs multimodales. Esto nos dará descripciones de imágenes más detalladas para ayudarnos a aprovechar los datos de las imágenes de manera más efectiva.\n", + "\n", + "Existen dos formas de hacer esto en Docling:\n", + "\n", + "1. Con modelos de HuggingFace cargados localmente: Este método es gratuito pero generalmente requiere hardware potente o utilizar modelos pequeños como [SmolVLM](https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct) que pueden no ser tan precisos.\n", + "2. Utilizando un VLLM basado en API: Si estas sirviendo un modelo de visión a través de una API como Ollama, [watsonx.ai](https://watsonx.ai/), o cualquiera que sea compatible con la API de OpenAI, puedes usarla para procesar las imágenes. Este método es generalmente más preciso y fácil de usar, pero puede incurrir en costos dependiendo del proveedor y el uso." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "RUN_LOCAL_VLM = False # Set to True if you have a local VLM setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "referenced_widgets": [ + "4ba3a486c40d4b39ae7c4bb6c004b440" + ] + }, + "id": "WXMHPBTdDkev", + "outputId": "5c21974b-e704-44bb-b0c8-6a6cedfdaaab" + }, + "outputs": [], + "source": [ + "from docling.datamodel.pipeline_options import PictureDescriptionApiOptions, PictureDescriptionVlmOptions\n", + "\n", + "# Configure enrichment pipeline\n", + "if RUN_LOCAL_VLM:\n", + " # activamos enriquecimiento de imagenes con un modelo VLM local\n", + " # (NOTA: si tu runtime no tiene GPUs, el rendimiento puede llegar a ser muy lento)\n", + " VLM_REPO_NAME = \"HuggingFaceTB/SmolVLM-500M-Instruct\"\n", + " VLM_MODEL_NAME = VLM_REPO_NAME.split(\"/\")[-1]\n", + " enrichment_options = PdfPipelineOptions(\n", + " do_picture_description=True,\n", + " picture_description_options=PictureDescriptionVlmOptions(\n", + " repo_id=VLM_REPO_NAME,\n", + " prompt=\"Describe in detail what is depicted in the image\",\n", + " generation_config=dict(\n", + " max_new_tokens=400,\n", + " do_sample=False\n", + " )\n", + " ),\n", + " generate_picture_images=True, # preserva las imagenes de las figuras para poder exportarlas luego\n", + " images_scale=1.0,\n", + " )\n", + "else:\n", + " # activamos enriquecimiento de imagenes con un modelo VLM remoto (requiere endpoint compatible con OpenAI)\n", + " VLM_MODEL_NAME = \"granite-vision-3.2-2b\"\n", + " enrichment_options = PdfPipelineOptions(\n", + " do_picture_description=True,\n", + " enable_remote_services=True,\n", + " picture_description_options=PictureDescriptionApiOptions(\n", + " # cualquier endpoint de LLMs compatible con la API de OpenAI funcionaría (e.g., ollama, litellm, LMStudio, ...)\n", + " url=\"http://0.0.0.0:4000/v1/chat/completions\", \n", + " params=dict(\n", + " model=VLM_MODEL_NAME,\n", + " seed=42,\n", + " max_tokens=400,\n", + " provenance=VLM_MODEL_NAME,\n", + " ),\n", + " prompt=\"Give a detailed description of what is depicted in the image\",\n", + " concurrency=4, \n", + " picture_area_threshold=0.1,\n", + " timeout=90,\n", + " ),\n", + " generate_picture_images=True, # preserva las imagenes de las figuras para poder exportarlas luego\n", + " images_scale=1.0,\n", + " )\n", + "\n", + "\n", + "converter_enriched = DocumentConverter(\n", + " format_options={\n", + " InputFormat.PDF: PdfFormatOption(pipeline_options=enrichment_options)\n", + " }\n", + ")\n", + "enr_result = converter_enriched.convert(docling_paper)\n", + "enr_doc = enr_result.document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vUVWwmcq4SJE", + "outputId": "374020e2-a78e-488f-a74f-b1c3eb288541" + }, + "outputs": [], + "source": [ + "from docling_core.types.doc.document import PictureDescriptionData\n", + "from IPython import display\n", + "\n", + "html_buffer = []\n", + "# display the first 5 pictures and their captions and annotations:\n", + "for pic in enr_doc.pictures[:5]:\n", + " html_item = (\n", + " f\"

Picture {pic.self_ref}

\"\n", + " f'
'\n", + " f\"

Caption

{pic.caption_text(doc=doc)}
\"\n", + " )\n", + " for annotation in pic.annotations:\n", + " if not isinstance(annotation, PictureDescriptionData):\n", + " continue\n", + " html_item += (\n", + " f\"

Annotations ({VLM_MODEL_NAME})

{annotation.text}
\\n\"\n", + " )\n", + " html_buffer.append(html_item)\n", + "display.HTML(\"
\".join(html_buffer))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UStdCE9f1RgJ" + }, + "source": [ + "## Resumen\n", + "\n", + "### Lo que has logrado en el Laboratorio 1\n", + "\n", + "¡Felicidades! Has dominado el primer paso crítico en la IA de documentos:\n", + "\n", + "- Conversión de documentos básica y avanzada con retroalimentación visual\n", + "- Múltiples formatos de exportación con opciones de visualización\n", + "- Extracción de tablas e imágenes con verificación visual\n", + "- Modelos de enriquecimiento y VLMs\n", + "\n", + "### Próximos Pasos\n", + "\n", + "Puedes continuar explorando Docling por tu cuenta probando con tus propios documentos o documentos más complejos, o avanzar al siguiente laboratorio en esta serie.\n", + "\n", + "Para más información, explora los recursos a continuación:\n", + "\n", + "- GitHub: https://github.com/docling-project/docling\n", + "- Documentation: https://docling-project.github.io/docling/\n", + "- Technical Report: https://arxiv.org/abs/2408.09869\n", + "- Examples: https://github.com/docling-project/docling/tree/main/examples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/workshops/2025_10_17_PyConES/notebooks/Lab2_Chunking.ipynb b/workshops/2025_10_17_PyConES/notebooks/Lab2_Chunking.ipynb new file mode 100644 index 0000000..c60413f --- /dev/null +++ b/workshops/2025_10_17_PyConES/notebooks/Lab2_Chunking.ipynb @@ -0,0 +1,1152 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 🧠 Hybrid Document Chunking Workshop\n", + "\n", + "Welcome to the **Hybrid Document Chunking Workshop**! This comprehensive notebook demonstrates how to use Docling's advanced chunking capabilities for RAG (Retrieval-Augmented Generation) applications with full structured output support.\n", + "\n", + "## 🎯 Learning Objectives\n", + "\n", + "By the end of this workshop, you will:\n", + "- Understand hybrid chunking and its advantages over simple text splitting\n", + "- Configure advanced document processing pipelines\n", + "- Process various document formats with OCR, table extraction, and figure exports\n", + "- Generate structured output with organized folder hierarchies\n", + "- Implement tokenization-aware chunking strategies\n", + "- Use LLM-powered image descriptions for multimodal RAG\n", + "- Analyze and visualize chunks for optimal RAG performance\n", + "\n", + "## 📋 Workshop Sections\n", + "\n", + "1. **🔧 Setup & Dependencies** - Install packages with UV\n", + "2. **⚙️ Advanced Pipeline Configuration** - All processing options explained\n", + "3. **📊 Structured Output System** - Organized folder hierarchies \n", + "4. **🖼️ Figure & Table Exports** - Visual content extraction\n", + "5. **🤖 LLM Image Descriptions** - AI-powered multimodal processing\n", + "6. **🧩 Hybrid Chunking Engine** - Smart, context-aware chunking\n", + "7. **📈 Analysis & Visualization** - Comprehensive chunk quality analysis\n", + "8. **🎛️ Interactive Configuration Testing** - Compare different settings\n", + "9. **💡 Best Practices** - Production-ready recommendations\n", + "\n", + "Let's dive deep! 🌊\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. 🔧 Setup & Dependencies with UV\n", + "\n", + "UV is a fast Python package installer and dependency manager. Let's install all required packages for this comprehensive workshop.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b738b04", + "metadata": {}, + "outputs": [], + "source": [ + "! echo \"::group::Install Dependencies\"\n", + "%pip install uv\n", + "! uv pip install \"git+https://github.com/ibm-granite-community/utils.git\" \\\n", + " transformers \\\n", + " pillow \\\n", + " langchain_community \\\n", + " 'langchain_huggingface[full]' \\\n", + " docling \\\n", + " replicate \\\n", + " matplotlib\n", + "! echo \"::endgroup::\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01317cbc", + "metadata": {}, + "outputs": [], + "source": [ + "# Core imports for the workshop\n", + "import json\n", + "import sys\n", + "import warnings\n", + "from pathlib import Path\n", + "from typing import Dict, List, Any, Optional\n", + "\n", + "import requests\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from IPython.display import display, HTML, Markdown\n", + "from ibm_granite_community.notebook_utils import get_env_var\n", + "\n", + "# Suppress PyTorch MPS warnings on Mac\n", + "warnings.filterwarnings(\"ignore\", message=\".*pin_memory.*not supported on MPS.*\")\n", + "warnings.filterwarnings(\"ignore\", category=UserWarning, module=\"torch.*\")\n", + "\n", + "# Docling imports\n", + "from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\n", + "from docling.document_converter import DocumentConverter, PdfFormatOption\n", + "from docling.chunking import HybridChunker\n", + "from docling.datamodel.base_models import InputFormat\n", + "from docling.datamodel.pipeline_options import PdfPipelineOptions\n", + "from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n", + "from docling_core.types.doc import PictureItem, TableItem\n", + "from transformers import AutoTokenizer\n", + "from langchain_community.llms import Replicate\n", + "\n", + "\n", + "# Optional replicate imports for image descriptions\n", + "try:\n", + " import base64\n", + " from dotenv import load_dotenv\n", + " import os\n", + " REPLICATE_AVAILABLE = True\n", + " # Load environment variables\n", + " load_dotenv()\n", + " os.environ['REPLICATE_API_TOKEN'] = get_env_var(\"REPLICATE_API_TOKEN\")\n", + " assert os.environ.get('REPLICATE_API_TOKEN'), \"Replicate unavailable\"\n", + "except AssertionError:\n", + " REPLICATE_AVAILABLE = False\n", + " print(\"⚠️ Replicate not available. Image descriptions will be disabled.\")\n", + "\n", + "print(\"✅ All imports successful!\")\n", + "print(\"🚀 Ready for advanced document processing!\")\n", + "\n", + "# Set up plotting style\n", + "plt.style.use('default')\n", + "sns.set_palette(\"husl\")" + ] + }, + { + "cell_type": "markdown", + "id": "bfeee639", + "metadata": {}, + "source": [ + "## 2. ⚙️ Advanced Pipeline Configuration\n", + "\n", + "The `AdvancedPipelineConfig` class provides comprehensive control over document processing. This replicates all the features from the command-line processor, including structured output, exports, and LLM integration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "821a683f", + "metadata": {}, + "outputs": [], + "source": [ + "class AdvancedPipelineConfig:\n", + " \"\"\"Comprehensive configuration class replicating document_processor.py functionality.\"\"\"\n", + " \n", + " def __init__(\n", + " self,\n", + " # Core processing options\n", + " do_ocr: bool = True,\n", + " do_table_structure: bool = True,\n", + " generate_page_images: bool = False, # Will be enabled if export_figures=True\n", + " generate_picture_images: bool = False, # Will be enabled if export_figures=True\n", + " \n", + " # Chunking options\n", + " chunk_max_tokens: int = 512,\n", + " chunk_merge_peers: bool = True,\n", + " embedding_model: str = \"sentence-transformers/all-MiniLM-L6-v2\",\n", + " \n", + " # Export options (the key new features!)\n", + " export_figures: bool = False,\n", + " export_tables: bool = False,\n", + " images_scale: float = 2.0, # Image resolution scale (1.0 = 72 DPI)\n", + " \n", + " # Structured output options\n", + " organize_output: bool = True, # Create organized folder structure\n", + " save_metadata: bool = True, # Save comprehensive document metadata\n", + " export_markdown: bool = True, # Export as markdown\n", + " output_dir: str = \"workshop_output\",\n", + " \n", + " # LLM Image description options\n", + " describe_images: bool = False,\n", + " llm_model: str = \"ibm-granite/granite-vision-3.3-2b\",\n", + " \n", + " # Supported formats\n", + " allowed_formats: List[InputFormat] = None,\n", + " ):\n", + " \"\"\"Initialize comprehensive pipeline configuration.\n", + " \n", + " Key New Features Explained:\n", + " \n", + " 🏗️ STRUCTURED OUTPUT (organize_output=True):\n", + " Creates organized folder hierarchy:\n", + " output_dir/files/document_name/\n", + " ├── json/ # Main JSON output + markdown\n", + " ├── metadata/ # Separate metadata file\n", + " └── exports/ # All visual exports\n", + " ├── figures/ # Extracted figures/pictures\n", + " ├── tables/ # Table images\n", + " └── pages/ # Page screenshots\n", + " \n", + " 🖼️ FIGURE EXPORTS (export_figures=True):\n", + " • Extracts all figures and pictures from documents\n", + " • Saves as high-resolution PNG files\n", + " • Automatically generates page screenshots\n", + " • Preserves visual content for multimodal RAG\n", + " \n", + " 📊 TABLE EXPORTS (export_tables=True):\n", + " • Extracts tables to multiple formats: CSV, HTML, Markdown\n", + " • Saves table images as PNG files\n", + " • Preserves table structure and content\n", + " • Enables both text and visual table retrieval\n", + " \n", + " 🤖 LLM IMAGE DESCRIPTIONS (describe_images=True):\n", + " • Supports Claude, GPT-4V, Gemini Vision, etc.\n", + " • Creates searchable text descriptions\n", + " • Tracks costs and token usage\n", + " • Enables semantic search over visual content\n", + " \n", + " 📋 COMPREHENSIVE METADATA (save_metadata=True):\n", + " • Processing configuration details\n", + " • Document statistics and structure info\n", + " • File metadata and format detection\n", + " • Chunk quality metrics\n", + " \"\"\"\n", + " # Core processing\n", + " self.do_ocr = do_ocr\n", + " self.do_table_structure = do_table_structure\n", + " self.generate_page_images = generate_page_images\n", + " self.generate_picture_images = generate_picture_images\n", + " \n", + " # Chunking\n", + " self.chunk_max_tokens = chunk_max_tokens\n", + " self.chunk_merge_peers = chunk_merge_peers\n", + " self.embedding_model = embedding_model\n", + " \n", + " # Export options\n", + " self.export_figures = export_figures\n", + " self.export_tables = export_tables\n", + " self.images_scale = images_scale\n", + " \n", + " # Structured output\n", + " self.organize_output = organize_output\n", + " self.save_metadata = save_metadata\n", + " self.export_markdown = export_markdown\n", + " self.output_dir = Path(output_dir)\n", + " \n", + " # LLM features\n", + " self.describe_images = describe_images\n", + " self.llm_model = llm_model\n", + " \n", + " # Multi-format support\n", + " self.allowed_formats = allowed_formats or [\n", + " InputFormat.PDF,\n", + " InputFormat.DOCX,\n", + " InputFormat.PPTX,\n", + " InputFormat.XLSX,\n", + " InputFormat.HTML,\n", + " InputFormat.MD,\n", + " InputFormat.IMAGE,\n", + " ]\n", + " \n", + " # Auto-enable image generation if we're exporting figures\n", + " if self.export_figures:\n", + " self.generate_page_images = True\n", + " self.generate_picture_images = True\n", + " \n", + " def to_pipeline_options(self) -> Optional[PdfPipelineOptions]:\n", + " \"\"\"Convert configuration to Docling PdfPipelineOptions.\"\"\"\n", + " try:\n", + " options = PdfPipelineOptions(\n", + " do_ocr=self.do_ocr,\n", + " do_table_structure=self.do_table_structure,\n", + " generate_page_images=self.generate_page_images,\n", + " generate_picture_images=self.generate_picture_images,\n", + " )\n", + " \n", + " # Set images scale for high-quality exports\n", + " if self.export_figures or self.generate_page_images or self.generate_picture_images:\n", + " options.images_scale = self.images_scale\n", + " \n", + " return options\n", + " except Exception as e:\n", + " print(f\"Warning: Could not create pipeline options: {e}\")\n", + " return None\n", + " \n", + " def summary(self) -> str:\n", + " \"\"\"Return a comprehensive summary of the configuration.\"\"\"\n", + " formats_str = ', '.join([f.name for f in self.allowed_formats])\n", + " \n", + " return f\"\"\"🔧 Advanced Pipeline Configuration:\n", + " \n", + "📊 CORE PROCESSING:\n", + " - OCR: {'✓' if self.do_ocr else '✗'}\n", + " - Table Structure: {'✓' if self.do_table_structure else '✗'}\n", + " - Page Images: {'✓' if self.generate_page_images else '✗'}\n", + " - Picture Images: {'✓' if self.generate_picture_images else '✗'}\n", + " \n", + "🧩 CHUNKING:\n", + " - Method: Hybrid (tokenization-aware)\n", + " - Max Tokens/Chunk: {self.chunk_max_tokens}\n", + " - Merge Peers: {'✓' if self.chunk_merge_peers else '✗'}\n", + " - Embedding Model: {self.embedding_model}\n", + " \n", + "📁 STRUCTURED OUTPUT:\n", + " - Organize Output: {'✓' if self.organize_output else '✗'}\n", + " - Save Metadata: {'✓' if self.save_metadata else '✗'}\n", + " - Export Markdown: {'✓' if self.export_markdown else '✗'}\n", + " - Output Directory: {self.output_dir}\n", + " \n", + "🖼️ EXPORTS:\n", + " - Export Figures: {'✓' if self.export_figures else '✗'}\n", + " - Export Tables: {'✓' if self.export_tables else '✗'}\n", + " - Images Scale: {self.images_scale}x ({int(self.images_scale * 72)} DPI)\n", + " \n", + "🤖 LLM FEATURES:\n", + " - Describe Images: {'✓' if self.describe_images else '✗'}\n", + " - LLM Model: {self.llm_model}\n", + " \n", + "⚙️ FORMATS:\n", + " - Supported: {formats_str}\"\"\"\n", + "\n", + "# Create a comprehensive configuration (matching the example command)\n", + "config = AdvancedPipelineConfig(\n", + " chunk_max_tokens=256,\n", + " organize_output=True,\n", + " export_figures=True,\n", + " export_tables=True,\n", + " save_metadata=True,\n", + " describe_images=True,\n", + " llm_model=\"ibm-granite/granite-vision-3.3-2b\",\n", + " output_dir=\"workshop_output\"\n", + ")\n", + "\n", + "print(\"✅ AdvancedPipelineConfig class defined!\")\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(config.summary())\n", + "print(\"=\"*60)\n" + ] + }, + { + "cell_type": "markdown", + "id": "171cdaea", + "metadata": {}, + "source": [ + "## 3. 📊 Structured Output System Explained\n", + "\n", + "The structured output system creates a comprehensive, organized hierarchy that makes it easy to manage and access all processed content. Let's understand each component:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4eb6a9e0", + "metadata": {}, + "outputs": [], + "source": [ + "def explain_output_structure():\n", + " \"\"\"Show the complete output structure with explanations.\"\"\"\n", + " \n", + " structure_diagram = \"\"\"\n", + "📁 STRUCTURED OUTPUT HIERARCHY:\n", + "\n", + "workshop_output/\n", + "└── files/ # Categorized by type (files vs audio)\n", + " └── document_name/ # One folder per document\n", + " ├── json/ # Main content\n", + " │ ├── document_name.json # Complete processed data\n", + " │ └── document_name.md # Markdown export\n", + " │\n", + " ├── metadata/ # Metadata only\n", + " │ └── document_name_metadata.json\n", + " │\n", + " └── exports/ \n", + " ├── figures/ \n", + " │ ├── document-picture-1.png\n", + " │ ├── document-picture-2.png\n", + " │ └── ...\n", + " │\n", + " ├── tables/ \n", + " │ ├── document-table-1.png # Visual\n", + " │ ├── document-table-1.csv # Data \n", + " │ ├── document-table-1.html # Formatted\n", + " │ ├── document-table-1.md # Markdown\n", + " │ └── ...\n", + " │\n", + " ├── pages/ # Page screenshots\n", + " │ ├── document-page-1.png\n", + " │ ├── document-page-2.png\n", + " │ └── ...\n", + " \"\"\"\n", + " \n", + " explanations = {\n", + " \"📄 json/\": \"Contains the main processed data as JSON and the original text as Markdown. This is your primary content for RAG.\",\n", + " \n", + " \"📋 metadata/\": \"Separate metadata file with processing config, document stats, file info, and quality metrics.\",\n", + " \n", + " \"🖼️ figures/\": \"All pictures, diagrams, charts, and visual elements extracted as high-res PNG files for multimodal RAG.\",\n", + " \n", + " \"📊 tables/\": \"Tables in multiple formats: PNG images for visual retrieval, CSV for data analysis, HTML for web display, Markdown for text processing.\",\n", + " \n", + " \"📖 pages/\": \"Screenshots of each document page, useful for layout-aware applications and visual document search.\",\n", + " \n", + " \"🤖 image-descriptions.json\": \"LLM-generated descriptions of all images, with cost tracking and metadata. Makes visual content searchable via text.\"\n", + " }\n", + " \n", + " print(structure_diagram)\n", + " print(\"\\n🔍 COMPONENT EXPLANATIONS:\\n\")\n", + " \n", + " for component, explanation in explanations.items():\n", + " print(f\"{component}\")\n", + " print(f\" {explanation}\")\n", + " print()\n", + " \n", + "\n", + "explain_output_structure()" + ] + }, + { + "cell_type": "markdown", + "id": "2ee0186c", + "metadata": {}, + "source": [ + "## 4. 🧠 Comprehensive Document Processor\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "016b1dde", + "metadata": {}, + "outputs": [], + "source": [ + "class ComprehensiveDocumentProcessor:\n", + " \"\"\"Full-featured document processor with all advanced capabilities.\"\"\"\n", + " \n", + " def __init__(self, config: Optional[AdvancedPipelineConfig] = None):\n", + " \"\"\"Initialize the comprehensive processor.\"\"\"\n", + " self.config = config or AdvancedPipelineConfig()\n", + " \n", + " # Initialize DocumentConverter\n", + " self.converter = self._initialize_converter()\n", + " \n", + " # Initialize HybridChunker\n", + " self.hybrid_chunker = None\n", + " self.tokenizer = None\n", + " self._initialize_hybrid_chunker()\n", + " \n", + " # Store conversion result for exports\n", + " self._conversion_result = None\n", + " \n", + " def _initialize_converter(self) -> DocumentConverter:\n", + " \"\"\"Initialize DocumentConverter with multi-format support.\"\"\"\n", + " try:\n", + " pipeline_options = self.config.to_pipeline_options()\n", + " \n", + " format_options = {}\n", + " if InputFormat.PDF in self.config.allowed_formats and pipeline_options:\n", + " format_options[InputFormat.PDF] = PdfFormatOption(\n", + " pipeline_cls=StandardPdfPipeline,\n", + " backend=PyPdfiumDocumentBackend,\n", + " pipeline_options=pipeline_options\n", + " )\n", + " \n", + " converter = DocumentConverter(\n", + " allowed_formats=self.config.allowed_formats,\n", + " format_options=format_options if format_options else None\n", + " )\n", + " \n", + " formats_str = ', '.join([f.name for f in self.config.allowed_formats])\n", + " print(f\"✅ Initialized converter for formats: {formats_str}\")\n", + " return converter\n", + " \n", + " except Exception as e:\n", + " print(f\"⚠️ Warning: Could not apply format options: {e}\")\n", + " return DocumentConverter()\n", + " \n", + " def _initialize_hybrid_chunker(self):\n", + " \"\"\"Initialize the hybrid chunker with tokenizer.\"\"\"\n", + " try:\n", + " print(f\"🔧 Loading tokenizer: {self.config.embedding_model}\")\n", + " self.tokenizer = AutoTokenizer.from_pretrained(self.config.embedding_model)\n", + " \n", + " self.hybrid_chunker = HybridChunker(\n", + " tokenizer=self.tokenizer,\n", + " max_tokens=self.config.chunk_max_tokens,\n", + " merge_peers=self.config.chunk_merge_peers\n", + " )\n", + " print(f\"✅ Initialized HybridChunker (max_tokens={self.config.chunk_max_tokens})\")\n", + " \n", + " except Exception as e:\n", + " print(f\"❌ Error: Could not initialize hybrid chunker: {e}\")\n", + " raise e\n", + " \n", + " def process_document(self, file_path: str) -> Dict[str, Any]:\n", + " \"\"\"Process a document with full structured output.\"\"\"\n", + " try:\n", + " print(f\"🔄 Processing document: {Path(file_path).name}\")\n", + " \n", + " # Convert document\n", + " result = self.converter.convert(file_path)\n", + " doc = result.document\n", + " self._conversion_result = result\n", + " \n", + " # Create output structure\n", + " doc_folder = self._get_output_folder(file_path)\n", + " \n", + " # Extract comprehensive metadata\n", + " metadata = self._extract_comprehensive_metadata(result, file_path, doc_folder)\n", + " \n", + " # Create chunks\n", + " chunks = self._create_hybrid_chunks(doc)\n", + " \n", + " # Extract tables and headers\n", + " tables = self._extract_tables(doc)\n", + " headers = self._extract_headers(doc)\n", + " \n", + " # Handle exports (figures, tables)\n", + " exports_info = {}\n", + " if self.config.export_figures or self.config.export_tables:\n", + " exports_info = self._handle_exports(result, doc, file_path, doc_folder)\n", + " \n", + " # Prepare complete processed data\n", + " processed_data = {\n", + " \"metadata\": metadata,\n", + " \"content\": {\n", + " \"full_text\": doc.export_to_markdown(),\n", + " \"structured_content\": json.loads(doc.to_json()) if hasattr(doc, 'to_json') else {},\n", + " },\n", + " \"chunks\": chunks,\n", + " \"tables\": tables,\n", + " \"headers\": headers,\n", + " \"exports\": exports_info,\n", + " \"document_stats\": {\n", + " \"total_characters\": len(doc.export_to_markdown()),\n", + " \"total_words\": len(doc.export_to_markdown().split()),\n", + " \"total_chunks\": len(chunks),\n", + " \"total_tables\": len(tables),\n", + " \"total_headers\": len(headers),\n", + " }\n", + " }\n", + " \n", + " # Create structured output if enabled\n", + " if self.config.organize_output:\n", + " output_structure = self._create_output_structure(doc_folder, processed_data)\n", + " processed_data[\"output_structure\"] = output_structure\n", + " \n", + " print(f\"✅ Document processed successfully! Created {len(chunks)} chunks\")\n", + " return processed_data\n", + " \n", + " except Exception as e:\n", + " return {\n", + " \"error\": f\"Failed to process document: {str(e)}\",\n", + " \"metadata\": {\"source_file\": str(file_path)},\n", + " }\n", + " \n", + " def _get_output_folder(self, file_path: str) -> Path:\n", + " \"\"\"Determine output folder structure.\"\"\"\n", + " if self.config.organize_output:\n", + " return self.config.output_dir / \"files\" / Path(file_path).stem\n", + " else:\n", + " return self.config.output_dir\n", + " \n", + " def _create_hybrid_chunks(self, doc) -> List[Dict[str, Any]]:\n", + " \"\"\"Create chunks using Docling's HybridChunker.\"\"\"\n", + " print(\"🧩 Creating hybrid chunks...\")\n", + " chunks = []\n", + " \n", + " try:\n", + " chunk_iter = self.hybrid_chunker.chunk(dl_doc=doc)\n", + " \n", + " for i, chunk in enumerate(chunk_iter):\n", + " contextualized_text = self.hybrid_chunker.contextualize(chunk=chunk)\n", + " \n", + " chunk_data = {\n", + " \"chunk_id\": i,\n", + " \"text\": chunk.text,\n", + " \"contextualized_text\": contextualized_text,\n", + " \"token_count\": len(self.tokenizer.encode(chunk.text)) if self.tokenizer else len(chunk.text.split()),\n", + " \"char_count\": len(chunk.text),\n", + " \"contextualized_char_count\": len(contextualized_text),\n", + " \"metadata\": {\n", + " \"headings\": getattr(chunk.meta, 'headings', []) if hasattr(chunk, 'meta') else [],\n", + " \"page_info\": getattr(chunk.meta, 'page_info', []) if hasattr(chunk, 'meta') else [],\n", + " \"content_type\": getattr(chunk.meta, 'content_type', None) if hasattr(chunk, 'meta') else None,\n", + " \"chunk_type\": \"hybrid\"\n", + " }\n", + " }\n", + " chunks.append(chunk_data)\n", + " \n", + " print(f\"✅ Created {len(chunks)} hybrid chunks\")\n", + " return chunks\n", + " \n", + " except Exception as e:\n", + " print(f\"❌ Error: Hybrid chunking failed: {e}\")\n", + " raise e\n", + " \n", + " def _extract_tables(self, doc) -> List[Dict[str, Any]]:\n", + " \"\"\"Extract table information from the document.\"\"\"\n", + " tables = []\n", + " try:\n", + " if hasattr(doc, 'to_json'):\n", + " doc_dict = json.loads(doc.to_json())\n", + " if 'tables' in doc_dict:\n", + " for i, table in enumerate(doc_dict['tables']):\n", + " tables.append({\n", + " \"table_id\": i,\n", + " \"content\": table,\n", + " \"extraction_method\": \"structured_json\"\n", + " })\n", + " \n", + " if hasattr(doc, 'tables'):\n", + " for i, table in enumerate(doc.tables):\n", + " tables.append({\n", + " \"table_id\": len(tables),\n", + " \"content\": str(table) if hasattr(table, '__str__') else table,\n", + " \"extraction_method\": \"direct_attribute\"\n", + " })\n", + " \n", + " if tables:\n", + " print(f\"✅ Extracted {len(tables)} tables\")\n", + " \n", + " except Exception as e:\n", + " print(f\"⚠️ Warning: Could not extract tables: {e}\")\n", + " \n", + " return tables\n", + " \n", + " def _extract_headers(self, doc) -> List[Dict[str, Any]]:\n", + " \"\"\"Extract header information from the document.\"\"\"\n", + " headers = []\n", + " try:\n", + " markdown_content = doc.export_to_markdown()\n", + " lines = markdown_content.split('\\n')\n", + " \n", + " for i, line in enumerate(lines):\n", + " line = line.strip()\n", + " if line.startswith('#'):\n", + " level = len(line) - len(line.lstrip('#'))\n", + " text = line.lstrip('#').strip()\n", + " if text:\n", + " headers.append({\n", + " \"level\": level,\n", + " \"text\": text,\n", + " \"line_number\": i,\n", + " })\n", + " \n", + " if headers:\n", + " print(f\"✅ Extracted {len(headers)} headers\")\n", + " \n", + " except Exception as e:\n", + " print(f\"⚠️ Warning: Could not extract headers: {e}\")\n", + " \n", + " return headers\n", + " \n", + " def _extract_comprehensive_metadata(self, result, file_path: str, doc_folder: Path) -> Dict[str, Any]:\n", + " \"\"\"Extract comprehensive metadata about the document and processing.\"\"\"\n", + " file_path_obj = Path(file_path)\n", + " \n", + " metadata = {\n", + " \"source_file\": str(file_path),\n", + " \"file_name\": file_path_obj.name,\n", + " \"file_stem\": file_path_obj.stem,\n", + " \"file_type\": file_path_obj.suffix.lower(),\n", + " \"file_size_bytes\": file_path_obj.stat().st_size if file_path_obj.exists() else 0,\n", + " \"title\": getattr(result.document, 'title', None) or file_path_obj.stem,\n", + " \"output_folder\": str(doc_folder),\n", + " \"processing_config\": {\n", + " \"chunking_method\": \"hybrid\",\n", + " \"max_tokens_per_chunk\": self.config.chunk_max_tokens,\n", + " \"ocr_enabled\": self.config.do_ocr,\n", + " \"table_structure_enabled\": self.config.do_table_structure,\n", + " \"export_figures\": self.config.export_figures,\n", + " \"export_tables\": self.config.export_tables,\n", + " \"organize_output\": self.config.organize_output,\n", + " \"describe_images\": self.config.describe_images,\n", + " \"llm_model\": self.config.llm_model if self.config.describe_images else None,\n", + " \"embedding_model\": self.config.embedding_model,\n", + " },\n", + " }\n", + " \n", + " # Document structure metadata\n", + " doc = result.document\n", + " if hasattr(doc, 'pages'):\n", + " metadata[\"page_count\"] = len(doc.pages) if doc.pages else 0\n", + " \n", + " if hasattr(doc, 'tables'):\n", + " metadata[\"table_count\"] = len(doc.tables) if doc.tables else 0\n", + " \n", + " # Content statistics\n", + " full_text = doc.export_to_markdown()\n", + " metadata[\"content_stats\"] = {\n", + " \"total_characters\": len(full_text),\n", + " \"total_words\": len(full_text.split()),\n", + " \"total_lines\": len(full_text.split('\\n')),\n", + " }\n", + " \n", + " return metadata\n", + " \n", + " def _create_output_structure(self, doc_folder: Path, processed_data: Dict[str, Any]) -> Dict[str, str]:\n", + " \"\"\"Create organized output folder structure and save files.\"\"\"\n", + " folders = {\n", + " \"json_folder\": doc_folder / \"json\",\n", + " \"metadata_folder\": doc_folder / \"metadata\",\n", + " \"exports_folder\": doc_folder / \"exports\",\n", + " \"figures_folder\": doc_folder / \"exports\" / \"figures\",\n", + " \"tables_folder\": doc_folder / \"exports\" / \"tables\", \n", + " \"pages_folder\": doc_folder / \"exports\" / \"pages\",\n", + " }\n", + " \n", + " # Create directories\n", + " for folder in folders.values():\n", + " folder.mkdir(parents=True, exist_ok=True)\n", + " \n", + " # Save JSON output\n", + " json_file = folders[\"json_folder\"] / f\"{processed_data['metadata']['file_stem']}.json\"\n", + " with json_file.open('w', encoding='utf-8') as f:\n", + " json.dump(processed_data, f, indent=2, ensure_ascii=False)\n", + " \n", + " # Save metadata separately\n", + " metadata_file = folders[\"metadata_folder\"] / f\"{processed_data['metadata']['file_stem']}_metadata.json\"\n", + " with metadata_file.open('w', encoding='utf-8') as f:\n", + " json.dump(processed_data['metadata'], f, indent=2, ensure_ascii=False)\n", + " \n", + " # Save markdown if enabled\n", + " if self.config.export_markdown and hasattr(self._conversion_result, 'document'):\n", + " markdown_file = folders[\"json_folder\"] / f\"{processed_data['metadata']['file_stem']}.md\"\n", + " with markdown_file.open('w', encoding='utf-8') as f:\n", + " f.write(self._conversion_result.document.export_to_markdown())\n", + " \n", + " print(f\"📁 Created structured output in: {doc_folder}\")\n", + " return {k: str(v) for k, v in folders.items()}\n", + " \n", + " # We'll add the export methods in the next cell due to length...\n", + "\n", + "print(\"✅ ComprehensiveDocumentProcessor class defined!\")\n", + "\n", + "print(\"🔧 Class capabilities:\")\n", + "print(\" • Document formats: PDF, DOCX\")\n", + "print(\" • Chunking: Hybrid chunking with configurable token limits\")\n", + "print(\" • Exports: Figures, tables, pages, markdown\")\n", + "print(\" • Output organization: Structured folders with metadata\")\n", + "print(\" • LLM integration: Image descriptions and content analysis\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "621a7c43", + "metadata": {}, + "outputs": [], + "source": [ + "def _handle_exports(self, result, doc, file_path: str, doc_folder: Path) -> Dict[str, Any]:\n", + " \"\"\"Handle figure and table exports based on configuration.\"\"\"\n", + " exports_info = {\n", + " \"figures_exported\": 0,\n", + " \"tables_exported\": 0,\n", + " \"pages_exported\": 0,\n", + " \"export_directory\": str(doc_folder / \"exports\"),\n", + " \"exported_files\": [],\n", + " \"image_descriptions\": []\n", + " }\n", + " \n", + " if not (self.config.export_figures or self.config.export_tables):\n", + " return exports_info\n", + " \n", + " # Create output directories\n", + " exports_folder = doc_folder / \"exports\"\n", + " figures_folder = exports_folder / \"figures\"\n", + " tables_folder = exports_folder / \"tables\"\n", + " pages_folder = exports_folder / \"pages\"\n", + " \n", + " for folder in [exports_folder, figures_folder, tables_folder, pages_folder]:\n", + " folder.mkdir(parents=True, exist_ok=True)\n", + " \n", + " doc_filename = Path(file_path).stem\n", + " \n", + " # Export figures if enabled\n", + " if self.config.export_figures:\n", + " figure_info = self._export_figures(result, doc, doc_filename, figures_folder, tables_folder, pages_folder, exports_folder)\n", + " exports_info.update(figure_info)\n", + " \n", + " # Export tables if enabled\n", + " if self.config.export_tables:\n", + " table_info = self._export_tables(result, doc, doc_filename, tables_folder)\n", + " exports_info.update(table_info)\n", + " \n", + " return exports_info\n", + "\n", + "def _export_figures(self, result, doc, doc_filename: str, figures_folder: Path, tables_folder: Path, pages_folder: Path, exports_folder: Path) -> Dict[str, Any]:\n", + " \"\"\"Export figures, page images, and pictures with optional LLM descriptions.\"\"\"\n", + " print(\"🖼️ Exporting figures...\")\n", + " exported_files = []\n", + " image_descriptions = []\n", + " \n", + " try:\n", + " # Save page images to pages subfolder\n", + " page_counter = 0\n", + " if hasattr(result, 'document') and hasattr(result.document, 'pages'):\n", + " for page_no, page in result.document.pages.items():\n", + " if hasattr(page, 'image') and page.image:\n", + " page_counter += 1\n", + " page_image_filename = pages_folder / f\"{doc_filename}-page-{page.page_no}.png\"\n", + " with page_image_filename.open(\"wb\") as fp:\n", + " page.image.pil_image.save(fp, format=\"PNG\")\n", + " exported_files.append(str(page_image_filename))\n", + " \n", + " # Save images of figures and tables\n", + " table_counter = 0\n", + " picture_counter = 0\n", + " \n", + " if hasattr(result, 'document'):\n", + " for element, _level in result.document.iterate_items():\n", + " if isinstance(element, TableItem):\n", + " table_counter += 1\n", + " element_image_filename = tables_folder / f\"{doc_filename}-table-{table_counter}.png\"\n", + " try:\n", + " with element_image_filename.open(\"wb\") as fp:\n", + " element.get_image(result.document).save(fp, \"PNG\")\n", + " exported_files.append(str(element_image_filename))\n", + " \n", + " # Add LLM description if enabled\n", + " if self.config.describe_images and REPLICATE_AVAILABLE:\n", + " desc_result = self._describe_image_with_llm(element_image_filename, self.config.llm_model)\n", + " desc_result.update({\n", + " \"type\": \"table\",\n", + " \"image_filename\": element_image_filename.name,\n", + " \"sequence_number\": table_counter\n", + " })\n", + " image_descriptions.append(desc_result)\n", + " \n", + " except Exception as e:\n", + " print(f\"⚠️ Warning: Could not export table {table_counter} image: {e}\")\n", + " \n", + " if isinstance(element, PictureItem):\n", + " picture_counter += 1\n", + " element_image_filename = figures_folder / f\"{doc_filename}-picture-{picture_counter}.png\"\n", + " try:\n", + " with element_image_filename.open(\"wb\") as fp:\n", + " element.get_image(result.document).save(fp, \"PNG\")\n", + " exported_files.append(str(element_image_filename))\n", + " \n", + " # Add LLM description if enabled\n", + " if self.config.describe_images and REPLICATE_AVAILABLE:\n", + " desc_result = self._describe_image_with_llm(element_image_filename, self.config.llm_model)\n", + " desc_result.update({\n", + " \"type\": \"picture/figure\",\n", + " \"image_filename\": element_image_filename.name,\n", + " \"sequence_number\": picture_counter\n", + " })\n", + " image_descriptions.append(desc_result)\n", + " \n", + " except Exception as e:\n", + " print(f\"⚠️ Warning: Could not export picture {picture_counter} image: {e}\")\n", + " \n", + " # Save consolidated image descriptions if any were generated\n", + " if self.config.describe_images and image_descriptions:\n", + " consolidated_descriptions = {\n", + " \"document_name\": doc_filename,\n", + " \"timestamp\": pd.Timestamp.now().isoformat(),\n", + " \"total_images\": len(image_descriptions),\n", + " \"total_cost\": sum(desc.get(\"cost\", 0) for desc in image_descriptions),\n", + " \"total_input_tokens\": sum(desc.get(\"input_tokens\", 0) for desc in image_descriptions),\n", + " \"total_output_tokens\": sum(desc.get(\"output_tokens\", 0) for desc in image_descriptions),\n", + " \"model_used\": self.config.llm_model,\n", + " \"descriptions\": image_descriptions\n", + " }\n", + " \n", + " consolidated_filename = exports_folder / f\"{doc_filename}-image-descriptions.json\"\n", + " with consolidated_filename.open(\"w\", encoding=\"utf-8\") as fp:\n", + " json.dump(consolidated_descriptions, fp, indent=2, ensure_ascii=False)\n", + " exported_files.append(str(consolidated_filename))\n", + " print(f\"🤖 Generated {len(image_descriptions)} LLM image descriptions\")\n", + " \n", + " print(f\"✅ Exported {len(exported_files)} figure files\")\n", + " return {\n", + " \"figures_exported\": table_counter + picture_counter,\n", + " \"pages_exported\": page_counter,\n", + " \"figure_files\": exported_files,\n", + " \"image_descriptions\": image_descriptions\n", + " }\n", + " \n", + " except Exception as e:\n", + " print(f\"❌ Warning: Figure export failed: {e}\")\n", + " return {\"figures_exported\": 0, \"figure_files\": []}\n", + "\n", + "def _export_tables(self, result, doc, doc_filename: str, tables_folder: Path) -> Dict[str, Any]:\n", + " \"\"\"Export tables to various formats (CSV, HTML, Markdown).\"\"\"\n", + " print(\"📊 Exporting tables...\")\n", + " exported_files = []\n", + " \n", + " try:\n", + " table_counter = 0\n", + " if hasattr(doc, 'tables'):\n", + " for table_ix, table in enumerate(doc.tables):\n", + " table_counter += 1\n", + " \n", + " try:\n", + " table_df = table.export_to_dataframe(doc=doc)\n", + " \n", + " # Save as CSV\n", + " csv_filename = tables_folder / f\"{doc_filename}-table-{table_ix + 1}.csv\"\n", + " table_df.to_csv(csv_filename, index=False)\n", + " exported_files.append(str(csv_filename))\n", + " \n", + " # Save as HTML\n", + " html_filename = tables_folder / f\"{doc_filename}-table-{table_ix + 1}.html\"\n", + " with html_filename.open(\"w\") as fp:\n", + " fp.write(table.export_to_html(doc=doc))\n", + " exported_files.append(str(html_filename))\n", + " \n", + " # Save as Markdown\n", + " md_filename = tables_folder / f\"{doc_filename}-table-{table_ix + 1}.md\"\n", + " with md_filename.open(\"w\") as fp:\n", + " fp.write(f\"## Table {table_ix + 1}\\n\\n\")\n", + " fp.write(table_df.to_markdown(index=False))\n", + " exported_files.append(str(md_filename))\n", + " \n", + " except Exception as e:\n", + " print(f\"⚠️ Warning: Could not export table {table_ix + 1}: {e}\")\n", + " \n", + " print(f\"✅ Exported {table_counter} tables in {len(exported_files)} files\")\n", + " return {\n", + " \"tables_exported\": table_counter,\n", + " \"table_files\": exported_files\n", + " }\n", + " \n", + " except Exception as e:\n", + " print(f\"❌ Warning: Table export failed: {e}\")\n", + " return {\"tables_exported\": 0, \"table_files\": []}\n", + "\n", + "def _describe_image_with_llm(self, image_path: Path, model: str = \"ibm-granite/granite-vision-3.3-2b\") -> Dict[str, Any]:\n", + " \"\"\"Describe an image using Replicate granite vision model.\"\"\"\n", + " try:\n", + " import replicate\n", + " import time\n", + " \n", + " # Convert image to base64 data URI\n", + " with open(image_path, \"rb\") as image_file:\n", + " image_data = base64.b64encode(image_file.read()).decode('utf-8')\n", + " image_data_uri = f\"data:image/png;base64,{image_data}\"\n", + " \n", + " prompt_text = \"Describe this image in detail. Focus on the main content, text, data, charts, diagrams, or any other relevant information that would be useful for document understanding and search.\"\n", + " \n", + " # Use Replicate granite vision model to describe the image\n", + " output = replicate.run(\n", + " \"ibm-granite/granite-vision-3.3-2b\",\n", + " input={\n", + " \"image\": image_data_uri,\n", + " \"top_p\": 1,\n", + " \"prompt\": prompt_text,\n", + " \"max_tokens\": 1024,\n", + " \"temperature\": 0.2\n", + " }\n", + " )\n", + " time.sleep(5) # To avoid rate limiting\n", + " \n", + " # granite vision returns an iterator, collect all output\n", + " description_parts = []\n", + " for item in output:\n", + " description_parts.append(str(item))\n", + " \n", + " description = \"\".join(description_parts).strip()\n", + " \n", + " return {\n", + " \"success\": True,\n", + " \"description\": description,\n", + " \"prompt\": prompt_text,\n", + " \"image_path\": str(image_path),\n", + " \"model\": \"ibm-granite/granite-vision-3.3-2b\",\n", + " \"input_tokens\": 0, # Replicate doesn't provide token info\n", + " \"output_tokens\": len(description.split()) if description else 0,\n", + " \"cost\": 0.0, # Replicate doesn't provide cost info in response\n", + " \"timestamp\": pd.Timestamp.now().isoformat(),\n", + " \"error\": None\n", + " }\n", + " \n", + " except Exception as e:\n", + " return {\n", + " \"success\": False,\n", + " \"error\": str(e),\n", + " \"description\": None,\n", + " \"prompt\": None,\n", + " \"image_path\": str(image_path),\n", + " \"model\": \"ibm-granite/granite-vision-3.3-2b\",\n", + " \"input_tokens\": 0,\n", + " \"output_tokens\": 0,\n", + " \"cost\": 0.0,\n", + " \"timestamp\": pd.Timestamp.now().isoformat()\n", + " }\n", + "\n", + "\n", + "# Add these methods to our ComprehensiveDocumentProcessor class\n", + "ComprehensiveDocumentProcessor._handle_exports = _handle_exports\n", + "ComprehensiveDocumentProcessor._export_figures = _export_figures \n", + "ComprehensiveDocumentProcessor._export_tables = _export_tables\n", + "ComprehensiveDocumentProcessor._describe_image_with_llm = _describe_image_with_llm\n", + "\n", + "print(\"✅ Export functionality added to ComprehensiveDocumentProcessor!\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "0ee6cb84", + "metadata": {}, + "source": [ + "## 5. 🎯 Complete Demo: Processing a Document with Full Features\n", + "\n", + "Now let's demonstrate the complete functionality by processing a document with all features enabled, just like the command-line example!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5878b27c", + "metadata": {}, + "outputs": [], + "source": [ + "# lets first download the PDF locally for processing\n", + "import requests\n", + "\n", + "url = \"https://arxiv.org/pdf/2501.17887\"\n", + "document_path = \"docling_paper.pdf\"\n", + "response = requests.get(url)\n", + "with open(document_path, \"wb\") as f:\n", + " f.write(response.content)\n", + "print(f\"✅ Downloaded sample PDF: {document_path}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3bf24b6a", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize the comprehensive processor with full features\n", + "processor = ComprehensiveDocumentProcessor(config)\n", + "\n", + "# Check for available example files\n", + "print(f\" Using document path: {document_path}\")\n", + "\n", + "\n", + "print(f\"\\nConfiguration Summary:\")\n", + "print(\"=\"*60)\n", + "print(config.summary())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "185ff64a", + "metadata": {}, + "outputs": [], + "source": [ + "# Process the document with full features\n", + "print(\"🚀 Starting comprehensive document processing...\")\n", + "print(\"This may take a few minutes depending on document size and features enabled.\")\n", + "\n", + "result = processor.process_document(document_path)\n", + "\n", + "# Analyze and display results\n", + "if \"error\" in result:\n", + " print(f\"❌ Error processing document: {result['error']}\")\n", + " print(\"💡 Make sure the document path is correct and you have the required API keys if using image descriptions.\")\n", + "else:\n", + " print(f\"\\n🎉 Document processed successfully!\")\n", + " \n", + " # Display comprehensive summary\n", + " stats = result[\"document_stats\"]\n", + " metadata = result[\"metadata\"]\n", + " exports = result.get(\"exports\", {})\n", + " \n", + " print(f\"\\n📊 PROCESSING RESULTS:\")\n", + " print(\"=\"*60)\n", + " print(f\"📄 Document: {metadata['file_name']}\")\n", + " print(f\"📁 Output folder: {metadata['output_folder']}\")\n", + " print(f\"📏 File size: {metadata['file_size_bytes']:,} bytes\")\n", + " if \"page_count\" in metadata:\n", + " print(f\"📖 Pages: {metadata['page_count']}\")\n", + " \n", + " print(f\"\\n🧩 CONTENT ANALYSIS:\")\n", + " print(f\" • Total characters: {stats['total_characters']:,}\")\n", + " print(f\" • Total words: {stats['total_words']:,}\")\n", + " print(f\" • Total chunks: {stats['total_chunks']}\")\n", + " print(f\" • Total tables: {stats['total_tables']}\")\n", + " print(f\" • Total headers: {stats['total_headers']}\")\n", + " \n", + " if stats['total_chunks'] > 0:\n", + " avg_chars_per_chunk = stats['total_characters'] / stats['total_chunks']\n", + " avg_words_per_chunk = stats['total_words'] / stats['total_chunks']\n", + " print(f\" • Avg chars/chunk: {avg_chars_per_chunk:.0f}\")\n", + " print(f\" • Avg words/chunk: {avg_words_per_chunk:.0f}\")\n", + " \n", + " print(f\"\\n🎨 EXPORTS SUMMARY:\")\n", + " if exports:\n", + " if exports.get('figures_exported', 0) > 0:\n", + " print(f\" • Figures exported: {exports['figures_exported']}\")\n", + " if exports.get('tables_exported', 0) > 0:\n", + " print(f\" • Tables exported: {exports['tables_exported']}\")\n", + " if exports.get('pages_exported', 0) > 0:\n", + " print(f\" • Pages exported: {exports['pages_exported']}\")\n", + " if exports.get('figure_files'):\n", + " print(f\" • Export files created: {len(exports['figure_files'])}\")\n", + " if exports.get('image_descriptions'):\n", + " total_cost = sum(desc.get(\"cost\", 0) for desc in exports['image_descriptions'])\n", + " total_tokens = sum(desc.get(\"input_tokens\", 0) + desc.get(\"output_tokens\", 0) for desc in exports['image_descriptions'])\n", + " print(f\" • LLM descriptions: {len(exports['image_descriptions'])}\")\n", + " print(f\" • Total LLM cost: ${total_cost:.4f}\")\n", + " print(f\" • Total LLM tokens: {total_tokens:,}\")\n", + " else:\n", + " print(\" • No exports configured\")\n", + " \n", + " if \"output_structure\" in result:\n", + " print(f\"\\n📁 STRUCTURED OUTPUT CREATED:\")\n", + " structure = result[\"output_structure\"]\n", + " for folder_type, folder_path in structure.items():\n", + " folder_name = folder_type.replace(\"_folder\", \"\")\n", + " print(f\" • {folder_name}: {folder_path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "3d838f82", + "metadata": {}, + "source": [ + "### ✅ What We've Done\n", + "\n", + "- **🔧 UV Package Management**: Modern Python dependency management\n", + "- **⚙️ Advanced Pipeline Configuration**: All document processing options\n", + "- **📊 Structured Output System**: Organized, production-ready folder hierarchies \n", + "- **🖼️ Figure & Table Exports**: Multi-format visual content extraction\n", + "- **🤖 LLM Image Descriptions**: AI-powered multimodal processing capabilities\n", + "- **🧩 Hybrid Chunking**: Smart, context-aware chunking for optimal RAG\n", + "- **📈 Comprehensive Analysis**: Quality metrics and visualization tools\n", + "\n", + "\n", + "- **Docling Documentation**: [docling-project.github.io](https://docling-project.github.io/docling/)\n", + "- **UV Package Manager**: [docs.astral.sh/uv](https://docs.astral.sh/uv/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/workshops/2025_10_17_PyConES/notebooks/Lab3_RAG.ipynb b/workshops/2025_10_17_PyConES/notebooks/Lab3_RAG.ipynb new file mode 100644 index 0000000..7ffe8d3 --- /dev/null +++ b/workshops/2025_10_17_PyConES/notebooks/Lab3_RAG.ipynb @@ -0,0 +1,2056 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "U48hO1_V_JRG", + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "# Construyendo un sistema RAG multimodal con Docling\n", + "\n", + "*Usando IBM Granite vision, embeddings de texto y modelos de IA generativa*\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oYjGpqcgXR9W" + }, + "source": [ + "## Lab 3: Del Papel al Conocimiento para IA Transparente - El Viaje Completo\n", + "Bienvenido al laboratorio final de nuestra workshop de Docling! Has recorrido un largo camino:\n", + "\n", + "- **Lab 1**: Aprendiste a convertir documentos preservando la estructura\n", + "- **Lab 2**: Dominaste estrategias inteligentes de chunking\n", + "- **Lab 3**: Ahora, construiremos un sistema RAG completo y listo para producción con una característica revolucionaria: el *visual grounding*\n", + "\n", + "\n", + "Este laboratorio representa la culminación de todo lo que has aprendido, mostrando cómo Docling permite no solo el procesamiento de documentos, sino sistemas de IA verdaderamente transparentes.\n", + "\n", + "## ¿Por qué este laboratorio es importante?\n", + "\n", + "Los sistemas RAG tradicionales tienen un problema de confianza. Cuando una IA proporciona información, los usuarios a menudo se preguntan:\n", + "\n", + "- \"¿De dónde proviene esta información?\" 🔍\n", + "- \"¿Puedo verificar que esto es preciso?\" ✅\n", + "- \"¿Está la IA alucinando o utilizando datos reales?\" 🤔\n", + "\n", + "**Visual Grounding** resuelve este problema mostrando a los usuarios exactamente de dónde se recuperó la información en los documentos originales. Esto no es solo una característica *agradable* de un sistema de IA, es esencial para casos de uso donde la precisión y la verificabilidad son cruciales:\n", + "\n", + "- **Salud**: 🏥 Verificar fuentes de información médica\n", + "- **Legal**: ⚖️ Rastrear citas a ubicaciones exactas en documentos\n", + "- **Financiero**: 💰 Auditar ideas financieras generadas por IA\n", + "- **Investigación**: 🔬 Validar afirmaciones científicas\n", + "- **Empresarial**: 🏢 Construir sistemas de IA internos confiables\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5HIV_JkOXw9q" + }, + "source": [ + "## ¿Qué hace especial a este lab?\n", + "\n", + "Estamos construyendo un sistema RAG multimodal con *visual grounding* que:\n", + "\n", + "1. **Procesa múltiples tipos de datos**: Texto, tablas e imágenes de tus documentos\n", + "2. **Muestra fuentes exactas**: Resalta la ubicación precisa de la información recuperada\n", + "3. **Comprende imágenes**: Utiliza modelos de visión por IA para comprender el contenido visual\n", + "4. **Mantiene la transparencia**: Cada respuesta puede ser verificada visualmente\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gAmR3BxvZ2Y0" + }, + "source": [ + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cJl_fmofYBmH" + }, + "source": [ + "## Entendiendo RAG Multimodal con Visual Grounding\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ghh3uCNMCFwU" + }, + "source": [ + "### ¿Qué es RAG Multimodal?\n", + "\n", + "[Retrieval Augmented Generation (RAG)](https://www.ibm.com/think/topics/retrieval-augmented-generation) es una técnica utilizada con LLMs para conectar el modelo con una base de conocimiento externa sin necesidad de realizar [fine-tuning](https://www.ibm.com/think/topics/rag-vs-fine-tuning).\n", + "\n", + "Los sistemas RAG tradicionales están limitados a casos de uso basados en texto. Sin embargo, los documentos reales contienen:\n", + "- **Texto**: Párrafos, listas, encabezados\n", + "- **Tablas**: Datos estructurados, información financiera\n", + "- **Imágenes**: Gráficos, diagramas, fotos, ilustraciones\n", + "\n", + "El RAG Multimodal puede utilizar [LLMs multimodales](https://www.ibm.com/think/topics/multimodal-ai) (MLLM) para procesar información de múltiples tipos de datos que se incluyen como parte de la base de conocimiento externa utilizada en RAG. Los datos multimodales pueden incluir texto, imágenes, audio, video u otras formas.\n", + "\n", + "\n", + "Puedes leer más sobre RAG Multimodal en este [artículo de IBM](https://www.ibm.com/think/topics/multimodal-rag)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gpxgNPDCZEDW" + }, + "source": [ + "### Visual Grounding: La Capa de Transparencia\n", + "\n", + "El grounding agrega una capa crucial de transparencia a los sistemas RAG. Cuando el sistema recupera información para responder a una consulta, no solo devuelve texto, sino que también muestra exactamente de dónde proviene esa información en el documento original mediante:\n", + "\n", + "- Dibujar cuadros delimitadores en las páginas del documento\n", + "- Resaltar regiones específicas\n", + "- Etiquetar tipos de contenido (TEXTO, TABLA, IMAGEN)\n", + "- Usar diferentes colores para múltiples fuentes\n", + "\n", + "En este notebook, utilizarás los modelos IBM Granite, capaces de procesar diferentes modalidades, mejorados con las capacidades de grounding visual de Docling para crear un sistema de IA transparente y verificable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-_S93MsAZz1S" + }, + "source": [ + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_eDALN1A9LF8" + }, + "source": [ + "## Objetivos\n", + "\n", + "En este laboratorio, aprenderás a:\n", + "\n", + "1. **Configurar Docling para el Grounding Visual**: Configurar el procesamiento de documentos para mantener referencias visuales\n", + "2. **Procesar Contenido Multimodal**: Manejar texto, tablas e imágenes con los metadatos adecuados\n", + "3. **Aprovechar los Modelos de Visión AI**: Utilizar los modelos de visión IBM Granite para entendimiento de imágenes\n", + "4. **Construir una Base de Datos Vectorial**: Almacenar embeddings con metadatos para visual grounding\n", + "5. **Implementar Atribución Visual**: Mostrar a los usuarios exactamente de dónde proviene la información\n", + "6. **Crear un Pipeline RAG Completo**: Combinar todos los componentes en un sistema listo para producción" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iSIhob4jZcAN" + }, + "source": [ + "### Tecnologías que usaremos\n", + "\n", + "Componentes clave:\n", + "\n", + "1. **[Docling](https://docling-project.github.io/docling/):** Un kit de herramientas de código abierto utilizado para analizar y convertir documentos.\n", + "2. **[LangChain](https://langchain.com)**: Para orquestar el pipeline RAG\n", + "3. **[IBM Granite Vision Models](https://www.ibm.com/granite/)**: Para entendimiento de contenido de imágenes\n", + "4. **Visual Grounding**: La capacidad única de Docling para atribución de fuentes\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IQlsFgvGZ9hH" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vooxv7ltEZBf" + }, + "source": [ + "## Prerequisitos\n", + "\n", + "Antes de comenzar, asegúrate de tener:\n", + "- Completados los Laboratorios 1 y 2 (o conocimiento equivalente de Docling)\n", + "- Python 3.10, 3.11 o 3.12 instalado\n", + "- Comprensión básica de embeddings y bases de datos vectoriales\n", + "- Familiaridad con los conceptos de los laboratorios anteriores" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UEoM938B_JRH" + }, + "outputs": [], + "source": [ + "import sys\n", + "assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), \"Use Python 3.10, 3.11, or 3.12 to run this notebook.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instalación de dependencias" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "BfMWUUSs_JRI", + "jupyter": { + "outputs_hidden": false + }, + "outputId": "ae0cf79a-d586-4836-dbdc-97d18b217a16", + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "! echo \"::group::Install Dependencies\"\n", + "%pip install uv\n", + "! uv pip install \"git+https://github.com/ibm-granite-community/utils.git\" \\\n", + " transformers \\\n", + " pillow \\\n", + " langchain_community \\\n", + " 'langchain_huggingface[full]' \\\n", + " langchain_milvus 'pymilvus[milvus_lite]'\\\n", + " docling \\\n", + " replicate \\\n", + " matplotlib\n", + "! echo \"::endgroup::\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4gu-Oeay_JRJ" + }, + "source": [ + "## Importar las librerías necesarias" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "G-ke7lF4CFwW" + }, + "outputs": [], + "source": [ + "# To see detailed information about the document processing and visual grounding operations, we'll configure INFO log level.\n", + "# NOTE: It is okay to skip running this cell if you prefer less verbose output.\n", + "import logging\n", + "\n", + "logging.basicConfig(level=logging.INFO)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "rJwoqaBPHySg" + }, + "outputs": [], + "source": [ + "import json\n", + "import base64\n", + "import io\n", + "import itertools\n", + "import tempfile\n", + "from pathlib import Path\n", + "from tempfile import mkdtemp\n", + "from collections import Counter\n", + "\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import PIL.Image\n", + "import PIL.ImageOps\n", + "from PIL import ImageDraw\n", + "from IPython.display import display\n", + "\n", + "# Docling imports for document processing and visual grounding\n", + "from docling.document_converter import DocumentConverter, PdfFormatOption\n", + "from docling.datamodel.base_models import InputFormat\n", + "from docling.datamodel.pipeline_options import PdfPipelineOptions\n", + "from docling.datamodel.document import DoclingDocument\n", + "from docling.chunking import DocMeta\n", + "from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n", + "from docling_core.types.doc.document import TableItem, RefItem\n", + "from docling_core.types.doc.labels import DocItemLabel\n", + "\n", + "# LangChain imports for RAG pipeline\n", + "from langchain_huggingface import HuggingFaceEmbeddings\n", + "from langchain_community.llms import Replicate\n", + "from langchain_core.documents import Document\n", + "from langchain_core.vectorstores import VectorStore\n", + "from langchain_milvus import Milvus\n", + "from langchain.prompts import PromptTemplate\n", + "from langchain.chains.retrieval import create_retrieval_chain\n", + "from langchain.chains.combine_documents import create_stuff_documents_chain\n", + "\n", + "# Model imports\n", + "from transformers import AutoTokenizer, AutoProcessor\n", + "from ibm_granite_community.notebook_utils import get_env_var, escape_f_string" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yodQeEWVa3Xa" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyixsnLzCFwW" + }, + "source": [ + "### Selección de Modelos" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AB_Hry0ebAtE" + }, + "source": [ + "### Los Tres Pilares de RAG Multimodal\n", + "\n", + "Para un sistema completo de RAG multimodal con visual grounding, necesitamos tres tipos de modelos, cada uno cumpliendo un propósito crucial:\n", + "\n", + "1. **Modelo de Embeddings**: Convierte texto en representaciones vectoriales\n", + " - Permite la búsqueda semántica (\"encontrar contenido similar en significado\")\n", + " - Debe manejar texto de fragmentos, tablas y descripciones de imágenes\n", + "\n", + "2. **Modelo de Visión**: Entiende y describe contenido visual\n", + " - Procesa imágenes encontradas en documentos\n", + " - Genera descripciones textuales para su recuperación\n", + "\n", + "3. **Modelo de Lenguaje**: Genera respuestas finales\n", + " - Sintetiza la información recuperada\n", + " - Produce respuestas coherentes y precisas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NEIpVe6yIAN6" + }, + "source": [ + "### Modelo de Embeddings\n", + "\n", + "Usaremos un [modelo de embeddings Granite](https://huggingface.co/collections/ibm-granite/granite-embedding-models-6750b30c802c1926a35550bb) para generar vectores de embeddings de texto.\n", + "\n", + "\n", + "- Optimizado para texto en múltiples idiomas\n", + "- Compacto (107 millones de parámetros) para un procesamiento rápido\n", + "- Excelente comprensión semántica\n", + "- Ventana de contexto de 512 tokens\n", + "\n", + "Si deseas usar un modelo diferente, puedes consultar [esta receta de modelos de embeddings de la comunidad de IBM Granite](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 528, + "referenced_widgets": [ + "30aed391581c47598facbe74763a71f3", + "bb9f7ac4cfce4dc1aa2cd4f9456c4a05", + "a8214a9c57ca467faa6b32127332d1b7", + "7ebbf8143a604b409e558131b83706a9", + "bd7b3990a2e5414a8bfdd5d58aa4c6c4", + "1d17691f611147058a744cf1b6d92ec5", + "005e089874c14d16a72d263559c6f11e", + "f519c63bf62b471eb4138a54774b25b2", + "46444911d17e43e28758b79d6fd1698c", + "105cf6ef3ba8456c8666a158d0f3075a", + "ded96aceadd4480b89374a99d67f56e2", + "ffa777f2454743a6ab66c35bbd1ac9dd", + "146abad9d66d4c2ca8f2f31ffe4bee64", + "1b9419b296104289a2f68d3bc2615ea2", + "f87a0f86f8ed41579edb87cb2936b980", + "ff1e65972bee45eeafb4070ca8064517", + "b9491b7c2a83410097591f434df30df1", + "268a66e6bdd94deaad362f7e6e860f87", + "0511ef3a989a4cd49fa4909918e53cf9", + "7ad3eeeb3980431a91f55b3d7bd97529", + "c768a35348e14b2cad0d9bc6ed047f41", + "3109510d96ca4385aca0a3d6b0b99e65", + "8d94a499aa02419da19ff30e9a4ccc8b", + "ff2b9ac983e54316858ea7f870d041a9", + "809e4f104bd04244a5bc2c1c65790497", + "de429ecb3f7c44978cf6bd9085259933", + "ea4efa50585b49e5a3242c64fb83a241", + "dc81b5fd026544ba975f4baa890c9ed2", + "1078c105925a437b93f26a8f04662b88", + "c82003732c1146d0b5b59c757f1859b8", + "78e5987e84d14efab9619dd6de9da651", + "45bdc919748f41ec946a78453533b9f1", + "63db65a7e1ba44a49953e564475774e1", + "0d13347b9e4848d5998b2419ed5765b4", + "83cb2acd36634bd8ab2e323b54cc8f43", + "5530b12b7d7146b7be524bf542509765", + "bef293aaada14720b396c2b2b464e5fe", + "be3cd11187f04f2ba79a29d6a93ad80b", + "701317fcc9e047678eca988cda1f0b19", + "67e6bc45803c4b069c1a054fe1ee28f0", + "f28fc775dcd34c66bf48c41f829dde38", + "c1bfcd0e354c4805a13d0b1fc111ee45", + "6399b41f5b9a4be8813893cd3224b1ce", + "69570c2887f04f3f8d67bbd975eaa79e", + "73c46fdee29e46068167061884060eb5", + "8e523d385eb743899c1086245f25f72a", + "c699c8d54bf845ddb0c4519a135dd862", + "fb534df5f7b34766b8a8e783c2008466", + "09563f36f1c8401c98acdb6176deb521", + "18bdb8d9c93a4251842c161ce7c15926", + "e22cbf4f9bcb4370abfebeb52272d15c", + "42057f222a1e462fb146991dae15b5fd", + "7e8a2b4d2d914a118ccfd66c460ce89c", + "bb496aa21cb841e5a368f12a7ef51209", + "f0b3502132724ec8a6c6345855b6a74d", + "c62bedacd48d4c70a1db6896bd3ead1b", + "6108fd6deb8c448fa37918088c9ac44a", + "d388ea4db4d14dbaad7fb7f2af893fdf", + "e5c720ed8ff44bf995dd22cc346364ca", + "49689529e77541fea13c6ae5b9bfb63b", + "ad8827ca671d4732a55a62bedd6533de", + "312ad4ea71454b55baab0c6552a4dac8", + "77ae0c3def7348d987d91a3aef9f8d12", + "2d856987a7994377b4482d4229681a5c", + "35f8a006f3c84b168282a927f065afc8", + "c66a1538bb0b44c6a617ba57189ef7b9", + "99f8b9c321f940d885df7cfa57d104ae", + "65278a69c9a24cddbbd1b20ab08abaeb", + "fe3acc6c2c19435891201245f1bf078c", + "65595dd5223c457e863e9ff049f6f90b", + "f34b67eba2ca41e4a3a8c08e6649256c", + "17361220c0594b9ca30f6500543db6e8", + "36a09ad6dd31454d8b8185a7a67e36fc", + "c5e095ef30f7444dbf98a3356c676234", + "a9116fd44acf4d3290d43f7643d372d1", + "72814321ea6440e1a6fe2a5f83236376", + "e3e72eb246594d3497e175ad017e6267", + "17b72d7de14b48c296350ed182da1abd", + "aabfc86e48924653b7b816f3653d2ca6", + "60ce3f06c5144ba3b8f13cb9813ad8c6", + "91e10ba5e5dd45c1a6e869829f657264", + "ea0ca547ee1b4d7895a16f525eae20d5", + "c029c57f8d2a4f89bbe759281c3ad16f", + "2d870e4a1d3146e49c71c07306b6a10d", + "0256b42e2d9e43658a52d9bb767fec3c", + "e05013c54d594ce28aa98b2ac2d72dd2", + "1ba73dfd141f43aa9ba337192ad39667", + "7049612009674999be46373e34af2245", + "927b891f7b7f4061b0ba2ba71de64dc7", + "0f963da83c9b4469afe820915871fc62", + "71e6963c6e6d4753bef3e6f036061558", + "50bd57c1bac040cb860f2aea40e2f64e", + "e0b9bfb59c8044f19087cef6b08f3d15", + "3381fa31190349e6bb996e1e333704f0", + "5ff928c8b9694b2997eb492519cb73ef", + "488135d46a0c4ac2af0feba79f1e2b12", + "9d5ca3199c2940e486c8e691abcfe370", + "50da0c70814f4623a599891fb022caa5", + "95c84535d1fc4223ae24694e2d8f932d", + "873b3db7c970448b887deab25c48d37f", + "08e964cf2358451e8b07678709e98ef2", + "35fca115556c425aa1f205294cd302b4", + "16c4dbd4c35a4c78b5254a245028c710", + "0cda12f4cb304c68ab52be19030ae4bd", + "81619090425b434bbee76003feec3baa", + "bd21dad115cf4e77937a7f2701cbcae5", + "72b986829a23454f9d0381e23e731e6d", + "9d36c4e7a66447c1bd93028168494b21", + "197ce4c0d60f43d786064011d0655d63", + "01eb40035b6544ddb790da91c4c49f5b", + "a4654dce525f428ea1826f551f6ecf96", + "1e544112cf7d44db8c3c0538be77bf1b", + "38764c10d2af43198d15d50241405bf6", + "20f39553605149899141309285c4d3ea", + "0116ae1b4a724e4eb3a2d890f6e2d916", + "dd1ec695d2be45ce957691f086b90c10", + "fe183c499c1e414ca6c1c41347a74594", + "19ca17543ff443e19f043478cd82657a", + "d8723555c1ce41a28a2dcfc24b8ed26d", + "02d6cc2e89d54f699475cb9f070900a7", + "12caabb1fa3349afa1b5ba246484534c" + ] + }, + "id": "mvztNZly_JRJ", + "outputId": "b65d35cf-4549-4e77-87f9-46ba5c3f6138" + }, + "outputs": [], + "source": [ + "embeddings_model_path = \"ibm-granite/granite-embedding-107m-multilingual\"\n", + "embeddings_model = HuggingFaceEmbeddings(\n", + " model_name=embeddings_model_path,\n", + ")\n", + "embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)\n", + "print(f\"Embeddings model loaded: {embeddings_model_path}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J9nTm3qICFwW" + }, + "source": [ + "### Cargar el Modelo de Visión Granite\n", + "\n", + "El modelo de visión nos ayudará a comprender las imágenes dentro de los documentos. Esto es crucial para un RAG verdaderamente multimodal, ya que muchos documentos contienen información visual importante.\n", + "\n", + "**¿Por qué usar un modelo alojado en la nube en lugar de uno local?**\n", + "- Procesamiento más rápido sin necesidad de GPU\n", + "- Rendimiento consistente\n", + "- Fácil de escalar\n", + "\n", + "**Nota**: Para configurar Replicate, consulta [Introducción a Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb).\n", + "\n", + "Para conectarte a un modelo en un proveedor diferente a Replicate, consulta la [receta de LLMs de Langchain de la comunidad de IBM Granite](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 377, + "referenced_widgets": [ + "a534e43df09f4bd184b5ec03735ba106", + "35df11e94a084fca84a79ff238dd41dc", + "fcc03f7726b0475b8c692f8102f300db", + "e38eb7bd68c443b49597a42bf99fddb6", + "9f0f75412d184759a4c866ce0c57ebca", + "8fe1a9a63b1f48878b3c46ee8aadcc79", + "fc671127aeab4397b217363b696a8d32", + "1ac06c72a57d450f959e4f676f3eb953", + "eda16fcae5774b6ab97db390f2cf6212", + "0358db24b603458c94bfbaa265bcd699", + "8ea2420779ae49938724db859bdfc218", + "f15ae056261348698338ef17784b267c", + "c89e1c71be9a4f9d9c520a9e655de69d", + "021d9eca558d4c52a7d5719fa595019a", + "b44231c04c5a42808cc49bfef5cb5761", + "afc6940a831a4712a1c6ec3bdcdeed7c", + "6a2ef5dbc966469c9e68bde49fa7b912", + "03d324aaa6734a058bd2ae8ce3fcc198", + "85874d0df7ba4223b2fc49a150ccf235", + "018db3282d104bfdbad8e79d568e6452", + "1656c7ac85ab48949681a70f34706ec4", + "18e60b6f5e1c415dbf1726d5678ff69a", + "d7776f5095b1447a87af1a5ff743ffea", + "f58d57126aca42d7ac30eb5f0d6c5862", + "c6f57eb2d7144b9382cbe71e65bd78c9", + "f2c5e47cd6194c7bb48228ea6abdece2", + "033c76f5356f4e7290d0f7c9f5c7e976", + "ba4b9c4a029c445a9c2d7bc2f4b94c2f", + "279c4bf80ed34918a39c9eed77e20448", + "e848ffc816fd4ac0ae220a575d816257", + "5535857e2f964b80947edb78e2af619c", + "7784639f83124f9e8891390744b5f7e8", + "ccc8df11a0a748d6ae1098f718bf909a", + "936fbdcc7f8f4d7e9f4c60a91f1bce82", + "c0863ca49d4343dda2f1e522305ae51c", + "22793dec401e416ab7e4a1fff8731353", + "14005f1853094e009bc0792807b56132", + "4618c66a4e964ea2b6ebbb0b2f620abf", + "91e6cb46a47f429fa124ee4830905f20", + "eba20cb1f3094c56a743398a488117a9", + "3d97db037b7e445d854bcceac1478fea", + "df5f7eadbaa544a9a90a8325440559ba", + "5378f121bb09445d83fdaf58d7260b28", + "7affa0c2c9694c5f8e117ce1820ff07d", + "8b051518b298442ea0e50ae4de00b607", + "99776d2b3526490da7dd7f91aace918d", + "3f97c371ec024e308e4e4bf9deac89a1", + "c11039a4c7644d51bf815a8c6a147b62", + "011f08213068401dabee1bef4f21cd30", + "23589f5c9174456e90a91ad04ab3190b", + "9980cf7d59434126b5aa29c724b924cc", + "616ddd6f314d47c3941762d2423e53a9", + "d6c46e36441a4b8c8cedae9379528e49", + "033085b9dc7b4a229f10560ec13ed5e4", + "197584bb2ee748ffae5a12d773c74c59", + "4158830094b849bda48c0f3383762906", + "5ce4d6659b09413c9d1c023c6c02d0a0", + "c5e2a132b2c54a1aaef37d05b6c02d24", + "a512326a7ce64f51b7acccde449cf211", + "e5d42ce5909f4bad8988074b3f7e095b", + "df483317ae1a491d8a6823fd0c542371", + "d86b3c886e094fecbebac6d06a60c83e", + "689f0d3ec2fe4a2cb4a906c4a8727bb3", + "cb9eb27155bc45b3a7581d41a8fbc924", + "06610e0f196146a9bf3f6d5bd542b3c4", + "c704fb332dcd4f558684ff8dbbcaab5d", + "9f299a085bdc49cf9928a8b727bf745f", + "bfe7ee9615354df2b55844ac87e53347", + "d7b4dfe4d3f34302986549791633a5a4", + "9fc1888a5f054a7ea1773f343519234b", + "328a38ed6f774e229cf94220b5e87d26", + "581df90d93a24b3c85f73147e84a04db", + "7c0dc1f65d8f477ebc9984e9014bbc7f", + "00068a3f007e48bcabe426b077055012", + "b6bb194f067d4e8bb3eb74ee34c617e5", + "f36158e5ffe8438a98207c333476e885", + "c89195399f53453caa59964dfc8fb311", + "20723c36ae014a79aae19124387ce7f4", + "208333b9ad43414db26be168e88a9764", + "067dfbebbd6b493ebe8e56cc68efe016", + "db953222e9514659bc18e4e9b15d756b", + "7fb731dd508d47f5a7abb5eb16e790d7", + "58ee2cc2bd084374a294e8d3edc0833e", + "df2261a91c354e4aa3133b1eb19e48d9", + "80b2b32f78314bc89d549242a1f24a26", + "aa8b3d7be6604fc58fbb55e2016a4f24", + "896a033ae3894e8990cb626883ad7459", + "3033414c0baf4d03857520b79de534fc", + "221c2936753f4954b81fa29d167303c9", + "ea3a35bc32524806957cc6ef391f035e", + "b6cde6b309bf4c79afa1e7250e737ed7", + "79705b356a834632aa5324b66daeef1d", + "e50b987dc41c4231aa8a5c9219fc49d5", + "6cacca41b9024c329d2bcba9e821d012", + "b14dd3351e7f4465a057e2038ff1108d", + "ea809b34f1d545fea87b12d16a991991", + "c21796d6756c4247bac05db4453d80eb", + "6a477ba04063486d87f937d97dac615e", + "80ec01e074b74fc5829a4be11cc64887" + ] + }, + "id": "-5272bCOCFwW", + "outputId": "d6ad7a1f-2988-4acc-8622-710a694d5702" + }, + "outputs": [], + "source": [ + "vision_model_path = \"ibm-granite/granite-vision-3.3-2b\"\n", + "vision_model = Replicate(\n", + " model=vision_model_path,\n", + " replicate_api_token=get_env_var(\"REPLICATE_API_TOKEN\"),\n", + " model_kwargs={\n", + " \"max_tokens\": embeddings_tokenizer.max_len_single_sentence,\n", + " \"min_tokens\": 100,\n", + " \"temperature\": 0.01 # low temperature for reproduceability\n", + " },\n", + ")\n", + "vision_processor = AutoProcessor.from_pretrained(vision_model_path)\n", + "print(f\"Vision model loaded: {vision_model_path}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ma8eWR10_JRJ" + }, + "source": [ + "### Cargar el Modelo de Lenguaje Granite\n", + "\n", + "Finalmente, nuestro modelo de lenguaje generará respuestas basadas en el contexto recuperado.\n", + "\n", + "**¿Por qué usar este modelo?**:\n", + "- 8B parámetros: Buen equilibrio entre calidad y velocidad\n", + "- Ajustado a instrucciones: Sigue nuestros prompts con precisión\n", + "- Familia Granite: Código abierto y con licencia Apache 2.0" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 226, + "referenced_widgets": [ + "44c6e3c381944af9ba6850e66e066f1f", + "10f070266fff46509ca258b61dae89e5", + "f0e36bc66e794361aee399c9c4f2d683", + "d03e6698b9d74f00af51217c993fa765", + "a5bbf421325644fdbc6edf12f8a034d9", + "64f672136d8c4d4abc184e6fa973b61e", + "0c838d83f14b4f37bbcf83e7f5401049", + "8f32c79df5d942129f3bdbc3342d8a05", + "1eaa73c964af482b8a6852ddcdbc4830", + "f67521ed26cd46e982675813d0fa5167", + "b12c8c3c552e451693c1063a298e7a6d", + "99b1488e5d094dba9e72c7bc2ca21427", + "cbb843c832fd4527a107089ca5f2fad3", + "07ead5501f5b4c358a2b0baa34b08490", + "2c5b43a5f60544fc9b6f5b5e7114b1b8", + "40a81bb7bd1e475885afbd2e09ef1fc6", + "85a41be27c604244b35c582e32be1c30", + "0749c65dab7e4a46b2f3dce6ccde0987", + "06eca4f2a8cc4bc6a22f190a68ca53e5", + "d8b0ed5af0a24cb1ae9d530de10b91f5", + "e31a3cf13e1b46d9a304cfff8ba8cc4d", + "9b6f8669ac5a459f84ec622b831e6b94", + "3fe5bb2331e34a62a54d07dc6ccb181b", + "7ee6bf328f9c41779e0ec04a11128d23", + "f52d124f4eaf47209d9736876b079301", + "f8a832d2353844efb5605d9c833a38b5", + "bf4655af6bc54811b9c190a796088651", + "6560e20b62374bddac5c2ba17c463b34", + "7fa8f0281e57466691155c9372286419", + "6d1220067e184e0f9637f10962ea8aff", + "46fa9c2dbdc44d8ba27e551ff6d84b65", + "df8b0f131e9a44e2a925d34ca2ccecb6", + "e7c88f3fb527430caaf311309de069a5", + "76c3c4d8293d4953876edff0f1a75898", + "657e53b228354a5695f73b5a7f6e2335", + "e00786348c644dcdaf5026296145e703", + "2d2f6e60a1c8457593943784a91c20fa", + "dc6d9d6bb5d14358b3c1d15f93d6d398", + "7fc30294363d4e9d8f1d38297996f17d", + "ba50a4b2fefd4b069a8f02f169d4b753", + "515b69ef7ef640c1b2d0c078c4075c0d", + "543b1fa684d24d57b5756816a1948dd6", + "6c3f9316cc424841a7bf2d8a9e70df70", + "40062877163d4ec38610bc4cbf0f6fcb", + "8efb1ca61b53466abedc92f95ae16633", + "d87cb3db4f1749bc8f749d2bf7fc7d45", + "2a233805f4f2437595b06c70a4d702b1", + "c3f51829de3b448291e0cc66cff2312b", + "218a3779ad18493eb1d23a06c6774388", + "b69c6b35e51d496891dc2af739ee6526", + "198a035bd5e94ab7a7b8768ce5c85e0a", + "5bccd67539f94be19746bc2777eac659", + "82796002bad64fada9314d34bcf4fecd", + "c0c081a48abe4410b16e60173eaa4326", + "51cb65a268a74e2e9e8b5e357c3b022f", + "65d629651352431384420d3606289847", + "cce0890083934592b0f60e04fb9622ac", + "70690a4626714ab397e249ad2cfe286e", + "3351e52693674feaac570cf92c83bf55", + "a7c612ec254d4de59ea7a85022610b9a", + "3e1b1638c3f64d76a2b6990a7474385c", + "ddb41d7e7adf4d97906ec238bd63edc6", + "69257ae2351c4565a97320b554fc3c16", + "4939785c8a514734ba57f363860fd587", + "b2be447bf8844e4eaba87d100f1af07b", + "ac210c1417ac4f16bf4ec22a111bd6cc" + ] + }, + "id": "Ckyj7Zrh_JRK", + "outputId": "41c24db5-5e33-4afb-e358-fdb2efe3ff38" + }, + "outputs": [], + "source": [ + "model_path = \"ibm-granite/granite-3.3-8b-instruct\"\n", + "model = Replicate(\n", + " model=model_path,\n", + " replicate_api_token=get_env_var(\"REPLICATE_API_TOKEN\"),\n", + " model_kwargs={\n", + " \"max_tokens\": 1000,\n", + " \"min_tokens\": 100,\n", + " \"temperature\": 0.01 # low temperature for reproduceability\n", + "\n", + " },\n", + ")\n", + "tokenizer = AutoTokenizer.from_pretrained(model_path)\n", + "print(f\"Language model loaded: {model_path}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VVfNx7tCcB8l" + }, + "source": [ + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nviHG3n7_JRK", + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "## Procesamiento de documentos con soporte para visual grounding" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m0TSvvRFJTDJ" + }, + "source": [ + "El visual grounding requiere una configuración especial durante la conversión de documentos. A diferencia del procesamiento estándar, necesitamos:\n", + "\n", + "1. **Generar imágenes de página de alta calidad**: Para resaltar elementos de la página visualmente.\n", + "2. **Preservar la información de coordenadas**: Para saber dónde se encuentra el contenido.\n", + "3. **Mantener la estructura del documento**: Para una atribución de fuentes precisa.\n", + "4. **Almacenar los documentos adecuadamente**: Para su posterior recuperación y visualización.\n", + "\n", + "Vamos a configurar Docling con estos requisitos:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "u3jGNMkKJZ-R", + "outputId": "b85c1c47-5206-4701-fd54-20212154f568" + }, + "outputs": [], + "source": [ + "# Configure the document converter to support visual grounding\n", + "pdf_pipeline_options = PdfPipelineOptions(\n", + " do_ocr=False, # Set to True if your PDFs contain scanned images\n", + " generate_picture_images=True, # Extract images from documents\n", + " generate_page_images=True, # CRITICAL: Generate page images for visual grounding\n", + ")\n", + "\n", + "format_options = {\n", + " InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),\n", + "}\n", + "\n", + "converter = DocumentConverter(format_options=format_options)\n", + "print(\"Document converter configured with visual grounding support\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eZ7Guu7A_JRK" + }, + "source": [ + "### Creamos un almacén local de documentos para el visual grounding\n", + "\n", + "Almacenaremos documentos que mantendrán la estructura completa del documento necesaria para el visual grounding. Esto es esencial para resaltar las coordenadas de origen más adelante:" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YNGz_0gZ_JRK", + "outputId": "f3585ca6-1668-49ae-c7cf-252301e98d94" + }, + "outputs": [], + "source": [ + "# Create document store for visual grounding\n", + "doc_store = {}\n", + "doc_store_root = Path(mkdtemp()) # Temporary directory for document store\n", + "print(f\"Document store created at: {doc_store_root}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tSLYgc7JCFwX" + }, + "source": [ + "### Convertir documentos con seguimiento visual\n", + "\n", + "Ahora procesaremos documentos mientras preservamos toda la información necesaria para el visual grounding:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FLDMCxFbCFwX", + "outputId": "d2a325a8-04ef-4cf2-94ac-c8c6a9f372e0" + }, + "outputs": [], + "source": [ + "sources = [\n", + " \"https://arxiv.org/pdf/2501.17887\" # Docling paper\n", + " # \"https://arxiv.org/pdf/2206.01062\", # DocLayNet paper\n", + " # \"https://arxiv.org/pdf/2311.18481\", # DocQA\n", + " # Añade más documentos según sea necesario\n", + "]\n", + "\n", + "conversions = {}\n", + "\n", + "print(\"Iniciando la conversión de documentos con visual grounding...\")\n", + "for source in sources:\n", + " # Por cada fuente, convertimos el documento preservando las imágenes y guardándolas en nuestro almacén local de documentos\n", + " print(f\"\\n Procesando: {source}\")\n", + "\n", + " # Convert document\n", + " result = converter.convert(source=source)\n", + " docling_document = result.document\n", + " conversions[source] = docling_document\n", + "\n", + " # Save document to store for visual grounding\n", + " # The binary hash ensures unique identification\n", + " file_path = doc_store_root / f\"{docling_document.origin.binary_hash}.json\"\n", + " docling_document.save_as_json(file_path)\n", + " doc_store[docling_document.origin.binary_hash] = file_path\n", + "\n", + " print(\"Documento convertido y guardado en el almacén local.\")\n", + " print(f\" - Document ID: {docling_document.origin.binary_hash}\")\n", + " print(f\" - Paginas: {len(docling_document.pages)}\")\n", + " print(f\" - Tablas: {len(docling_document.tables)}\")\n", + " print(f\" - Imágenes: {len(docling_document.pictures)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0vX3XqSOKBos" + }, + "source": [ + "## Procesado del Contenido del Documento con Metadatos de Atribución\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2dvccMBeKC7c" + }, + "source": [ + "### La Importancia de los Metadatos para el Visual Grounding\n", + "\n", + "Para que el visual grounding funcione, cada fragmento de contenido debe mantener metadatos sobre su ubicación de origen. Esto incluye:\n", + "- **Números de página**: Qué página(s) contienen este contenido\n", + "- **Cajas delimitadoras**: Coordenadas exactas en la página\n", + "- **Referencias de documentos**: Enlaces de regreso al documento fuente\n", + "- **Tipo de contenido**: Si es texto, tabla o imagen\n", + "\n", + "Este metadato es lo que nos permite resaltar regiones de interés específicas en las páginas del documento más adelante." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mdT7gzIeKFyd" + }, + "source": [ + "## Procesamos los Chunks de Texto con metadatos de ubicación\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4zgnHW4UCFwX" + }, + "source": [ + "Ahora procesamos cualquier tabla en los documentos. Convertimos los datos de la tabla al formato markdown para pasarlos al modelo de lenguaje. Se crea una lista de documentos LangChain a partir de las representaciones en markdown de la tabla." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "VDa6yik4CFwX", + "outputId": "a8ff50a6-20d6-4a0f-c198-a7a6e4dc9f46" + }, + "outputs": [], + "source": [ + "from docling.chunking import DocMeta\n", + "\n", + "doc_id = 0\n", + "texts: list[Document] = []\n", + "\n", + "print(\"\\nProcessing text chunks with visual grounding metadata...\")\n", + "for source, docling_document in conversions.items():\n", + " chunker = HybridChunker(tokenizer=embeddings_tokenizer)\n", + "\n", + " for chunk in chunker.chunk(docling_document):\n", + " items = chunk.meta.doc_items\n", + "\n", + " # Skip single-item chunks that are tables (we'll process them separately)\n", + " if len(items) == 1 and isinstance(items[0], TableItem):\n", + " continue\n", + "\n", + " refs = \" \".join(map(lambda item: item.get_ref().cref, items))\n", + " text = chunk.text\n", + "\n", + " # Create document with enhanced metadata for visual grounding\n", + " document = Document( # langchain_core.documents.Document\n", + " page_content=text,\n", + " metadata={\n", + " \"doc_id\": (doc_id:=doc_id+1),\n", + " \"source\": source,\n", + " \"ref\": refs, # References for tracking specific document items\n", + " \"dl_meta\": chunk.meta.model_dump(), # CRITICAL: Store chunk metadata for visual grounding\n", + " \"origin_hash\": docling_document.origin.binary_hash # Link to stored document\n", + " },\n", + " )\n", + " texts.append(document)\n", + "\n", + "print(f\"Created {len(texts)} text chunks with visual grounding metadata\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SzoFPRBjCFwX" + }, + "source": [ + "### Procesamos las tablas con información espacial\n", + "\n", + "Las tablas requieren un manejo especial para preservar su estructura y ubicación:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GaiUN8nVCFwX", + "outputId": "bb25ef96-c755-4884-89bb-093e47524dd1" + }, + "outputs": [], + "source": [ + "doc_id = len(texts)\n", + "tables: list[Document] = []\n", + "\n", + "print(\"\\nProcessing tables...\")\n", + "for source, docling_document in conversions.items():\n", + " for table in docling_document.tables:\n", + " if table.label in [DocItemLabel.TABLE]:\n", + " ref = table.get_ref().cref\n", + " text = table.export_to_markdown(docling_document)\n", + "\n", + " # Extract provenance information for visual grounding\n", + " prov_data = []\n", + " if hasattr(table, 'prov') and table.prov:\n", + " for prov in table.prov:\n", + " # Get the page to access its height for coordinate conversion\n", + " if prov.page_no < len(docling_document.pages):\n", + " page = docling_document.pages[prov.page_no]\n", + " # Convert to top-left origin and normalize\n", + " bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n", + " bbox_norm = bbox.normalized(page.size)\n", + "\n", + " prov_data.append({\n", + " \"page_no\": prov.page_no,\n", + " \"bbox\": {\n", + " \"l\": bbox_norm.l, # Use normalized coordinates\n", + " \"t\": bbox_norm.t,\n", + " \"r\": bbox_norm.r,\n", + " \"b\": bbox_norm.b\n", + " }\n", + " })\n", + "\n", + " document = Document(\n", + " page_content=text,\n", + " metadata={\n", + " \"doc_id\": (doc_id:=doc_id+1),\n", + " \"source\": source,\n", + " \"ref\": ref,\n", + " \"origin_hash\": docling_document.origin.binary_hash,\n", + " \"item_type\": \"table\", # Mark as table\n", + " \"prov_data\": prov_data # Store provenance as simple data\n", + " },\n", + " )\n", + " tables.append(document)\n", + "\n", + "print(f\"Created {len(tables)} table documents\")" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [], + "source": [ + "print(tables[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GWHbpeEuCFwX" + }, + "source": [ + "### Procesamos las imágenes con entendimiento visual\n", + "\n", + "Para un verdadero RAG multimodal, necesitamos entender el contenido de las imágenes. Usaremos el modelo de visión Granite para generar descripciones.\n", + "\n", + "**NOTA**: El procesamiento de imágenes puede llevar tiempo dependiendo de la cantidad de imágenes y del servicio del modelo de visión. Cada imagen será analizada individualmente." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "htIYVVjHPKSX", + "outputId": "251f0d3e-132c-43b6-eeae-e87c6bd87a03" + }, + "outputs": [], + "source": [ + "import replicate\n", + "import time\n", + "\n", + "def encode_image(image: PIL.Image.Image, format: str = \"png\") -> str:\n", + " \"\"\"Encode image to base64 for vision model processing\"\"\"\n", + " image = PIL.ImageOps.exif_transpose(image) or image\n", + " image = image.convert(\"RGB\")\n", + "\n", + " buffer = io.BytesIO()\n", + " image.save(buffer, format)\n", + " encoding = base64.b64encode(buffer.getvalue()).decode(\"utf-8\")\n", + " uri = f\"data:image/{format};base64,{encoding}\"\n", + " return uri\n", + "\n", + "# Configuración del prompt de visión - siéntete libre de experimentar con esto!\n", + "image_prompt = \"Give a detailed description of what is depicted in the image\"\n", + "conversation = [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\"type\": \"image\"},\n", + " {\"type\": \"text\", \"text\": image_prompt},\n", + " ],\n", + " },\n", + "]\n", + "vision_prompt = vision_processor.apply_chat_template(\n", + " conversation=conversation,\n", + " add_generation_prompt=True,\n", + ")\n", + "\n", + "pictures: list[Document] = []\n", + "doc_id = len(texts) + len(tables)\n", + "num_image_descriptions = 0\n", + "for source, docling_document in conversions.items():\n", + " num_pictures = len(docling_document.pictures)\n", + " for i, picture in enumerate(docling_document.pictures):\n", + " ref = picture.get_ref().cref\n", + " print(f\" Processing image: {ref} ({i+1}/{num_pictures})\")\n", + "\n", + " image = picture.get_image(docling_document)\n", + " if image:\n", + " num_image_descriptions += 1\n", + " # Generate image description using vision model\n", + " # text = vision_model.invoke(vision_prompt, image=encode_image(image))\n", + " resp = replicate.run(\n", + " \"ibm-granite/granite-vision-3.3-2b\",\n", + " input={\n", + " \"image\": encode_image(image),\n", + " \"prompt\": image_prompt,\n", + " \"max_tokens\": embeddings_tokenizer.max_len_single_sentence,\n", + " \"min_tokens\": 100\n", + " }\n", + " )\n", + " time.sleep(5) # Small delay to avoid rate limiting\n", + " text = ''.join(resp)\n", + " # Extract provenance information for visual grounding\n", + " prov_data = []\n", + " if hasattr(picture, 'prov') and picture.prov:\n", + " for prov in picture.prov:\n", + " # Get the page to access its height for coordinate conversion\n", + " if prov.page_no < len(docling_document.pages):\n", + " page = docling_document.pages[prov.page_no]\n", + " # Convert to top-left origin and normalize\n", + " bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n", + " bbox_norm = bbox.normalized(page.size)\n", + "\n", + " prov_data.append({\n", + " \"page_no\": prov.page_no,\n", + " \"bbox\": {\n", + " \"l\": bbox_norm.l,\n", + " \"t\": bbox_norm.t,\n", + " \"r\": bbox_norm.r,\n", + " \"b\": bbox_norm.b\n", + " }\n", + " })\n", + "\n", + " document = Document(\n", + " page_content=text,\n", + " metadata={\n", + " \"doc_id\": (doc_id:=doc_id+1),\n", + " \"source\": source,\n", + " \"ref\": ref,\n", + " \"origin_hash\": docling_document.origin.binary_hash,\n", + " \"item_type\": \"picture\", # Mark as picture for special handling\n", + " \"prov_data\": prov_data # Store normalized provenance data\n", + " },\n", + " )\n", + " pictures.append(document)\n", + "\n", + "print(f\"Created {len(pictures)} image descriptions\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UV3N4DIVL1a_" + }, + "source": [ + "### Mostramos una muestra de los documentos procesados\n", + "\n", + "Examinemos lo que hemos creado para entender la naturaleza multimodal de nuestro sistema:" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "collapsed": true, + "id": "C_F_3_ZUL1yS", + "outputId": "5a54c832-6f9d-432f-8df1-1205791b0b8d" + }, + "outputs": [], + "source": [ + "import textwrap\n", + "\n", + "print(\"\\nSample processed documents:\")\n", + "print(\"=\" * 80)\n", + "\n", + "# Show sample text chunks\n", + "print(\"\\nTEXT CHUNK EXAMPLES:\")\n", + "print(\"-\" * 80)\n", + "for i, text_doc in enumerate(texts[:3]): # Show first 3 text chunks\n", + " print(f\"\\nText Chunk {i+1}:\")\n", + " print(f\" Document ID: {text_doc.metadata['doc_id']}\")\n", + " print(f\" Source: {text_doc.metadata['source'].split('/')[-1]}\") # Just filename\n", + " print(f\" Reference: {text_doc.metadata['ref']}\")\n", + " print(f\" Has visual grounding: {'dl_meta' in text_doc.metadata}\")\n", + " print(\" Content preview:\")\n", + " print(f\" {text_doc.page_content[:250]}...\")\n", + " if i < 2: # Add separator between examples except after the last one\n", + " print(\"-\" * 40)\n", + "\n", + "# Show sample tables\n", + "print(\"\\n\\nTABLE EXAMPLES:\")\n", + "print(\"-\" * 80)\n", + "if tables:\n", + " for i, table_doc in enumerate(tables[:3]): # Show first 3 tables\n", + " print(f\"\\nTable {i+1}:\")\n", + " print(f\" Document ID: {table_doc.metadata['doc_id']}\")\n", + " print(f\" Reference: {table_doc.metadata['ref']}\")\n", + " print(\" Content preview (Markdown format):\")\n", + " # Show first few lines of the table\n", + " table_lines = table_doc.page_content.split('\\n')[:8]\n", + " for line in table_lines:\n", + " print(f\" {line}\")\n", + "else:\n", + " print(\" No tables found in the document.\")\n", + "\n", + "# Show sample images with descriptions\n", + "print(\"\\n\\nIMAGE EXAMPLES WITH AI-GENERATED DESCRIPTIONS:\")\n", + "print(\"-\" * 80)\n", + "if pictures:\n", + " for i, pic_doc in enumerate(pictures[:3]): # Show first 3 images\n", + " print(f\"\\nImage {i+1}:\")\n", + " print(f\" Document ID: {pic_doc.metadata['doc_id']}\")\n", + " print(f\" Reference: {pic_doc.metadata['ref']}\")\n", + " print(\" AI-Generated Description:\")\n", + " # Wrap the description for better readability\n", + " wrapped_text = textwrap.fill(pic_doc.page_content, width=70, initial_indent=\" \", subsequent_indent=\" \")\n", + " print(wrapped_text)\n", + "\n", + " # Display the actual image\n", + " source = pic_doc.metadata['source']\n", + " ref = pic_doc.metadata['ref']\n", + " docling_document = conversions[source]\n", + " picture = RefItem(cref=ref).resolve(docling_document)\n", + " image = picture.get_image(docling_document)\n", + " if image:\n", + " print(\"\\n Original Image:\")\n", + " # Resize image for display if too large\n", + " display_image = image.copy()\n", + " max_width = 600\n", + " if display_image.width > max_width:\n", + " ratio = max_width / display_image.width\n", + " new_height = int(display_image.height * ratio)\n", + " display_image = display_image.resize((max_width, new_height), PIL.Image.Resampling.LANCZOS)\n", + " display(display_image)\n", + "\n", + " if i < min(2, len(pictures)-1):\n", + " print(\"-\" * 40)\n", + "else:\n", + " print(\" No images found in the document.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W292HEjOOkC3" + }, + "source": [ + "### Comprendiendo la Atribución Visual de Fuentes\n", + "\n", + "El grounding visual es lo que diferencia a este sistema. Las funciones que definiremos en las siguientes celdas nos permiten:\n", + "1. **Localizar**: Encontrar la fuente exacta de cualquier información recuperada\n", + "2. **Resaltar**: Dibujar indicadores visuales en las páginas del documento\n", + "3. **Diferenciar**: Usar estilos distintos para texto, tablas e imágenes\n", + "4. **Verificar**: Permitir a los usuarios confirmar las respuestas de la IA contra los documentos fuente\n" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "id": "x0SFvP-tOtkE" + }, + "outputs": [], + "source": [ + "# Esta función visualiza de dónde proviene un fragmento de texto en el documento original.\n", + "def visualize_chunk_grounding(chunk, doc_store, highlight_color=\"blue\"):\n", + " \"\"\"\n", + " Visualize where a text chunk comes from in the original document.\n", + "\n", + " This function:\n", + " 1. Loads the original document from the store\n", + " 2. Finds the pages containing the chunk content\n", + " 3. Draws bounding boxes around the source regions\n", + " 4. Displays the highlighted pages\n", + "\n", + " Args:\n", + " chunk: LangChain Document with visual grounding metadata\n", + " doc_store: Dictionary mapping document hashes to file paths\n", + " highlight_color: Color for highlighting (blue, green, red, etc.)\n", + "\n", + " Returns:\n", + " Dictionary of page images with highlights\n", + " \"\"\"\n", + " # Get the origin hash\n", + " origin_hash = chunk.metadata.get(\"origin_hash\")\n", + " if not origin_hash:\n", + " print(\"No origin hash found in metadata\")\n", + " return None\n", + "\n", + " # Load the full document from store\n", + " dl_doc = DoclingDocument.load_from_json(doc_store.get(origin_hash))\n", + "\n", + " print(f\"Visualizing source location for chunk {chunk.metadata.get('doc_id', 'Unknown')}\")\n", + "\n", + " # Handle different types of content\n", + " page_images = {}\n", + " item_type = chunk.metadata.get(\"item_type\", \"text\")\n", + "\n", + " if item_type in [\"picture\", \"table\"] and \"prov_data\" in chunk.metadata:\n", + " # Handle tables and pictures with simple provenance data\n", + " prov_data = chunk.metadata[\"prov_data\"]\n", + "\n", + " if not prov_data:\n", + " print(f\"No provenance data available for this {item_type}\")\n", + " return None\n", + "\n", + " for prov in prov_data:\n", + " page_no = prov[\"page_no\"]\n", + "\n", + " # Get page image\n", + " if page_no < len(dl_doc.pages):\n", + " page = dl_doc.pages[page_no]\n", + " if hasattr(page, 'image') and page.image:\n", + " if page_no not in page_images:\n", + " img = page.image.pil_image.copy()\n", + " page_images[page_no] = {\n", + " 'image': img,\n", + " 'page': page,\n", + " 'draw': ImageDraw.Draw(img)\n", + " }\n", + "\n", + " # Draw bounding box\n", + " draw = page_images[page_no]['draw']\n", + " bbox = prov[\"bbox\"]\n", + "\n", + " # Draw bounding box\n", + " draw = page_images[page_no]['draw']\n", + " bbox = prov[\"bbox\"]\n", + "\n", + " # The coordinates are already normalized and in top-left origin\n", + " # Just scale to image dimensions\n", + " img_width = page_images[page_no]['image'].width\n", + " img_height = page_images[page_no]['image'].height\n", + "\n", + " l = int(bbox[\"l\"] * img_width)\n", + " r = int(bbox[\"r\"] * img_width)\n", + " t = int(bbox[\"t\"] * img_height)\n", + " b = int(bbox[\"b\"] * img_height)\n", + "\n", + " # Ensure coordinates are valid (min/max) just in case\n", + " l, r = min(l, r), max(l, r)\n", + " t, b = min(t, b), max(t, b)\n", + "\n", + " # Clamp to image bounds\n", + " l = max(0, min(l, img_width - 1))\n", + " r = max(0, min(r, img_width - 1))\n", + " t = max(0, min(t, img_height - 1))\n", + " b = max(0, min(b, img_height - 1))\n", + "\n", + " # Draw highlight with different styles for different types\n", + " if item_type == \"picture\":\n", + " draw.rectangle([l, t, r, b], outline=highlight_color, width=4)\n", + " draw.text((l, t-20), \"IMAGE\", fill=highlight_color)\n", + " elif item_type == \"table\":\n", + " draw.rectangle([l, t, r, b], outline=highlight_color, width=3)\n", + " draw.text((l, t-20), \"TABLE\", fill=highlight_color)\n", + "\n", + " elif \"dl_meta\" in chunk.metadata:\n", + " # Handle text chunks with DocMeta\n", + " try:\n", + " meta = DocMeta.model_validate(chunk.metadata[\"dl_meta\"])\n", + "\n", + " # Process each item in the chunk to find source locations\n", + " for doc_item in meta.doc_items:\n", + " if hasattr(doc_item, 'prov') and doc_item.prov:\n", + " for prov in doc_item.prov:\n", + " page_no = prov.page_no\n", + "\n", + " # Get or create page image\n", + " if page_no not in page_images:\n", + " if page_no < len(dl_doc.pages):\n", + " page = dl_doc.pages[page_no]\n", + " if hasattr(page, 'image') and page.image:\n", + " img = page.image.pil_image.copy()\n", + " page_images[page_no] = {\n", + " 'image': img,\n", + " 'page': page,\n", + " 'draw': ImageDraw.Draw(img)\n", + " }\n", + "\n", + " # Draw bounding box on the page\n", + " if page_no in page_images:\n", + " page_data = page_images[page_no]\n", + " page = page_data['page']\n", + " draw = page_data['draw']\n", + "\n", + " # Convert coordinates to image space\n", + " bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n", + " bbox = bbox.normalized(page.size)\n", + "\n", + " # Scale to actual image dimensions\n", + " l = int(bbox.l * page_data['image'].width)\n", + " r = int(bbox.r * page_data['image'].width)\n", + " t = int(bbox.t * page_data['image'].height)\n", + " b = int(bbox.b * page_data['image'].height)\n", + "\n", + " # Draw highlight rectangle\n", + " draw.rectangle([l, t, r, b], outline=highlight_color, width=2)\n", + " except Exception as e:\n", + " print(f\"Error processing text chunk metadata: {e}\")\n", + " return None\n", + " else:\n", + " print(\"No visual grounding metadata available for this chunk\")\n", + " return None\n", + "\n", + " # Display highlighted pages\n", + " for page_no, page_data in sorted(page_images.items()):\n", + " plt.figure(figsize=(12, 16))\n", + " plt.imshow(page_data['image'])\n", + " plt.axis('off')\n", + "\n", + " # Add title indicating content type\n", + " if item_type == \"picture\":\n", + " title = \"Image Location\"\n", + " elif item_type == \"table\":\n", + " title = \"Table Location\"\n", + " else:\n", + " title = \"Text Location\"\n", + " plt.title(f'{title} - Page {page_no + 1}', fontsize=16)\n", + " plt.tight_layout()\n", + " plt.show()\n", + "\n", + " return page_images\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MuVVgC_YQaln" + }, + "source": [ + "## Popular la base de datos vectorial con embeddings y metadatos" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_wqnAw0Te1zi" + }, + "source": [ + "### Comprendiendo las Bases de Datos Vectoriales en RAG Multimodal\n", + "\n", + "Las bases de datos vectoriales son el motor de búsqueda de nuestro sistema RAG. Estas:\n", + "- Almacenan representaciones numéricas (embeddings) de nuestro contenido\n", + "- Permiten búsquedas por similitud semántica\n", + "- Mantienen todos los metadatos necesarios para el grounding visual\n", + "- Soportan una recuperación rápida a escala\n", + "\n", + "Para contenido multimodal, esto significa:\n", + "- Los fragmentos de texto se incrustan directamente\n", + "- El markdown de las tablas se incrusta para búsquedas estructurales\n", + "- Las descripciones de imágenes generadas por IA se incrustan para búsquedas visuales\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_0oC5arQbHi" + }, + "source": [ + "### Configuración de la Base de Datos Vectorial\n", + "\n", + "Usaremos Milvus, una base de datos vectorial de alto rendimiento. Milvus es una base de datos de código abierto diseñada para manejar grandes volúmenes de datos vectoriales, lo que la hace ideal para aplicaciones de IA y aprendizaje automático. Para más opciones de bases de datos vectoriales, consulta [esta receta de Almacenamiento Vectorial con Langchain](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YQIOysr3Qgg5", + "outputId": "0feefa9e-27df-446f-ae90-3e9f156eebd5" + }, + "outputs": [], + "source": [ + "# Create a temporary database file\n", + "db_file = tempfile.NamedTemporaryFile(prefix=\"vectorstore_\", suffix=\".db\", delete=False).name\n", + "print(f\"Vector database will be saved to: {db_file}\")\n", + "\n", + "# Initialize Milvus vector store\n", + "vector_db: VectorStore = Milvus(\n", + " embedding_function=embeddings_model,\n", + " connection_args={\"uri\": db_file},\n", + " auto_id=True,\n", + " enable_dynamic_field=True, # Allows flexible metadata storage\n", + " index_params={\"index_type\": \"AUTOINDEX\"}, # Automatic index optimization\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sy-12B1eQq_P" + }, + "source": [ + "### Añadimos Documentos a la Base de Datos Vectorial\n", + "\n", + "Ahora añadiremos todos nuestros documentos procesados (fragmentos de texto, tablas y descripciones de imágenes) a la base de datos vectorial:" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JlQMqXOEQt2o", + "outputId": "0302a03a-c95a-4e46-f1e5-9c9320a7abda" + }, + "outputs": [], + "source": [ + "print(\"\\nAñadiendo documentos a la base de datos vectorial...\")\n", + "documents = list(itertools.chain(texts, tables, pictures))\n", + "ids = vector_db.add_documents(documents)\n", + "print(f\"Añadidos {len(ids)} documentos a la base de datos vectorial.\")\n", + "print(f\" - Fragmentos de texto: {len(texts)}\")\n", + "print(f\" - Tablas: {len(tables)}\")\n", + "print(f\" - Descripciones de imágenes: {len(pictures)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xavz2IieQyqI" + }, + "source": [ + "## Test de Recuperación con Atribución Visual" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YjuJtoCKfPAo" + }, + "source": [ + "### Probando la Recuperación con Atribución Visual\n", + "\n", + "Antes de construir todo el pipeline de RAG, probemos que nuestra recuperación y atribución visual funcionen correctamente. Esto ayuda a verificar:\n", + "- Se encuentra contenido basado en similitud semántica\n", + "- Se preserva la metadata de atribución visual\n", + "- Se manejan correctamente diferentes tipos de contenido" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8gLXwOObfSCR" + }, + "source": [ + "### Basic Retrieval Test\n" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yuc09LOtQ5IP", + "outputId": "c8d08282-c57a-442c-c154-1469271947d3" + }, + "outputs": [], + "source": [ + "# Test query\n", + "test_query = \"What is docling?\"\n", + "\n", + "print(f\"\\nTesting retrieval for query: '{test_query}'\")\n", + "print(\"=\" * 80)\n", + "\n", + "# Retrieve relevant documents\n", + "retrieved_docs = vector_db.as_retriever().invoke(test_query)\n", + "\n", + "# Display retrieved documents\n", + "for i, doc in enumerate(retrieved_docs):\n", + " print(f\"\\nRetrieved Document {i+1}:\")\n", + "\n", + " # Determine content type\n", + " item_type = doc.metadata.get('item_type', 'text')\n", + " if item_type == 'picture':\n", + " content_type = \"AI-Generated Image Description\"\n", + " elif item_type == 'table':\n", + " content_type = \"Table (Markdown)\"\n", + " else:\n", + " content_type = \"Text Chunk\"\n", + "\n", + " print(f\"Type: {content_type}\")\n", + " print(f\"Content preview: {doc.page_content[:200]}...\")\n", + " print(f\"Source: {doc.metadata['source'].split('/')[-1]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4iBcEjhVRBcG" + }, + "source": [ + "## Construcción del Pipeline RAG Completo" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ma2N5kKTRHWO" + }, + "source": [ + "\n", + "Ahora implementaremos el sistema RAG multimodal completo que:\n", + "1. Recupera contenido multimodal relevante\n", + "2. Muestra exactamente de dónde proviene cada pieza\n", + "3. Genera respuestas precisas y fundamentadas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_60Mv9wgfmkm" + }, + "source": [ + "### Diferenciando Tipos de Contenido\n", + "\n", + "Nuestro sistema maneja tres tipos de contenido de manera distinta:\n", + "\n", + "1. **Fragmentos de Texto**: El resaltado estándar muestra pasajes de texto\n", + "2. **Tablas**: Bordes gruesos con etiquetas \"TABLE\" marcan datos estructurados\n", + "3. **Imágenes**: Bordes distintivos con etiquetas \"IMAGE\" muestran ubicaciones de imágenes\n", + "\n", + "Esta diferenciación visual ayuda a los usuarios a comprender rápidamente qué tipo de contenido contribuyó a la respuesta de la IA." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6PRpHSBcJehQ" + }, + "source": [ + "### Probando el Sistema RAG Multimodal con Visual Grounding" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "id": "wmQdDvcXRD39" + }, + "outputs": [], + "source": [ + "# Bringing It All Together\n", + "def rag_with_visual_grounding(question, vector_db, doc_store, model, tokenizer, top_k=5):\n", + " \"\"\"\n", + " Perform RAG with visual grounding of results.\n", + "\n", + " This function:\n", + " 1. Retrieves relevant chunks from the vector database\n", + " 2. Visualizes where each chunk comes from in the original document\n", + " 3. Generates a response using the retrieved context\n", + "\n", + " Args:\n", + " question: User's query\n", + " vector_db: Vector database with embedded documents\n", + " doc_store: Document store for visual grounding\n", + " model: Language model for response generation\n", + " tokenizer: Tokenizer for the language model\n", + " top_k: Number of chunks to retrieve\n", + "\n", + " Returns:\n", + " Tuple of (outputs, relevant_chunks)\n", + " \"\"\"\n", + " print(f\"\\nPregunta: {question}\")\n", + " print(\"=\" * 80)\n", + "\n", + " # Step 1: Retrieve relevant chunks\n", + " print(f\"\\nRecuperando los {top_k} fragmentos relevantes...\")\n", + " retriever = vector_db.as_retriever(search_kwargs={\"k\": top_k})\n", + " relevant_chunks = retriever.invoke(question)\n", + "\n", + " print(f\"Se encontraron {len(relevant_chunks)} fragmentos relevantes\")\n", + "\n", + " # Step 2: Visualize each chunk's source location\n", + " print(\"\\nVisualizando la ubicación de los fragmentos recuperados...\")\n", + "\n", + " for i, chunk in enumerate(relevant_chunks):\n", + " print(f\"\\n--- Resultado {i+1} ---\")\n", + "\n", + " # Determine content type\n", + " item_type = chunk.metadata.get('item_type', 'text')\n", + " if item_type == 'picture':\n", + " content_type = \"Descripción de Imagen Generada por IA\"\n", + " color = 'red'\n", + " elif item_type == 'table':\n", + " content_type = \"Tabla (Markdown)\"\n", + " color = 'green'\n", + " else:\n", + " content_type = \"Fragmento de Texto\"\n", + " color = 'blue'\n", + "\n", + " print(f\"Content type: {content_type}\")\n", + " print(f\"Text preview: {chunk.page_content[:200]}...\")\n", + " print(f\"Source: {chunk.metadata.get('source', 'Unknown').split('/')[-1]}\")\n", + "\n", + " # Show visual grounding if available\n", + " if \"dl_meta\" in chunk.metadata or \"prov_data\" in chunk.metadata:\n", + " visualize_chunk_grounding(\n", + " chunk,\n", + " doc_store,\n", + " highlight_color=color\n", + " )\n", + " else:\n", + " print(\" (No visual grounding available for this chunk)\")\n", + "\n", + " # Step 3: Create RAG pipeline for response generation\n", + " print(\"\\nGenerando respuesta con LLM...\")\n", + "\n", + " # Create Granite prompt template\n", + " prompt = tokenizer.apply_chat_template(\n", + " conversation=[{\n", + " \"role\": \"system\",\n", + " \"content\": \"Answer the question based on the provided context in a concise manner. Always answer in the same language as the question.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"{input}\",\n", + " }],\n", + " documents=[{\n", + " \"doc_id\": \"0\",\n", + " \"text\": \"{context}\",\n", + " }],\n", + " add_generation_prompt=True,\n", + " tokenize=False,\n", + " )\n", + "\n", + " prompt_template = PromptTemplate.from_template(\n", + " template=escape_f_string(prompt, \"input\", \"context\")\n", + " )\n", + "\n", + " # Document prompt template\n", + " document_prompt_template = PromptTemplate.from_template(template=\"\"\"\\\n", + "<|end_of_text|>\n", + "<|start_of_role|>document {{\"document_id\": \"{doc_id}\"}}<|end_of_role|>\n", + "{page_content}\"\"\")\n", + "\n", + " # Create chains\n", + " combine_docs_chain = create_stuff_documents_chain(\n", + " llm=model,\n", + " prompt=prompt_template,\n", + " document_prompt=document_prompt_template,\n", + " document_separator=\"\",\n", + " )\n", + "\n", + " rag_chain = create_retrieval_chain(\n", + " retriever=retriever,\n", + " combine_docs_chain=combine_docs_chain,\n", + " )\n", + "\n", + " # Generate response\n", + " outputs = rag_chain.invoke({\"input\": question})\n", + "\n", + " print(\"\\nRespuesta generada:\")\n", + " print(\"=\" * 80)\n", + " print(outputs['answer'])\n", + " print(\"=\" * 80)\n", + "\n", + " return outputs, relevant_chunks" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OQitCyY0RVPn" + }, + "source": [ + "## Demostración Final" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4d__u-q6RVnO" + }, + "source": [ + "\n", + "Vamos a ejecutar una consulta y ver el sistema completo en acción:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "ud0Oy_K4Ra7u", + "outputId": "f7f459a6-e44b-4878-c119-bc4f28d69d08" + }, + "outputs": [], + "source": [ + "main_query = \"Como funciona la pipeline de conversión de PDFs de Docling?\"\n", + "outputs, chunks = rag_with_visual_grounding(\n", + " main_query,\n", + " vector_db,\n", + " doc_store,\n", + " model,\n", + " tokenizer,\n", + " top_k=3\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dABGUdQ5RaX4" + }, + "source": [ + "# Resumen y Próximos Pasos\n", + "\n", + "### Lo que Has Logrado\n", + "\n", + "¡Felicidades! Has construido con éxito un sistema avanzado de RAG multimodal. Esto es lo que has aprendido:\n", + "\n", + "1. **Implementación de un RAG con *Visual Grounding***\n", + " - Configuraste Docling para preservar referencias visuales\n", + " - Mantuviste metadatos de coordenadas a lo largo del pipeline de procesamiento\n", + " - Creaste atribución visual para todos los tipos de contenido\n", + "\n", + "2. **Procesamiento de Documentos Multimodal**\n", + " - Manejo fluido de texto, tablas e imágenes\n", + " - Uso de chunking inteligente para optimizar la recuperación\n", + " - Uso de modelos de visión IA para comprensión de imágenes\n", + "\n", + "3. **Arquitectura Transparente de RAG**\n", + " - Como establecer confianza en tu sistema RAG mediante verificación visual\n", + " - Habilitación de atribución de fuentes para cada respuesta\n", + " - Creación de un resaltado diferenciado por cada tipo de contenido\n", + "\n", + "4. **Integración Lista para Producción**\n", + " - Combinación efectiva de múltiples modelos de IA\n", + " - Creación de un pipeline escalable de procesamiento de documentos\n", + "\n", + "### ¿Por qué el Grounding Visual lo Cambia Todo?\n", + "\n", + "Los sistemas RAG tradicionales son \"cajas negras\": los usuarios deben confiar ciegamente en la IA. Tu sistema:\n", + "\n", + "- **Muestra las fuentes**: Cada afirmación puede ser verificada visualmente\n", + "- **Construye confianza**: Los usuarios ven exactamente de dónde proviene la información\n", + "- **Habilita auditorías**: Perfecto para industrias reguladas\n", + "- **Reduce alucinaciones**: La verificación visual detecta errores\n", + "\n", + "### El Poder de la Comprensión Multimodal\n", + "\n", + "Al procesar texto, tablas e imágenes, tu sistema:\n", + "\n", + "- Captura información completa del documento\n", + "- Habilita consultas y respuestas más ricas\n", + "- Maneja la complejidad de documentos del mundo real\n", + "- Proporciona respuestas completas" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J3zahmhOg0Jd" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "phDxA2ivg0vn" + }, + "source": [ + "## Proximos Pasos:\n", + "\n", + "\n", + "1. **Experimenta con otros documentos**\n", + " - Prueba con documentos en español, francés o alemán\n", + " - Prueba documentos con diagramas técnicos y gráficos\n", + " - Procesa informes con contenido mixto\n", + "\n", + "2. **Personaliza la IA para ti**\n", + " - Utiliza los modelos de embeddings, visión y lenguaje que se adapten a tu flujo de trabajo común.\n", + " - Ajusta los prompts para mejorar la calidad de las descripciones de imágenes y respuestas del modelo de lenguaje.\n", + " \n", + " ```python\n", + " # Ejemplo: Prompts de imagen específicos a un dominio\n", + " medical_prompt = \"Describe esta imagen médica, señalando cualquier anomalía o característica clave\"\n", + " financial_prompt = \"Analiza este gráfico financiero, identificando tendencias y puntos de datos clave\"\n", + " ```\n", + "\n", + "3. **Optimiza el Rendimiento**\n", + " - Procesa documentos en batch: [Docling Batch Conversion](https://docling-project.github.io/docling/examples/batch_convert/)\n", + " - Usa aceleración por GPU con modelos de visión locales." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lVPQsnmFg-HB" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "db3UNKhQhI7Z" + }, + "source": [ + "## Recursos Adicionales\n", + "\n", + "### Documentación Oficial\n", + "- **[Documentación de Docling](https://github.com/docling-project/docling)**: Últimas características y actualizaciones\n", + "- **[Modelos IBM Granite](https://www.ibm.com/granite/)**: Tarjetas de modelo y capacidades\n", + "- **[Documentación de LangChain](https://python.langchain.com/)**: Patrones y mejores prácticas de RAG\n", + "- **[Documentación de Milvus](https://milvus.io/docs)**: Optimización de bases de datos vectoriales\n", + "\n", + "### Recursos Comunitarios\n", + "- Únete a la [comunidad de Docling en GitHub](https://github.com/docling-project/docling/discussions)\n", + "- Comparte tus implementaciones\n", + "- Contribuye con mejoras Docling! ❤️\n", + " \n", + "### Temas Relacionados para Explorar\n", + "- Análisis de Diseño de Documentos\n", + "- Embeddings Multimodales\n", + "- Respuesta a Preguntas Visuales\n", + "- Sistemas de IA Explicables" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ytIk0-fghmMy" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wADjxUDfhm5L" + }, + "source": [ + "## Conclusión\n", + "\n", + "Has completado un increíble viaje desde la conversión básica de documentos hasta la construcción de un sistema de IA sofisticado y transparente. La combinación de la comprensión de documentos de Docling, las capacidades de IA de Granite y el grounding visual crea una aplicación poderosa.\n", + "\n", + "Tu sistema RAG multimodal representa la vanguardia de la IA para documentos. Ya sea que estés construyendo para el sector de la salud, legal, financiero o cualquier otro dominio, ahora tienes las herramientas para crear sistemas de IA que no solo son poderosos, sino también confiables y transparentes." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.5" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}