You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-128Lines changed: 3 additions & 128 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,136 +19,11 @@ Part of the [GPT-RAG](https://github.com/Azure/gpt-rag) solution.
19
19
20
20
The **GPT-RAG Data Ingestion** service automates the processing of diverse document types—such as PDFs, images, spreadsheets, transcripts, and SharePoint files—preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experies for agent-based RAG applications.
21
21
22
-
## How data ingestion works
22
+
For full documentation, visit the **[GPT-RAG documentation site](https://azure.github.io/GPT-RAG/)**.
23
23
24
-
The service performs the following steps:
24
+
## Contributing
25
25
26
-
***Scan sources**: Detects new or updated content in configured sources
27
-
***Process content**: Chunk and enrich data for retrieval
28
-
***Index documents**: Writes processed chunks into Azure AI Search
29
-
***Schedule execution**: Runs on a CRON-based scheduler defined by environment variables
30
-
31
-
## Supported data sources
32
-
33
-
-[Blob Storage](docs/blob_data_source.md)
34
-
-[NL2SQL Metadata](docs/nl2sql_data_source.md)
35
-
- SharePoint
36
-
37
-
## Supported formats and chunkers
38
-
39
-
The ingestion service selects a chunker based on the file extension, ensuring each document is processed with the most suitable method.
40
-
41
-
***`.pdf` files** — Processed by the [DocAnalysisChunker](chunking/chunkers/doc_analysis_chunker.py) using the Document Intelligence API. Structured elements such as tables and sections are extracted and converted into Markdown, then segmented with LangChain splitters. When Document Intelligence API 4.0 is enabled, `.docx` and `.pptx` files are handled the same way.
42
-
43
-
***Image files** (`.bmp`, `.png`, `.jpeg`, `.tiff`) — The [DocAnalysisChunker](chunking/chunkers/doc_analysis_chunker.py) applies OCR to extract text before chunking.
44
-
45
-
***Text-based files** (`.txt`, `.md`, `.json`, `.csv`) — Processed by the [LangChainChunker](chunking/chunkers/langchain_chunker.py), which splits content into paragraphs or sections.
46
-
47
-
***Specialized formats**:
48
-
49
-
*`.vtt` (video transcripts) — Handled by the [TranscriptionChunker](chunking/chunkers/transcription_chunker.py), which splits content by time codes.
50
-
*`.xlsx` (spreadsheets) — Processed by the [SpreadsheetChunker](chunking/chunkers/spreadsheet_chunker.py), chunked by rows or sheets.
51
-
52
-
## How to deploy the data ingestion service
53
-
54
-
### Prerequisites
55
-
56
-
Before deploying the application, you must provision the infrastructure as described in the [GPT-RAG](https://github.com/azure/gpt-rag) repo. This includes creating all necessary Azure resources required to support the application runtime.
57
-
58
-
<detailsmarkdown="block">
59
-
<summary>Click to view <strong>software</strong> prerequisites</summary>
60
-
<br>
61
-
The machine used to customize and or deploy the service should have:
> For earlier versions, use the corresponding release in the GitHub repository (e.g., v1.0.0 for the initial version).
147
-
148
-
149
-
## 🤝 Contributing
150
-
151
-
We appreciate contributions! See [CONTRIBUTING](https://github.com/Azure/gpt-rag/blob/main/CONTRIBUTING.md) for guidelines on submitting pull requests.
26
+
We welcome contributions! See the [contribution guidelines](https://azure.github.io/GPT-RAG/contributing/) for details on how to contribute.
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {chunk_id} in {elapsed_time:.2f} seconds.")
153
+
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks][{sheet['name']}] Processed chunk {current_chunk_id} in {elapsed_time:.2f} seconds.")
152
154
153
155
total_elapsed_time=time.time() -total_start_time
154
156
logging.debug(f"[spreadsheet_chunker][{self.filename}][get_chunks] Finished get_chunks. Created {len(chunks)} chunks in {total_elapsed_time:.2f} seconds.")
0 commit comments