Multimodal dapt curation edits (#813)

ruchaa-apte · sarahyurick · web-flow · commit d3ca4b315d87 · 2025-07-16T19:54:56.000Z
* Edits to address VDR

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;

* Edits to interpret outputs and requirements

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;

* Update tutorials/multimodal_dapt_curation/curator/README.md

Co-authored-by: Sarah Yurick &lt;53962159+sarahyurick@users.noreply.github.com&gt;
Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;

---------

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;
Co-authored-by: Sarah Yurick &lt;53962159+sarahyurick@users.noreply.github.com&gt;
diff --git a/tutorials/multimodal_dapt_curation/README.md b/tutorials/multimodal_dapt_curation/README.md
@@ -13,8 +13,11 @@ In this section, we guide you through extracting various modalities (text, image
 The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
 
 ## Instructions
+
+- For this tutorial, we will maintain two separate environments (or Docker containers) to ensure that the dependencies for each part remain isolated and do not interfere with one another.
 - Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding.
 - Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance.
+- Please make sure you are using Python 3.10 when running this tutorial.
 
 ## License
 Refer to the respective repositories for licensing information.
diff --git a/tutorials/multimodal_dapt_curation/curator/README.md b/tutorials/multimodal_dapt_curation/curator/README.md
@@ -27,6 +27,16 @@ The tutorial follows these steps:
 6. Apply semantic deduplication to get rid of duplicate images extracted
 7. Save the filtered and curated data
 
+## Interpreting the outputs
+The tutorial provides detailed logging of the dataset curation process:
+- It begins by printing the original dataset lengths for text extracted from different modalities, such as tables and charts.
+- It then displays the progressive reductions in dataset size as various filters are applied:
+   - Fuzzy deduplication
+   - Semantic deduplication
+   - Additional filtering mechanisms
+- During the PII redaction step, the number of names and email addresses redacted from the dataset is also reported.
+Once the tutorial completes, the final curated outputs are saved in the `curated/` directory. The results are organized by modality, such as `text/` or `tables_charts/`, for easy access and inspection.
+
 ## Usage
 After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command:
 ```sh
diff --git a/tutorials/multimodal_dapt_curation/curator/requirements.txt b/tutorials/multimodal_dapt_curation/curator/requirements.txt
@@ -1,6 +1,5 @@
 arxiv==2.1.0
 arxiv-downloader
-cchardet
 nltk==3.8.1
 pdfminer.six==20221105
 poppler-utils
diff --git a/tutorials/multimodal_dapt_curation/ingest/README.md b/tutorials/multimodal_dapt_curation/ingest/README.md
@@ -50,6 +50,8 @@ python main.py --analyze --display
 ```
 
 ## Notes
+Extraction should complete within 30 seconds. If it hangs, your build key is likely being throttled. Consider using a self-hosted solution or contact NVIDIA to request an unlimited key.
+
 - Computing Recall on Extraction
     - In order to calculate end-to-end recall accuracy of a retrieval pipeline refer to following [tutorial](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/bo767_recall.ipynb)
 - Exploring Outputs