Skip to content

Commit d3ca4b3

Browse files
Multimodal dapt curation edits (#813)
* Edits to address VDR Signed-off-by: Rucha Apte <[email protected]> * Edits to interpret outputs and requirements Signed-off-by: Rucha Apte <[email protected]> * Update tutorials/multimodal_dapt_curation/curator/README.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Rucha Apte <[email protected]> --------- Signed-off-by: Rucha Apte <[email protected]> Co-authored-by: Sarah Yurick <[email protected]>
1 parent 9850de9 commit d3ca4b3

File tree

4 files changed

+15
-1
lines changed

4 files changed

+15
-1
lines changed

tutorials/multimodal_dapt_curation/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,11 @@ In this section, we guide you through extracting various modalities (text, image
1313
The second part of the tutorial covers best practices for data curation for DAPT. This stage processes extracted text, tables, charts, and images using the curation pipeline. To complete the prerequisites and execute the tutorial, follow the README in the `curator` folder within the directory.
1414

1515
## Instructions
16+
17+
- For this tutorial, we will maintain two separate environments (or Docker containers) to ensure that the dependencies for each part remain isolated and do not interfere with one another.
1618
- Ensure that all prerequisites for both `nv-ingest` (extraction) and `curator` (curation) are completed before proceeding.
1719
- Follow the respective READMEs in the `ingest` and `curator` folders for step-by-step guidance.
20+
- Please make sure you are using Python 3.10 when running this tutorial.
1821

1922
## License
2023
Refer to the respective repositories for licensing information.

tutorials/multimodal_dapt_curation/curator/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,16 @@ The tutorial follows these steps:
2727
6. Apply semantic deduplication to get rid of duplicate images extracted
2828
7. Save the filtered and curated data
2929

30+
## Interpreting the outputs
31+
The tutorial provides detailed logging of the dataset curation process:
32+
- It begins by printing the original dataset lengths for text extracted from different modalities, such as tables and charts.
33+
- It then displays the progressive reductions in dataset size as various filters are applied:
34+
- Fuzzy deduplication
35+
- Semantic deduplication
36+
- Additional filtering mechanisms
37+
- During the PII redaction step, the number of names and email addresses redacted from the dataset is also reported.
38+
Once the tutorial completes, the final curated outputs are saved in the `curated/` directory. The results are organized by modality, such as `text/` or `tables_charts/`, for easy access and inspection.
39+
3040
## Usage
3141
After installing the NeMo Curator package, install the required dependencies and run the pipeline using the following command:
3242
```sh

tutorials/multimodal_dapt_curation/curator/requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
arxiv==2.1.0
22
arxiv-downloader
3-
cchardet
43
nltk==3.8.1
54
pdfminer.six==20221105
65
poppler-utils

tutorials/multimodal_dapt_curation/ingest/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ python main.py --analyze --display
5050
```
5151

5252
## Notes
53+
Extraction should complete within 30 seconds. If it hangs, your build key is likely being throttled. Consider using a self-hosted solution or contact NVIDIA to request an unlimited key.
54+
5355
- Computing Recall on Extraction
5456
- In order to calculate end-to-end recall accuracy of a retrieval pipeline refer to following [tutorial](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/bo767_recall.ipynb)
5557
- Exploring Outputs

0 commit comments

Comments
 (0)