Skip to content

Commit f8679c6

Browse files
arhamm1greptile-apps[bot]lbliii
authored
Update getting started - text quickstart.md (#1238)
* Update getting started - text quickstart.md Signed-off-by: Arham Mehta <[email protected]> * Update text.md Signed-off-by: Arham Mehta <[email protected]> * Update docs/get-started/text.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <[email protected]> --------- Signed-off-by: Arham Mehta <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: L.B. <[email protected]>
1 parent e76c002 commit f8679c6

File tree

1 file changed

+12
-13
lines changed

1 file changed

+12
-13
lines changed

docs/get-started/text.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,22 +12,22 @@ modality: "text-only"
1212

1313
# Get Started with Text Curation
1414

15-
This guide helps you set up and get started with NeMo Curator's text curation capabilities. Follow these steps to prepare your environment and run your first text curation pipeline.
15+
This guide provides step-by-step instructions for setting up NeMo Curators text curation capabilities. Follow these instructions to prepare your environment and execute your first text curation pipeline.
1616

1717
## Prerequisites
1818

19-
To use NeMo Curator's text curation modules, ensure you meet the following requirements:
19+
To use NeMo Curators text curation modules, ensure your system meets the following requirements:
2020

2121
* Python 3.10, 3.11, or 3.12
2222
* packaging >= 22.0
2323
* uv (for package management and installation)
2424
* Ubuntu 22.04/20.04
2525
* NVIDIA GPU (optional for most text modules, required for GPU-accelerated operations)
2626
* Volta™ or higher (compute capability 7.0+)
27-
* CUDA 12 (or above)
27+
* CUDA 12 (or later)
2828

2929
:::{tip}
30-
If you don't have `uv` installed, refer to the [Installation Guide](../admin/installation.md) for setup instructions, or install it quickly with:
30+
If `uv` is not installed, refer to the [Installation Guide](../admin/installation.md) for setup instructions, or install it quickly using:
3131

3232
```bash
3333
curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh
@@ -40,7 +40,7 @@ source $HOME/.local/bin/env
4040

4141
## Installation Options
4242

43-
You can install NeMo Curator in three ways:
43+
You can install NeMo Curator using one of the following methods:
4444

4545
::::{tab-set}
4646

@@ -100,7 +100,7 @@ NeMo Curator uses a pipeline-based architecture for processing text data. Before
100100

101101
## Set Up Data Directory
102102

103-
Create a directory structure for your text datasets:
103+
Create the following directories for your text datasets:
104104

105105
```bash
106106
mkdir -p ~/nemo_curator/data/sample
@@ -133,34 +133,33 @@ pipeline.add_stage(
133133
JsonlReader(
134134
file_paths="~/nemo_curator/data/sample/",
135135
files_per_partition=4,
136-
fields=["text", "id"] # Only read required columns for efficiency
136+
fields=["text", "id"]
137137
)
138138
)
139139

140140
# Add quality filtering stages
141141
pipeline.add_stage(
142142
ScoreFilter(
143-
score_fn=WordCountFilter(min_words=50, max_words=100000),
143+
filter_obj=WordCountFilter(min_words=50, max_words=100000),
144144
text_field="text",
145-
score_field="word_count" # Optional: save scores for analysis
145+
score_field="word_count"
146146
)
147147
)
148148

149149
pipeline.add_stage(
150150
ScoreFilter(
151-
score_fn=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25),
151+
filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25),
152152
text_field="text",
153-
score_field="non_alpha_score" # Optional: save scores for analysis
153+
score_field="non_alpha_score"
154154
)
155155
)
156156

157157
# Write the curated results
158158
pipeline.add_stage(
159159
JsonlWriter("~/nemo_curator/data/curated")
160160
)
161-
162161
# Execute the pipeline
163-
results = pipeline.run() # Uses XennaExecutor by default for distributed processing
162+
results = pipeline.run()
164163

165164
print(f"Pipeline completed successfully! Processed {len(results) if results else 0} tasks.")
166165
```

0 commit comments

Comments
 (0)