Skip to content

Commit 9c89d09

Browse files
committed
Merge upstream/main into merge_datasets branch
Resolved conflicts: - Updated megatron_tokenizer.py to use AutoTokenizer.from_pretrained - Updated test mocks to match upstream implementation - Added upstream tutorial entries (Llama Nemotron, GLiNER PII) - Preserved merge_datasets documentation section Signed-off-by: asolergi-nv <asolergibert@nvidia.com>
2 parents 387815d + 605321b commit 9c89d09

File tree

118 files changed

+5403
-2508
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

118 files changed

+5403
-2508
lines changed

.github/workflows/cicd-main.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ jobs:
128128
matrix:
129129
os: [ubuntu-latest]
130130
python-version: ["3.10", "3.12"]
131-
folder: ["backends", "core", "models", "pipelines", "stages-audio", "stages-common", "stages-deduplication", "stages-image", "stages-synthetic", "stages-text", "stages-video", "tasks", "utils"]
131+
folder: ["backends", "config", "core", "models", "pipelines", "stages-audio", "stages-common", "stages-deduplication", "stages-image", "stages-synthetic", "stages-text", "stages-video", "tasks", "utils"]
132132
needs: [pre-flight, cicd-wait-in-queue]
133133
runs-on: ${{ matrix.os }}
134134
name: Unit_Test_${{ matrix.folder}}_CPU_python-${{ matrix.python-version }}
@@ -247,7 +247,7 @@ jobs:
247247
if: |
248248
(
249249
needs.pre-flight.outputs.docs_only == 'true'
250-
|| success()
250+
|| always()
251251
)
252252
&& !cancelled()
253253
runs-on: ubuntu-latest

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,3 +158,17 @@ data/
158158

159159
# InternVideo2 dependency (cloned by installation script)
160160
InternVideo/
161+
162+
# UV cache directory
163+
.uv_cache/
164+
165+
# Ray temp directory
166+
.ray_temp/
167+
168+
uv.lock
169+
pyproject.toml
170+
171+
token_test/
172+
*.parquet
173+
*.bin
174+
*.idx

benchmarking/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
ARG NEMO_CURATOR_IMAGE=nemo_curator
16-
FROM ${NEMO_CURATOR_IMAGE} AS nemo_curator_benchmarking
15+
ARG CURATOR_IMAGE=nemo_curator
16+
FROM ${CURATOR_IMAGE} AS nemo_curator_benchmarking
1717

1818
# Add system utilities useful for benchmark and debug
1919
RUN apt-get update \

benchmarking/README.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,10 @@ Note: you may only need to do this periodically when the environment needs to be
3333

3434
**2. Update config:**
3535

36-
Update `results_path`, `artifacts_path`, and `datasets_path` in the YAML config file based on your preferences. In this example, we'll edit the YAML config `./benchmarking/nightly-benchmark.yaml`
36+
Update `results_path` and `datasets_path` in the YAML config file based on your preferences. In this example, we'll edit the YAML config `./benchmarking/nightly-benchmark.yaml`
3737

3838
```yaml
3939
results_path: /path/where/results/are/stored
40-
artifacts_path: /path/where/artifacts/are/stored
4140
datasets_path: /path/to/datasets
4241
```
4342
@@ -67,7 +66,7 @@ Results are written to the `results_path` specified in your configuration, organ
6766
A **session** represents a single invocation of the benchmarking framework. Each session:
6867
- Has a unique name with timestamp (e.g., `benchmark-run__2025-01-23__14-30-00`)
6968
- Contains one or more benchmark entries
70-
- Produces a session directory with results and artifacts
69+
- Produces a session directory with results
7170
- Captures environment metadata (system info, package versions, etc.)
7271

7372
### Scripts
@@ -113,7 +112,7 @@ See [Sinks: Custom Reporting & Actions](#sinks-custom-reporting--actions) for de
113112

114113
The framework uses one or more YAML files to configure benchmark sessions. Multiple configuration files are merged, allowing separation of concerns (e.g., machine-specific paths vs. benchmark definitions).
115114

116-
A useful pattern is to use multiple YAML files, where configuration that does not typically change is in one or more files, and user or machine-specific configuration is others. For example, `my_paths_and_reports.yaml` could have results / artifacts / datasets paths and personal sink settings (individual slack channel, etc.), and `release-benchmarks.yaml` could have the team-wide configuration containing the individual benchmark entries and performance requirements.
115+
A useful pattern is to use multiple YAML files, where configuration that does not typically change is in one or more files, and user or machine-specific configuration is others. For example, `my_paths_and_reports.yaml` could have results / datasets paths and personal sink settings (individual slack channel, etc.), and `release-benchmarks.yaml` could have the team-wide configuration containing the individual benchmark entries and performance requirements.
117116

118117
This can be especially useful during development. During development you'll not only want to use your own paths and report settings, you'll also want to use the standard benchmarking environment (i.e. a container), but cannot afford to rebuild the Docker image for each code change you're evaluating. The `--use-host-curator` flag is intended for this case. This flag will use your Curator source dir on host inside the container via a volume mount (this works because the container has curator installed in editable mode), and no image rebuild step is needed.
119118

@@ -125,13 +124,11 @@ An example of a development scenario using this pattern looks like this:
125124
### Configuration Structure
126125

127126
```yaml
128-
# Required: Base paths for results, artifacts, and datasets
127+
# Required: Base paths for results and datasets
129128
# These paths must exist on the host machine
130129
# When running in Docker with tools/run.sh, paths are automatically mapped to container volumes
131-
# These base paths can be referenced in other configuration values using {results_path}, {artifacts_path}, {datasets_path}
132-
# NOTE: the current version of the framework does not use artifacts_path
130+
# These base paths can be referenced in other configuration values using {results_path}, {datasets_path}
133131
results_path: /path/to/results
134-
artifacts_path: /path/to/artifacts
135132
datasets_path: /path/to/datasets
136133

137134
# Optional: Global timeout for all entries (seconds)
@@ -247,7 +244,6 @@ datasets:
247244
248245
Available base path placeholders:
249246
- `{results_path}` - Resolves to the configured `results_path`
250-
- `{artifacts_path}` - Resolves to the configured `artifacts_path` *Note: unused in current version of the framework*
251247
- `{datasets_path}` - Resolves to the configured `datasets_path`
252248

253249
**Dataset references** - Reference datasets in entry arguments:
@@ -312,7 +308,7 @@ Run benchmarks using a configuration file:
312308
```
313309

314310
This command:
315-
- Reads the configuration file and extracts `results_path`, `artifacts_path`, and `datasets_path`
311+
- Reads the configuration file and extracts `results_path` and `datasets_path`
316312
- Automatically creates volume mounts to map these paths into the container
317313
- Runs the benchmarking framework with the Curator code built into the Docker image
318314
- Passes environment variables like `SLACK_WEBHOOK_URL` and `MLFLOW_TRACKING_URI` to the container

benchmarking/config.yaml

Lines changed: 0 additions & 65 deletions
This file was deleted.

benchmarking/dummy-config.yaml

Lines changed: 0 additions & 53 deletions
This file was deleted.

0 commit comments

Comments
 (0)