Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
132 commits
Select commit Hold shift + click to select a range
55577ce
Add readme for coreset selection engine
aviban15 Feb 4, 2026
390c5c4
Add files via upload
sidrocks Feb 4, 2026
f2fabb8
initial commit
sidrocks Feb 4, 2026
be875cc
initial commit
sidrocks Feb 4, 2026
08f086a
Add files via upload
sidrocks Feb 10, 2026
b0f5042
feat: adapting to curriculum_v6.yaml
pankaj1311 Feb 10, 2026
924d99f
feat: added sharding logic and global dedup hash within shard and aft…
pankaj1311 Feb 10, 2026
fc5d5eb
Add coreset engine v5 (stage-wise selection)
sidrocks Feb 11, 2026
9951988
Initial Data processing/Exact dedup on curriculum data
BalajiAJ Feb 11, 2026
7611254
Initial Data processing/Exact dedup on curriculum data
BalajiAJ Feb 11, 2026
86fb9a9
Exact dedup on curriculum data using pyspark
BalajiAJ Feb 11, 2026
a936491
Exact dedup on curriculum data using pyspark
BalajiAJ Feb 11, 2026
2e44956
Disable total_token as it will be read from curriculum.yaml stage-wise
sidrocks Feb 11, 2026
05669cd
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of github.com:…
sidrocks Feb 11, 2026
41bde5e
Added optional Band-Infer feature to infer a band on band or difficul…
sidrocks Feb 12, 2026
7492ba1
remove total-tokens
sidrocks Feb 12, 2026
ec91a2a
Added glue_job_single.py
abhi1021 Feb 12, 2026
907e316
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of https://git…
abhi1021 Feb 12, 2026
891825b
Update S3 bucket name in Glue job configuration
abhi1021 Feb 12, 2026
d2f3e58
feat: add the emr scripts for dedup and stats
pankaj1311 Feb 12, 2026
4f90cd4
added new allowed_domains
abhi1021 Feb 13, 2026
ca3099a
feat: fixed shard.sh
pankaj1311 Feb 13, 2026
5970629
feat: fixed batch_processor
pankaj1311 Feb 13, 2026
e0b2d67
feat: fixed batch_processor
pankaj1311 Feb 13, 2026
0201c26
feat: fixed batch_processor
pankaj1311 Feb 13, 2026
6c0c0fc
feat: includes changes to folder segregation (Author - Balaji)
pankaj1311 Feb 13, 2026
f8568f1
feat: included commands for running on EC2 and notebook for distribut…
pankaj1311 Feb 13, 2026
7cc4141
feat: included commands for running on EC2 and fixes to resume operat…
pankaj1311 Feb 13, 2026
7ea7d33
added missing fields in the coreset selection indices
sidrocks Feb 13, 2026
21a4f92
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of
sidrocks Feb 13, 2026
af2307d
chore: updated the commands
pankaj1311 Feb 13, 2026
6cb3d20
commit as modifying local fields Merge branch 'p3/feat/stage-wise-cor…
sidrocks Feb 13, 2026
338e654
chore: updated the python to python3 in commands
pankaj1311 Feb 13, 2026
9eacd09
delete
sidrocks Feb 13, 2026
798a53d
remove pem
sidrocks Feb 13, 2026
d537731
Update kv-t3-459.pem
sidrocks Feb 13, 2026
bae0857
del pem
sidrocks Feb 13, 2026
714e0cb
Dedup script update
BalajiAJ Feb 14, 2026
7198d6c
Dedup script update
BalajiAJ Feb 14, 2026
8da4e32
band-inference applied score-source and validation of coreset generat…
sidrocks Feb 14, 2026
5440ff4
Remove folder output2 coreset output
sidrocks Feb 14, 2026
a5026f5
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of github.com:…
sidrocks Feb 15, 2026
fa68da0
columns fix issue in dedup script
BalajiAJ Feb 16, 2026
229fda7
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of https://git…
BalajiAJ Feb 16, 2026
c040fa8
columns fix issue in dedup script
BalajiAJ Feb 16, 2026
c7cd591
include code for B6 - will be optional and get activated only curricu…
sidrocks Feb 16, 2026
e92d64b
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of github.com:…
sidrocks Feb 16, 2026
662ce78
columns fix issue in dedup script
BalajiAJ Feb 17, 2026
6a7d036
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of https://git…
BalajiAJ Feb 17, 2026
1029609
feat: major changes to dependency management, to be merged with stagi…
pankaj1311 Feb 17, 2026
3bb7000
feat: evaluate the curriculums created
pankaj1311 Feb 17, 2026
a61e7f7
feat: fixes to run tests from relative paths, can be altered later pe…
pankaj1311 Feb 17, 2026
732cef5
chore: add coresets build commands to md
pankaj1311 Feb 17, 2026
feffea4
chore: update contribution.md
pankaj1311 Feb 17, 2026
441b4fa
chore: add automation scripts and documentation for running it manual…
pankaj1311 Feb 17, 2026
9a0fcd8
feat: automation script for coresets generation
pankaj1311 Feb 17, 2026
c8df7b5
feat: CI pipeline for coresets generation
pankaj1311 Feb 17, 2026
1fccdbf
chore: standardizing the resume option
pankaj1311 Feb 17, 2026
14aeb77
feat: this commit includes the automation script to run coresets gene…
pankaj1311 Feb 17, 2026
323cc26
chore: updated .gitignore
pankaj1311 Feb 17, 2026
d54c9ab
feat: includes the CI pipeline for running the coresets generation (n…
pankaj1311 Feb 17, 2026
280c6df
chore: remove the duplicate automation script
pankaj1311 Feb 17, 2026
2ede4f9
feat: s3 spatial analysis + plots (#513)
pankaj1311 Feb 20, 2026
4f42ac6
P3/feat/coresets dist analysis (#520)
pankaj1311 Feb 21, 2026
45bbd1a
Merge branch 'staging' into p3/feat/stage-wise-coreset-selection_v2
pankaj1311 Feb 21, 2026
8686515
P3/feat/coresets dist analysis (#533)
pankaj1311 Feb 21, 2026
84ccd04
P3/feat/coresets dist analysis (#534)
pankaj1311 Feb 21, 2026
7a9f3ee
P3/feat/coresets dist analysis (#535)
pankaj1311 Feb 21, 2026
dea8de4
P3/feat/coresets dist analysis (#536)
pankaj1311 Feb 21, 2026
4794224
P3/feat/coresets dist analysis (#537)
pankaj1311 Feb 21, 2026
41592f7
P3/feat/coresets dist analysis (#538)
pankaj1311 Feb 21, 2026
1ffd2c9
P3/feat/coresets dist analysis (#539)
pankaj1311 Feb 21, 2026
22864c1
P3/feat/coresets dist analysis (#540)
pankaj1311 Feb 21, 2026
cd31c8d
feat: included the user interruption logic for shards
pankaj1311 Feb 22, 2026
3e8e872
feat: included the interruption logic
pankaj1311 Feb 22, 2026
c86c256
feat: fixed default params and pre-commit fixes
pankaj1311 Feb 22, 2026
56b5e7b
Fix: Protected slices adding tokens to B4/B5 skipping disallowed doma…
sidrocks Feb 22, 2026
1e96e1e
Updated deliverables.md with latest update to code and additional req…
sidrocks Feb 23, 2026
39100ff
chore: added the report to be published
pankaj1311 Feb 23, 2026
63331f5
chore: added the report to be published
pankaj1311 Feb 23, 2026
bdb54b1
Updated documentation on coreset selection operational process
sidrocks Feb 23, 2026
8908553
Add files via upload
vj1117 Feb 24, 2026
550441c
docs(coreset): add T3 critical review and production readiness audit
AnkitaMungalpara Feb 24, 2026
20c0f5e
Update T3_GO_NOGO_REVIEW_REPORT_240226_v4.md
vj1117 Feb 24, 2026
489f045
Add files via upload
vj1117 Feb 25, 2026
9ec2a15
chore: updated the report with comments from T3 team
pankaj1311 Feb 25, 2026
a172247
chore: updated the report with comments from T3 team
pankaj1311 Feb 25, 2026
87a9b02
Revise token accounting and data contract details
sidrocks Feb 25, 2026
28d27d4
Upstream data contract specs addded
sidrocks Feb 25, 2026
4361030
Merge branch 'p3/feat/stage-wise-coreset-selection_v2' of github.com:…
sidrocks Feb 25, 2026
914fbc1
Update token accounting and add upstream data contract
sidrocks Feb 25, 2026
a7c82e8
Revise chunk file schema section in documentation
sidrocks Feb 25, 2026
37ee632
Update deliverables section in T3 report
BalajiAJ Feb 25, 2026
dd57dd6
Update T3 report with deduplication and performance details
BalajiAJ Feb 25, 2026
814dc03
Add files via upload
vj1117 Feb 26, 2026
395e593
Add files via upload
vj1117 Feb 26, 2026
82526b4
update ablation report generation include additional metric for singl…
sidrocks Feb 26, 2026
38c0359
feat: added changes to the production playbook
pankaj1311 Feb 26, 2026
deaf3e5
feat: update commands.sh and include new band policies to curriculum.…
pankaj1311 Feb 26, 2026
05ea404
chore: pre-commit fixes
pankaj1311 Feb 26, 2026
fe4216a
Added instruction of using total_tokens during pre-run, merge report …
sidrocks Feb 26, 2026
cf037e5
Merge branch 'staging' into p3/feat/stage-wise-coreset-selection_v2
pankaj1311 Feb 26, 2026
c76b502
chore: cleanup
pankaj1311 Feb 26, 2026
6caf629
chore: updated gitignore
pankaj1311 Feb 26, 2026
d33ceae
Merge branch 'staging' into p3/feat/stage-wise-coreset-selection_v2
pankaj1311 Feb 27, 2026
c45bbd1
chore: minor fixes to commands.sh
pankaj1311 Feb 27, 2026
7b0c8f4
chore: minor fixes to commands.sh
pankaj1311 Feb 27, 2026
1e08818
Added final review summary report
sidrocks Feb 27, 2026
4374fec
chore: fixed path in validate_infra.sh
pankaj1311 Feb 27, 2026
3bed63b
chore: fixed path in validate_infra.sh
pankaj1311 Feb 27, 2026
d356126
chore: final fixes made to infra validations
pankaj1311 Feb 27, 2026
847c5c3
Merge branch 'staging' into p3/feat/stage-wise-coreset-selection_v2
pankaj1311 Feb 28, 2026
2139644
fix tiebreaker when band_scores of multiple records are same
sidrocks Mar 2, 2026
b2b6ac1
Merge branch 'staging' into p3/feat/stage-wise-coreset-selection_v2
pankaj1311 Mar 3, 2026
8125aaa
chore: minor pre-commit fixes and updates to the documentation
pankaj1311 Mar 3, 2026
12d8729
feat: add t1_file_path to the final T3 output
pankaj1311 Mar 3, 2026
ab4e00f
Inclusion of new field t1_file_path as part of output coreset selecte…
sidrocks Mar 3, 2026
044ceb3
feat: added nvme setup script
pankaj1311 Mar 4, 2026
e91c42d
Application of language policy twice - base selection and protected s…
sidrocks Mar 4, 2026
9596d4f
chore: pre-commit fix
pankaj1311 Mar 4, 2026
018dbe2
chore: added s3 paths and inventory script
pankaj1311 Mar 4, 2026
f9da0a7
feat: removing accidentally added submodules
pankaj1311 Mar 4, 2026
20531c3
chore: pre-commit fixes
pankaj1311 Mar 4, 2026
d349b18
feat: added the source field
pankaj1311 Mar 4, 2026
3eee423
Added new domains - Translation, Golden and also update the new domai…
sidrocks Mar 4, 2026
7dfa3d0
feat: divided foreground and background scripts, removed dead paramet…
pankaj1311 Mar 4, 2026
cff0707
chore: pre-commit fixes
pankaj1311 Mar 4, 2026
4c8ea35
feat: create output paths based on enable_nvme flag
pankaj1311 Mar 5, 2026
d1c0d14
feat: removed the domain from curriculum policy
pankaj1311 Mar 5, 2026
bad055a
feat: asw sync changes
pankaj1311 Mar 5, 2026
a2222b3
chore: included languages to curriculum.yaml
pankaj1311 Mar 5, 2026
5bcb3e2
feat: added post processing steps
pankaj1311 Mar 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,30 @@ language_and_context:
- lang: "en"
max_share: 0.92
secondary_languages:
- lang: ["as", "bn", "gu", "hi", "kn", "ml", "mr", "or", "pa", "ta", "te"]
- lang:
[
"as",
"bn",
"gu",
"hi",
"kn",
"ml",
"mr",
"or",
"pa",
"ta",
"te",
"bn_roman",
"gu_roman",
"hi_roman",
"ml_roman",
"mr_roman",
"or_roman",
"pa_roman",
"ta_roman",
"te_roman",
"kn_roman",
]
max_share: 0.08
earliest_stage: "1B"
excluded_languages: ["zh", "ja", "ko", "fr", "de", "es"]
Expand Down Expand Up @@ -108,7 +131,16 @@ difficulty_system:
name: "Nursery"
intent: "Surface language acquisition"
allowed_modalities: ["general_text"]
allowed_domains: ["web", "social", "qa"]
allowed_domains:
[
"web",
"social",
"qa",
"education",
"language_literacy",
"conversation",
"translation",
]
constraints:
tokenizer:
avg_max: 5000
Expand All @@ -123,7 +155,18 @@ difficulty_system:
name: "Primary"
intent: "Fluent everyday language"
allowed_modalities: ["general_text", "clean_exposition"]
allowed_domains: ["web", "encyclopedia", "news", "social", "qa"]
allowed_domains:
[
"web",
"encyclopedia",
"news",
"social",
"qa",
"education",
"language_literacy",
"conversation",
"translation",
]
constraints:
tokenizer:
avg_max: 10000
Expand All @@ -139,7 +182,16 @@ difficulty_system:
intent: "Structured knowledge without explicit reasoning"
allowed_modalities: ["general_text", "structured_knowledge"]
allowed_domains:
["encyclopedia", "news", "education", "literature", "web", "qa"]
[
"encyclopedia",
"news",
"education",
"literature",
"web",
"qa",
"conversation",
"translation",
]
constraints:
tokenizer:
avg_max: 20000
Expand All @@ -154,7 +206,16 @@ difficulty_system:
name: "Undergraduate"
intent: "Reasoning emergence"
allowed_modalities: ["structured_knowledge", "technical_text", "code"]
allowed_domains: ["science", "math", "education", "code", "literature"]
allowed_domains:
[
"science",
"math",
"education",
"code",
"literature",
"conversation",
"translation",
]
constraints:
tokenizer:
avg_max: 40000
Expand Down Expand Up @@ -391,7 +452,15 @@ domains:

band_domain_policy:
B0:
["web", "social", "qa", "education", "language_literacy", "conversation"]
[
"web",
"social",
"qa",
"education",
"language_literacy",
"conversation",
"translation",
]
B1:
[
"web",
Expand All @@ -402,6 +471,7 @@ domains:
"education",
"language_literacy",
"conversation",
"translation",
]
B2:
[
Expand All @@ -412,8 +482,18 @@ domains:
"web",
"qa",
"conversation",
"translation",
]
B3:
[
"science",
"math",
"education",
"code",
"literature",
"conversation",
"translation",
]
B3: ["science", "math", "education", "code", "literature", "conversation"]
B4: ["science", "math", "code", "instruction"]
B5: ["instruction", "science", "math", "code"]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,7 @@ def _build_stage_coreset(self, stage_name: str, stage_config) -> dict:
"byte_length": getattr(all_chunks[cid], "byte_length", 0),
"source_doc_id": getattr(all_chunks[cid], "source_doc_id", ""),
"source_url": getattr(all_chunks[cid], "source_url", None),
"t1_file_path": getattr(all_chunks[cid], "t1_file_path", None),
# Many datasets use `source` as the dataset identifier; keep both.
"source": getattr(all_chunks[cid], "dataset_id", None)
or all_chunks[cid].dataset_id,
Expand Down Expand Up @@ -870,13 +871,15 @@ def _base_iter_batches() -> (
columns = [
"chunk_id",
"dataset_id",
"source",
"token_count_estimate",
"byte_length",
"domain",
"language",
"band",
"source_doc_id",
"source_url",
"t1_file_path",
"token_ids",
# Optional continuous score columns used by --band-score-source.
"band_score",
Expand Down Expand Up @@ -1307,6 +1310,8 @@ def _build_stage_coreset(self, stage_name: str, stage_config) -> dict:
or meta_dict.get("source_doc_id", ""),
source_url=row.get("source_url", None)
or meta_dict.get("source_url", None),
t1_file_path=row.get("t1_file_path", None)
or meta_dict.get("t1_file_path", None),
)

# Preserve raw input source when available (some datasets distinguish dataset_id vs source).
Expand Down Expand Up @@ -1420,6 +1425,7 @@ def _build_stage_coreset(self, stage_name: str, stage_config) -> dict:
),
"source_doc_id": getattr(meta, "source_doc_id", ""),
"source_url": getattr(meta, "source_url", None),
"t1_file_path": getattr(meta, "t1_file_path", None),
# Preserve original `source` when present; fallback to dataset_id.
"source": getattr(meta, "source", None)
or meta.dataset_id,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ print(df.head())
**Sample Columns (typical):**

- `chunk_id`, `dataset_id`, `token_count`, `domain`, `language`, `band`
- `byte_length`, `source_doc_id`, `source_url`
- `byte_length`, `source_doc_id`, `source_url`, `t1_file_path`
- `source` (when available)


Expand All @@ -120,7 +120,7 @@ with open("output/coresets/1B/selected_indices.jsonl") as f:
**Sample Output (schema-aligned):**

```json
{"chunk_id":"ch_001","dataset_id":"books","source":"books","token_count":2048,"byte_length":6463,"domain":"literature","language":"en","band":"B0","source_doc_id":"part-00000-...parquet","source_url":"s3://..."}
{"chunk_id":"ch_001","dataset_id":"books","source":"books","token_count":2048,"byte_length":6463,"domain":"literature","language":"en","band":"B0","source_doc_id":"part-00000-...parquet","source_url":"s3://...","t1_file_path":"s3://t1-raw/.../file.parquet"}
```

### CSV Format
Expand All @@ -140,8 +140,8 @@ df = pd.read_csv("output/coresets/1B/selected_indices.csv")
**Sample Output (schema-aligned):**

```csv
chunk_id,dataset_id,source,token_count,byte_length,domain,language,band,source_doc_id,source_url
ch_001,books,books,2048,2048,6463,literature,en,B0,part-00000-...parquet,s3://...
chunk_id,dataset_id,source,token_count,byte_length,domain,language,band,source_doc_id,source_url,t1_file_path
ch_001,books,books,2048,2048,6463,literature,en,B0,part-00000-...parquet,s3://...,s3://t1-raw/.../file.parquet
```

## Configuration Examples
Expand Down Expand Up @@ -204,8 +204,9 @@ Each row/object contains:
- **source**: Original dataset source label when provided (often same as dataset_id)
- **source_doc_id**: Document source file name
- **source_url**: URL if available
- **t1_file_path**: Path/URI to the original raw source file (recorded by the T1 dataset team) that contains the raw data for this chunk

* source_url+source_doc_id --> Leads to the source dataset file and then use chunk_id to pull the exact record data (Raw dataset)
* `t1_file_path` and/or `source_url`+`source_doc_id` can be used for traceability to the source dataset; use `chunk_id` to locate the exact record within that source.

## Performance Comparison

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Defines the **expected upstream chunk schema** consumed by the coreset pipeline
> 1) **Metadata-only chunk pool** (like `data/outputv2/b0_shard_0.jsonl`): IDs + band/domain/language + counts + band probabilities/scores.
> 2) **Text/tokens present**: includes `chunk_text` and/or `token_ids` to enable real dedup and richer diversity scoring.
>
> Band inference is controlled by the streaming entrypoint flags `--band-inference` and `--band-score-source` (sometimes described informally as “band inference” / “band source score”).
> Band inference is controll/ed by the streaming entrypoint flags `--band-inference` and `--band-score-source` (sometimes described informally as “band inference” / “band source score”).

### Required fields

Expand All @@ -37,6 +37,7 @@ Defines the **expected upstream chunk schema** consumed by the coreset pipeline
| `dataset_id` (or `source`) | string | Traceability/output | JSONL defaults to `"ds"` (aliases: `dataset_id` or `source` or `metadata.source`) |
| `byte_length` | int | Traceability/output | Defaults to `0` |
| `source_doc_id` | string | Traceability/output | Should be provided; otherwise empty/missing propagates |
| `t1_file_path` | string | Traceability/output | Optional. When provided by the T1 dataset team, points to the original raw source file containing the record (distinct from `source_doc_id`, which is typically the processed Parquet part filename). |
| `source_url` | string | Traceability/output | Optional |
| `quality_flags` | list[str] | Output metadata | Defaults to `[]` |
| `sensitive_markers` | list[str] | Output metadata | Defaults to `[]` |
Expand All @@ -48,7 +49,7 @@ This file is a **metadata-only** chunk pool: it does **not** include `chunk_text

### Columns present (verbatim)

`agentic_score`, `band`, `band_p_B0`, `band_p_B1`, `band_p_B2`, `band_p_B3`, `band_p_B4`, `band_p_B5`, `band_score`, `byte_length`, `chunk_id`, `code_score`, `compression_ratio`, `cot_score`, `difficulty_score`, `domain`, `fertility_estimate`, `has_agentic`, `has_code`, `has_cot`, `has_reasoning`, `language`, `math_score`, `reasoning_score`, `source`, `source_doc_id`, `source_url`, `token_count_estimate`, `unique_token_ratio`, `word_count`.
`agentic_score`, `band`, `band_p_B0`, `band_p_B1`, `band_p_B2`, `band_p_B3`, `band_p_B4`, `band_p_B5`, `band_score`, `byte_length`, `chunk_id`, `code_score`, `compression_ratio`, `cot_score`, `difficulty_score`, `domain`, `fertility_estimate`, `has_agentic`, `has_code`, `has_cot`, `has_reasoning`, `language`, `math_score`, `reasoning_score`, `source`, `source_doc_id`, `source_url`, `t1_file_path`, `token_count_estimate`, `unique_token_ratio`, `word_count`.

### What the pipeline consumes from these columns

Expand All @@ -63,7 +64,8 @@ This file is a **metadata-only** chunk pool: it does **not** include `chunk_text
- `band` → `ChunkMetadata.band`
- `source_doc_id` → `ChunkMetadata.source_doc_id`
- `source_url` → `ChunkMetadata.source_url`
- `band_score` → attached dynamically as `metadata.band_score` (used for ranking when present)
- `t1_file_path` → attached dynamically as `ChunkMetadata.t1_file_path` (propagates to selected-indices output when present)
- `band_score` → attached dynamically as `ChunkMetadata.band_score` (used for ranking when present)

When running the streaming entrypoint `coreset_builder.py` with `--band-inference` enabled (anything other than `none`), the builder may also read `difficulty_score` and/or `band_p_B0..band_p_B6` (per `--band-score-source`) to:

Expand All @@ -76,12 +78,12 @@ Other fields in this file (e.g., `has_code`, `*_score`, `word_count`, `unique_to

Flat record (recommended):
```json
{"chunk_id":"ch_001","dataset_id":"books","token_count_estimate":2048,"byte_length":9876,"domain":"clean_web","language":"en","band":"B2","source_doc_id":"part-00000","source_url":"s3://...","token_ids":[1,2,3]}
{"chunk_id":"ch_001","dataset_id":"books","token_count_estimate":2048,"byte_length":9876,"domain":"clean_web","language":"en","band":"B2","source_doc_id":"part-00000","source_url":"s3://...","t1_file_path":"s3://t1-raw/.../file.parquet","token_ids":[1,2,3]}
```

Nested metadata (accepted):
```json
{"uid":"ch_001","token_count":2048,"metadata":{"source":"books","domain":"clean_web","language":"en","band":"B2","source_doc_id":"part-00000"}}
{"uid":"ch_001","token_count":2048,"metadata":{"source":"books","domain":"clean_web","language":"en","band":"B2","source_doc_id":"part-00000","t1_file_path":"s3://t1-raw/.../file.parquet"}}
```

## Parquet: minimum viable columns
Expand All @@ -90,4 +92,4 @@ Required columns:
- `chunk_id`, `dataset_id`, `domain`, `language`, `band`, `byte_length`, `source_doc_id`, and one of `token_count`/`token_count_estimate`

Optional columns:
- `source_url`, `quality_flags`, `sensitive_markers`, `start_offset`, `token_ids`
- `source_url`, `t1_file_path`, `quality_flags`, `sensitive_markers`, `start_offset`, `token_ids`
Loading