Skip to content

Conversation

@nabinchha
Copy link
Contributor

@nabinchha nabinchha commented Nov 4, 2025

Fix for: #2

Users should be able point to a local folder with wildcard pattern like so (for parquet, json, jsonl, csv). If we follow this pattern, duckdb is able to read from all files across these extensions.

parquet_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.parquet")
json_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.json")
jsonl_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.jsonl")
csv_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.csv")

#8 should merge first.

We need this for BigIron to support any sort of non trivial partitioned seed dataset. We currently have a workaround in BigIron to consolidate partitions into one file, but that does not scale at all.

Example Preview result pointing seed dataset to "csv/*.csv"

[11:36:46] [INFO] 0️⃣ Using the first matching file in 'csv/*.csv' to determine column names in seed dataset
[11:36:46] [INFO] 🕵️ Preview generation in progress
[11:36:46] [INFO] ✅ Validation passed
[11:36:46] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[11:36:46] [INFO] 🩺 Running health checks for models...
[11:36:46] [INFO]   |-- 👀 Checking 'nvidia/nvidia-nemotron-nano-9b-v2'...
col_names: ['language', 'greetings', 'name']
[11:36:47] [INFO]   |-- ✅ Passed!
[11:36:47] [INFO] 🌱 Sampling 10 records from seed dataset
[11:36:47] [INFO]   |-- seed dataset size: 22 records
[11:36:47] [INFO]   |-- sampling strategy: shuffle
[11:36:47] [INFO]   |-- selection: partition 2 of 3
[11:36:47] [INFO]   |-- seed dataset size after selection: 7 records
[11:36:47] [INFO] 📝 Preparing llm-text column generation
[11:36:47] [INFO]   |-- column name: 'greetings_completion'
[11:36:47] [INFO]   |-- model config:
{
    "alias": "nano-v2",
    "model": "nvidia/nvidia-nemotron-nano-9b-v2",
    "inference_parameters": {
        "temperature": 0.5,
        "top_p": null,
        "max_tokens": 2048,
        "max_parallel_requests": 4,
        "timeout": null,
        "extra_body": null
    },
    "provider": null
}
[11:36:47] [INFO]   |-- default model provider: 'nvidia'
[11:36:47] [INFO] 🐙 Processing llm-text column 'greetings_completion' with 4 concurrent workers
[11:37:01] [INFO] 📊 Model usage summary:
{
    "nvidia/nvidia-nemotron-nano-9b-v2": {
        "token_usage": {
            "prompt_tokens": 397,
            "completion_tokens": 5102,
            "total_tokens": 5499
        },
        "request_usage": {
            "successful_requests": 10,
            "failed_requests": 0,
            "total_requests": 10
        },
        "tokens_per_second": 380,
        "requests_per_minute": 41
    }
}
[11:37:01] [INFO] 📐 Measuring dataset column statistics:
[11:37:01] [INFO]   |-- 📝 column: 'greetings_completion'
[11:37:01] [INFO]   |-- 🌱 column: 'language'
[11:37:01] [INFO]   |-- 🌱 column: 'greetings'
[11:37:01] [INFO]   |-- 🌱 column: 'name'
[11:37:01] [INFO] 🎉 Preview complete!

Base automatically changed from nm/seed-config-partition-strategy to main November 4, 2025 23:36
johnnygreco
johnnygreco previously approved these changes Nov 4, 2025
@eric-tramel
Copy link
Contributor

big fan, here. Thanks @nabinchha !

eric-tramel
eric-tramel previously approved these changes Nov 5, 2025
@nabinchha nabinchha dismissed stale reviews from eric-tramel and johnnygreco via 690b00f November 5, 2025 18:23
@nabinchha
Copy link
Contributor Author

nabinchha commented Nov 5, 2025

Found another validation fix we need to make while testing a notebook:
690b00f cc @johnnygreco

As is, we will not support this when we need to upload to a datastore.

@nabinchha nabinchha requested a review from johnnygreco November 5, 2025 18:27
@nabinchha nabinchha requested a review from eric-tramel November 5, 2025 18:39
johnnygreco
johnnygreco previously approved these changes Nov 5, 2025
@nabinchha nabinchha merged commit d01e5bf into main Nov 5, 2025
10 checks passed
@nabinchha nabinchha deleted the nabinchha/bug/2-support-seed-path-with-partition-files branch November 5, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants