Skip to content

工具池 数据混合 ,数据集 tool12 #128

@zanguixuan3

Description

@zanguixuan3

2025-12-31 02:01:44 | INFO | data_engine.utils.logger_utils:144 - Create logger ID 3 with loglevel: INFO, export to /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/log/tool_data_mixture_postprocess_internal_time_20251231020144.txt
2025-12-31 02:01:44 | INFO | data_engine.core.executor_tools:52 - Preparing tool...
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:44 - Setting up data ingester...
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:30 - Using dataset_path: /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input, repo:longrui/tools12, branch:main
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:55 - Preparing exporter...
解析为多个值: ['0.4', '0.2']
解析结果 - weights: ['0.4', '0.2'], max_samples: 3
2025-12-31 02:01:44 | INFO | data_engine.core.executor_tools:59 - Launching tool...
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:41 - model_id:longrui/tools12
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:43 - endpoint:http://modelhub.cmr-co.com
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:44 - 入参:repo_id:longrui/tools12, repo_type:dataset, revision:main, cache_dir:/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input, endpoint:http://modelhub.cmr-co.com, token:b2bc8452d426461d8e4aac51b82fdebc

Downloading .gitattributes: 0%| | 0.00/2.34k [00:00<?, ?B/s]
Downloading .gitattributes: 100%|##########| 2.34k/2.34k [00:00<00:00, 2.82MB/s]

Downloading README.md: 0%| | 0.00/25.0 [00:00<?, ?B/s]
Downloading README.md: 100%|##########| 25.0/25.0 [00:00<00:00, 34.8kB/s]

Downloading data1.jsonl: 0%| | 0.00/813 [00:00<?, ?B/s]
Downloading data1.jsonl: 100%|##########| 813/813 [00:00<00:00, 1.27MB/s]

Downloading data2.parquet: 0%| | 0.00/1.09k [00:00<?, ?B/s]
Downloading data2.parquet: 100%|##########| 1.09k/1.09k [00:00<00:00, 1.61MB/s]
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:54 - result: /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input, _src_path: /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:95 - Data ingested from /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input
_accelerator 5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
2025-12-31 02:01:44 | DEBUG | data_engine.tools.base_tool:137 - Op [data_mixture_postprocess_internal] running with number of procs:3
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:109 - Processing tool...
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: []
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:99 - handling files: {'.jsonl': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data1.jsonl']}
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: []
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:99 - handling files: {'.parquet': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data2.parquet']}
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: ['.jsonl']
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:99 - handling files: {'.jsonl': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data1.jsonl']}

Generating jsonl split: 0 examples [00:00, ? examples/s]
Generating jsonl split: 3 examples [00:00, 289.16 examples/s]
2025-12-31 02:01:45 | INFO | data_engine.format.formatter:187 - Unifying the input dataset formats...
2025-12-31 02:01:45 | INFO | data_engine.format.formatter:202 - There are 3 sample(s) in the original dataset.

Filter (num_proc=3): 0%| | 0/3 [00:00<?, ? examples/s]
Filter (num_proc=3): 67%|######6 | 2/3 [00:00<00:00, 18.10 examples/s]
Filter (num_proc=3): 100%|##########| 3/3 [00:00<00:00, 12.65 examples/s]
2025-12-31 02:01:46 | INFO | data_engine.format.formatter:216 - 3 samples left after filtering empty text.
2025-12-31 02:01:46 | WARNING | data_engine.format.formatter:265 - No global config passed into unify_format function. Relative paths in the dataset might not be converted to their absolute versions. Data of other modalities might not be able to find by Data-Juicer.
2025-12-31 02:01:46 | INFO | data_engine.format.mixture_formatter:137 - sampled 2 from 3
2025-12-31 02:01:46 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: ['.parquet']
2025-12-31 02:01:46 | INFO | data_engine.utils.file_utils:99 - handling files: {'.parquet': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data2.parquet']}

Generating parquet split: 0 examples [00:00, ? examples/s]
Generating parquet split: 6 examples [00:00, 654.30 examples/s]
2025-12-31 02:01:47 | INFO | data_engine.format.formatter:187 - Unifying the input dataset formats...
2025-12-31 02:01:47 | INFO | data_engine.format.formatter:202 - There are 6 sample(s) in the original dataset.

Filter (num_proc=3): 0%| | 0/6 [00:00<?, ? examples/s]
Filter (num_proc=3): 100%|##########| 6/6 [00:00<00:00, 56.32 examples/s]
Filter (num_proc=3): 100%|##########| 6/6 [00:00<00:00, 26.87 examples/s]
2025-12-31 02:01:47 | INFO | data_engine.format.formatter:216 - 6 samples left after filtering empty text.
2025-12-31 02:01:47 | WARNING | data_engine.format.formatter:265 - No global config passed into unify_format function. Relative paths in the dataset might not be converted to their absolute versions. Data of other modalities might not be able to find by Data-Juicer.
2025-12-31 02:01:47 | INFO | data_engine.format.mixture_formatter:137 - sampled 1 from 6
2025-12-31 02:01:47 | INFO | data_engine.format.mixture_formatter:143 - There are 3 in final dataset
2025-12-31 02:01:47 | INFO | data_engine.exporter.base_exporter:161 - Export dataset into a single file...

Creating json from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 72.56ba/s]
_accelerator -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5
2025-12-31 02:01:47 | INFO | data_engine.tools.base_tool:114 - Tool are done in 2.992s.
2025-12-31 02:01:47 | INFO | data_engine.tools.base_tool:121 - Exporting dataset to somewhere...
2025-12-31 02:01:47 | INFO | data_engine.exporter.csghub_exporter:97 - Start to upload /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data to repo: longrui/tools12 with branch: main
2025-12-31 02:01:47 | INFO | data_engine.exporter.csghub_exporter:200 - repo longrui/tools12 all branches: ['main', 'refs-convert-parquet']
2025-12-31 02:01:47 | INFO | data_engine.exporter.csghub_exporter:153 - Start to push /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data to repo: longrui/tools12 with branch: v1,user_name: longrui, token: b2bc8452d426461d8e4aac51b82fdebc
2025-12-31 02:02:02 | INFO | data_engine.exporter.csghub_exporter:166 - Done push /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data to repo: longrui/tools12 with branch: v1
2025-12-31 02:02:02 | INFO | data_engine.exporter.csghub_exporter:169 - Remove /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_git
2025-12-31 02:02:02 | INFO | data_engine.exporter.csghub_exporter:172 - Remove /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data
2025-12-31 02:02:02 | WARNING | data_server.job.JobExecutor:127 - Job 115 still in PROCESSING state in finally block, marking as FAILED
同上一个 issue
写到文件成功 日志不 完全成功

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions