-
Notifications
You must be signed in to change notification settings - Fork 4
Description
2025-12-31 02:01:44 | INFO | data_engine.utils.logger_utils:144 - Create logger ID 3 with loglevel: INFO, export to /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/log/tool_data_mixture_postprocess_internal_time_20251231020144.txt
2025-12-31 02:01:44 | INFO | data_engine.core.executor_tools:52 - Preparing tool...
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:44 - Setting up data ingester...
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:30 - Using dataset_path: /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input, repo:longrui/tools12, branch:main
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:55 - Preparing exporter...
解析为多个值: ['0.4', '0.2']
解析结果 - weights: ['0.4', '0.2'], max_samples: 3
2025-12-31 02:01:44 | INFO | data_engine.core.executor_tools:59 - Launching tool...
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:41 - model_id:longrui/tools12
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:43 - endpoint:http://modelhub.cmr-co.com
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:44 - 入参:repo_id:longrui/tools12, repo_type:dataset, revision:main, cache_dir:/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input, endpoint:http://modelhub.cmr-co.com, token:b2bc8452d426461d8e4aac51b82fdebc
Downloading .gitattributes: 0%| | 0.00/2.34k [00:00<?, ?B/s]
Downloading .gitattributes: 100%|##########| 2.34k/2.34k [00:00<00:00, 2.82MB/s]
Downloading README.md: 0%| | 0.00/25.0 [00:00<?, ?B/s]
Downloading README.md: 100%|##########| 25.0/25.0 [00:00<00:00, 34.8kB/s]
Downloading data1.jsonl: 0%| | 0.00/813 [00:00<?, ?B/s]
Downloading data1.jsonl: 100%|##########| 813/813 [00:00<00:00, 1.27MB/s]
Downloading data2.parquet: 0%| | 0.00/1.09k [00:00<?, ?B/s]
Downloading data2.parquet: 100%|##########| 1.09k/1.09k [00:00<00:00, 1.61MB/s]
2025-12-31 02:01:44 | INFO | data_engine.ingester.csghub_ingester:54 - result: /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input, _src_path: /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:95 - Data ingested from /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input
_accelerator 5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
2025-12-31 02:01:44 | DEBUG | data_engine.tools.base_tool:137 - Op [data_mixture_postprocess_internal] running with number of procs:3
2025-12-31 02:01:44 | INFO | data_engine.tools.base_tool:109 - Processing tool...
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: []
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:99 - handling files: {'.jsonl': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data1.jsonl']}
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: []
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:99 - handling files: {'.parquet': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data2.parquet']}
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: ['.jsonl']
2025-12-31 02:01:44 | INFO | data_engine.utils.file_utils:99 - handling files: {'.jsonl': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data1.jsonl']}
Generating jsonl split: 0 examples [00:00, ? examples/s]
Generating jsonl split: 3 examples [00:00, 289.16 examples/s]
2025-12-31 02:01:45 | INFO | data_engine.format.formatter:187 - Unifying the input dataset formats...
2025-12-31 02:01:45 | INFO | data_engine.format.formatter:202 - There are 3 sample(s) in the original dataset.
Filter (num_proc=3): 0%| | 0/3 [00:00<?, ? examples/s]
Filter (num_proc=3): 67%|######6 | 2/3 [00:00<00:00, 18.10 examples/s]
Filter (num_proc=3): 100%|##########| 3/3 [00:00<00:00, 12.65 examples/s]
2025-12-31 02:01:46 | INFO | data_engine.format.formatter:216 - 3 samples left after filtering empty text.
2025-12-31 02:01:46 | WARNING | data_engine.format.formatter:265 - No global config passed into unify_format function. Relative paths in the dataset might not be converted to their absolute versions. Data of other modalities might not be able to find by Data-Juicer.
2025-12-31 02:01:46 | INFO | data_engine.format.mixture_formatter:137 - sampled 2 from 3
2025-12-31 02:01:46 | INFO | data_engine.utils.file_utils:78 - params suffixes defined: ['.parquet']
2025-12-31 02:01:46 | INFO | data_engine.utils.file_utils:99 - handling files: {'.parquet': ['/data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/input/data2.parquet']}
Generating parquet split: 0 examples [00:00, ? examples/s]
Generating parquet split: 6 examples [00:00, 654.30 examples/s]
2025-12-31 02:01:47 | INFO | data_engine.format.formatter:187 - Unifying the input dataset formats...
2025-12-31 02:01:47 | INFO | data_engine.format.formatter:202 - There are 6 sample(s) in the original dataset.
Filter (num_proc=3): 0%| | 0/6 [00:00<?, ? examples/s]
Filter (num_proc=3): 100%|##########| 6/6 [00:00<00:00, 56.32 examples/s]
Filter (num_proc=3): 100%|##########| 6/6 [00:00<00:00, 26.87 examples/s]
2025-12-31 02:01:47 | INFO | data_engine.format.formatter:216 - 6 samples left after filtering empty text.
2025-12-31 02:01:47 | WARNING | data_engine.format.formatter:265 - No global config passed into unify_format function. Relative paths in the dataset might not be converted to their absolute versions. Data of other modalities might not be able to find by Data-Juicer.
2025-12-31 02:01:47 | INFO | data_engine.format.mixture_formatter:137 - sampled 1 from 6
2025-12-31 02:01:47 | INFO | data_engine.format.mixture_formatter:143 - There are 3 in final dataset
2025-12-31 02:01:47 | INFO | data_engine.exporter.base_exporter:161 - Export dataset into a single file...
Creating json from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]
Creating json from Arrow format: 100%|##########| 1/1 [00:00<00:00, 72.56ba/s]
_accelerator -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5
2025-12-31 02:01:47 | INFO | data_engine.tools.base_tool:114 - Tool are done in 2.992s.
2025-12-31 02:01:47 | INFO | data_engine.tools.base_tool:121 - Exporting dataset to somewhere...
2025-12-31 02:01:47 | INFO | data_engine.exporter.csghub_exporter:97 - Start to upload /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data to repo: longrui/tools12 with branch: main
2025-12-31 02:01:47 | INFO | data_engine.exporter.csghub_exporter:200 - repo longrui/tools12 all branches: ['main', 'refs-convert-parquet']
2025-12-31 02:01:47 | INFO | data_engine.exporter.csghub_exporter:153 - Start to push /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data to repo: longrui/tools12 with branch: v1,user_name: longrui, token: b2bc8452d426461d8e4aac51b82fdebc
2025-12-31 02:02:02 | INFO | data_engine.exporter.csghub_exporter:166 - Done push /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data to repo: longrui/tools12 with branch: v1
2025-12-31 02:02:02 | INFO | data_engine.exporter.csghub_exporter:169 - Remove /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_git
2025-12-31 02:02:02 | INFO | data_engine.exporter.csghub_exporter:172 - Remove /data/dataflow/数据混合tools12_a30b4aba-67bb-4a71-8f78-4848b5a23fb1/output/_df_dataset.jsonl/_data/_data
2025-12-31 02:02:02 | WARNING | data_server.job.JobExecutor:127 - Job 115 still in PROCESSING state in finally block, marking as FAILED
同上一个 issue
写到文件成功 日志不 完全成功