Skip to content

工具池 -原始arXiv数据转换为JSONL BUG 运行失败 #121

@zanguixuan3

Description

@zanguixuan3
Image

2026-01-04 02:54:32 | INFO | data_engine.utils.logger_utils:144 - Create logger ID 3 with loglevel: INFO, export to /data/dataflow/arXiv2jsontools4_9b73afdf-f297-431b-ad23-b157a648a50f/output/log/tool_raw_arxiv_to_jsonl_preprocess_internal_time_20260104025432.txt
2026-01-04 02:54:32 | INFO | data_engine.core.executor_tools:52 - Preparing tool...
2026-01-04 02:54:32 | INFO | data_engine.tools.base_tool:44 - Setting up data ingester...
2026-01-04 02:54:32 | INFO | data_engine.ingester.csghub_ingester:30 - Using dataset_path: /data/dataflow/arXiv2jsontools4_9b73afdf-f297-431b-ad23-b157a648a50f/input, repo:longrui/tools4, branch:main
2026-01-04 02:54:32 | INFO | data_engine.tools.base_tool:55 - Preparing exporter...
2026-01-04 02:54:32 | INFO | data_engine.core.executor_tools:59 - Launching tool...
2026-01-04 02:54:32 | INFO | data_engine.ingester.csghub_ingester:41 - model_id:longrui/tools4
2026-01-04 02:54:32 | INFO | data_engine.ingester.csghub_ingester:43 - endpoint:http://modelhub.cmr-co.com
2026-01-04 02:54:32 | INFO | data_engine.ingester.csghub_ingester:44 - 入参:repo_id:longrui/tools4, repo_type:dataset, revision:main, cache_dir:/data/dataflow/arXiv2jsontools4_9b73afdf-f297-431b-ad23-b157a648a50f/input, endpoint:http://modelhub.cmr-co.com, token:b2bc8452d426461d8e4aac51b82fdebc

Downloading .gitattributes: 0%| | 0.00/2.25k [00:00<?, ?B/s]
Downloading .gitattributes: 100%|##########| 2.25k/2.25k [00:00<00:00, 2.76MB/s]

Downloading README.md: 0%| | 0.00/25.0 [00:00<?, ?B/s]
Downloading README.md: 100%|##########| 25.0/25.0 [00:00<00:00, 24.9kB/s]

Downloading open.tar.gz: 0%| | 0.00/1.78M [00:00<?, ?B/s]
Downloading open.tar.gz: 100%|##########| 1.78M/1.78M [00:00<00:00, 61.9MB/s]
2026-01-04 02:54:33 | INFO | data_engine.ingester.csghub_ingester:54 - result: /data/dataflow/arXiv2jsontools4_9b73afdf-f297-431b-ad23-b157a648a50f/input, _src_path: /data/dataflow/arXiv2jsontools4_9b73afdf-f297-431b-ad23-b157a648a50f/input
2026-01-04 02:54:33 | INFO | data_engine.tools.base_tool:95 - Data ingested from /data/dataflow/arXiv2jsontools4_9b73afdf-f297-431b-ad23-b157a648a50f/input
_accelerator 5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
2026-01-04 02:54:33 | DEBUG | data_engine.tools.base_tool:137 - Op [raw_arxiv_to_jsonl_preprocess_internal] running with number of procs:3
2026-01-04 02:54:33 | INFO | data_engine.tools.base_tool:109 - Processing tool...
_accelerator -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5
2026-01-04 02:54:33 | INFO | data_engine.tools.base_tool:114 - Tool are done in 0.128s.
2026-01-04 02:54:33 | INFO | data_engine.tools.base_tool:121 - Exporting dataset to somewhere...
2026-01-04 02:54:33 | INFO | data_engine.exporter.csghub_exporter:94 - The target dir is empty, no need to upload anything, abort.
2026-01-04 02:54:33 | WARNING | data_server.job.JobExecutor:127 - Job 119 still in PROCESSING state in finally block, marking as FAILED
后续数据集附件:

yanshishujuji.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions