Skip to content

开放计算系统抓取 URL 数据预处理 BUG #120

@zanguixuan3

Description

@zanguixuan3
Image 开放计算系统抓取 URL 数据预处理 基于大型语言模型的网站和本地文档(XML、HTML、JSON 等)的数据抓取工具 Preprocess 2026-01-04 02:44:26 | INFO | data_engine.utils.logger_utils:144 - Create logger ID 3 with loglevel: INFO, export to /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/output/log/tool_opencsg_scrape_url_data_preprocess_internal_time_20260104024426.txt 2026-01-04 02:44:26 | INFO | data_engine.core.executor_tools:52 - Preparing tool... 2026-01-04 02:44:26 | INFO | data_engine.tools.base_tool:44 - Setting up data ingester... 2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:30 - Using dataset_path: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input, repo:longrui/Test1, branch:main 2026-01-04 02:44:26 | INFO | data_engine.tools.base_tool:55 - Preparing exporter... 2026-01-04 02:44:26 | INFO | data_engine.core.executor_tools:59 - Launching tool... 2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:41 - model_id:longrui/Test1 2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:43 - endpoint:http://modelhub.cmr-co.com 2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:44 - 入参:repo_id:longrui/Test1, repo_type:dataset, revision:main, cache_dir:/data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input, endpoint:http://modelhub.cmr-co.com, token:b2bc8452d426461d8e4aac51b82fdebc

Downloading .gitattributes: 0%| | 0.00/2.25k [00:00<?, ?B/s]
Downloading .gitattributes: 100%|##########| 2.25k/2.25k [00:00<00:00, 2.77MB/s]

Downloading README.md: 0%| | 0.00/30.0 [00:00<?, ?B/s]
Downloading README.md: 100%|##########| 30.0/30.0 [00:00<00:00, 44.4kB/s]

Downloading excel.xlsx: 0%| | 0.00/17.4k [00:00<?, ?B/s]
Downloading excel.xlsx: 100%|##########| 17.4k/17.4k [00:00<00:00, 21.5MB/s]
2026-01-04 02:44:27 | INFO | data_engine.ingester.csghub_ingester:54 - result: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input, _src_path: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input
2026-01-04 02:44:27 | INFO | data_engine.tools.base_tool:95 - Data ingested from /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input
_accelerator 5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
2026-01-04 02:44:27 | DEBUG | data_engine.tools.base_tool:137 - Op [opencsg_scrape_url_data_preprocess_internal] running with number of procs:3
2026-01-04 02:44:27 | INFO | data_engine.tools.base_tool:109 - Processing tool...
2026-01-04 02:44:27 | INFO | data_engine.tools.legacies.opencsg_scrapegraphai:61 - target_dir: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/output/_df_dataset.jsonl/_data, url: https://top.baidu.com/board?tab=realtime, prompt: Give me all the news with their abstracts
2026-01-04 02:44:27 | ERROR | data_server.job.JobExecutor:107 - Job 117 execution failed with error: argument of type 'NoneType' is not iterable
2026-01-04 02:44:27 | INFO | data_server.job.JobExecutor:117 - Job 117 marked as FAILED

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions