-
Notifications
You must be signed in to change notification settings - Fork 4
Description
开放计算系统抓取 URL 数据预处理 基于大型语言模型的网站和本地文档(XML、HTML、JSON 等)的数据抓取工具 Preprocess
2026-01-04 02:44:26 | INFO | data_engine.utils.logger_utils:144 - Create logger ID 3 with loglevel: INFO, export to /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/output/log/tool_opencsg_scrape_url_data_preprocess_internal_time_20260104024426.txt
2026-01-04 02:44:26 | INFO | data_engine.core.executor_tools:52 - Preparing tool...
2026-01-04 02:44:26 | INFO | data_engine.tools.base_tool:44 - Setting up data ingester...
2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:30 - Using dataset_path: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input, repo:longrui/Test1, branch:main
2026-01-04 02:44:26 | INFO | data_engine.tools.base_tool:55 - Preparing exporter...
2026-01-04 02:44:26 | INFO | data_engine.core.executor_tools:59 - Launching tool...
2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:41 - model_id:longrui/Test1
2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:43 - endpoint:http://modelhub.cmr-co.com
2026-01-04 02:44:26 | INFO | data_engine.ingester.csghub_ingester:44 - 入参:repo_id:longrui/Test1, repo_type:dataset, revision:main, cache_dir:/data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input, endpoint:http://modelhub.cmr-co.com, token:b2bc8452d426461d8e4aac51b82fdebc
Downloading .gitattributes: 0%| | 0.00/2.25k [00:00<?, ?B/s]
Downloading .gitattributes: 100%|##########| 2.25k/2.25k [00:00<00:00, 2.77MB/s]
Downloading README.md: 0%| | 0.00/30.0 [00:00<?, ?B/s]
Downloading README.md: 100%|##########| 30.0/30.0 [00:00<00:00, 44.4kB/s]
Downloading excel.xlsx: 0%| | 0.00/17.4k [00:00<?, ?B/s]
Downloading excel.xlsx: 100%|##########| 17.4k/17.4k [00:00<00:00, 21.5MB/s]
2026-01-04 02:44:27 | INFO | data_engine.ingester.csghub_ingester:54 - result: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input, _src_path: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input
2026-01-04 02:44:27 | INFO | data_engine.tools.base_tool:95 - Data ingested from /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/input
_accelerator 5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
2026-01-04 02:44:27 | DEBUG | data_engine.tools.base_tool:137 - Op [opencsg_scrape_url_data_preprocess_internal] running with number of procs:3
2026-01-04 02:44:27 | INFO | data_engine.tools.base_tool:109 - Processing tool...
2026-01-04 02:44:27 | INFO | data_engine.tools.legacies.opencsg_scrapegraphai:61 - target_dir: /data/dataflow/urlTest_9503623a-2e60-475d-aa10-e64ebe5bad73/output/_df_dataset.jsonl/_data, url: https://top.baidu.com/board?tab=realtime, prompt: Give me all the news with their abstracts
2026-01-04 02:44:27 | ERROR | data_server.job.JobExecutor:107 - Job 117 execution failed with error: argument of type 'NoneType' is not iterable
2026-01-04 02:44:27 | INFO | data_server.job.JobExecutor:117 - Job 117 marked as FAILED