-
Notifications
You must be signed in to change notification settings - Fork 325
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Before Asking 在提问之前
-
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
project_name:
dataset_path:
export_path:
np: 64
text_keys: 'text'
open_tracer: true
trace_num: 1000000000000
process:
- document_minhash_deduplicator:
jaccard_threshold: 0.7
window_size: 10
num_permutations: 128
tokenization: 'character'
num_cpus: 64
我的config只有一个算子,document_minhash_deduplicator,运行过程中,除了compute_hash阶段CPU能用满,其他阶段CPU利用率极低,只有3%左右,这是为什么呢?150w条数据去重需要跑接近2天的时间。
Additional 额外信息
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested