Skip to content

Commit d4ab729

Browse files
authored
* add more demos (#12)
1 parent 7a06df6 commit d4ab729

File tree

20 files changed

+1413
-22
lines changed

20 files changed

+1413
-22
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
203203
- Data Processing:
204204
- Scientific Literature (e.g. [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sci_data/summary)]
205205
- Programming Code (e.g. [TheStack](https://huggingface.co/datasets/bigcode/the-stack)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_code_data/summary)]
206-
- Chinese Instruction Data (e.g. [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/sft_data_zh/summary)]
206+
- Chinese Instruction Data (e.g. [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sft_zh_data/summary)]
207207
- Tool Pool:
208208
- Dataset Splitting by Language [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_dataset_splitting_by_language/summary)]
209209
- Quality Classifier for CommonCrawl [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]

README_ZH.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
200200
* 数据处理:
201201
* 科学文献 (例如 [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sci_data/summary)]
202202
* 编程代码 (例如 [TheStack](https://huggingface.co/datasets/bigcode/the-stack)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_code_data/summary)]
203-
* 中文指令数据 (例如 [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/sft_data_zh/summary)]
203+
* 中文指令数据 (例如 [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)) [[ModelScope](https://modelscope.cn/studios/Data-Juicer/process_sft_zh_data/summary)]
204204
* 工具池:
205205
* 按语言分割数据集 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_dataset_splitting_by_language/summary)]
206206
* CommonCrawl 质量分类器 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/tool_quality_classifier/summary)]

demos/README.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ streamlit run app.py
1616
- Data (`data`)
1717
- This folder contains some sample datasets.
1818

19+
- Overview scan (`overview_scan`)
20+
- This demo introduces the basic concepts and functions of Data-Juicer, such as features, configuration, operators, and so on.
21+
1922
- Data process loop (`data_process_loop`)
2023
- This demo analyzes and processes a dataset, providing a comparison of statistical information before and after the processing.
2124

@@ -28,17 +31,20 @@ streamlit run app.py
2831
- Data visualization statistics (`data_visualization_statistics`)
2932
- This demo analyzes the dataset and obtain up to 13 statistics.
3033

34+
- Process SFT Chinese data (`process_sft_zh_data`)
35+
- This demos analyzes and processes part of Chinese dataset in Alpaca-CoT to show how to process IFT or SFT data for LLM fine-tuning.
36+
37+
- Process SCI data (`process_sci_data`)
38+
- This demos analyzes and processes part of arXiv dataset to show how to process scientific literature data for LLM pre-training.
39+
40+
- Process code data (`process_code_data`)
41+
- This demos analyzes and processes part of Stack-Exchange dataset to show how to process code data for LLM pre-training.
42+
3143
- Text quality classifier (`tool_quality_classifier`)
3244
- This demo provides 3 text quality classifier to score the dataset.
3345

3446
- Dataset splitting by language (`tool_dataset_splitting_by_language`)
3547
- This demo splits a dataset to different sub-datasets by language.
3648

37-
## Demos Coming Soon
38-
- Overview scan
39-
- Auto evaluation helm
40-
- Data mixture
41-
- SFT data zh
42-
- Process sci data
43-
- Process code data
44-
- Data process hpo
49+
- Data mixture (`data_mixture`)
50+
- This demo selects and mixes samples from multiple datasets and exports them into a new dataset.

demos/README_ZH.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -16,30 +16,35 @@ streamlit run app.py
1616
- 数据集样例 (`data`)
1717
- 该文件夹包含一些样例数据集。
1818

19+
- 初探索 (`overview_scan`)
20+
- 该示例介绍了 Data-Juicer 的基本概念和功能,例如特性、配置系统,算子等等。
21+
1922
- 数据处理回路 (`data_process_loop`)
2023
- 该示例用来分析和处理数据集,并给出处理前后数据集的统计信息比对。
2124

2225
- 词法多样性可视化 (`data_visualization_diversity`)
23-
- 该示例可以用来分析 SFT 数据集的动词-名词结构, 并绘制成sunburst层级环形图表。
26+
- 该示例可以用来分析 SFT 数据集的动词-名词结构并绘制成sunburst层级环形图表。
2427

2528
- 算子效果可视化 (`data_visualization_op_effect`)
2629
- 该示例可以分析数据集的统计信息,并根据这些统计信息可以显示出每个 `Filter` 算子在不同阈值下的效果。
2730

2831
- 统计信息可视化 (`data_visualization_statistics`)
29-
- 示例可以分析数据集,并获得多达13种统计信息。
32+
- 该示例可以分析数据集,并获得多达13种统计信息。
33+
34+
- 处理 SFT 中文数据 (`process_sft_zh_data`)
35+
- 以 Alpaca-CoT 的部分中文数据为例,演示了 LLM 中指令跟随微调数据和有监督微调数据的分析和处理流程。
36+
37+
- 处理预训练科学文献类数据 (`process_sci_data`)
38+
- 以 arXiv 的部分数据为例,演示了如何处理 LLM 预训练中的科学文献类数据的分析和处理流程。
39+
40+
- 处理预训练代码类数据 (`process_code_data`)
41+
- 以 Stack-Exchange 的部分数据为例,演示了如何处理 LLM 预训练中的代码类数据的分析和处理流程。
3042

3143
- 文本质量打分器 (`tool_quality_classifier`)
32-
- 该示例提供了3种文本质量打分器, 对数据集进行打分评估。
44+
- 该示例提供了3种文本质量打分器,对数据集进行打分评估。
3345

3446
- 按语言分割数据集 (`tool_dataset_splitting_by_language`)
3547
- 该示例按照语言将数据集拆分为不同的子数据集。
3648

37-
## 即将上线的的演示
38-
- Overview scan | 初体验
39-
- Auto evaluation helm | 自动HELM评测
40-
- Data mixture | 数据混合
41-
- SFT data zh | 中文指令微调数据处理
42-
- Process sci data | 科学文献数据处理
43-
- Process code data | 代码数据处理
44-
- Data process hpo | 数据混合超参自动优化
45-
49+
- 数据混合 (`data_mixture`)
50+
- 该示例从多份数据集中进行采样并混合为一个新的数据集。

demos/data_mixture/app.py

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
from pathlib import Path
2+
3+
import pandas as pd
4+
import streamlit as st
5+
6+
from data_juicer.format import load_formatter
7+
8+
if st.__version__ >= '1.23.0':
9+
data_editor = st.data_editor
10+
else:
11+
data_editor = st.data_editor.experimental_data_editor
12+
13+
14+
@st.cache_data
15+
def convert_csv(df):
16+
# IMPORTANT: Cache the conversion to prevent computation on every rerun
17+
return df.to_csv(encoding='utf_8_sig').encode('utf-8')
18+
19+
20+
@st.cache_data
21+
def convert_jsonl(df):
22+
# IMPORTANT: Cache the conversion to prevent computation on every rerun
23+
return df.to_json(orient='records', lines=True,
24+
force_ascii=False).encode('utf-8')
25+
26+
27+
class Visualize:
28+
29+
@staticmethod
30+
def setup():
31+
st.set_page_config(
32+
page_title='Data-Juicer',
33+
page_icon=':smile',
34+
layout='wide',
35+
# initial_sidebar_state="expanded",
36+
)
37+
38+
readme_link = 'https://github.com/alibaba/data-juicer'
39+
st.markdown(
40+
'<div align = "center"> <font size = "70"> Data-Juicer \
41+
</font> </div>',
42+
unsafe_allow_html=True,
43+
)
44+
st.markdown(
45+
f'<div align = "center"> A Data-Centric Text Processing System for \
46+
Large Language Models, \
47+
see more details in our <a href={readme_link}>page</a></div>',
48+
unsafe_allow_html=True,
49+
)
50+
51+
@staticmethod
52+
def mix_dataset():
53+
54+
data_files = list(Path('./data').glob('*jsonl'))
55+
56+
data_files_dict = {file.stem: str(file) for file in data_files}
57+
col1, col2 = st.columns(2)
58+
all_selected = []
59+
with col1:
60+
col3, col4 = st.columns(2)
61+
with col3:
62+
st.subheader('Select datasets')
63+
options = sorted(list(data_files_dict.keys()))
64+
selected_ds = st.multiselect(label='datasets',
65+
options=options,
66+
label_visibility='hidden')
67+
for ds in selected_ds:
68+
all_selected.append({'dataset': ds, 'weight': 1.0})
69+
with col4:
70+
st.subheader('Select sampling method')
71+
options = ['Random']
72+
st.selectbox(label='method',
73+
options=options,
74+
label_visibility='hidden')
75+
76+
st.subheader('Set weight (0.0-1.0)')
77+
datasets = data_editor(all_selected, use_container_width=True)
78+
ds_names = [ds['dataset'] for ds in datasets]
79+
ds_files = [data_files_dict[ds['dataset']] for ds in datasets]
80+
weights = [ds['weight'] for ds in datasets]
81+
with col2:
82+
st.subheader('Show selected dataset details')
83+
display_select = st.checkbox('Display')
84+
if display_select:
85+
if len(datasets) > 0:
86+
tabs = st.tabs(ds_names)
87+
for tab, ds_file in zip(tabs, ds_files):
88+
with tab:
89+
st.write(pd.read_json(ds_file, lines=True))
90+
91+
start_btn = st.button('Start to mix datasets', use_container_width=True)
92+
if start_btn:
93+
if len(datasets) > 0:
94+
data_path = ' '.join([
95+
' '.join([str(weight), ds_file])
96+
for ds_file, weight in zip(ds_files, weights)
97+
])
98+
formatter = load_formatter(data_path)
99+
df = pd.DataFrame(formatter.load_dataset())
100+
101+
st.session_state.dataset = df
102+
else:
103+
st.warning('Please select one dataset at least')
104+
105+
dataset = st.session_state.get('dataset', pd.DataFrame())
106+
st.subheader('Mixed dataset')
107+
st.dataframe(dataset, use_container_width=True)
108+
st.download_button(label='Download mixed dataset as JSONL',
109+
data=convert_jsonl(dataset),
110+
file_name='mixed_dataset.jsonl')
111+
112+
@staticmethod
113+
def visualize():
114+
Visualize.setup()
115+
Visualize.mix_dataset()
116+
117+
118+
def main():
119+
Visualize.visualize()
120+
121+
122+
if __name__ == '__main__':
123+
main()

0 commit comments

Comments
 (0)