Skip to content

Commit 2e62d2a

Browse files
HYLcoolcyruszhanggemini-code-assist[bot]
authored
Release v1.5.0 (#918)
* * update version and doc * * ignore the first two items in the shape and focus on the specific shapes * + add model_params to text_tagging_by_prompt_mapper + flush the buffer when outputting the trace results and wait for 1 sec * bugfix: use /workspace for shared access in ray-head and ray-worker * Update data_juicer/ops/mapper/text_tagging_by_prompt_mapper.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: cyruszhang <cyrus.ylzhang@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent ae290f7 commit 2e62d2a

File tree

11 files changed

+99
-38
lines changed

11 files changed

+99
-38
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ tests/tools/tmp_*/
3232
tests/ops/deduplicator/chinese_dedup/
3333
tests/ops/deduplicator/english_dedup/
3434

35+
# temp directory for distributed Ray tests (shared between containers)
36+
tmp/
37+
3538

3639
# perf bench data
3740
perf_bench_data/

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,15 @@ for s in res_ds:
8686

8787
## 📰 News
8888

89+
<details open>
90+
<summary>[2026-02-12] Release v1.5.0: <b>Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs</b></summary>
91+
92+
- 🚀 *Enhanced Distributed Execution Framework* -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.
93+
- 🤖 *Expanded Embodied AI Video Processing* -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.
94+
- 💪🏻 *System Performance & Developer Experience Optimizations* -- Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates.
95+
- 🐳 *Critical Bug Fixes & Stability Improvements* -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.
96+
</details>
97+
8998
<details open>
9099
<summary>[2026-02-02] Release v1.4.6: <b>Copilot, Video Bytes I/O & Ray Tracing </b></summary>
91100

@@ -96,7 +105,7 @@ for s in res_ds:
96105
- 🐳 *Enhancements & fixes* — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.
97106
</details>
98107

99-
<details open>
108+
<details>
100109
<summary>[2026-01-15] Release v1.4.5: <b>20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade</b> </summary>
101110

102111
- *Embodied-AI OPs*: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus *S3 upload/download*.

README_ZH.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,15 @@ for s in res_ds:
8585

8686
## 📰 动态
8787

88+
<details open>
89+
<summary>[2026-02-12] Release v1.5.0: <b>分区Ray执行器,OP级环境隔离,以及更多具身算子</b></summary>
90+
91+
- 🚀 *分布式执行框架升级* — 新增分区Ray执行器与OP级隔离环境,强化容错性、可扩展性及依赖冲突管理。
92+
- 🤖 *具身AI视频处理能力扩展* — 集成相机校准、视频去畸变、手部重建、位姿估计等专用操作符,提升多视角视频处理能力。
93+
- 💪🏻 *系统性能与开发体验优化* — 支持批处理推理、内存/日志精简、关键逻辑重构,并更新文档与问题模板。
94+
- 🐳 *关键问题修复与稳定性提升* — 修复重复项追踪、参数冲突、首页渲染等缺陷,增强系统可靠性。
95+
</details>
96+
8897
<details open>
8998
<summary>[2026-02-02] Release v1.4.6: <b>Copilot、视频字节 I/O 与 Ray 追踪</b></summary>
9099

@@ -95,7 +104,7 @@ for s in res_ds:
95104
- 🐳 *增强与修复* — 刷新 Docker 镜像、小幅性能提升、GitHub Insights 流量工作流、Ray 兼容性更新以及 Bug/文档修复。
96105
</details>
97106

98-
<details open>
107+
<details >
99108
<summary>[2026-01-15] Release v1.4.5: <b>20+ 新 OP、Ray vLLM 管道与 Sphinx 文档升级</b> </summary>
100109

101110
- *具身 AI OP*:添加/增强了用于视频标题生成(VLM)、视频对象分割(YOLOE+SAM2)、视频深度估计(可视化 + 点云)、人体姿态(MMPose)、图像标签(VLM)、单图像 3D 人体网格恢复(SAM 3D Body)的映射器,以及 *S3 上传/下载*

data_juicer/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "1.4.6"
1+
__version__ = "1.5.0"
22

33
import sys
44

data_juicer/core/tracer/ray_tracer.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,5 +167,7 @@ def finalize_traces(self):
167167
# We'll use a generic name for now, could be improved with operator type detection
168168
res_name = self.get_trace_file_path(op_name)
169169
dif_df = pd.DataFrame(traces)
170-
dif_df.to_json(res_name, orient="records", lines=True, force_ascii=False)
170+
with open(res_name, "w") as out_buf:
171+
dif_df.to_json(out_buf, orient="records", lines=True, force_ascii=False)
172+
out_buf.flush()
171173
print(f"Exported {len(traces)} traced samples for op [{op_name}] to {res_name}")

data_juicer/ops/mapper/text_tagging_by_prompt_mapper.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ def __init__(
7474
tensor_parallel_size: int = None,
7575
max_model_len: int = None,
7676
max_num_seqs: int = 256,
77+
model_params: Dict = None,
7778
sampling_params: Dict = None,
7879
*args,
7980
**kwargs,
@@ -93,6 +94,7 @@ def __init__(
9394
derived from the model config.
9495
:param max_num_seqs: It is only valid when enable_vllm is True.
9596
Maximum number of sequences to be processed in a single iteration.
97+
:param model_params: Parameters for model initialization.
9698
:param sampling_params: Sampling parameters for text generation.
9799
e.g {'temperature': 0.9, 'top_p': 0.95}
98100
:param args: extra args
@@ -117,7 +119,8 @@ def __init__(
117119
self.prompt = prompt
118120
self.tag_list = tag_list
119121
self.enable_vllm = enable_vllm
120-
model_params = {"trust_remote_code": trust_remote_code, "max_num_seqs": max_num_seqs}
122+
model_params = (model_params or {}).copy()
123+
model_params.update({"trust_remote_code": trust_remote_code, "max_num_seqs": max_num_seqs})
121124
if tensor_parallel_size is not None:
122125
model_params["tensor_parallel_size"] = tensor_parallel_size
123126
if max_model_len is not None:

tests/core/executor/test_partitioned_integration.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import shutil
1919
import tempfile
2020
import unittest
21+
import uuid
2122

2223
from data_juicer.config import init_configs
2324
from data_juicer.core.executor.ray_executor_partitioned import PartitionedRayExecutor
@@ -31,7 +32,12 @@ class PartitionedExecutorIntegrationTest(DataJuicerTestCaseBase):
3132

3233
def setUp(self) -> None:
3334
super().setUp()
34-
self.tmp_dir = tempfile.mkdtemp(prefix='test_partitioned_integration_')
35+
# Use a shared directory under root_path instead of system /tmp
36+
# This ensures the temp directory is accessible by all Ray workers
37+
# in distributed mode (e.g., Docker containers sharing /workspace)
38+
unique_name = f'test_partitioned_integration_{uuid.uuid4().hex[:8]}'
39+
self.tmp_dir = os.path.join(self.root_path, 'tmp', unique_name)
40+
os.makedirs(self.tmp_dir, exist_ok=True)
3541

3642
def tearDown(self) -> None:
3743
super().tearDown()
@@ -458,7 +464,12 @@ class CheckpointResumeIntegrationTest(DataJuicerTestCaseBase):
458464

459465
def setUp(self) -> None:
460466
super().setUp()
461-
self.tmp_dir = tempfile.mkdtemp(prefix='test_ckpt_resume_')
467+
# Use a shared directory under root_path instead of system /tmp
468+
# This ensures the temp directory is accessible by all Ray workers
469+
# in distributed mode (e.g., Docker containers sharing /workspace)
470+
unique_name = f'test_ckpt_resume_{uuid.uuid4().hex[:8]}'
471+
self.tmp_dir = os.path.join(self.root_path, 'tmp', unique_name)
472+
os.makedirs(self.tmp_dir, exist_ok=True)
462473

463474
def tearDown(self) -> None:
464475
super().tearDown()

tests/core/executor/test_ray_executor_partitioned.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
import os
2+
import shutil
23
import tempfile
34
import unittest
5+
import uuid
6+
47
from data_juicer.core.executor.ray_executor_partitioned import PartitionedRayExecutor
58
from data_juicer.config import init_configs
69
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase, TEST_TAG
@@ -11,13 +14,16 @@ class PartitionedRayExecutorTest(DataJuicerTestCaseBase):
1114

1215
def setUp(self) -> None:
1316
super().setUp()
14-
# Create temporary directory
15-
self.tmp_dir = tempfile.mkdtemp(prefix='test_ray_executor_partitioned_')
17+
# Use a shared directory under root_path instead of system /tmp
18+
# This ensures the temp directory is accessible by all Ray workers
19+
# in distributed mode (e.g., Docker containers sharing /workspace)
20+
unique_name = f'test_ray_executor_partitioned_{uuid.uuid4().hex[:8]}'
21+
self.tmp_dir = os.path.join(self.root_path, 'tmp', unique_name)
22+
os.makedirs(self.tmp_dir, exist_ok=True)
1623

1724
def tearDown(self) -> None:
1825
super().tearDown()
1926
# Clean up temporary directory
20-
import shutil
2127
if os.path.exists(self.tmp_dir):
2228
shutil.rmtree(self.tmp_dir)
2329

@@ -537,11 +543,15 @@ class PartitionedRayExecutorEdgeCasesTest(DataJuicerTestCaseBase):
537543

538544
def setUp(self) -> None:
539545
super().setUp()
540-
self.tmp_dir = tempfile.mkdtemp(prefix='test_ray_executor_edge_')
546+
# Use a shared directory under root_path instead of system /tmp
547+
# This ensures the temp directory is accessible by all Ray workers
548+
# in distributed mode (e.g., Docker containers sharing /workspace)
549+
unique_name = f'test_ray_executor_edge_{uuid.uuid4().hex[:8]}'
550+
self.tmp_dir = os.path.join(self.root_path, 'tmp', unique_name)
551+
os.makedirs(self.tmp_dir, exist_ok=True)
541552

542553
def tearDown(self) -> None:
543554
super().tearDown()
544-
import shutil
545555
if os.path.exists(self.tmp_dir):
546556
shutil.rmtree(self.tmp_dir)
547557

tests/core/tracer/test_ray_tracer.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import unittest
33
import tempfile
44
import shutil
5+
import time
56
import jsonlines as jl
67
from data_juicer.core.tracer.ray_tracer import RayTracer
78
from data_juicer.utils.unittest_utils import TEST_TAG
@@ -58,6 +59,7 @@ def test_collect_mapper_sample_basic(self):
5859

5960
# Finalize traces to write to file
6061
ray.get(tracer.finalize_traces.remote())
62+
time.sleep(1)
6163

6264
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_mapper.jsonl')
6365
self.assertTrue(os.path.exists(trace_file_path))
@@ -87,6 +89,7 @@ def test_collect_mapper_sample_no_change(self):
8789

8890
# Finalize traces to write to file
8991
ray.get(tracer.finalize_traces.remote())
92+
time.sleep(1)
9093

9194
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_mapper.jsonl')
9295
# File should not exist since no samples were collected
@@ -105,6 +108,7 @@ def test_collect_mapper_sample_with_trace_keys(self):
105108

106109
# Finalize traces to write to file
107110
ray.get(tracer.finalize_traces.remote())
111+
time.sleep(1)
108112

109113
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_mapper.jsonl')
110114
self.assertTrue(os.path.exists(trace_file_path))
@@ -135,6 +139,7 @@ def test_collect_mapper_sample_with_missing_trace_keys(self):
135139

136140
# Finalize traces to write to file
137141
ray.get(tracer.finalize_traces.remote())
142+
time.sleep(1)
138143

139144
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_mapper.jsonl')
140145
self.assertTrue(os.path.exists(trace_file_path))
@@ -166,6 +171,7 @@ def test_collect_mapper_sample_not_in_op_list(self):
166171

167172
# Finalize traces to write to file
168173
ray.get(tracer.finalize_traces.remote())
174+
time.sleep(1)
169175

170176
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_mapper.jsonl')
171177
self.assertFalse(os.path.exists(trace_file_path))
@@ -183,6 +189,7 @@ def test_collect_filter_sample_basic(self):
183189

184190
# Finalize traces to write to file
185191
ray.get(tracer.finalize_traces.remote())
192+
time.sleep(1)
186193

187194
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_filter.jsonl')
188195
self.assertTrue(os.path.exists(trace_file_path))
@@ -208,6 +215,7 @@ def test_collect_filter_sample_should_keep(self):
208215

209216
# Finalize traces to write to file
210217
ray.get(tracer.finalize_traces.remote())
218+
time.sleep(1)
211219

212220
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_filter.jsonl')
213221
self.assertFalse(os.path.exists(trace_file_path))
@@ -225,6 +233,7 @@ def test_collect_filter_sample_not_in_op_list(self):
225233

226234
# Finalize traces to write to file
227235
ray.get(tracer.finalize_traces.remote())
236+
time.sleep(1)
228237

229238
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-test_filter.jsonl')
230239
self.assertFalse(os.path.exists(trace_file_path))
@@ -254,6 +263,7 @@ def test_collect_mapper_sample_show_num_limit(self):
254263

255264
# Finalize traces to write to file
256265
ray.get(tracer.finalize_traces.remote())
266+
time.sleep(1)
257267

258268
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-limited_mapper.jsonl')
259269
self.assertTrue(os.path.exists(trace_file_path))
@@ -288,6 +298,7 @@ def test_collect_filter_sample_show_num_limit(self):
288298

289299
# Finalize traces to write to file
290300
ray.get(tracer.finalize_traces.remote())
301+
time.sleep(1)
291302

292303
trace_file_path = os.path.join(self.work_dir, 'trace', 'sample_trace-limited_filter.jsonl')
293304
self.assertTrue(os.path.exists(trace_file_path))
@@ -327,6 +338,7 @@ def test_finalize_traces_empty(self):
327338

328339
# Don't collect anything, just finalize
329340
ray.get(tracer.finalize_traces.remote())
341+
time.sleep(1)
330342

331343
# No trace files should exist
332344
trace_dir = os.path.join(self.work_dir, 'trace')

tests/ops/mapper/test_text_tagging_by_prompt_mapper.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,9 @@ def test_tagging_vllm(self):
5252
enable_vllm=True,
5353
max_model_len=1024,
5454
max_num_seqs=16,
55-
sampling_params={'temperature': 0.1, 'top_p': 0.95, 'max_tokens': 256})
55+
sampling_params={'temperature': 0.1, 'top_p': 0.95, 'max_tokens': 256},
56+
model_params={'gpu_memory_utilization': 0.8},
57+
)
5658

5759

5860
if __name__ == '__main__':

0 commit comments

Comments
 (0)