Skip to content

Commit d531861

Browse files
authored
fix: fix cpu bind (#173)
Signed-off-by: Song Gao <disxiaofei@163.com>
1 parent 05517e4 commit d531861

File tree

9 files changed

+579
-56
lines changed

9 files changed

+579
-56
lines changed

docs/cgroup_binding.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# cgroup Binding and Startup Gate Synchronization
2+
3+
## Purpose
4+
5+
This document explains how veloFlux binds FlowInstance processes to Linux cgroup v2 paths and how
6+
the startup gate coordinates Manager/worker startup order.
7+
8+
The goals are:
9+
10+
- keep CPU isolation deterministic at process level
11+
- avoid worker runtime initialization before cgroup binding is complete
12+
- provide explicit failure signals and logs for operations/debugging
13+
14+
## Configuration
15+
16+
All fields are under `server`:
17+
18+
- `default_cgroup_path`
19+
Optional cgroup v2 path for the in-process default FlowInstance (Manager process).
20+
- `extra_flow_instances[].cgroup_path`
21+
Optional cgroup v2 path per worker subprocess.
22+
- `startup_gate_path`
23+
Optional base directory for startup gate marker files.
24+
- `startup_gate_timeout_ms`
25+
Optional worker wait timeout for gate readiness.
26+
27+
## Effective Enablement Rules
28+
29+
- Manager default-instance binding runs only when `default_cgroup_path` is configured.
30+
- Worker cgroup binding runs per instance only when that instance has `cgroup_path`.
31+
- Startup gate is enabled only if:
32+
- at least one extra instance has a non-empty `cgroup_path`, and
33+
- `startup_gate_path` is configured.
34+
- If no extra instance declares `cgroup_path`, startup gate is skipped.
35+
36+
## Startup Sequence
37+
38+
### 1) Manager (`default`) pre-runtime bind
39+
40+
Before creating the Tokio runtime, the main process tries to join `default_cgroup_path`.
41+
42+
- success: logs one `flow instance bound to cgroup` record
43+
- failure: logs reason + cgroup snapshot and exits startup
44+
45+
This keeps the Manager binding attempt in a pre-runtime stage.
46+
47+
### 2) Worker spawn and gate-controlled startup
48+
49+
For each `extra_flow_instances` entry (serially):
50+
51+
1. Manager spawns worker process (`--worker ...`).
52+
2. If configured, Manager binds worker PID to `cgroup_path`.
53+
3. If startup gate is enabled:
54+
- on bind success: Manager writes `<instance>.ready`
55+
- on bind failure: Manager writes `<instance>.fail`, kills workers, returns error
56+
4. Worker process waits for gate readiness before creating its runtime and listeners:
57+
- sees `.ready` -> continue startup
58+
- sees `.fail` -> exit with failure
59+
- timeout -> exit with failure
60+
61+
This guarantees worker runtime init happens only after Manager-side binding succeeds.
62+
63+
## Startup Gate Details
64+
65+
- Manager creates a per-run gate directory under `startup_gate_path` (for example:
66+
`boot-<pid>-<timestamp_ms>`).
67+
- Marker writes are atomic (tmp file + rename).
68+
- Marker naming:
69+
- ready: `<instance>.ready`
70+
- fail: `<instance>.fail`
71+
- On gate session creation, stale entries under `startup_gate_path` are removed best-effort.
72+
- On Manager shutdown (normal path), the per-run gate directory is removed best-effort.
73+
74+
## Failure Behavior
75+
76+
If any worker bind/gate step fails:
77+
78+
- current worker is terminated
79+
- previously started workers are terminated
80+
- startup returns error to Manager bootstrap path
81+
82+
This keeps startup behavior fail-fast and avoids partial multi-instance availability.
83+
84+
## cgroup Join Implementation Notes
85+
86+
- `join_pid` validates cgroup path format (`/...`, no parent traversal).
87+
- If target cgroup already matches `/proc/<pid>/cgroup`, the join is treated as no-op success.
88+
- Primary write target is `cgroup.procs`.
89+
- If `cgroup.procs` write returns `EINVAL`, code attempts fallback via `cgroup.threads`.
90+
- Failure logs include a cgroup snapshot (`cgroup.type`, `cgroup.subtree_control`, `cpu.max`,
91+
and `cpuset.*`) for root-cause diagnosis.
92+
93+
## Operational Guidance
94+
95+
- Prepare cgroup tree/controller delegation before starting veloFlux.
96+
- Use startup logs to verify:
97+
- default Manager bind
98+
- per-worker bind result
99+
- gate enabled/disabled reason
100+
- If startup fails, inspect:
101+
- cgroup path existence and type (domain/leaf)
102+
- controller delegation (`cpu` in relevant subtree controls)
103+
- gate directory permission and timeout settings

scripts/perf_pr_host/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ Update these paths using `setup_cgroup.sh` output:
2222
- `server.extra_flow_instances[0].cgroup_path`
2323
- `server.extra_flow_instances[1].cgroup_path`
2424

25+
For startup synchronization between manager and worker processes, set:
26+
27+
- `server.startup_gate_path` (recommended: `/tmp/veloflux-startup-gate`)
28+
- `server.startup_gate_timeout_ms` (example: `15000`)
29+
2530
## 1) Setup cgroup
2631

2732
```bash

scripts/perf_pr_host/config.yaml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,17 +27,18 @@ server:
2727
# Example:
2828
# CG_MANAGER=/veloflux-ci/perf-pr-host-001/veloflux/manager
2929
default_cgroup_path: "/veloflux-ci/perf-pr-host-001/veloflux/manager"
30+
startup_gate_path: "/tmp/veloflux-startup-gate"
31+
startup_gate_timeout_ms: 15000
3032

3133
extra_flow_instances:
3234
- id: "fi_critical"
3335
worker_addr: "127.0.0.1:18081"
34-
metrics_addr: "0.0.0.0:19891"
35-
profile_addr: "0.0.0.0:16061"
36+
metrics_addr: "127.0.0.1:19891"
37+
profile_addr: "127.0.0.1:16061"
3638
cgroup_path: "/veloflux-ci/perf-pr-host-001/veloflux/fi_critical"
3739

3840
- id: "fi_best"
3941
worker_addr: "127.0.0.1:18082"
40-
metrics_addr: "0.0.0.0:19892"
41-
profile_addr: "0.0.0.0:16062"
42+
metrics_addr: "127.0.0.1:19892"
43+
profile_addr: "127.0.0.1:16062"
4244
cgroup_path: "/veloflux-ci/perf-pr-host-001/veloflux/fi_best"
43-

src/cgroup.rs

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,26 @@ fn cgroup_threads_path(
5050
Ok(dir.join("cgroup.threads"))
5151
}
5252

53+
#[cfg(target_os = "linux")]
54+
fn pid_cgroup_v2_path(pid: u32) -> Result<String, Box<dyn std::error::Error + Send + Sync>> {
55+
let path = PathBuf::from(format!("/proc/{pid}/cgroup"));
56+
let raw =
57+
std::fs::read_to_string(&path).map_err(|err| format!("read {}: {err}", path.display()))?;
58+
for line in raw.lines() {
59+
if let Some(value) = line.strip_prefix("0::") {
60+
let value = value.trim();
61+
if value.is_empty() {
62+
return Ok("/".to_string());
63+
}
64+
if value.starts_with('/') {
65+
return Ok(value.to_string());
66+
}
67+
return Ok(format!("/{value}"));
68+
}
69+
}
70+
Err(format!("missing unified cgroup entry in {}", path.display()).into())
71+
}
72+
5373
#[cfg(target_os = "linux")]
5474
pub fn debug_snapshot(cgroup_path: &str) -> String {
5575
let dir = match cgroup_dir(cgroup_path) {
@@ -122,7 +142,14 @@ pub fn join_pid(
122142
pid: u32,
123143
cgroup_path: &str,
124144
) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
125-
let procs_path = cgroup_procs_path(cgroup_path)?;
145+
let target_path = validate_cgroup_path(cgroup_path)?;
146+
if let Ok(current_path) = pid_cgroup_v2_path(pid) {
147+
if current_path == target_path {
148+
return Ok(());
149+
}
150+
}
151+
152+
let procs_path = cgroup_procs_path(target_path)?;
126153
let mut f = std::fs::OpenOptions::new()
127154
.write(true)
128155
.open(&procs_path)

src/config.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,8 @@ impl Default for MetricsConfig {
133133
pub struct ServerConfig {
134134
pub manager_addr: Option<String>,
135135
pub default_cgroup_path: Option<String>,
136+
pub startup_gate_path: Option<String>,
137+
pub startup_gate_timeout_ms: Option<u64>,
136138
#[serde(default)]
137139
pub extra_flow_instances: Vec<FlowInstanceSpec>,
138140
}
@@ -142,6 +144,8 @@ impl Default for ServerConfig {
142144
Self {
143145
manager_addr: Some(crate::server::DEFAULT_MANAGER_ADDR.to_string()),
144146
default_cgroup_path: None,
147+
startup_gate_path: None,
148+
startup_gate_timeout_ms: None,
145149
extra_flow_instances: Vec::new(),
146150
}
147151
}
@@ -191,6 +195,12 @@ impl AppConfig {
191195
if let Some(path) = self.server.default_cgroup_path.as_ref() {
192196
opts.default_cgroup_path = Some(path.clone());
193197
}
198+
if let Some(path) = self.server.startup_gate_path.as_ref() {
199+
opts.startup_gate_path = Some(path.clone());
200+
}
201+
if let Some(timeout_ms) = self.server.startup_gate_timeout_ms {
202+
opts.startup_gate_timeout_ms = Some(timeout_ms);
203+
}
194204
if !self.server.extra_flow_instances.is_empty() {
195205
opts.extra_flow_instances = self.server.extra_flow_instances.clone();
196206
}

src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@ pub mod cgroup;
1212
pub mod config;
1313
pub mod logging;
1414
pub mod server;
15+
pub mod startup_gate;
1516
pub use manager::{register_schema, schema_registry, SchemaParser};

src/main.rs

Lines changed: 99 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,39 +4,118 @@ static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
44

55
use veloflux::server;
66

7-
#[tokio::main]
8-
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
7+
#[derive(Debug, Clone)]
8+
struct WorkerCliArgs {
9+
instance_id: String,
10+
config_path: String,
11+
startup_gate_dir: Option<String>,
12+
startup_gate_timeout_ms: Option<u64>,
13+
}
14+
15+
impl WorkerCliArgs {
16+
fn parse(args: Vec<String>) -> Result<Self, Box<dyn std::error::Error + Send + Sync>> {
17+
let mut instance_id = None;
18+
let mut config_path = None;
19+
let mut startup_gate_dir = None;
20+
let mut startup_gate_timeout_ms = None;
21+
let mut it = args.into_iter().peekable();
22+
while let Some(arg) = it.next() {
23+
match arg.as_str() {
24+
"--flow-instance-id" => {
25+
instance_id = it.next();
26+
}
27+
"--config" => {
28+
config_path = it.next();
29+
}
30+
"--startup-gate-dir" => {
31+
startup_gate_dir = it.next();
32+
}
33+
"--startup-gate-timeout-ms" => {
34+
if let Some(raw) = it.next() {
35+
let parsed = raw
36+
.parse::<u64>()
37+
.map_err(|_| format!("invalid --startup-gate-timeout-ms: {raw}"))?;
38+
startup_gate_timeout_ms = Some(parsed);
39+
}
40+
}
41+
_ => {}
42+
}
43+
}
44+
45+
let instance_id = instance_id.ok_or("--flow-instance-id is required in --worker mode")?;
46+
let config_path = config_path.ok_or("--config is required in --worker mode")?;
47+
Ok(Self {
48+
instance_id,
49+
config_path,
50+
startup_gate_dir,
51+
startup_gate_timeout_ms,
52+
})
53+
}
54+
}
55+
56+
fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
957
let args = std::env::args().skip(1).collect::<Vec<_>>();
1058
if args.iter().any(|arg| arg == "--worker") {
11-
return run_worker(args).await;
59+
let worker_args = WorkerCliArgs::parse(args)?;
60+
if let Some(run_dir) = worker_args.startup_gate_dir.as_deref() {
61+
let timeout_ms = worker_args
62+
.startup_gate_timeout_ms
63+
.unwrap_or(veloflux::startup_gate::DEFAULT_STARTUP_GATE_TIMEOUT_MS);
64+
veloflux::startup_gate::wait_until_ready(
65+
run_dir,
66+
&worker_args.instance_id,
67+
timeout_ms,
68+
)?;
69+
}
70+
let rt = tokio::runtime::Builder::new_multi_thread()
71+
.enable_all()
72+
.build()?;
73+
return rt.block_on(run_worker(worker_args));
1274
}
1375

1476
let result = veloflux::bootstrap::default_init()?;
1577
// Keep logging guard alive for the duration of the application
1678
let _logging_guard = result.logging_guard;
1779

18-
let ctx = server::init(result.options, result.instance).await?;
19-
server::start(ctx).await
20-
}
21-
22-
async fn run_worker(args: Vec<String>) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
23-
let mut instance_id = None;
24-
let mut config_path = None;
25-
let mut it = args.into_iter().peekable();
26-
while let Some(arg) = it.next() {
27-
match arg.as_str() {
28-
"--flow-instance-id" => {
29-
instance_id = it.next();
80+
if let Some(path) = result.options.default_cgroup_path.as_deref() {
81+
match veloflux::cgroup::join_current_process(path) {
82+
Ok(()) => {
83+
tracing::info!(
84+
flow_instance_id = "default",
85+
pid = std::process::id(),
86+
cgroup_path = %path,
87+
reason = "manager process joined target cgroup (pre-runtime single-thread stage)",
88+
"flow instance bound to cgroup"
89+
);
3090
}
31-
"--config" => {
32-
config_path = it.next();
91+
Err(err) => {
92+
tracing::error!(
93+
flow_instance_id = "default",
94+
pid = std::process::id(),
95+
cgroup_path = %path,
96+
reason = %err,
97+
cgroup_snapshot = %veloflux::cgroup::debug_snapshot(path),
98+
"failed to bind flow instance to cgroup"
99+
);
100+
return Err(err);
33101
}
34-
_ => {}
35102
}
36103
}
37104

38-
let instance_id = instance_id.ok_or("--flow-instance-id is required in --worker mode")?;
39-
let config_path = config_path.ok_or("--config is required in --worker mode")?;
105+
let rt = tokio::runtime::Builder::new_multi_thread()
106+
.enable_all()
107+
.build()?;
108+
rt.block_on(async move {
109+
let ctx = server::init(result.options, result.instance).await?;
110+
server::start(ctx).await
111+
})
112+
}
113+
114+
async fn run_worker(
115+
worker_args: WorkerCliArgs,
116+
) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
117+
let instance_id = worker_args.instance_id;
118+
let config_path = worker_args.config_path;
40119

41120
flow::init_process_once();
42121
flow::metrics::set_flow_instance_id(&instance_id);

0 commit comments

Comments
 (0)