-
Notifications
You must be signed in to change notification settings - Fork 29
Resource control based on cgroup #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
8c26c3d
938d551
d22ad52
909000e
547f9ef
e30456c
ee05214
fa5c919
afc9e0a
19010da
64e8569
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
## Summary | ||
### General Motivation | ||
|
||
Currently, we don't control and limit the resource usage of worker processes, except running the worker in a container (see the `container` part in the [doc](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#api-reference)). In most of the scenarios, the container is unnecessary, but resource control is necessary for isolation. | ||
|
||
[Control groups](https://man7.org/linux/man-pages/man7/cgroups.7.html), usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored. | ||
|
||
So, the goal of this proposal is to achieve resource control for worker processes by cgroup in Linux. | ||
|
||
### Should this change be within `ray` or outside? | ||
|
||
These changes would be within Ray core. | ||
|
||
## Stewardship | ||
### Required Reviewers | ||
@ericl, @edoakes, @simon-mo, @chenk008 @raulchen | ||
|
||
### Shepherd of the Proposal (should be a senior committer) | ||
@ericl | ||
ericl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Design and Architecture | ||
|
||
### Cluster level API | ||
We should add some new system configs for the resource control. | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- worker_resource_control_method: Set to "cgroup" by default. | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- cgroup_manager: Set to "cgroupfs" by default. We should also support `systemd`. | ||
- cgroup_mount_path: Set to `/sys/fs/cgroup/` by default. | ||
- cgroup_unified_hierarchy: Whether to use cgroup v2 or not. Set to `False` by default. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we only support cgroup v2 to reduce the scope? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should also support cgroup v1 because it has been used widely and written into standard. I have described here https://github.com/ray-project/enhancements/pull/7/files#diff-98ccfc9582e95581aae234797bc273b2fb68cb9e4dcc3030c8e94ba447daef7dR112-R113. |
||
- cgroup_use_cpuset: Whether to use cpuset. Set to `False` by default. | ||
|
||
### User level API | ||
#### Simple API using a flag | ||
```python | ||
runtime_env = { | ||
"enable_resource_control": True, | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
``` | ||
|
||
```python | ||
from ray.runtime_env import RuntimeEnv | ||
runtime_env = ray.runtime_env.RuntimeEnv( | ||
enable_resource_control=True | ||
) | ||
``` | ||
|
||
#### Entire APIs | ||
```python | ||
runtime_env = { | ||
"enable_resource_control": True, | ||
"resource_control_config": { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is cgroup version automatically detected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found a util function in runc https://github.com/opencontainers/runc/blob/main/libcontainer/cgroups/utils.go#L34-L50. But cgroup v1 and v2 could both be enabled in some systems. And it also depends on which sub systems(e.g. cpu, memory) has been enable in the cgroup. So, I'd like to add a config in first step. We can also change this in future. |
||
"cpu_enabled": True, | ||
"memory_enabled": True, | ||
"cpu_strict_usage": False, | ||
"memory_strict_usage": True, | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} | ||
``` | ||
|
||
```python | ||
from ray.runtime_env import RuntimeEnv, ResourceControlConfig | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
runtime_env = ray.runtime_env.RuntimeEnv( | ||
resource_control_config=ResourceControlConfig( | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
cpu_enabled=True, memory_enabled=True, cpu_strict_usage=False, memory_strict_usage=True) | ||
) | ||
``` | ||
|
||
### Implementation | ||
#### Work steps | ||
|
||
When we run `ray.init` with a `runtime_env` and `eager_install` is enabled, the main steps are: | ||
- (**Step 1**) The Raylet(Worker Pool) receives the publishing message of job started. | ||
- (**Step 2**) The Raylet sends the RPC `GetOrCreateRuntimeEnv` to the Agent. | ||
- (**Step 3**) The Agent setups the `runtime_env`. For `resource_control`, the agent generates `command_prefix` in `runtime_env_context` to talk how to enable the resource control, e.g. using cgroupfs or systemd. A cgroupfs sample is like: | ||
`mkdir /sys/fs/cgroup/{worker_id} && echo "200000 1000000" > /sys/fs/cgroup/foo/cpu.max && echo {pid} > /sys/fs/cgroup/foo/cgroup.procs` | ||
|
||
When we create a `Task` or `Actor` with a `runtime_env`(or a inherited `runtime_env`), the main steps are: | ||
- (**Step 3**) The worker submits the task. | ||
- (**Step 4**) The `task_spec` is received by the Raylet after scheduling. | ||
- (**Step 5**) The Raylet sends the RPC `GetOrCreateRuntimeEnv` to the Agent. | ||
- (**Step 6**) The agent generates `command_prefix` in `runtime_env_context` and replies the RPC. | ||
- (**Step 7**) The Raylet starts the new worker process with the `runtime_env_context`. | ||
- (**Step 8**) The `setup_worker.py` setups the `resource_control` by `command_prefix` for the new worker process. | ||
|
||
#### Cgroup Manager | ||
The Cgroup Manager is used to create or delete cgroups, and bind worker processes to cgroups. We plan to integrate the Cgroup Manager in the Agent. | ||
|
||
We should abstract the Cgroup Manager because we have more than one way to manager cgroups in linux. The two main ways are cgroupfs and systemd, which also used in the [container technology](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/). | ||
|
||
##### cgroupfs | ||
Using cgroupfs to manage cgroups is like: | ||
``` | ||
mkdir /sys/fs/cgroup/{worker_id} | ||
echo "200000 1000000" > /sys/fs/cgroup/foo/cpu.max | ||
echo {pid} > /sys/fs/cgroup/foo/cgroup.procs | ||
``` | ||
NOTE: This is an example based on cgroup v2. The commmand lines in cgroup v1 is defferent and incompatible. | ||
|
||
We also could use the [libcgroup](https://github.com/libcgroup/libcgroup/blob/main/README) to simplify the implementation. This library support both cgroup v1 and cgroup v2. | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
##### systemd | ||
Using systemd to manage cgroups is like: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's the advantage of using systemd here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we use systemfs in a systemd based system, there will be more than one component which manage the cgroup tree. And from https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/, we could know that: In the short-term future writing directly to the control group tree from applications should still be OK, as long as the Pax Control Groups document is followed. In the medium-term future it will still be supported to alter/read individual attributes of cgroups directly, but no longer to create/delete cgroups without using the systemd API. In the longer-term future altering/reading attributes will also be unavailable to userspace applications, unless done via systemd's APIs (either D-Bus based IPC APIs or shared library APIs for passive operations). In addition, systemd is highly recommended by |
||
``` | ||
systemctl set-property {worker_id}.service CPUQuota=20% | ||
systemctl start {worker_id}.service | ||
``` | ||
|
||
NOTE: The entire config options is [here](https://man7.org/linux/man-pages/man5/systemd.resource-control.5.html). And we can also use `StartTransientUnit` to create cgroup with worker process simply. This is a [dbus](https://www.freedesktop.org/wiki/Software/systemd/dbus/) API and there is a [dbus-python](https://dbus.freedesktop.org/doc/dbus-python/) module we can used. | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
##### More references | ||
- [Yarn NodeManagerCgroups](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html) | ||
|
||
#### About Cgroup v1 and v2 | ||
[Cgroup v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt) is more reasonable, but we should also support cgroup v1 because v1 is widely used and has been hard coded to the [OCI](https://opencontainers.org/) standards. You can see this [blog](https://www.redhat.com/sysadmin/fedora-31-control-group-v2) for more. | ||
|
||
Change if cgroup v2 has been enabled in your linux system: | ||
``` | ||
mount | grep '^cgroup' | awk '{print $1}' | uniq | ||
``` | ||
|
||
And you can try to enable cgroup v2: | ||
``` | ||
grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1" | ||
reboot | ||
``` | ||
|
||
#### How to delete cgroups | ||
When worker processes die, we should delete the cgroup which is created for the processes. This work is only needed for cgroupfs because systemd could handle it. | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
So, we should add a new RPC named `CleanForDeadWorker` in `RuntimeEnvService`. The Raylet should send this PRC to the Agent and the Agent will delete the cgroup. | ||
SongGuyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``` | ||
message CleanForDeadWorkerRequest { | ||
string worker_id = 1; | ||
} | ||
|
||
message CleanForDeadWorkerReply { | ||
AgentRpcStatus status = 1; | ||
string error_message = 2; | ||
} | ||
|
||
service RuntimeEnvService { | ||
... | ||
rpc CleanForDeadWorker(CleanForDeadWorkerRequest) | ||
returns (CleanForDeadWorkerReply); | ||
} | ||
``` | ||
|
||
## Compatibility, Deprecation, and Migration Plan | ||
|
||
This proposal will not change any existing APIs and any default behaviors of Ray Core. | ||
|
||
## Test Plan and Acceptance Criteria | ||
|
||
We plan to benchmark the resouces of worker processes: | ||
- CPU soft control: The worker could use idle CPU times which exceeds the CPU quota. | ||
- CPU hard control: Set the config `cpu_strict_usage=True`. The worker couldn't exceed the CPU quota. | ||
- Memory soft control: The worker could use idle memory which exceeds the memory quota. | ||
- Memory hard control: Set the config `memory_strict_usage=True`. The worker couldn't exceed the memory quota. | ||
|
||
Acceptance criteria: | ||
- A set of reasonable APIs. | ||
- A set of reasonable benchmark results. | ||
|
||
## (Optional) Follow-on Work | ||
|
||
In the first version, we can only support one cgroup manager based on cgroupfs. We can support systemd based cgroup manager in future. | ||
And we can achieve control for more resources, like `blkio`, `devices`, and `net_cls`. |
Uh oh!
There was an error while loading. Please reload this page.