Skip to content

Reduce bind mounting (and thus --bind args) in containers. #1387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
12 changes: 10 additions & 2 deletions cwltool/docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,13 @@ def append_volume(
runtime: List[str], source: str, target: str, writable: bool = False
) -> None:
"""Add binding arguments to the runtime list."""
if (
os.path.isfile(source)
and target.endswith(os.path.basename(source))
and not writable
):
source = os.path.dirname(source)
target = os.path.dirname(target)
options = [
"type=bind",
"source=" + source,
Expand All @@ -260,8 +267,9 @@ def append_volume(
options.append("readonly")
output = StringIO()
csv.writer(output).writerow(options)
mount_arg = output.getvalue().strip()
runtime.append("--mount={}".format(mount_arg))
mount_arg = f"--mount={output.getvalue().strip()}"
if mount_arg not in runtime:
runtime.append(mount_arg)
# Unlike "--volume", "--mount" will fail if the volume doesn't already exist.
if not os.path.exists(source):
os.makedirs(source)
Expand Down
4 changes: 3 additions & 1 deletion cwltool/pathmapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,9 @@ def setup(self, referenced_files: List[CWLObjectType], basedir: str) -> None:
stagedir = self.stagedir
for fob in referenced_files:
if self.separateDirs:
stagedir = os.path.join(self.stagedir, "stg%s" % uuid.uuid4())
stagedir = os.path.join(self.stagedir,
"s5g%s" % uuid.uuid5(uuid.UUID('6a56ca02-b6f0-4c1a-a4b0-fb0068ce80ad'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor Author

@DailyDreaming DailyDreaming Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid5 is deterministic. This groups together files that were previously grouped together by providing the same uuid for each file with the same os.path.dirname, so it violates completely separating files. This should avoid conflicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! Why s5g instead of stg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you be passing something to uuid.uuid5 related to the fob.location?

Copy link
Contributor Author

@DailyDreaming DailyDreaming Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed s5g just in case this needs to distinguish behavior in the future based on versions. Though it's not necessary.

And fob['location'] is line 176, the comment blurb doesn't make it visible here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "s5g%s" % uuid.uuid5(uuid.UUID('6a56ca02-b6f0-4c1a-a4b0-fb0068ce80ad' generates a static string that will never change, to which you make a subdir based upon the directory name of the location field. So why not skip to the directory name itself? Or use it to see the uuid5 generator?

What if there are location URIs that end in the same directory name but are otherwise unrelated? This could lead to files overwriting each other. Or files being adjacent to one another when they should not be.

Copy link
Contributor Author

@DailyDreaming DailyDreaming Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go through my thinking on this, and please point out any flaws.

Uuid5 generates different uuids for different seeds, the seed in this PR being both the file's directory and a static uuid. So uuid.uuid5(arg1, arg2) = uuid_a and uuid.uuid5(arg1, arg3) = uuid_b. The difference is that if uuid.uuid5(arg1, arg2) = uuid_a occurs again, it will generate the same uuid_a output.

>>> from uuid import uuid5, UUID
>>> uuid5(UUID('6a56ca02-b6f0-4c1a-a4b0-fb0068ce80ad'), '/tmp/1')
UUID('e3e49c59-338c-5369-a215-3d68bda76d69')
>>> uuid5(UUID('6a56ca02-b6f0-4c1a-a4b0-fb0068ce80ad'), '/tmp/2')
UUID('48220b67-6c36-500b-ae08-ecfa48951864')
>>> uuid5(UUID('6a56ca02-b6f0-4c1a-a4b0-fb0068ce80ad'), '/tmp')
UUID('98da1ac5-114d-539b-b727-948e7754ec95')
>>> uuid5(UUID('6a56ca02-b6f0-4c1a-a4b0-fb0068ce80ad'), '/tmp/1')
UUID('e3e49c59-338c-5369-a215-3d68bda76d69')

The idea being that all pre-existing paths cannot conflict since they co-existed together peacefully in the first place.

So files:

/tmp/a/b.txt
/tmp/b/a.txt
/tmp/b/b.txt
/tmp/c/b.txt

would map to these inside of the container:

uuid_a/b.txt
uuid_b/a.txt
uuid_b/b.txt
uuid_c/b.txt

So why not skip to the directory name itself?

If I understand your question correctly, we could use the same directory names as the files originally had, but they might conflict with folders that already exist within the container.

Or use it to see the uuid5 generator?

If I understand this to be to "seed" the uuid5 generator with the directory, I'm using os.path.dirname(fob['location'])) to seed the uuid5. Though I'm not sure if that's the question here.

What if there are location URIs that end in the same directory name but are otherwise unrelated? This could lead to files overwriting each other. Or files being adjacent to one another when they should not be.

I was wondering about this, and was trying to reason if this could happen if the input files were writable in a an input dir outside of the workdir. I haven't tested this.

Some of this too is that I tried to switch to a simpler strategy and it may just not work.

os.path.dirname(fob['location'])))
self.visit(
fob,
stagedir,
Expand Down
27 changes: 15 additions & 12 deletions cwltool/singularity.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,14 +282,19 @@ def get_from_requirements(
def append_volume(
runtime: List[str], source: str, target: str, writable: bool = False
) -> None:
runtime.append("--bind")
runtime.append(
"{}:{}:{}".format(
docker_windows_path_adjust(source),
docker_windows_path_adjust(target),
"rw" if writable else "ro",
)
)
src = docker_windows_path_adjust(source)
dst = docker_windows_path_adjust(target)
writable_flag = "rw" if writable else "ro"
if (
os.path.isfile(src)
and dst.endswith(os.path.basename(src))
and writable_flag == "ro"
):
src = os.path.dirname(src)
dst = os.path.dirname(dst)
bind_arg = f"--bind={src}:{dst}:{writable_flag}"
if bind_arg not in runtime:
Copy link
Contributor

@ionox0 ionox0 Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried running this PR after getting the argument list too long error when using large number of input directories, and it seems that the source directory here is actually different for each file, and therefore it did not reduce the number of --bind arguments:

...
--bind=/juno/work/access/testing/users/johnsoni/access_qc/5500_HF/jobStore/files/for-job/kind-CWLJob/instance-aop8iz_w/file-34cfcc9ed208473bb67710a5108f1a32:/var/lib/cwl/s5g5ff703f1-41b2-56d3-93b3-7e36c1a37224/all_qc_files/duplex_bam_sequence_qc_dir/C-R0CLHU-N001-d:ro \
--bind=/juno/work/access/testing/users/johnsoni/access_qc/5500_HF/jobStore/files/for-job/kind-CWLJob/instance-aop8iz_w/file-fee328e946c9448c92ee501c6daa4b5a:/var/lib/cwl/s5g5ff703f1-41b2-56d3-93b3-7e36c1a37224/all_qc_files/duplex_bam_sequence_qc_dir/C-R0CLHU-N001-d:ro \
...

It seems toil actually produces different parent directories for files that might have originally came from the same Directory object. So if files that came from the same directory can be stored as such in the jobstore, I would expect this change to fix the issue

runtime.append(bind_arg)

def add_file_or_directory_volume(
self, runtime: List[str], volume: MapperEnt, host_outdir_tgt: Optional[str]
Expand Down Expand Up @@ -403,17 +408,15 @@ def create_runtime(
)
)
else:
runtime.append("--bind")
runtime.append(
"{}:{}:rw".format(
"--bind={}:{}:rw".format(
docker_windows_path_adjust(os.path.realpath(self.outdir)),
self.builder.outdir,
)
)
runtime.append("--bind")
tmpdir = "/tmp" # nosec
runtime.append(
"{}:{}:rw".format(
"--bind={}:{}:rw".format(
docker_windows_path_adjust(os.path.realpath(self.tmpdir)), tmpdir
)
)
Expand Down