Glob Expansion, Presence validation for Provide Artifacts#85
Glob Expansion, Presence validation for Provide Artifacts#85
Provide Artifacts#85Conversation
using new-type semantics for expanded/validated pahts with a loose contract and exporting to to the upload process
|
Asking for an early review to validate the architecture, and maybe style. Not asking for a look at the functionality because it's untested and almost certainly broken -- I'll make sure it passes new tests and the context from fail-state is usable before asking for that review. My understanding of validate is it currently is called before we do our business, and is kept kind of separate. Globbing is a problem because we can't validate until after we're done globbing. I think any implementation will inevitably clash with the current model because of that. What I'm trying this time: New-type wrapper on Paths that we've glob-expanded and validated. TOCTOU is a huge issue if we were to try to store these long-term, but I think our only current workflow is a pretty direct build & upload, which should be fine? Ideally, internal functions would require the new-type while friction with exposing them to external functions like Big question is whether this will be O.K. to slap in with the current way we do validation, or perhaps we could even cut scope if we can't easily think of a cleaner way. I personally like offering a contract with a wrapping type, but I'm not sure if it fits in well enough. |
Kajmany
left a comment
There was a problem hiding this comment.
Annotated smaller bits I'm curious about. Just asking for a skim for now.
| // FIXME: this was a quick hack but other semantics might be preferable | ||
| // need to make it easy to expose internal for exernal APIs | ||
| impl AsRef<Path> for InputPath { | ||
| fn as_ref(&self) -> &Path { | ||
| &self.0 | ||
| } | ||
| } |
There was a problem hiding this comment.
This was a quick and dirty way to expose the internals. Wondering if there's better semantics? Ideally we'd remove friction with passing it to external APIs that want &Path, but not allow internal functions that should require &InputPath
There was a problem hiding this comment.
This makes sense for how to handle the conversion to me!
| chal.directory, &provide | ||
| ); | ||
|
|
||
| let docker = docker().await?; |
There was a problem hiding this comment.
Does this do anything? Couldn't tell if it's just unused or if it warms up the daemon or something
There was a problem hiding this comment.
Yeah, looks unused. Probably a holdover from before the artifacts/docker were split out. The daemon client will get instantiated on first use regardless of where its called from so there's no reason for it to be done if its not used.
| // TODO: verify readability (can stat even if we can't read, duh) | ||
|
|
||
| /// For **local** (non-container) asset paths which expands globs | ||
| /// but blocks references that are outside the base config directory. | ||
| /// | ||
| /// Recursively descends into folders found as a result of globbing or single specification | ||
| /// and accumulates each result as a separate path. | ||
| // Ow! my stack! | ||
| /// | ||
| /// Folders, files, and symlinks to them are accepted. All symlinks will be | ||
| /// resolved and replaced in returned paths. | ||
| /// | ||
| /// --- | ||
| /// Glob-matching is attempted only on paths which do not exist. | ||
| pub fn process_all_paths(chal_dir: &Path, paths: &[PathBuf]) -> Result<Vec<InputPath>> { | ||
| // anything above this dir would be leaving the directory | ||
| let canonical_chal_dir = fs::canonicalize(chal_dir)?; | ||
| let base_cfg_dir = canonical_chal_dir | ||
| .parent() | ||
| .and_then(|p| p.parent()) | ||
| .ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?; | ||
|
|
||
| let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| { | ||
| match std::fs::canonicalize(path) { | ||
| // Assume paths we can resolve aren't globs | ||
| Ok(p) => { | ||
| let stat = fs::metadata(&p)?; | ||
| if stat.is_dir() { | ||
| // Recursively run on directory contents, hope we don't explode | ||
| let dentry_paths = fs::read_dir(&p)?.map(|dr| dr.map(|d| d.path())).collect::<Result<Vec<PathBuf>, _>>()?; | ||
| acc.extend(process_all_paths(chal_dir, &dentry_paths)?); | ||
| } else if stat.is_file() { | ||
| acc.push(InputPath::fast_validate_single(&base_cfg_dir, &p)?); | ||
| } else { | ||
| bail!("Expected canonicalized asset path to be file or directory but {p:?} is not"); | ||
| } | ||
| Ok(acc) | ||
| } | ||
| // Try paths we can't locate as globs before failing | ||
| Err(e) if e.kind() == std::io::ErrorKind::NotFound => { | ||
| let abs_path = std::path::absolute(path)?; | ||
| let expanded = expand_one_glob(&abs_path)?; | ||
| acc.extend(process_all_paths(chal_dir, &expanded)?); | ||
| Ok(acc) | ||
| } | ||
| Err(e) => Err(e).with_context(|| format!("failed to canonicalize path {:?}", path)), | ||
| } | ||
| })?; | ||
| Ok(resolved) | ||
| } |
There was a problem hiding this comment.
This is going to naturally be a mess. Idea is to scrape folders, follow symlinks, and process globs in a recursive manner. There's external libraries to handle the folders, but I figured integrating them would be more mess than iterating dentries, and probably not that much safer?
Definitely incomplete, BTW. Wouldn't look at it seriously until unit testing is done, but drawing attention to it now
There was a problem hiding this comment.
It took me a couple readings to fully parse what this was doing -- would it be cleaner to have the input/outputpath helpers return a enum of PathExists/PathFree/Error to represent the three cases here? it feels like we're mixing the path validation in to these different places instead of handling the check in one place.
There was a problem hiding this comment.
It is currently "failing fast" and validating as we go -- I thought this would be more readable (perf wasn't really the concern) but it sounds like it's doing the opposite.
I'm going to basically re-write this function and most likely go with making validation separate from the path collection, because it sounds like that'll be more legible. The only problem I can think of that would introduce is accidentally specifying a directory outside the challenge and getting bogged down. I'll probably use a helper that keeps some state on depth/items accumulated to protect the stack/time for that case (and absurd crawls in-repo, too)
I will keep in mind your other comments on style & itertools during. Thanks!
| "assets/{chal_slug}/{file}", | ||
| chal_slug = chal.directory.to_string_lossy(), | ||
| file = file.file_name().unwrap().to_string_lossy() | ||
| file = file.as_ref().file_name().unwrap().to_string_lossy() |
There was a problem hiding this comment.
I know the branch is behind, but is there more needed to integrate with upload? Only other effective change I can think of is that InputPath is guaranteed to be canonicalized.
There was a problem hiding this comment.
Yeah, this should be just fine. Since this is only pulling off the filename the canonicalization doesn't matter.
detjensrobert
left a comment
There was a problem hiding this comment.
I would break out the new structs and glob helpers into another module here since the artifacts file is getting pretty large with all this.
| /// A path to a readable file that existed within the base directory. | ||
| /// | ||
| /// Vulnerable to TOCTOU issues. Create and consume as close to usage as possible. | ||
| pub struct InputPath(PathBuf); |
There was a problem hiding this comment.
I'd prefer to name this ValidatedPath or ExpandedPath since this is what is returned post-validation. InputPath reads more as a un-validated version of Path.
There was a problem hiding this comment.
After poking around I see the Input here refers to input files for the archive/rename -- maybe ValidSourcePath and ValidDestPath then?
| /// | ||
| /// Recursively descends into folders found as a result of globbing or single specification | ||
| /// and accumulates each result as a separate path. | ||
| // Ow! my stack! |
| // any file at this path is a no-no | ||
| if fs::symlink_metadata(&abs_output).is_ok() { | ||
| bail!("Output path {output:?} already exists"); | ||
| } |
There was a problem hiding this comment.
The user should be able to force overwrite with a CLI flag, like when rerunning and it would be clobbering previously generated files.
| .ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?; | ||
|
|
||
| let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| { | ||
| match std::fs::canonicalize(path) { |
There was a problem hiding this comment.
Should the only place we are trying to canonicalize the path be in the path validate function? Feels weird to be doing it multiple times. Can we use that here or break that out from the structs into a separate helper?
| files.iter().try_fold(Vec::new(), |mut acc, f| { | ||
| let paths = expand_one_glob(f)?; | ||
| acc.extend(paths); | ||
| Ok::<Vec<PathBuf>, anyhow::Error>(acc) | ||
| }) |
There was a problem hiding this comment.
map().flatten_ok().collect() I think looks cleaner than try_fold with scattered extends() in the middle:
| files.iter().try_fold(Vec::new(), |mut acc, f| { | |
| let paths = expand_one_glob(f)?; | |
| acc.extend(paths); | |
| Ok::<Vec<PathBuf>, anyhow::Error>(acc) | |
| }) | |
| files | |
| .iter() | |
| .map(|f| expand_one_glob(f)) | |
| .flatten_ok() | |
| .collect() |
| // but we're not exporting these bad semantics outside this module to be fair | ||
| let pattern = file.to_string_lossy(); | ||
| // TODO: will need more context! | ||
| let paths = glob::glob(&pattern)?.collect::<Result<Vec<_>, _>>()?; |
There was a problem hiding this comment.
Itertools has try_collect for this exact thing ;):
| let paths = glob::glob(&pattern)?.collect::<Result<Vec<_>, _>>()?; | |
| let paths = glob::glob(&pattern)?.try_collect()?; |
| .and_then(|p| p.parent()) | ||
| .ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?; | ||
|
|
||
| let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| { |
There was a problem hiding this comment.
Ditto on the fold/map here
| // TODO: verify readability (can stat even if we can't read, duh) | ||
|
|
||
| /// For **local** (non-container) asset paths which expands globs | ||
| /// but blocks references that are outside the base config directory. | ||
| /// | ||
| /// Recursively descends into folders found as a result of globbing or single specification | ||
| /// and accumulates each result as a separate path. | ||
| // Ow! my stack! | ||
| /// | ||
| /// Folders, files, and symlinks to them are accepted. All symlinks will be | ||
| /// resolved and replaced in returned paths. | ||
| /// | ||
| /// --- | ||
| /// Glob-matching is attempted only on paths which do not exist. | ||
| pub fn process_all_paths(chal_dir: &Path, paths: &[PathBuf]) -> Result<Vec<InputPath>> { | ||
| // anything above this dir would be leaving the directory | ||
| let canonical_chal_dir = fs::canonicalize(chal_dir)?; | ||
| let base_cfg_dir = canonical_chal_dir | ||
| .parent() | ||
| .and_then(|p| p.parent()) | ||
| .ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?; | ||
|
|
||
| let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| { | ||
| match std::fs::canonicalize(path) { | ||
| // Assume paths we can resolve aren't globs | ||
| Ok(p) => { | ||
| let stat = fs::metadata(&p)?; | ||
| if stat.is_dir() { | ||
| // Recursively run on directory contents, hope we don't explode | ||
| let dentry_paths = fs::read_dir(&p)?.map(|dr| dr.map(|d| d.path())).collect::<Result<Vec<PathBuf>, _>>()?; | ||
| acc.extend(process_all_paths(chal_dir, &dentry_paths)?); | ||
| } else if stat.is_file() { | ||
| acc.push(InputPath::fast_validate_single(&base_cfg_dir, &p)?); | ||
| } else { | ||
| bail!("Expected canonicalized asset path to be file or directory but {p:?} is not"); | ||
| } | ||
| Ok(acc) | ||
| } | ||
| // Try paths we can't locate as globs before failing | ||
| Err(e) if e.kind() == std::io::ErrorKind::NotFound => { | ||
| let abs_path = std::path::absolute(path)?; | ||
| let expanded = expand_one_glob(&abs_path)?; | ||
| acc.extend(process_all_paths(chal_dir, &expanded)?); | ||
| Ok(acc) | ||
| } | ||
| Err(e) => Err(e).with_context(|| format!("failed to canonicalize path {:?}", path)), | ||
| } | ||
| })?; | ||
| Ok(resolved) | ||
| } |
There was a problem hiding this comment.
It took me a couple readings to fully parse what this was doing -- would it be cleaner to have the input/outputpath helpers return a enum of PathExists/PathFree/Error to represent the three cases here? it feels like we're mixing the path validation in to these different places instead of handling the check in one place.
| // FIXME: this was a quick hack but other semantics might be preferable | ||
| // need to make it easy to expose internal for exernal APIs | ||
| impl AsRef<Path> for InputPath { | ||
| fn as_ref(&self) -> &Path { | ||
| &self.0 | ||
| } | ||
| } |
There was a problem hiding this comment.
This makes sense for how to handle the conversion to me!
| impl InputPath { | ||
| /// Yield canonical path if the input resolves to one readable file | ||
| /// located within base repo directory | ||
| pub fn validate_single(base: &Path, input: &Path) -> Result<Self> { |
There was a problem hiding this comment.
Since these are making an InputPath from the source, validate_from() might be a better name. validate_single reads as if it is not returning anything and from() is the rust custom for this sort of conversion.
The type enforcement makes sense to me moreso than doing the validation in |
Our original motivation was allowing authors to use Unix-like globs when specifying artifacts to upload, but this also seems like good opportunity to validate files are present and readable within the challenge directory, where possible.
This is all currently done at build-time, and a subset of the functionality will be called by
validatethat is not dependent on build-time things.Will update description/tasks as this is closer to being finished
Closes #54
TODO: