Skip to content

Glob Expansion, Presence validation for Provide Artifacts#85

Draft
Kajmany wants to merge 1 commit intomainfrom
pa/artifacts
Draft

Glob Expansion, Presence validation for Provide Artifacts#85
Kajmany wants to merge 1 commit intomainfrom
pa/artifacts

Conversation

@Kajmany
Copy link
Collaborator

@Kajmany Kajmany commented Feb 17, 2026

Our original motivation was allowing authors to use Unix-like globs when specifying artifacts to upload, but this also seems like good opportunity to validate files are present and readable within the challenge directory, where possible.

This is all currently done at build-time, and a subset of the functionality will be called by validate that is not dependent on build-time things.

Will update description/tasks as this is closer to being finished

Closes #54

TODO:

  • Integrate with validate, redundant operation is O.K. as discussed
  • Update 'repo' example folder to work with validate
  • Integrate with build process
  • Integrate with upload
  • Test upload integration, potentially just manually
  • Allow globs for in-repo extraction
  • Validate in-repo paths are files located in-repo
  • Allow in-repo folder paths
  • Unit tests using figment that do file-system interaction
  • Don't over-write files in repo during artifact extraction (allow over-ride?)
  • Allow globs from container provides (potentially has to be inefficient?)
  • Allow folder upload from container provides (maybe better)
  • Last pass: clean up unused, quick doc pass on entire artifact module

using new-type semantics for expanded/validated pahts with a loose contract and exporting to to the upload process
@Kajmany
Copy link
Collaborator Author

Kajmany commented Feb 18, 2026

Asking for an early review to validate the architecture, and maybe style. Not asking for a look at the functionality because it's untested and almost certainly broken -- I'll make sure it passes new tests and the context from fail-state is usable before asking for that review.

My understanding of validate is it currently is called before we do our business, and is kept kind of separate. Globbing is a problem because we can't validate until after we're done globbing. I think any implementation will inevitably clash with the current model because of that.

What I'm trying this time: New-type wrapper on Paths that we've glob-expanded and validated. TOCTOU is a huge issue if we were to try to store these long-term, but I think our only current workflow is a pretty direct build & upload, which should be fine? Ideally, internal functions would require the new-type while friction with exposing them to external functions like fs::copy would be minimized.

Big question is whether this will be O.K. to slap in with the current way we do validation, or perhaps we could even cut scope if we can't easily think of a cleaner way. I personally like offering a contract with a wrapping type, but I'm not sure if it fits in well enough.

Copy link
Collaborator Author

@Kajmany Kajmany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotated smaller bits I'm curious about. Just asking for a skim for now.

Comment on lines +360 to +366
// FIXME: this was a quick hack but other semantics might be preferable
// need to make it easy to expose internal for exernal APIs
impl AsRef<Path> for InputPath {
fn as_ref(&self) -> &Path {
&self.0
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a quick and dirty way to expose the internals. Wondering if there's better semantics? Ideally we'd remove friction with passing it to external APIs that want &Path, but not allow internal functions that should require &InputPath

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense for how to handle the conversion to me!

chal.directory, &provide
);

let docker = docker().await?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do anything? Couldn't tell if it's just unused or if it warms up the daemon or something

Copy link
Contributor

@detjensrobert detjensrobert Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks unused. Probably a holdover from before the artifacts/docker were split out. The daemon client will get instantiated on first use regardless of where its called from so there's no reason for it to be done if its not used.

Comment on lines +279 to +328
// TODO: verify readability (can stat even if we can't read, duh)

/// For **local** (non-container) asset paths which expands globs
/// but blocks references that are outside the base config directory.
///
/// Recursively descends into folders found as a result of globbing or single specification
/// and accumulates each result as a separate path.
// Ow! my stack!
///
/// Folders, files, and symlinks to them are accepted. All symlinks will be
/// resolved and replaced in returned paths.
///
/// ---
/// Glob-matching is attempted only on paths which do not exist.
pub fn process_all_paths(chal_dir: &Path, paths: &[PathBuf]) -> Result<Vec<InputPath>> {
// anything above this dir would be leaving the directory
let canonical_chal_dir = fs::canonicalize(chal_dir)?;
let base_cfg_dir = canonical_chal_dir
.parent()
.and_then(|p| p.parent())
.ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?;

let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| {
match std::fs::canonicalize(path) {
// Assume paths we can resolve aren't globs
Ok(p) => {
let stat = fs::metadata(&p)?;
if stat.is_dir() {
// Recursively run on directory contents, hope we don't explode
let dentry_paths = fs::read_dir(&p)?.map(|dr| dr.map(|d| d.path())).collect::<Result<Vec<PathBuf>, _>>()?;
acc.extend(process_all_paths(chal_dir, &dentry_paths)?);
} else if stat.is_file() {
acc.push(InputPath::fast_validate_single(&base_cfg_dir, &p)?);
} else {
bail!("Expected canonicalized asset path to be file or directory but {p:?} is not");
}
Ok(acc)
}
// Try paths we can't locate as globs before failing
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
let abs_path = std::path::absolute(path)?;
let expanded = expand_one_glob(&abs_path)?;
acc.extend(process_all_paths(chal_dir, &expanded)?);
Ok(acc)
}
Err(e) => Err(e).with_context(|| format!("failed to canonicalize path {:?}", path)),
}
})?;
Ok(resolved)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to naturally be a mess. Idea is to scrape folders, follow symlinks, and process globs in a recursive manner. There's external libraries to handle the folders, but I figured integrating them would be more mess than iterating dentries, and probably not that much safer?

Definitely incomplete, BTW. Wouldn't look at it seriously until unit testing is done, but drawing attention to it now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a couple readings to fully parse what this was doing -- would it be cleaner to have the input/outputpath helpers return a enum of PathExists/PathFree/Error to represent the three cases here? it feels like we're mixing the path validation in to these different places instead of handling the check in one place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently "failing fast" and validating as we go -- I thought this would be more readable (perf wasn't really the concern) but it sounds like it's doing the opposite.

I'm going to basically re-write this function and most likely go with making validation separate from the path collection, because it sounds like that'll be more legible. The only problem I can think of that would introduce is accidentally specifying a directory outside the challenge and getting bogged down. I'll probably use a helper that keeps some state on depth/items accumulated to protect the stack/time for that case (and absurd crawls in-repo, too)

I will keep in mind your other comments on style & itertools during. Thanks!

"assets/{chal_slug}/{file}",
chal_slug = chal.directory.to_string_lossy(),
file = file.file_name().unwrap().to_string_lossy()
file = file.as_ref().file_name().unwrap().to_string_lossy()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the branch is behind, but is there more needed to integrate with upload? Only other effective change I can think of is that InputPath is guaranteed to be canonicalized.

Copy link
Contributor

@detjensrobert detjensrobert Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be just fine. Since this is only pulling off the filename the canonicalization doesn't matter.

Copy link
Contributor

@detjensrobert detjensrobert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break out the new structs and glob helpers into another module here since the artifacts file is getting pretty large with all this.

/// A path to a readable file that existed within the base directory.
///
/// Vulnerable to TOCTOU issues. Create and consume as close to usage as possible.
pub struct InputPath(PathBuf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to name this ValidatedPath or ExpandedPath since this is what is returned post-validation. InputPath reads more as a un-validated version of Path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After poking around I see the Input here refers to input files for the archive/rename -- maybe ValidSourcePath and ValidDestPath then?

///
/// Recursively descends into folders found as a result of globbing or single specification
/// and accumulates each result as a separate path.
// Ow! my stack!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh!

Comment on lines +383 to +386
// any file at this path is a no-no
if fs::symlink_metadata(&abs_output).is_ok() {
bail!("Output path {output:?} already exists");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user should be able to force overwrite with a CLI flag, like when rerunning and it would be clobbering previously generated files.

.ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?;

let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| {
match std::fs::canonicalize(path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the only place we are trying to canonicalize the path be in the path validate function? Feels weird to be doing it multiple times. Can we use that here or break that out from the structs into a separate helper?

Comment on lines +260 to +264
files.iter().try_fold(Vec::new(), |mut acc, f| {
let paths = expand_one_glob(f)?;
acc.extend(paths);
Ok::<Vec<PathBuf>, anyhow::Error>(acc)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map().flatten_ok().collect() I think looks cleaner than try_fold with scattered extends() in the middle:

Suggested change
files.iter().try_fold(Vec::new(), |mut acc, f| {
let paths = expand_one_glob(f)?;
acc.extend(paths);
Ok::<Vec<PathBuf>, anyhow::Error>(acc)
})
files
.iter()
.map(|f| expand_one_glob(f))
.flatten_ok()
.collect()

// but we're not exporting these bad semantics outside this module to be fair
let pattern = file.to_string_lossy();
// TODO: will need more context!
let paths = glob::glob(&pattern)?.collect::<Result<Vec<_>, _>>()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Itertools has try_collect for this exact thing ;):

Suggested change
let paths = glob::glob(&pattern)?.collect::<Result<Vec<_>, _>>()?;
let paths = glob::glob(&pattern)?.try_collect()?;

.and_then(|p| p.parent())
.ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?;

let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on the fold/map here

Comment on lines +279 to +328
// TODO: verify readability (can stat even if we can't read, duh)

/// For **local** (non-container) asset paths which expands globs
/// but blocks references that are outside the base config directory.
///
/// Recursively descends into folders found as a result of globbing or single specification
/// and accumulates each result as a separate path.
// Ow! my stack!
///
/// Folders, files, and symlinks to them are accepted. All symlinks will be
/// resolved and replaced in returned paths.
///
/// ---
/// Glob-matching is attempted only on paths which do not exist.
pub fn process_all_paths(chal_dir: &Path, paths: &[PathBuf]) -> Result<Vec<InputPath>> {
// anything above this dir would be leaving the directory
let canonical_chal_dir = fs::canonicalize(chal_dir)?;
let base_cfg_dir = canonical_chal_dir
.parent()
.and_then(|p| p.parent())
.ok_or_else(|| anyhow!("Project root not found (CWD is too shallow)"))?;

let resolved: Vec<InputPath> = paths.iter().try_fold(Vec::new(), |mut acc, path| {
match std::fs::canonicalize(path) {
// Assume paths we can resolve aren't globs
Ok(p) => {
let stat = fs::metadata(&p)?;
if stat.is_dir() {
// Recursively run on directory contents, hope we don't explode
let dentry_paths = fs::read_dir(&p)?.map(|dr| dr.map(|d| d.path())).collect::<Result<Vec<PathBuf>, _>>()?;
acc.extend(process_all_paths(chal_dir, &dentry_paths)?);
} else if stat.is_file() {
acc.push(InputPath::fast_validate_single(&base_cfg_dir, &p)?);
} else {
bail!("Expected canonicalized asset path to be file or directory but {p:?} is not");
}
Ok(acc)
}
// Try paths we can't locate as globs before failing
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
let abs_path = std::path::absolute(path)?;
let expanded = expand_one_glob(&abs_path)?;
acc.extend(process_all_paths(chal_dir, &expanded)?);
Ok(acc)
}
Err(e) => Err(e).with_context(|| format!("failed to canonicalize path {:?}", path)),
}
})?;
Ok(resolved)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a couple readings to fully parse what this was doing -- would it be cleaner to have the input/outputpath helpers return a enum of PathExists/PathFree/Error to represent the three cases here? it feels like we're mixing the path validation in to these different places instead of handling the check in one place.

Comment on lines +360 to +366
// FIXME: this was a quick hack but other semantics might be preferable
// need to make it easy to expose internal for exernal APIs
impl AsRef<Path> for InputPath {
fn as_ref(&self) -> &Path {
&self.0
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense for how to handle the conversion to me!

impl InputPath {
/// Yield canonical path if the input resolves to one readable file
/// located within base repo directory
pub fn validate_single(base: &Path, input: &Path) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are making an InputPath from the source, validate_from() might be a better name. validate_single reads as if it is not returning anything and from() is the rust custom for this sort of conversion.

@detjensrobert
Copy link
Contributor

What I'm trying this time: New-type wrapper on Paths that we've glob-expanded and validated. TOCTOU is a huge issue if we were to try to store these long-term, but I think our only current workflow is a pretty direct build & upload, which should be fine? Ideally, internal functions would require the new-type while friction with exposing them to external functions like fs::copy would be minimized.

Big question is whether this will be O.K. to slap in with the current way we do validation, or perhaps we could even cut scope if we can't easily think of a cleaner way. I personally like offering a contract with a wrapping type, but I'm not sure if it fits in well enough.

The type enforcement makes sense to me moreso than doing the validation in validate() -- though we could probably confirm the presence of the files then it would be tricky to edit or recreate the struct then with the validated types. I think doing static validation (checking presence of repo files without updating the config at all) fits better with our current style of validation, but using a type wrapper is more idiomatic Rust I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Challenge provide: paths should allow globs

2 participants