make DiB_fileStats skip invalid files (fileSize <= 0) to prevent negative totals and bogus allocation #4487
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
DiB_fileStats()andDiB_loadFiles()that could drivetotalSizeToLoadnegative when invalid inputs are present (e.g., broken symlinks, disappearing files, IO errors).totalSizeToLoadgoes negative, it later gets cast tosize_tand wraps to a huge value, leading to large allocation sizes and “not enough memory for DiB_trainFiles”.DiB_fileStats()withDiB_loadFiles(): both now skip non-positive file sizes.To be 100% transparent, I think you need a pretty pathological case to see crashes -- to the point where there's something severely wrong with your sample dir. I probably wouldn't put up a PR for this normally, but since it is possible to not crash but get slightly incorrect totals thought it's worth it.
Motivation and background
While investigating dictionary-training hanging with empty files, I found that my environment (conda-forge zstd labeled v1.5.2) didn’t include the upstream fix for the empty-file infinite loop (#3081). Took me a while to figure that out though, and while investigating I looked more broadly at dibio.c and noticed this possible issue.
The bug fixed in this PR did not impact me (unrelated to my empty file hang), I just happened to notice it via code inspection.
Root cause
DiB_getFileSize()returns:DiB_loadFiles()correctly skips files withfileSize <= 0.DiB_fileStats(), however, only skipped fileSize == 0. It counted negative sizes as samples and addedMIN(fileSize, SAMPLESIZE_MAX)tototalSizeToLoad. ForfileSize == -1, this subtracts 1 from totalSizeToLoad per invalid entry and inflates the expected sample count.Impact
With enough invalid entries in the training set (or just a few if the valid samples are small),
totalSizeToLoadcan go negative.The memory sizing logic casts negative totals to
size_t, wrapping to a massive value. This can result in:malloc(huge)→ failure →EXM_THROW(12)“not enough memory for DiB_trainFiles”, orminor-to-large drift between the intended and actual allocation sizes, depending on the exact inputs.
Even when it doesn’t crash, this mismatch can under/over-allocate sample memory. I'm not an expert in the trainer tool though, so I'm not sure if that really matters or if it can lead to the trained dicts being suboptimal?
Reproduction (added in 96fdb9b)
Minimal setup: a handful of small “good” files plus many nonexistent paths passed on the command line (nonexistent paths bypass symlink filtering).
The temporary debug logs show negative file sizes being counted by DiB_fileStats(), final totals going negative, and the resulting huge loadedSize after casting.
Before (current behavior)
With 5 valid samples plus many nonexistent files:
After (this change, 85f4a7e):
The “before” output produces a negative total and a huge loadedSize, ending in
EXM_THROW(12).The “after” output shows consistent nbSamples, sane totals, and modest allocations, proceeding to training as expected.