Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@ const stopAfter100ms = await glob('**/*.css', {
signal: AbortSignal.timeout(100),
})

// cap concurrent async directory reads when traversing a large tree
const jsfilesThrottled = await glob('**/*.js', { concurrency: 8 })

// sync variants do not support concurrency and throw a TypeError
// globSync('**/*.js', { concurrency: 8 })

// multiple patterns supported as well
const images = await glob(['css/*.{png,jpeg}', 'public/*.{png,jpeg}'])

Expand Down Expand Up @@ -315,6 +321,18 @@ share the previously loaded cache.
This option may be either a string path or a `file://` URL
object or string.

- `concurrency` Limit simultaneous async directory reads to a
positive integer greater than or equal to `8`.

When omitted, async walks use the same unconstrained traversal
as before. When set, it only changes traversal rate, not the
result set, and it works alongside `signal`. Values below `8`,
`0`, negative numbers, and non-integers throw a `RangeError`
because the walker's fan-out needs enough in-flight `readdir()`
work to avoid starvation. Synchronous variants (`globSync()`,
`globStreamSync()`, and `globIterateSync()`) throw a `TypeError`
if `concurrency` is provided.

- `root` A string path resolved against the `cwd` option, which
is used as the starting point for absolute patterns that start
with `/`, (but not drive letters or UNC paths on Windows).
Expand Down
34 changes: 34 additions & 0 deletions benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,20 @@ CJS
MJS
t node "$wd/bench-working-dir/async.mjs" "$p"

echo -n $'current glob async c=8 \t'
cat > "$wd/bench-working-dir/async-c8.mjs" <<MJS
import { glob } from '$wd/dist/esm/index.js'
glob(process.argv[2], { concurrency: 8 }).then(files => console.log(files.length))
MJS
t node "$wd/bench-working-dir/async-c8.mjs" "$p"

echo -n $'current glob async c=16 \t'
cat > "$wd/bench-working-dir/async-c16.mjs" <<MJS
import { glob } from '$wd/dist/esm/index.js'
glob(process.argv[2], { concurrency: 16 }).then(files => console.log(files.length))
MJS
t node "$wd/bench-working-dir/async-c16.mjs" "$p"

echo -n $'current glob stream \t'
cat > "$wd/bench-working-dir/stream.mjs" <<MJS
import {globStream} from '$wd/dist/esm/index.js'
Expand All @@ -188,6 +202,26 @@ MJS
MJS
t node "$wd/bench-working-dir/stream.mjs" "$p"

echo -n $'current stream c=8 \t'
cat > "$wd/bench-working-dir/stream-c8.mjs" <<MJS
import {globStream} from '$wd/dist/esm/index.js'
let c = 0
globStream(process.argv[2], { concurrency: 8 })
.on('data', () => c++)
.on('end', () => console.log(c))
MJS
t node "$wd/bench-working-dir/stream-c8.mjs" "$p"

echo -n $'current stream c=16 \t'
cat > "$wd/bench-working-dir/stream-c16.mjs" <<MJS
import {globStream} from '$wd/dist/esm/index.js'
let c = 0
globStream(process.argv[2], { concurrency: 16 })
.on('data', () => c++)
.on('end', () => console.log(c))
MJS
t node "$wd/bench-working-dir/stream-c16.mjs" "$p"

# echo -n $'current glob sync cjs -e \t'
# t node -e '
# console.log(require(process.argv[1]).sync(process.argv[2]).length)
Expand Down
10 changes: 10 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# changeglob

## [Unreleased]

### Added

- Add a `concurrency` option to async glob APIs so callers can cap
simultaneous directory reads without changing match results.
Values below `8` reject with `RangeError`, and sync variants
throw `TypeError` if `concurrency` is provided. Implements
US-006.

## 13

- Move the CLI program out to a separate package, `glob-bin`.
Expand Down
90 changes: 78 additions & 12 deletions src/glob.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,46 @@ const defaultPlatform: NodeJS.Platform =
process.platform
: 'linux'

export const MIN_ASYNC_READDIR_CONCURRENCY = 8
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this limited to 8? Feels arbitrary. What happens when it's lower?

There should really be no reason why it can't be set at 1, in fact; it'd just mean that each readdir waits for the one before it to complete before proceeding. Would be slow, of course, but also very memory and CPU efficient.


const concurrencyMinimumMessage = () =>
`concurrency must be a positive integer greater than or equal to ${MIN_ASYNC_READDIR_CONCURRENCY}`

const concurrencyReason =
'async glob walks need enough concurrent directory reads to avoid starvation in the walker fan-out'
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am dubious of this claim.


const invalidConcurrencyMessage = () =>
`invalid concurrency option: ${concurrencyMinimumMessage()} because ${concurrencyReason}`
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a method just to return a constant string? just define the string and use that.


const syncConcurrencyMessage =
'concurrency option is not supported for synchronous glob variants'

export const validateAsyncConcurrency = (
concurrency: number | undefined,
): number | undefined => {
if (concurrency === undefined) {
return undefined
}

if (
!Number.isInteger(concurrency) ||
concurrency <= 0 ||
concurrency < MIN_ASYNC_READDIR_CONCURRENCY
) {
throw new RangeError(invalidConcurrencyMessage())
}

return concurrency
}

export const assertNoSyncConcurrency = (
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertion methods should not be exported if they're only used in this module. Also, these can be tightened up considerably, something like this in src/index.ts

const validConcurrency = (concurrency: number | undefined, sync: boolean) =>
  concurrency === undefined ? true
  : !sync && Number.isInteger(concurrency) && concurrency > 0
const assertValidConcurrency = (concurrency: number | undefined, sync: boolean) => {
  if (!validConcurrency) {
    throw new Error(
      sync ? 'concurrency not supported for sync glob operations'
      : `Invalid concurrency value: ${concurrency}`
    )
}

Then in the various methods:

export function globSync(...) {
  assertValidConcurrency(options.concurrency, true)
  ...
}

export function glob(...) {
  assertValidConcurrency(options.concurrency, false)
  ...
}

concurrency: number | undefined,
): void => {
if (concurrency !== undefined) {
throw new TypeError(syncConcurrencyMessage)
}
}

/**
* A `GlobOptions` object may be provided to any of the exported methods, and
* must be provided to the `Glob` constructor.
Expand Down Expand Up @@ -71,6 +111,17 @@ export interface GlobOptions {
*/
cwd?: string | URL

/**
* Limit simultaneous async directory reads to a positive integer greater
* than or equal to `8`. Values below `8` are rejected because the async
* walker fan-out needs enough in-flight `readdir()` work to avoid
* starvation.
*
* Has no effect when omitted, and is not supported by synchronous glob
* methods.
*/
concurrency?: number

/**
* Include `.dot` files in normal matches and `globstar`
* matches. Note that an explicit dot in a portion of the pattern
Expand Down Expand Up @@ -408,6 +459,7 @@ export class Glob<Opts extends GlobOptions> implements GlobOptions {
windowsPathsNoEscape: boolean
withFileTypes: FileTypes<Opts>
includeChildMatches: boolean
concurrency?: number

/**
* The options provided to the constructor.
Expand Down Expand Up @@ -455,6 +507,7 @@ export class Glob<Opts extends GlobOptions> implements GlobOptions {
this.realpath = !!opts.realpath
this.absolute = opts.absolute
this.includeChildMatches = opts.includeChildMatches !== false
this.concurrency = opts.concurrency

this.noglobstar = !!opts.noglobstar
this.matchBase = !!opts.matchBase
Expand Down Expand Up @@ -558,21 +611,28 @@ export class Glob<Opts extends GlobOptions> implements GlobOptions {
*/
async walk(): Promise<Results<Opts>>
async walk(): Promise<(string | Path)[]> {
const concurrency = validateAsyncConcurrency(this.concurrency)
const walker = new GlobWalker(this.patterns, this.scurry.cwd, {
...this.opts,
maxDepth:
this.maxDepth !== Infinity ?
this.maxDepth + this.scurry.cwd.depth()
: Infinity,
platform: this.platform,
nocase: this.nocase,
includeChildMatches: this.includeChildMatches,
})

// Walkers always return array of Path objects, so we just have to
// coerce them into the right shape. It will have already called
// realpath() if the option was set to do so, so we know that's cached.
// start out knowing the cwd, at least
return [
...(await new GlobWalker(this.patterns, this.scurry.cwd, {
...this.opts,
maxDepth:
this.maxDepth !== Infinity ?
this.maxDepth + this.scurry.cwd.depth()
: Infinity,
platform: this.platform,
nocase: this.nocase,
includeChildMatches: this.includeChildMatches,
}).walk()),
...(await (
concurrency === undefined ?
walker.walk()
: walker.walkWithConcurrency(concurrency)
)),
]
}

Expand All @@ -581,6 +641,7 @@ export class Glob<Opts extends GlobOptions> implements GlobOptions {
*/
walkSync(): Results<Opts>
walkSync(): (string | Path)[] {
assertNoSyncConcurrency(this.concurrency)
return [
...new GlobWalker(this.patterns, this.scurry.cwd, {
...this.opts,
Expand All @@ -600,7 +661,8 @@ export class Glob<Opts extends GlobOptions> implements GlobOptions {
*/
stream(): Minipass<Result<Opts>, Result<Opts>>
stream(): Minipass<string | Path, string | Path> {
return new GlobStream(this.patterns, this.scurry.cwd, {
const concurrency = validateAsyncConcurrency(this.concurrency)
const stream = new GlobStream(this.patterns, this.scurry.cwd, {
...this.opts,
maxDepth:
this.maxDepth !== Infinity ?
Expand All @@ -609,14 +671,18 @@ export class Glob<Opts extends GlobOptions> implements GlobOptions {
platform: this.platform,
nocase: this.nocase,
includeChildMatches: this.includeChildMatches,
}).stream()
})
return concurrency === undefined ?
stream.stream()
: stream.streamWithConcurrency(concurrency)
}

/**
* Stream results synchronously.
*/
streamSync(): Minipass<Result<Opts>, Result<Opts>>
streamSync(): Minipass<string | Path, string | Path> {
assertNoSyncConcurrency(this.concurrency)
return new GlobStream(this.patterns, this.scurry.cwd, {
...this.opts,
maxDepth:
Expand Down
5 changes: 4 additions & 1 deletion src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import type {
GlobOptionsWithFileTypesTrue,
GlobOptionsWithFileTypesUnset,
} from './glob.js'
import { Glob } from './glob.js'
import { assertNoSyncConcurrency, Glob } from './glob.js'
import { hasMagic } from './has-magic.js'

export { escape, unescape } from 'minimatch'
Expand Down Expand Up @@ -55,6 +55,7 @@ export function globStreamSync(
pattern: string | string[],
options: GlobOptions = {},
) {
assertNoSyncConcurrency(options.concurrency)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Won't it assert in the constructor anyway?

return new Glob(pattern, options).streamSync()
}

Expand Down Expand Up @@ -108,6 +109,7 @@ export function globSync(
pattern: string | string[],
options: GlobOptions = {},
) {
assertNoSyncConcurrency(options.concurrency)
return new Glob(pattern, options).walkSync()
}

Expand Down Expand Up @@ -163,6 +165,7 @@ export function globIterateSync(
pattern: string | string[],
options: GlobOptions = {},
) {
assertNoSyncConcurrency(options.concurrency)
return new Glob(pattern, options).iterateSync()
}

Expand Down
Loading