-
Notifications
You must be signed in to change notification settings - Fork 133
Submit draft of RFC for ForEach-Object -Parallel proposal #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
052e515
b75a572
fb0017b
6b78263
c51c960
5592e87
f5cfefd
4596ac4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,68 +16,95 @@ This RFC proposes a new parameter set for the existing ForEach-Object cmdlet to | |
## Motivation | ||
|
||
As a PowerShell User, | ||
I can do simple fan-out concurrency with the PowerShell ForEach-Object cmdlet, without having to obtain and load a separate module, or deal with PowerShell jobs unless I want to. | ||
I can execute foreach-object piped input in script blocks running in parallel threads, either synchronously or asynchronously, while limiting the number of threads running at a given time. | ||
|
||
## Specification | ||
|
||
There will be two new parameter sets added to the existing ForeEach-Object cmdlet to support both synchronous and asynchronous operations for parallel script block execution. | ||
For the synchronous case, the `ForEach-Object` cmdlet will not return until all parallel executions complete. | ||
For the asynchronous case, the `ForEach-Object` cmdlet will immediately return a PowerShell job object that contains child jobs of each parallel execution. | ||
A new `-Parallel` parameter set will be added to the existing ForEach-Object cmdlet that supports running piped input concurrently in a provided script block. | ||
|
||
- `-Parallel` parameter switch specifies parallel script block execution | ||
|
||
- `-ScriptBlock` parameter takes a script block that is executed in parallel for each piped input variable | ||
|
||
- `-ThrottleLimit` parameter takes an integer value that determines the maximum number of script blocks running at the same time | ||
|
||
- `-TimeoutSeconds` parameter takes an integer that specifies the maximum time to wait for completion before the command is aborted | ||
|
||
- `-AsJob` parameter switch indicates that a job is returned, which represents the command running asynchronously | ||
|
||
The 'ForEach-Object -Parallel' command will return only after all piped input have been processed. | ||
Unless the '-AsJob' switch is used, in which case a job object is returned immediately that monitors the ongoing execution state and collects generated data. | ||
The returned job object can be used with all PowerShell cmdlets that manipulate jobs. | ||
|
||
### Implementation details | ||
|
||
|
||
Implementation will be similar to the ThreadJob module. | ||
Script block execution will be run for each piped input on a separate thread and runspace. | ||
The number of threads that run at a time will be limited by a `-ThrottleLimit` parameter with a default value. | ||
Piped input that exceeds the allowed number of threads will be queued until a thread is available. | ||
For synchronous operation, a `-Timeout` parameter will be available that terminates the wait for completion after a specified time. | ||
Without a `-Timeout` parameter, the cmdlet will wait indefinitely for completion. | ||
Implementation will be similar to the ThreadJob module in that thread script block execution will be contained within a PSThreadChildJob object. | ||
The jobs will be run concurrently on separate runspaces/threads up to the ThrottleLimit value, and the remainder queued to wait for an available runspace/thread to run on. | ||
Initial implementation will not attempt to reuse threads and runspaces when running queued items, due to concerns of stale state breaking script execution. | ||
For example, PowerShell uses thread local storage to store per thread default runspaces. | ||
And even though there is a runspace 'ResetRunspaceState' API method, it only resets session variables and debug/transaction managers. | ||
Imported modules and function definitions are not affected. | ||
A script that defines a constant function would fail if the function is already defined. | ||
The initial assumption will be that runspace/thread creation time is insignificant compared to the time needed to execute the script block, either because of high compute needs or because of long wait times for results. | ||
If this assumption is not true then the user should consider batching the work load to each foreach-object iteration, or simply use the sequential/non-parallel form of the cmdlet. | ||
|
||
### Synchronous parameter set | ||
The 'TimeoutSeconds' parameter will attempt to halt all script block executions after the timeout time has passed, however it may not be immediately successful if the running script is calling a native command or API, in which case it needs for the call to return before it can halt the running script. | ||
|
||
Synchronous ForEach-Object -Parallel returns after all script blocks complete running or timeout | ||
### Variable passing | ||
|
||
```powershell | ||
ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock {} | ||
``` | ||
ForEach-Object -Parallel will support the PowerShell `$_` current piped item variable within each script block. | ||
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope. | ||
If the passed in variable is a value type, a copy of the value is passed to the script block. | ||
If the passed in variable is a reference type, the reference is passed and each running script block can modify it. | ||
Since the script blocks are running in different threads, modifying a reference type that is not thread safe will result in undefined behavior. | ||
|
||
- `-Parallel` : parameter switch specifies fan-out parallel script block execution | ||
Script block variables will be special cased because they have runspace affinity. | ||
Therefore script block variables will not be passed by reference and instead a new script block object instance will be created from the original script block variable Ast (abstract syntax tree). | ||
|
||
|
||
- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads | ||
### Exceptions | ||
|
||
- `-TimeoutSecs` : parameter takes an integer that specifies the maximum time to wait for completion in seconds | ||
For critical exceptions, such as out of memory or stack overflow, the CLR will crash the process. | ||
Since all parallel running script blocks run in different threads in the same process, all running script blocks will terminate, and queued script blocks will never run. | ||
This is different from PowerShell jobs (Start-Job) where each job script runs in a separate child process, and therefore has better isolation to crashes. | ||
The lack of process isolation is one of the costs of better performance while using threads for parallelization. | ||
|
||
### Asynchronous parameter set | ||
For all other catchable exceptions, PowerShell will catch them from each thread and write them as non-terminating error records to the error data stream. | ||
If the `ErrorAction` parameter is set to 'Stop' then cmdlet will attempt to stop the parallel execution on any error. | ||
|
||
Asynchronous ForEach-Object -Parallel immediately returns a job object for monitoring parallel script block execution | ||
### Stop behavior | ||
|
||
|
||
```powershell | ||
ForEach-Object -Parallel -ThrottleLimit 5 -AsJob -ScriptBlock {} | ||
``` | ||
Whenever a timeout, a terminating error (-ErrorAction Stop), or a stop command (Ctrl+C) occurs, a stop signal will be sent to all running script blocks, and any queued script block iterations will be dequeued. | ||
This does not guarantee that a running script will stop immediately, if that script is running a native command or making an API call. | ||
So it is possible for a stop command to be ineffective if one running thread is busy or hung. | ||
|
||
- `-Parallel` : parameter switch specifies fan-out parallel script block execution | ||
We can consider including some kind of 'forcetimeout' parameter that would kill any threads that did not end in a specified time. | ||
|
||
- `-ThrottleLimit` : parameter takes an integer value that determines the maximum number threads | ||
If a job object is returned (-AsJob) the child jobs that were dequeued by the stop command will remain at 'NotStarted' state. | ||
|
||
- `-AsJob` : parameter switch returns a job object | ||
### Data streams | ||
|
||
### Variable passing | ||
Warning, Error, Debug, Verbose data streams will be written to the cmdlet data streams as received from each running parallel script block. | ||
Progress data streams will not be supported, but can be added later if desired. | ||
|
||
ForEach-Object -Parallel will support the PowerShell `$_` current piped item variable within each script block. | ||
It will also support the `$using:` directive for passing variables from script scope into the parallel executed script block scope. | ||
### Supported scenarios | ||
|
||
### Examples | ||
```powershell | ||
# Ensure needed module is installed on local system | ||
if (! (Get-Module -Name MyLogsModule -ListAvailable)) { | ||
Install-Module -Name MyLogsModule -Force | ||
} | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$logs = $computerNames | ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSecs 1800 -ScriptBlock { | ||
$logs = $computerNames | ForEach-Object -Parallel -ThrottleLimit 10 -TimeoutSeconds 1800 -ScriptBlock { | ||
Get-Logs -ComputerName $_ | ||
} | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$job = ForEach-Object -Parallel -ThrottleLimit 10 -InputObject $computerNames -AsJob -ScriptBlock { | ||
$job = ForEach-Object -Parallel -ThrottleLimit 10 -InputObject $computerNames -TimeoutSeconds 1800 -AsJob -ScriptBlock { | ||
Get-Logs -ComputerName $_ | ||
} | ||
$logs = $job | Wait-Job | Receive-Job | ||
|
@@ -91,9 +118,65 @@ $logs = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock { | |
} | ||
``` | ||
|
||
```powershell | ||
$computerNames = 'computer1','computer2','computer3','computer4','computer5' | ||
$logNames = 'System','SQL','AD','IIS' | ||
$logResults = ForEach-Object -Parallel -InputObject $computerNames -ScriptBlock { | ||
Get-Logs -ComputerName $_ -LogNames $using:logNames | ||
} | ForEach-Object -Parallel -ScriptBlock { | ||
Process-Log $_ | ||
} | ||
``` | ||
|
||
### Unsupported scenarios | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking on the unsupported scenarios I see an inconsistence - we can send variables by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Start-Parallel (https://www.powershellgallery.com/packages/start-parallel/1.3.0.0) goes down the
One of the things I keep saying is if the desire is to make a parallel version of
The first invoke has a deserialized object without the methods associated with a directory. The second has a "normal" directory object.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parser does add some restrictions to $using variables, mainly because they were used for remoting. But I think we want those restrictions given undefined behavior of assignable $using variables. A new parameter set was added to foreach-object specifically to indicate that the -Parallel operation is not the same as the sequential operation using the traditional parameter set. I thought about creating a new cmdlet, but I feel a new parameter set is sufficient to differentiate. |
||
|
||
```powershell | ||
# Variables must be passed in via $using: keyword | ||
$LogNameToUse = "IISLogs" | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
# This will fail because $LogNameToUse has not been defined in this scope | ||
Get-Log -ComputerName $_ -LogName $LogNameToUse | ||
} | ||
``` | ||
|
||
```powershell | ||
# Passed in reference variables should not be assigned to | ||
$MyLogs = @() | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
# Not thread safe, undefined behavior | ||
# Cannot assign to using variable | ||
$using:MyLogs += Get-Logs -ComputerName $_ | ||
} | ||
|
||
$dict = [System.Collections.Generic.Dictionary[string,object]]::New() | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
$dict = $using:dict | ||
$logs = Get-Logs -ComputerName $_ | ||
# Not thread safe, undefined behavior | ||
$dict.Add($_, $logs) | ||
} | ||
``` | ||
|
||
```powershell | ||
# Value types not passed by reference | ||
$count = 0 | ||
$computers | ForEach-Object -Parallel -ScriptBlock { | ||
# Can't assign to using variable | ||
$using:count += 1 | ||
$logs = Get-Logs -ComputerName $_ | ||
return @{ | ||
ComputerName = $_ | ||
Count = $count | ||
Logs = $logs | ||
} | ||
} | ||
``` | ||
|
||
## Alternate Proposals and Considerations | ||
|
||
|
||
Another option (and a previous RFC proposal) is to resurrect the PowerShell Windows workflow script `foreach -parallel` keyword to be used in normal PowerShell script to perform parallel execution of foreach loop iterations. | ||
However, the majority of the community felt it would be more useful to update the existing ForeEach-Object cmdlet with a -parallel parameter set. | ||
We may want to eventually implement both solutions. | ||
But the ForEach-Object -Parallel proposal in this RFC should be implemented first since it is currently the most popular. | ||
|
||
There are currently other proposals to create a more general framework to support running arbitrary scripts and cmdlets in parallel, by marking them as able to support parallelism (see RFC #206). | ||
That is outside the scope of this RFC, which focuses on extending just the ForEach-Object cmdlet to support parallel execution, and is intended to allow users to do parallel script/command execution without having to resort to PowerShell APIs. |
Uh oh!
There was an error while loading. Please reload this page.