Skip to content

Conversation

@mikeperello-scopely
Copy link

Description of change

According to the AWS documentation here, there is a possibility to Scan a DynamoDB table in parallel. This is useful for large scans, as by default, the Scan operation returns data to the application in 1 MB increments.

Manual QA steps

In order to run the tap in parallel, we need to specify as environment variables, the following attributes:

  • parallel_segment: specify the segment ID.
  • parallel_totalsegments : specify the total number of segments.

Risks

Rollback steps

  • revert this branch

Additional info

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan

the Scan operation can logically divide a table or secondary index into multiple segments, with multiple application workers scanning the segments in parallel. Each worker can be a thread.

In order to run a Parallel scan, we need to run multiple executions of the tap, each one with different parallel_segment attribute value. But the same parallel_totalsegments.
So for example:

  • Execution 1:

    • parallel_segment = 0
    • parallel_totalsegments = 2
  • Execution 2:

    • parallel_segment = 1
    • parallel_totalsegments = 2

⚠️The first parallel_segment must start at 0.

Add parallel segment and total number of segments to the scan.
@singer-bot
Copy link

Hi @mikeperello-scopely, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

@singer-bot
Copy link

You did it @mikeperello-scopely!

Thank you for signing the Singer Contribution License Agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants