processing big data

Jump to bottom

abk edited this page Jul 27, 2020 · 1 revision

Processing big data

AWS Lambda
- Serverless data processing tool that you can use
- Way to run code snippets in the cloud.
  - Serverless
  - Continuous scaling
- Often used to process data as it's move around.
- Lambda is typically used as glue between data stream and dynamoDB.
- Examples
  - Transaction rate alarm. Transform the data in anyway and notify and acts like magic glue.  
Lambda Integration (part 1).
- Why not just run a server
  - Server management (patches, monitoring etc)
  - Servers can be cheap but scaling gets expensive.
  - You don't pay for processing time you don't use
  - Easier to split up development between front end and back end.
- Main uses of lambda
  - Real time file processing
  - Real time stream processing
  - ETL
  - Cron replacement (Use time as trigger for lambda invocation).  Calling lambda function periodically.
  - Process AWS events.
  - There are many lambda triggers, noteworthy are.
    - S3, kinesis, dynamodb, IOT
    - Kinesis stream is NOT pushing the stream into lambda. Lambda polls that data and PULLs the data into lambda. 
Lambda integration (part 2).
- Lambda and Elastisearch service.
  - S3 > AWS Lambda > Elastisearch service (process and analyst).
  - S3 > AWS Lambda > AWS data pipeline (process the data further after it's activated by Lambda). You can schedule activities in datapipeline., but with lambda we can invoke anytime instead of fixed timeline.
  - S3 > AWS Lambda > RedShift  ^  | V DynamoDB. Lambda has to be stateless. Hence store the state in DynamoDB.
- Lambda + Kinesis
  - Lambda receives an event with batch of stream records.
    - You specify batch size when setting up the tigger.
    - Too large batch size can cause timeouts
    - Batches can be split beyond Lambda's payload limit (6MB)
    - If lambda fails, it will timeout and do retry
    - This can stall the shard if you don't handle errors properly.
    - User more shards to ensure processing is NOT holdup by errors.
    - Lambda processes shard data async. 
Lambda Anti patterns (Where you don't want to use Lambda).
- Long running applications
- Dynamic Websites
- Stateful applications.