Skip to content

processing big data

abk edited this page Jul 27, 2020 · 1 revision

Processing big data

  • AWS Lambda
    • Serverless data processing tool that you can use
    • Way to run code snippets in the cloud.
      • Serverless
      • Continuous scaling
    • Often used to process data as it's move around.
    • Lambda is typically used as glue between data stream and dynamoDB.
    • Examples
      • Transaction rate alarm.
Transform the data in anyway and notify and acts like magic glue. 

  • Lambda Integration (part 1).
    • Why not just run a server
      • Server management (patches, monitoring etc)
      • Servers can be cheap but scaling gets expensive.
      • You don't pay for processing time you don't use
      • Easier to split up development between front end and back end.
    • Main uses of lambda
      • Real time file processing
      • Real time stream processing
      • ETL
      • Cron replacement (Use time as trigger for lambda invocation). 
Calling lambda function periodically.
      • Process AWS events.
      • There are many lambda triggers, noteworthy are.
        • S3, kinesis, dynamodb, IOT
        • Kinesis stream is NOT pushing the stream into lambda. Lambda polls that data and PULLs the data into lambda.

  • Lambda integration (part 2).
    • Lambda and Elastisearch service.

      • S3 > AWS Lambda > Elastisearch service (process and analyst).
      • S3 > AWS Lambda > AWS data pipeline (process the data further after it's activated by Lambda). You can schedule activities in datapipeline., but with lambda we can invoke anytime instead of fixed timeline.
      • S3 > AWS Lambda > RedShift
 ^
 | V DynamoDB. Lambda has to be stateless. Hence store the state in DynamoDB.
    • Lambda + Kinesis

      • Lambda receives an event with batch of stream records.
        • You specify batch size when setting up the tigger.
        • Too large batch size can cause timeouts
        • Batches can be split beyond Lambda's payload limit (6MB)
        • If lambda fails, it will timeout and do retry
        • This can stall the shard if you don't handle errors properly.
        • User more shards to ensure processing is NOT holdup by errors.
        • Lambda processes shard data async.

  • Lambda Anti patterns (Where you don't want to use Lambda).
    • Long running applications
    • Dynamic Websites
    • Stateful applications.
Clone this wiki locally