Skip to content

Latest commit

 

History

History
3433 lines (3132 loc) · 147 KB

File metadata and controls

3433 lines (3132 loc) · 147 KB

Cloud Native

An application is cloud native when it is designed to run on the cloud (private, public or hybrid) and it can adapt to different scenarios, so it is scalable.

The cloud means just making computing resources (computing, storage, networking) available over internet and on-demand.

Main cloud native features are:

  • Core elements:
    • [[#Microservices]]: it is an architectural-style where applications are considered as collections of atomic, resilient services with limited responsibility w.r.t. the entire application, and developed by its own small team
    • [[#Containers]]
    • Serverless computing: developers can focus just on the logic (code), while the cloud provider is responsible for the whole computing infrastructure, scaling and availability
      • automation
      • cost efficient
      • faster time-to-market
    • service meshes
    • immutable infrastructure
    • declarative APIs
  • [[#Event-Driven Architecture]] (EDA): components communicate only through events, in a producer-consumer fashion exchanging events.
    • Events represent state changes (or triggers) that lead to a further action
    • Producer and consumer approach
      • enables real-time interactions
      • enables loose coupling and flexibility: new services can subscribe without changing existing ones
  • Continuous delivery: frequent updates with minimal downtime
  • Elastic scaling: applications can expand and contract resources based on usage
    • Horizontal scaling: add more instances of equal services when demand increases (load balancers are used)
    • Vertical scaling: dynamically adjust computing resources (cpu, ram)
  • More responsive and scalable
  • Enables real-time processing

Applications

  • Healthcare
  • Finance
  • Retail

Innovations

  • AI/ML
  • Blockchain
  • Digital Twin
  • 5G
  • Edge Computing
  • Zero Trust Architecture
  • Automation at Scale

Pros and Cons

Pros

  • Flexible resource allocation
  • Cost reduction
    • eliminate fixed capital costs (no need to buy hardware)
    • pay-as-you-go
    • efficient use of resources
  • Business continuity (DevOps)
    • Release faster
    • Automate processes
  • Faster development and deployment
  • Resilience: self-healing and quick recovery
  • Efficiency: dynamic resource allocation, hardware potential maximization
  • Agility

Cons

  • Security: data privacy and compliance more difficult to handle across multiple environments
  • Vendor lock-in: limited mobility to migrate between cloud providers
  • Skill gaps: need for Docker and Kubernetes specialists

Evolution of Information Systems

![[Pasted image 20250828120616.png|400]]

  • Development process: waterfall → agile (see [[#Evolution of System Design]])
  • Architecture: monolithic (tightly coupled modules) → [[#Microservices]] (more modular)
  • Development & Packaging: physical servers → virtual servers/containers (see [[#Evolution of Servers]])
  • Infrastructure: self-hosted → cloud infrastructure

Maturity Model

![[Pasted image 20250828121118.png|600]] It helps organizations assess their current status towards becoming fully cloud-native. The maturity model consists of several stages:

  1. Cloud-Aware applications deployed in but not optimized for the cloud
  2. Cloud-Enabled applications deployed in but not some parts are optimized for the cloud
  3. Cloud-Native applications deployed in and optimized for the cloud
  4. Cloud-Powered applications deployed in and optimized for the cloud and the organization fully embraces the cloud approach (in development, deployment, monitoring, scaling)

12 steps towards Cloud Native

  1. Codebase: one codebase (tracked by a VCS), many deployments
  2. Dependencies: declare and isolate Use files such as requirements.txt (python), package.json (node)
  3. Configuration: store config in env, separated from code
  4. Backing services: services as resources It enables swapping services by changing config rather than the code
  5. Build, release, run: create more stages, with respective configs/envs
  6. Processes: run app as stateless, with no persistence To ensure persistence, a backing service (see 4) should be used
  7. Port binding: export internal service through a specific port
  8. Concurrency: enable stateless processes
  9. Disposability: fast startup, graceful shutdown, seamless restarts
  10. Dev/prod parity: ensure minimal differences
  11. Logs: stream and aggregate externally, using a centralized system
  12. Admin process: create dedicated admin processes in the same environment used by the app, to avoid running admin/debug/maintenance commands inside the main long-running app service

The Software Architect

Is the professional figure responsible for Cloud Native transformation. Its tasks are:

  • Design (the system) for failure
  • Enable scalability
  • Security by design
  • Collaboration with DevOps

Evolution of System Design

There are two main approaches when comes to design a system.

Waterfall Model

Sequential development It is characterized by

  • structured and rigid approach
  • timeline subdivided in phases. Strong emphasis on documentation and traceability.
  • Rapid process, so poor reflexes on changes
  • Late testing → bugfix at higher costs ![[Pasted image 20260120160100.png|600]]

Agile Methodology

Iterative development: development is broken in short cycles Alternative to the [[#Waterfall Model]], so focused on:

  • flexibility
  • fast delivery
  • customer value
  • individuals + interactions > process + tools
  • working software > documentation
  • reacting to change > following a rigid plan

Extreme Programming

  • Pair programming
  • Test-Driven Development (TDD)
  • Continuous Integration (CI)
  • Constant refactoring
  • Frequent releases ![[Pasted image 20260120160609.png|400]]

Scrum Framework

  • 2-4 week sprints that must deliver a usable increment for the software
  • Backlogs are list of prioritized tasks and they are
    • Product backlog: general list of tasks
    • Sprint backlog: list of tasks selected for the specific sprint
  • Core events
    • Daily scrum
    • Sprint review
    • Retrospective
  • Roles
    • Product owner
    • Scrum master
    • Development team

Evolution of Servers

Bare-metal physical servers → VMs/containers/serverless ![[Pasted image 20250828163832.png]]

Bare-metal servers

Physical server which can host just one workload and operates independently.

  • inefficient: high cost with low utilization rate (<20%)
  • complex maintenance and management

VMs

Process of abstracting physical hardware resources into multiple virtual resources. The idea is to maximize hardware utilization: multiple VMs with multiple applications can run simultaneously on a single host, managed by a hypervisor which dynamically allocates resources to them.

  • Each VM is isolated w.r.t. the other ones: multiple OS running on the same hardware
  • Resource sharing principle: CPU, RAM, storage are shared across multiple VMs. Hardware potential is maximized, reducing idle time
  • Enhanced portability
  • Enhanced scalability. Legacy systems can be migrated to VMs (so to the cloud) easily
  • Cost saving: there is no need for a single machine per application
  • Disaster Recovery: VMs can be easily snapshotted and restored

VMs | Hypervisors

It’s the software that allocates resources to each VM and runs independently w.r.t. the VMs. VMs communicate with the HV to access real hardware.

  • Type 1 HV: bare-metal, runs on the physical hardware
    • performance and efficiency
    • used in real data centers (ex: ESXI)
  • Type 2 HV: software installed on existing OS
    • used in less demanding scenarios (ex: VirtualBox, Parallels)

VMs | Lifecycle

  1. Provisioning: VM is created on predefined configuration
  2. Operation: VM runs apps by using shared resources
  3. Snapshot and backup
  4. Deprovisioning: VM is shut down when no longer needed

VMs | Cloud

  • Server Virtualization: physical servers are divided into multiple VMs
  • Network Virtualization: physical networks are divided into multiple independent networks
  • Storage Virtualization: storage resources are pooled across multiple services, but they are presented as a unified entity

VMs | Best Practices

  • Isolate VMs to prevent cross-VM attacks
  • Update the [[#Hypervisors]]
  • Implement robust access control policies

Containers

Tip

Very famous with Docker introduction in 2014

They encapsulate an entire application source code and its dependencies, to ensure consistent execution across different environment and systems. They share the same host OS, but remain isolated and with no conflicts.

Containers are stateless by definition: DO NOT store persistent data inside. Data must be stored in external databases or distributed storage solutions. This allows the restart/deletion of the container without worrying about loosing data.

Using containers enable the [[#Rolling Deployment]] practice: gradually replace the old container version with new one, without downtime. Of course, robust automations are required to enable seamless deployment and rollbacks.

  • Less overhead than [[#VMs]]
  • Quicker startup/shutdown than [[#VMs]]
  • More portable than [[#VMs]]
  • The only one cons is that they cross-OS compatibility is limited.

Containers | Components

  • Engine: manages lifecycle (Docker, Orbstack) and made up by
    • Docker deamon or Runtime: executes containers, interacting with the OS kernel
    • Client: CLI or GUI that enable the user to send commands to the runtime
  • Image: immutable atomic package, containing dependencies and configurations
  • Containers: running instances of the downloaded images
  • Registry: service that manages - create/share/destroy - Docker container images
    • Visibility
      • public, so cloud-hosted
      • private, so self-hosted ![[Pasted image 20260119164732.png|700]]

Containers | Lifecycle Management

  • Ephemeral: containers are designed to be short-lived. They should be stopped and replaced when not needed
  • Automation: containers are commonly used in CI/CD pipelines ![[Pasted image 20260119115128.png]]

Containers | Deployment

Rolling Deployment

Updates are rolled out gradually, by replacing old containers one at a time. In the middle phase, users will access mixed versions of the containers. ![[Pasted image 20260120171931.png|700]]

Blue/Green Deployment
  • blue: older version of the system; current live version
  • green: newer version of the system; updated version Once that the green instance has been tested and ready, traffic is fully routed from blue to greed. Then green → blue. ![[Pasted image 20260120171955.png|700]]
Canary Deployment

The new version of the system is reachable only by a restricted part of the users. Once that it has been tested, the full updated system is deployed. ![[Pasted image 20260120172042.png|700]]

Containers | Orchestration

It enables enterprises to manage complex microservice environments efficiently

  • Automated deployment
  • Efficient scaling: the number of containers can be adjusted w.r.t. the performances
  • Service discovery: mechanism that lets services automatically find and talk to each other without hard-coding IP addresses
  • Serverless compatible: developers can focus on the code rather than on the infrastructure

Containers | Auditing

Containers are designed run just one service, so the whole output logs are centralized and sent via stdout and stderr. Log collectors can used (ex: Fluentd).

More advanced monitoring tools can be used, to aggregate logs from multiple containers on multiple OS hosts (ex: Prometheus + Grafana). teri#### Containers | Security

  • Vulnerability scanning: tools can be used to statically scan containers’ images
  • Isolation: isolate processes and limit resource access

Containers | Edge Computing

Containers can be deployed close to the data source in a edge computing scenario, to enable real-time data processing. This practice can:

  • improve performances
  • reduce latencies

Containers | Best Practices

  • [ [[#Containers Security]] ] Use non-root containers to avoid privilege escalation attacks
  • [ [[#Containers Security]] ] Use robust ACLs to manage permissions
  • [ [[#Containers Security]] ] Implement network segmentation for container communication: separating containers into different virtual networks so they don’t all see or talk to each other by default.
  • [ [[#Containers Auditing]] ] Monitor both the container(s) and the host environment
  • Scan images at multiple stages of the pipeline, to catch issues early, in both dev and prod (ex: Clair).
  • Start from containerizing few core services, not all at once
  • Secure every part of the container lifecycle
  • Automate the pipeline for image creation, testing, deployment and so on (ex: Jenkins)
  • Use orchestration tools to manage large-scale deployments

Hybrid VMs + Containers

Modern environments combine [[#VMs]] with [[#Containers]], so an hybrid approach is used.

  • VMs are used to provide OS-level isolation; one per workload
  • Containers are used to provide process-level isolation; one per one part of workload

Cloud Services Models

The idea is to deliver services over the Internet without worrying about infrastructure or hardware stuff and delegating everything to cloud computing providers. This abstraction can be applied to different layers of a cloud app, from the infrastructure, to the platform, to the app.

IssS - Infrastructure as a Service

Deliver computing power, storage and network resources via the cloud.

Customer have control over:

  • VM
    • runtime
    • OS
    • apps
  • Network configuration
  • Storage

PaaS - Platform as a Service

Offer a platform with tools for building, testing and deploying applications, without managing infrastructure or the OS of the machine.

Customer have control over:

  • development tools, database, middleware
  • HIDDEN infrastructure layers PaaS styles:
  • Instance-based PaaS: application code deployed to VMs
  • Framework-based PaaS: code conform to specific framework
  • Metadata-based PaaS: code in form of metadata, instead in form of a normal source code

SaaS - Software as a Service

Deliver fully functional software applications over the Internet.

  • subscriptionn-based
  • HIDDEN infrastructure layers
  • vendor lock-in ![[Pasted image 20250829112810.png|500]]

SaaS | Multi-tenant

In general, the software offered is used by multiple customers, so the software must support multi-tenancy: customers use the same infrastructure, but their data is logically separated.

Data can be partitioned in two ways (see [[#Resource Management]]):

  • Siloing: dedicated database per tenant
  • Pooling: shared database, different tenant IDs

Each tenant is created after the user have gone through the control plane component, which handles the onboarding and authentication.

Different tenants can access different subsets of the services offered by the SaaS, considering their auth level.

Resource Management

When resources (computing, storage, network) are shared, three different approaches can be used:

  • Silo: each tenant has a dedicated resource
    • ⊕ high security and high performances
    • ⊖ high costs
  • Pool: resources are shared across the tenants
    • ⊕ more efficient, lower costs
    • ⊖ potentially insecure
  • Hybrid: some resources are shared, other are dedicated

Cloud Deployment Models

  • Public Cloud: infrastructure shared to the public
    • scales on-demand
    • pay-as-you-go
  • Private Cloud: infrastructure dedicated to the single organization
    • highly customizable
    • higher costs
  • Hybrid Cloud: combines public and private
    • flexible (between public and private approach)
  • Community Cloud: infrastructure shared to specific organizations (ex: GARR) with a common mission
    • shared policies
    • shared requirements
    • shared governance Security is critical in every model. Compliance standard are employed, such as HIPAA and PCI DSS.

Top 3 Cloud Providers

![[Pasted image 20260120182725.png|700]]

Principles

The following are just a bunch of principles used in the cloud native scenario. A principle is a set of rules guiding the system design decisions. They define:

  • system structure
  • interactions and relationships between components
  • helps managing the complexity of the system

Cloud Native | API First Principle

All services are exposed via APIs, typically REST.

  • Description before implementation: documentation is a contract between services

Cloud Native | Monolithic Architecture Principle

All components are tightly coupled in a single application; APIs are exposed.

  • Suitable for small applications
  • Entire system is bounded to the slowest component

Cloud Native | Polylithic Architecture Principle

The application is broken into multiple independent components, that can be built and deployed independently.

  • modular approach
  • single components can be scaled

Cloud Native | Polyglot Persistence Principle

In this scenario, multiple database technologies are used, each of them suitable for its dedicated purpose.

  • Relational DB ← transaction-heavy systems (ex: MySQL, Postgres)
  • NoSQL DB ← scalability and high throughput (ex: MongoDB)
  • Graph DB ← complex relationships and queries (ex: Neo4j)

Cloud Native | Modeled with Business Domain Principle

The idea is to develop the software by connecting the implementation closely to the business concepts; also defined [[#Domain-Driven Design]] (DDD) ![[Pasted image 20260121113735.png]]

Cloud Native | Consumer First Principle

API definition is tailored for the costumer. Define how microservices will interact, from the user perspective.

  • API documented with tools like Swagger, Confluence, JIRA

Cloud Native | Decentralize Everything Principle

Promotes self-directions, self-sufficiency and autonomy.

  • decentralize microservices → more testable, scalable
  • infrastructure as code (ex: kubernetes)
  • decentralize DevOps → each pipeline is assigned to one team
  • decentralize data → [[#Cloud Native Polyglot Persistence Principle]]

Cloud Native | Digital Decoupling Principle (DDP)

Migrating logic and data to service-based platforms.

Cloud Native | Isolate Failure Principle (IFP)

Prevent system-wide failures by isolating issues, so the other services can keep running despite the failed ones. Avoids single points of failure scenario.

Cloud Native | Location-Independent Principle (LIP)

#TODO

Security | Defense in Depth Principle

Implement security in multiple layers of the application:

  • logic → secure coding, authentication
  • data → encryption
  • network → firewalls, security groups, VPNs The more control the user have, the lesser will be the protection ensured by the provider.

Example: Amazon AWS

  • Infrastructure services (ex: [[#EC2]]) → more customizable, more user responsibility
  • Container services (ex: RDS) → mid
  • Managed services (ex: [[#S3]], KMS, [[#DynamoDB]]) → less customizable, less user responsibility

Security | Security by Design Principle

Vulnerabilities and attacks in prod are reduced because product was built to be secure from the ground up. Multiple layers of security are used (see [[#Defense in Depth Principle]]) and services are independent from each other (see [[#Cloud Native Isolate Failure Principle (IFP)]]).

Container | Single Concern Principle (SCP)

One responsibility per container. This allows the image to be deployed for different scenarios, with different configuration, of course.

Container | High Observability Principle

Understand the internal state of the system by leveraging on its auditing features (see [[#Containers Auditing]]), such as logs and metrics. More complex tools can be used such as Grafana, Prometheus, NewRelic.

Container | Lifecycle Conformance Principle

Containers should react to lifecycle events.

  • SIGTERM, sent by docker stop → graceful shutdown
  • SIGKILL → sent by engine if container was not shutdown within 10s
  • post-start → perform initialization tasks
  • pre-stop → perform cleanup tasks

Container | Image Immutability Principle (IIP)

A container’s image must not be changed once built. When multiple behaviors are needed, they must be set in the configuration.

  • image reusability
  • rollback enhanced

Container | Process Disposability Principle (PDP)

Containers are meant to be ephemeral, so designed for quick startup and shutdown.

  • allows auto-scaling features (required by cloud native approach)

Container | Self-Containment Principle

Container must contain everything needed at build time, such as libraries, dependencies, etc. Again, configuration and state outside the container are managed by configurations.

Container | Runtime Confinement Principle (RCP)

Each container should declare its resource requirements and communicate them to the hosting platform. Applications that go beyond the limits are eligible for termination.

Code Quality | Don’t Repeat Yourself Principle (DRY)

Be sure to properly use functions in the source code. Reuse images.

  • semver
  • limit code based on libraries

Code Quality | Isolation

Changes to one module should not impact others. Each microservice is responsible for its own state. Allows [[#Cloud Native Isolate Failure Principle (IFP)]]

  • limit dependencies

Common Closure Principle (CCP)

Components that change for the same reason should be packed together.

  • Reduces the number of services to be redeployed when an update is necessary

Separation of Concerns (SoC)

Decouple software in isolated modules; break larger problems into smaller ones and well define boundaries.

  • separate UI, API, logic, DB

Architecture Layering

Layering an application means organizing it into horizontal and vertical layers.

Layering in traditional applications

  • Presentation
  • Business logic
  • Persistence
  • Database (?)

Layering in cloud-native applications

  • Presentation
  • Integration: event-driver architecture
  • Legacy: integration with internal enterprise application
  • Third-party SaaS: integration with externals

Orthogonal Design Principles: Cohesion and Coupling

Orthogonality means that components/modules can be modified without affecting other ones. The presence of components/modules assumes that modular design is implemented. ![[Pasted image 20260121150500.png|300]] Orthogonal design is based on the cohesion and coupling principles:

  • Cohesion: measures how strongly elements insider a module are related to each other.
    • High cohesion: each element performs a well-defined task, so it’s easier to understand, maintain and reuse.
    • Low cohesion
    • Functional Cohesion: each element performs a well-defined task
    • Sequential Cohesion: each element performs a task that is part of a sequence of the module
    • Communication Cohesion: each module operates on the same data or reference the same structure. Reusability is reduced because tied to a specific data source
    • Procedural Cohesion: modules are grouped on the order of execution inside the module
    • Temporal Cohesion: they must execute at the same time. This deny reusability
    • Logical Cohesion: elements are grouped because perform similar tasks
    • Coincidental Cohesion: elements are grouped for no particular reason
  • Coupling: measures the degree of independence between modules/microservices
    • No Coupling: modules do not communicate/depend (ex: server and clients that exchange details over socket)
    • Message Coupling: modules communicate without passing parameters (ex: dependency injection)
    • Data Coupling: modules communicate by passing parameters with no sharing of unnecessary information
    • Stamp Coupling: modules communicate by passing parameters where only part of data is required, maybe using complex structures
    • Control Coupling: one module controls the other one by passing control parameters
    • External Coupling: modules share an externally imposed data format/protocol/interface
    • Common Coupling: modules depend on shared global data
    • Content Coupling: one module modifies the internal behavior of another one

SOLID Design Principles

Set of principles used to reduce dependencies, improve maintainability/flexibility/scalability of the software, in order to make it cloud-native compliant.

Single Responsibility Principle (SRP)

Every module must have its own responsibility/purpose w.r.t. a specific domain (see [[#Separation of Concerns (SoC)]]).

Open-Closed Principle (OCP)

Software entities should be open for extensions, but closed for modifications. Enhances should be made without modified current functionality.

Liskov Substitution Principle

Objects of a superclass should be replaceable with objects of its subclasses without affecting correctness of application.

Example Let's assume that we have just defined the Document class and subsequently generated the Letter, SMS, and Novel classes as child classes of the Document class. Wherever in our code a Document type object is required, it should be possible to use any object instantiated from one of the Document's child classes, such as SMS, without in any way prejudicing the proper functioning of the program.

Interface Segregation Principle (ISP)

Consumers should not be forced to depend on methods or properties unused. This leads to less dead code.

Dependency Inversion Principle (DIP)

High-level modules should not depend on low-level modules; both should depend on abstractions between them. This leads to loose coupling.

Microservices

Hexagonal Architecture

Note

It is not directly related to microservices. A full monolith or a single microservice can follow the hexagonal architecture.

The idea is to separate the core business logic from the external services (decoupling).

  • Flexible → components can be swapped easily (ex: MySQL → MongoDB)
  • Testable The hexagon is made up by:
  • core: business logic
  • ports: entry/exit points to/from the application from/to the outside world. External systems are connected via dedicated adapters ![[Pasted image 20260121161942.png|500]]

Example: Migrate app to microservice

![[Screenshot 2026-01-21 alle 16.30.19.png]]

Step 1: Identifying System Operations

Define the specific operations the system must support, independent from the technical development of the system. Operation → Task/Process.

FTGO: operations are placing an order, updating an order, process payment.

Step 2: Identifying Services

Step 1 → [Functional requirements] → Step 2 Map each operation to a dedicated microservice.

FTGO: create two microservices: one for orders and the other for payment. More precisely:

  • Accounting service
  • Supplier Management:
    • Restaurant service
    • Courier service: couriers management is separated to enhance their distinct role in the scenario
    • Consumer service
    • Kitchen service
    • Delivery service: merges both Courier availability and Delivery management because they are strictly related

![[Pasted image 20260121171459.png|400]]


Another way to define services is based on DDD (see [[#Cloud Native Modeled with Business Domain Principle]]). Basically the domain is broken into subdomains and one service is created for each subdomain.

FTGO: ![[Pasted image 20260122101516.png|400]]


Each service defined must satisfy the [[#Single Responsibility Principle (SRP)]] and [[#Common Closure Principle (CCP)]]. Having many services might increase the network latency. Solutions can be:

  • batch API calls
  • combine services that need frequent interaction In a microservice-enabled scenario, maintaining consistency across services is a critical task, especially when updating data concurrently. Solution is:
  • use SAGA #tolink

For each service, can be defined

  • operations: functions of the current service
  • collaborators: other services called by the current one’s operations

FTGO: ![[Pasted image 20260122102717.png|500]] ![[Pasted image 20260122102743.png|500]]

Step 3: Define Service APIs and Collaborations

API Semantic Versioning

It is a good practice to version the APIs, and to increment the version when changes are made. The format to follow is MAJOR.MINOR.PATCH:

  • MAJOR: increment for non backward-compatible API changes
  • MINOR: increment for backward-compatible API feature additions
  • PATCH: increment for backward-compatible API bugfixes

[!note] Backward-compatible Newer version of the API is compatible with already existing clients.


APIs with different version can exist at the same time, so there must be some way to call that specific API version:

  • URI Versioning: version is embedded in the URI (ex: GET /v2/orders/xyz)
    • potential breakage of existing links
  • MIME-Type Versioning: version is included in the HTTP headers, via content negotiation
    • requires client support to read/request it

Microservice Inter-Process Communication (IPC)

Practices that enable microservices to communicate and collaborate on processing requests.

Synchronous Communication

Client sends a request and waits for a response.

Remote Procedure Invocation (RPI)

The request is done by using a proxy interface implemented by an adapter class. Components involved are:

  • RPI Proxy: routes requests to the service
  • RPI Server Adapter: handles requests received by invoking business logic
  • Response Flow: it is not an entity, is just the process where the service processes a request and sends back a response Service discovery mechanisms might be required to locate services or to handle partial failures.
REST

Resources are identified using URIs and data is represented using JSON or XML formats. HTTP methods can have different verbs:

  • GET: retrieve a resource
  • POST: create new resources
  • PUT: update existing resources
  • DELETE: delete existing resources Service discovery mechanisms might be required to locate services.

REST APIs are often documented using tools such as

  • OpenAPI: standardized Interface Definition Language (IDL) that allows developers to define APIs in a human-readable format.
    • Some tools allow the automatic openAPI document generation
    • Helps in automatically generate client stubs and server skeletons
    • Initially part of the Swagger project
  • Swagger: tool for writing OpenAPI definitions; provides a GUI for API testing ![[Pasted image 20260122112744.png|150]]
  • Info: title, version, description, license
  • Servers: one or mode base urls
  • Paths → Operations:
    • endpoints and respective HTTP method
    • parameters (see below)
    • request body (see below)
    • responses (see below)
  • Components: reusable entities
    • schemas
    • parameters
    • responses
    • callbacks
gRPC

gRPC uses:

  • HTTP/2 for data transport
  • Protocol Buffers for message serialization: they provide a binary format more efficient than text-based formats (like JSON used in [[#REST]]). However, It is a more rigid definition. Communication between gRPC entities can happen with different approaches:
  • Unary RPC: single request, single response
  • Server Streaming RPC: single request, stream of responses
  • Client Streaming RPC: stream of requests, single response
  • Bidirectional Streaming RPC: stream of requests, stream of responses (suitable for real-time interaction)
Handling Partial Failures

![[Pasted image 20260122114557.png|400]] Partial failures occurs when the client blocks (ex: waiting for an unresponsive service) which leads to cascading failures. Many patterns are used, or a combination of them of course.

Circuit Breaker Pattern

In the Circuit Breaker Pattern proxy between consumer and microservice is used:

  1. rejects requests that exceed a failure threshold timeout
  2. prevents further request attempts by opening the circuit
  3. after the timeout, the client can retry and, if successful, the circuit is closed

Additional recovery strategies can be used like:

  • use network timeouts
  • impose a cap on the number of requests
  • return meaningful errors
  • use fallback responses (ex: cached data)
Bulkhead Pattern

Recall the [[#Cloud Native Isolate Failure Principle (IFP)]] as a basis for the Bulkhead pattern: when a microservice fails, it won’t impact the ability of others to respond.

Retry Pattern

Faults are broken into:

  • Transient Faults: likely to succeed upon repetition
  • Non-Transient Faults: unlikely to succeed upon repetition, so frequent retries are avoided, using an exponential backoff Client will retry failed service operations
Service Discovery

Since autoscaling, failures, upgrades are frequent in a microservices scenario, automatic service discovery mechanisms are required because static/hardcoding service endpoints could be a pain (recall the [[#Cloud Native Location-Independent Principle (LIP)]]).

A Service Registry is used: a database that holds the network location of services.

  • referenced by each service instance
  • can provide a list of available services
  • it is updated on services start, stop, fail
  • responses to clients’ queries to obtain services’ instance locations, so it routes requests accordingly
  • can perform health checks

The Service Registry can be Application-Level… ![[Pasted image 20260122121817.png|400]] …or Platform-Provided ![[Pasted image 20260122122409.png|400]]

Asynchronous Messaging

Clients exchange messages through channels. Messages are made up by:

  • Header: metadata
  • Body: data in test or binary format Messages can be:
  • Documents: data
  • Commands: specify an operation
  • Events: often represent state change Channels can be:
  • Point-to-Point: messages are delivered to one client ![[Pasted image 20260122151543.png|400]]
    • used for commands
    • Algorithm:
      1. client sends a command message, with a msgId (MessageId)
      2. if message was requesting data then
        1. service replies with data and corresponding msgId (CorrelationId)
        2. else it was a One-Way Notification sent by the client to the service, so no response is expected
  • Publish-Subscribe: messages are delivered to all the client subscribed to the topic ![[Pasted image 20260122151628.png|400]]
    • used for events
    • patterns:
      • Publish/Subscribe: producer publishes (or subscribes) messages to a channel that multiple consumers can read
      • Publish/Async Responses: client publishes a message specifying a reply channel
Brokerless Messaging

Services can exchange messages without an intermediary. Pros

  • Less latency
  • Less traffic Cons
  • Each service must know the other ones Example: ZeroMQ
Broker-based Messaging

Pros

  • Loose coupling
  • Message buffering Cons
  • Single point of failure vulnerability Example: [[#RabbitMQ]], Kafka
Advanced Message Queuing Protocol (AMQP)

It allows different applications and platform to communicate via messaging.

  • reliable: guarantees message delivery with ack, routing, etc.
  • flexible: supports different patterns

Core components:

  • Queue: a buffer that stores messages
    • durable: stored on storage
    • non-durable: stored on memory
  • Exchange made possible by a routing mechanism
  • Binding: defines how messages are routed
  • Message: data being transferred
    • persistent
    • transient

Possible messages flows:

  • Producer → Exchange: producer sends the message
  • Exchange → Queue: exchange routes the messages to the correct queue
  • Queue → Consumer: consumer processes the message

Types of exchange:

  • Direct Exchange: routes messages based on an exact match between the message routing key and the queue’s binding key.
    • supports multicast
  • Fanout Exchange: routes messages to all queues bound to the exchange, ignoring routing keys.
    • supports broadcast
  • Topic Exchange: routes messages based on pattern matching between routing key and binding key.
    • Example bindings:
      • logs.* → logs.error, logs.info
      • logs.# → logs.error.db, logs.system.cpu
  • Headers Exchange: routes messages based on message headers instead of routing keys.
    • Headers act like parameters for selecting clients
    • Used for more complex routing

AQMP supports Point-to-Point channel type for messaging (see [[#Asynchronous Messaging]] → Channel Types):

  1. the producer sends a message with a routing key
  2. the message is stored in the queue where queue name == routing key
  3. only one consumer will consume messages from the queue
RabbitMQ

It is based on the [[#Advanced Message Queuing Protocol (AMQP)]].

Remote Procedure Calling (see [[#Remote Procedure Invocation (RPI)]]) can be implemented over RabbitMQ: ![[Pasted image 20260122162837.png]]

  1. The client sends a request message to a specific rpc_queue, by specifying a correlation_id as the id for the communication and a callback queue where the server will store the result.
  2. Server processes the request and send the result to the callback queue, keeping the correlation_id

In general, in a RabbitMQ-enabled scenario, the actors are:

  • Producers (P): send tasks to a queue
  • Consumers (C_1, C_2) So RabbitMQ can distribute tasks on available workers, preventing the single one from being overloaded. Queues and messages can be made persistent to survive system crashes. ![[Pasted image 20260126155405.png | 200]] Increasing the number of service instances can increase throughput, but may disrupt the expected order of message processing (concurrent processing scenario). So competing receivers scenario may occur: a sharded message channel is used. This allows maintaining message order in concurrent processing.
AsyncAPI

It is an open standard used to document asynchronous and event-driven APIs (analogous of Swagger used for [[#REST]]). Like for Swagger, tiny servers and clients can be generated from the documentation. AsyncAPI Generator is used to automatically create documentation. AsyncAPI Studio is a web-based GUI that can help developers. ![[Pasted image 20260122164744.png|150]] ![[Pasted image 20260122165312.png|150]]

  • Servers: defines one or more message brokers, their details and security policies
  • Channels: represent the communication paths through which messages are exchanged
    • topics
    • queues
    • routing keys
  • Channels → Channel: summary, description, messages
  • Channels → Messages: define the structure of the data
    • headers
    • payload
  • Operations: define the action taken on a channel
    • send/receive
    • channel reference
    • message reference
  • Tags and External Docs
  • Components: reusable entities
    • Schemas
    • Messages
    • Parameters
    • Security Schemes
    • Bindings for other protocols
    • Traits
    • Correlation IDs
Asynchronous Messaging | Handling Duplicated Messages

Duplicated Messages might occur because (FTGO):

  1. “Order Created” event is processed, then followed by an “Order Cancelled” event
  2. “Order Created” event isn’t acknowledged, so both “Order created” and “Order Cancelled” must be redelivered
  3. “Order Created” might be the only one re-delivered Solution could be:
  • Idempotent Message Handlers: multiple invocation of the same operation don’t have additional effect
  • Track messages and discard duplicated
Asynchronous Messaging | Availability

Asynchronous messaging decouples services, so resilience and availability is improved.

While with synchronous, like [[#REST]], availability is reduced because all services must be available at the same time.

System availability is calculated by multiplying all the services’ availabilities: $$ avail_{tot}=\prod_{i} avail_{service,i} $$

Full Asynchronous Interaction

Order Service receives a request from the client, so it will interact asynchronously with the Consumer and Restaurant Service to process the request.

  • No service is blocked waiting for a response
  • However, the API gateway will use a synchronous protocol such as REST ![[Pasted image 20260122174622.png|500]]

Event-Driven Replication

Some services might persist data received from other services in the system to avoid excess of requests. However, storage and data aging problems must be handled. ![[Pasted image 20260122174952.png|500]]


The final flow is the following one: ![[Pasted image 20260122175430.png|500]] Basically, when the client request a createOrder operation, it is executed and an immediate response is received, without waiting for further processing, with a PENDING state. After that the other services asynchronously perform their tasks, the order will get a VALIDATED state.

API Composition Pattern

In a monolithic approach, data is in a single database so queries can use a single select to collect and join data. In microservice architectures data is scattered across multiple services, so multiple data sources.

The idea of the API Composition Pattern is to locate the API composition logic in a specific component of the architecture called the API Composer. There are less and more structured approaches to implement the API Composer:

  • Client composition: embedded the web application itself.
  • Gateway composition: the composer is combined with the API gateway; the component will both expose external APIs and centralize API composition logic.
  • Service composition: the composer is a standalone service; usually used for more complex API composition logic. ![[Pasted image 20260129122031.png]] The composer can perform query calls in two ways:
  • Parallel Calls: latency is minimized because multiple services are invoked simultaneously
  • Sequential Calls: needed when a service call depends on the result of another one

Pros

  • simplicity
  • reusability
  • flexibility Cons
  • increased overhead ← scattered data across data sources
  • reduced availability ← more services involved
  • risk of data inconsistency between data sources
  • developers may spend more time on improving performances than on business logic features

FTGO: implementing findOrderHistory() query ![[Pasted image 20260129124215.png|450]] Of course, the query will be filtered on:

  • consumerId
  • page for pagination
  • and more: maxOrderAge, restaurantName, etc. Data needed to perform filtering is - of course - scattered across multiple services.

The API Pattern allows multiple strategies to handle joins:

  • In-Memory Join: all data is fetched from the services and the composer performs the join
    • inefficient for large data sets
  • Bulk Fetch by ID: all data is queried from the services using appropriate parameters
    • services bust expose bulk fetch APIs

Coordination and Orchestration

One of the critical aspect when using microservices is the transaction management one: data is just spread across different services because each one of them has its own database, but we still want to implement transactional operations. A service often needs to:

  • update the database
  • publish messages

Recall the [[#Asynchronous Messaging Availability]] issue: all services must be available for the transaction to complete.

The approach is to prefer the availability over consistency of the system (see [[#CAP Theorem]]).

CAP Theorem

A system can only have two out of three properties:

  • Consistency: all the nodes sees the same data at the same time
  • Availability: every request receives a response (success or failure)
  • Partition Tolerance: the system continues to operate, despite network splits

One might say to use distributed transactions pattern, but it is not suitable for microservices.

The Outbox Pattern

![[Pasted image 20260126163810.png|600]] The outbox table is used as dedicated database table that acts as a temporary message queue.

  1. While the service is done processing, it inserts results in the database and a messages into the outbox table, by using a transaction.
  2. A separate process, the message relay, polls - following the Polling Publisher Pattern - the outbox table and will send them to a message broker (ex: Kafka, RabbitMQ).
  3. Once that events has been sent, corresponding rows are deleted from the outbox table.

There is a variant where the message relay reads from the Transaction Log of the database, without using an additional table. This is called the Transaction Log Tailing Pattern.

  • Scales better than the [[#The Outbox Pattern]] - polling can be a bottleneck in high-traffic systems
  • More complex to implement

SAGA Pattern

It is a distributed design pattern for managing long-running transactions across microservices. It ensures consistency even without relying on distributed transactions.

All steps in a saga pattern-enabled process are triggered by specific events or commands. If a step of the chain fails, preceding actions are reversed to maintain consistency by using appropriate [[#Compensating Transactions]].

SAGA | Architecture

![[Pasted image 20260126165936.png|400]] Unlike ACID transactions, SAGA cannot rollback automatically, so compensating transactions must be defined explicitly. Their behavior can differ from one scenario to another.

  • Compensatable Transactions: transactions that need a compensating transaction to roll back state - when some writing operations where made. They precede the pivot
  • Pivot Transaction: if it succeed, the saga will complete
  • Retriable Transaction: they are guaranteed to succede after the pivot In general, each transaction can be executed one on a different participant (microservices).

FTGO: ![[Pasted image 20260126170148.png|600]]

SAGA | Coordination

The whole process can be coordinated in two ways:

  • Choreography: coordination is decentralized, so services listen for/trigger each other and acts accordingly
  • Orchestration: a saga orchestrator service manages the whole flow, so sends commands to each service, in sequence
Choreography-based Coordination

Again, services publish events when creating/updating objects. Loose coupling still holds because queues are used as proxies, so participants that use a specific queue does not need to know each other, but just the queue.

  • Prone to cyclic dependencies
  • not scalable → participants might need to subscribe to a lot of events

FTGO: success flow ![[Pasted image 20260127105100.png|400]] FTGO: error flow ![[Pasted image 20260127105611.png|400]] In the example:

  • OrderService exposes POST /orders APIs
  • messages are exchanged using [[#RabbitMQ]]
  • routing keys are consistent: <service>.<event>
  • AccountingService is the pivot
Orchestration-based Coordination

FTGO: ![[Pasted image 20260127105939.png|400]]

Here a saga orchestrator is used: it sends commands to each service, in sequence, waiting for their asynchronous replies. A saga orchestrator can be represented using a State Machine: a set of states and transitions between them. Each - state machine’s - transition is triggered by the completion of a local transition, and it determines the next. It acts as a good compact representation for the whole scenario.

  • it ensures that both the database update and message publication occur - atomically - together or not at all
  • retries requests if necessary
  • tracks the saga process

The other participants must send a reply message to the orchestrator once they updated their local DBs. The reply message triggers the next step.

Pros:

  • cyclic dependencies are avoided since participants communicate with the coordinator and not to each other
  • less coupling: participants must know just the coordinator
  • improved SoC because the saga process is entirely managed by the coordinator Cons:
  • the orchestrator might become too complex

FTGO: ![[Pasted image 20260126175231.png|600]]

SAGA | Handling Lack of Isolation

Saga lacks isolation: concurrent sagas can access data mid-execution, causing anomalies.

SAGA | Anomalies
  • Lost Updates: one saga overwrites changes made by another one without first reading them
  • Dirty Reads: one saga reads data from another saga that hasn’t - successfully - completed its updates
  • Fuzzy/Nonrepeatable Reads: different steps of one saga read the same data, but receive different values because they have been changed by another saga
SAGA | Countermeasures

The concurrency handled mechanism can be chosen based on the business risk associated with each request.

  • Semantic Lock: a compensatable transaction sets a flag (ex: _PENDING) in the record being updated, acting as a lock for the record not fully committed. The flag will be cleared out by a retriable transaction (saga completes) or a compensating transaction (saga rolls back) When a record is read as locked:
    • Fail and retry
    • Block until lock is released → more isolation, deadlock management is required
  • Commutative Updates: ensure operation can be executed in any order, without affecting the final outcome. Commutative operations are easier to roll back.
  • Pessimistic View: ensure a logical sequence of the operations in the saga to minimize business risk.
  • Reread Value: prevents lost updates by checking one more time the record’s state before making changes, in case another saga updated it after the first read.
  • Version: a version file is used to track all the updates made on the record, with the associated timestamp

Business Logic Organization Patterns

The organization of business logic inside a microservice determines its scalability and maintainability. Often the decision to make is to choose between the following ones.

Transaction Script

It is a procedural approach, so suitable for simple, linear logic.

Each operation is handled by a separate and dedicated script or method, living in service classes. While data objects (DAOs) handle persistence

Applications:

  • used for CRUD-style microservices
  • suitable for light microservices FTGO:
  • OrderService → logic
  • OrderDao → data access
  • Order → pure data model

Business Capability Pattern

Domain Model

It is an object-oriented approach, so suitable for structured scenarios that require reusable logic.

Here data and logic coexists within domain classes. The purpose is to model real-world concepts.

Requires much more design effort, but pays off for maintainability, testability, scalability, evolving business logic.

Cons:

  • some relationships and boundaries can be left implicit

Domain-Driven Design

Refines the Domain Model, used in complex domain. Each microservice defines a bounded context taken from the business domain. The idea is to align the entire software model with the business model.

Main components are:

  • Entity: actor in the scenario defined by its unique identity and not by its attributes, with a dedicated lifecycle
  • Value Object: actor in the scenario defined by its attributes
  • Factory: creates objects
  • Repository: handles persistence and retrieval
  • Service: contains domain logic not belonging to a particular Entity/Value Object
DDD | Aggregates

All the objects related to a single transactional unit are grouped into an Aggregate. This is an addition w.r.t. [[#Domain Model]]. The aggregate entity acts as a root entity because it is the only entrypoint to access single items.

Aggregates allow clear transaction boundaries definition, so they avoid complex coordination logic across services and they are focused on ensuring local consistency.

A scale trade-off must be considered:

  • too small aggregates → too many connections
  • too large aggregates → performance degradation
DDD | Aggregating Rules
  1. Reference Only the Aggregate Root: only the root of an aggregate (object) can be accessed from external classes or services
  2. Inter-Aggregate References by Primary Key: aggregates reference each other by primary key (atomic field) only, not by reference
  3. One Transaction per Aggregate: each transaction can create/update only one aggregate. Of course, operations involving multiple aggregated use [[#SAGA Pattern]]. ![[Pasted image 20260127125454.png|400]]
    • One transaction per Aggregate
    • Multiple transactions per Service

Now that aggregate concept is clear, Services can be thought as the (only) entrypoint to the specific aggregate’s business logic.

DDD | Domain Events

Given an instant in the vole timespan, each aggregate has its own state. A Domain Event represents a significant change of state. Domain Events names are usually past-participle verbs (ex: OrderCreated) and have properties that contain information about the event. All the data is contained by a wrapper object of the event.

FTGO: the Order an aggregate of Order Items and its the entrypoint to the items. ![[Pasted image 20260127124227.png|400]]

Final FTGO Example

![[Pasted image 20260127170007.png|400]]

  • OrderService: is the primary entrypoint to the system logic It manages two aggregates:
    • Order ← persistence managed by OrderRepository
    • Restaurant ← persistence managed by RestaurantRepository
  • Inbound adapters:
    • User → [ REST APIs ] → OrderService
    • Listen to RestaurantService → [ OrderEventConsumer ] → update Restaurant data
    • OrderCommandHandlers
    • SagaReplyAdapter → handles OrderSaga responses and triggers saga actions
  • Outbound Adapters:
    • OrderRepository → [ DBAdapter ] → Database

Event-Driven Architecture

It is an asynchronous architecture where components communicate via events. Core components are:

  • Event Producer: detects and generate events, sending them to Brokers
  • Event Broker: routes event, filters and route them to the appropriate Consumers
  • Event Consumer: reacts and processes events, asynchronously (enables decoupling)
  • Event Persistence: store events for durability and replay

Features:

  • Asynchronous communication → services operate independently, decoupled → low latency
  • Loose coupling → services does not need to know each other → better maintainability
  • Scalability
  • Event Persistence → enables auditability
  • Real-time event processing
  • Event stream → continuous stream of data

Event Processing Styles

  • Single Event Processing (SEP): an event immediately triggers an action (Trigger-Action)
  • Event Stream Processing (ESP): filter and processes stream of events (Real-Time Flow)
    • Kafka is often used
  • Complex Event Processing (CEP): filter and analyzes stream of events to identify meaningful patterns (Multi-Event Analysis)
    • Used for logistics, energy management, finance

Change Data Capture (CDC)

The idea is to track incremental changes - INSERT, UPDATE, DELETE - in data, in a data store. This approach is very useful to synchronize changes across repositories. All the changes are read from transaction logs, timestamps or triggers. ![[Pasted image 20260127181516.png | 400]] Data streams triggered by CDC events can be processed in real-time. This is called Event Streaming. Changes are delivered as streams to allow immediate processing. Core components are:

  • Event Storage: stores data chronologically
  • Streaming Technology (Kasfka, Kinesis, etc.)
  • Stream Processors: enrich, transform, analyze incoming data Event Streaming is used to:
  • capture real-time insights for applications
  • track user actions in a e-commerce ![[Pasted image 20260127181558.png|500]] Even streaming can be combined with message queues for seamless and real-time integration.

Event Sourcing

In this pattern, state changes are stored as a series of events in event stores, instead of just performing the change. The current state is built just by replaying all the history of events. Moreover, it is possible to revert to a previous state, just by replaying all the history to a specific record.

  • Useful for auditing and maintenance

Each time an aggregate is requested, it is loaded using the following algorithm:

  1. load all the events for the aggregate from the EVENT table
  2. create an instance using the aggregate’s default constructor
  3. apply all the events fetched to the instance, using apply()

Pros

  • User and time are tracked for each state change, so for each event
  • Auditing, historical queries, temporal analysis are facilitated
  • Events are easier to (de)serialize w.r.t. ORM-based Cons
  • Idempotence must be implemented
  • Data deletion if more difficult, especially because of GDPR → encryption/pseudonymization can be solutions
  • queries made on the whole aggregate state can be complex

FTGO: each order is persisted as multiple rows in the EVENTS table, each one representing a state change (ex: OrderCreated, OrderApproved, OrderShipped). ![[Pasted image 20260128150342.png|400]]

Event Sourcing vs Traditional approach

In the traditional approach, persistence is implemented using:

  • Class-to-Table Mapping: each application class → database table
  • Field-to-Column Mapping: each application class attribute → database table column
  • Each object instance → database table row ![[Pasted image 20260128145118.png|400]] Cons
  • Can lead to complex mappings
  • History not automatically available
  • Difficult auditing
  • Object-Relational impedance mismatch

Event Sourcing | Command Methods

Again, each command generate state change, then a sequence of events, without modifying the current one. Using these two commands allow ensuring [[#Separation of Concerns (SoC)]] principle.

  • process(command): it creates and return a list of events representing the intended state changes.
    • takes the command object which contains the requests’ data (FTGO: new order details can be passed when calling OrderRevisionProposed)
    • can throw exceptions if validation fails, preventing invalid state changes
    • SoC: process() handles validation and event generation
  • apply(event): processes each generated event to update the aggregate’s state, so to update the aggregate’s fields based on the event given
    • SoC: apply() handles state update ![[Pasted image 20260128151344.png|300]]

Event Sourcing | Handling Concurrent Updates

When multiple requests try to update the same aggregate, one might overwrite another one’s changes. This anomaly can be prevented with Optimistic Locking: a version column is used to keep track on changes made from the first read of the aggregate. Each aggregate will have a version.

  • If version is still the same → no change was made by other transactions → the current transaction updates version field
  • If version was changed (by another transaction) → subsequent transaction fails to prevent overwrites

Event Sourcing | Snapshots

Some long-lived aggregates can accumulate a large number of events over time, making inefficient to load and replay all events. Snapshots are used to periodically capture and save the current state of an aggregate, in a specific event version N.

With snapshots, the algorithm to load an aggregate changes:

  1. retrieve the most recent snapshot
  2. load only the events occurred after the snapshot
  3. apply all the events fetched in previous step to the instance, using apply() ![[Pasted image 20260128160553.png|400]]

Event Sourcing | Idempotent Message Processing

Message brokers may deliver the same message multiple times, so it is better for the service handle duplicates with idempotence:

  • RDBMS-based event stores → a PROCESSED_MESSAGES table is used to record message IDs, so if a message ID already exists it can be discarded
  • NoSQL-based event stores → the message ID is embedded in generated events to detect duplicated. Of course, event are not always generated, so using a pseudo event might by handy.

Event Sourcing | Event Schema Changes

During the lifetime of an event-based application, some changes could be necessary, at different levels of the architecture:

  • Schema Level: adding (backward-compatible) or renaming aggregates
  • Aggregate Level: adding (backward-compatible) or removing attributes from the aggregates
  • Event Level: adding (backward-compatible), modifying or removing fields from an event Of course, some migration operations are needed to switch from the old to the new structure.

Event Sourcing w/ Saga

Services using Event Sourcing often requires SAGAs to maintain data consistency across multiple (micro)services.

  • RDBMS-based event stores → supports ACID → seamless SAGA integration
  • NoSQL-based event stores → lack of ACID → alternative approaches to implement SAGA

Event Sourcing w/ Saga | Choreography

In such scenario the flow is:

  1. Aggregates emit events when updated
  2. Event Handlers consume the events and update related aggregates, so new events are emitted

Event Sourcing w/ Saga | Orchestration

The saga orchestrator coordinates multi-step workflows across services.

Saga participants must ensure idempotence to prevent duplicate processing - by recording and checking a message ID - and must alway send reply messages to the saga oschestrator. Saga commands might not change the aggregate state, so no event will be emitted.

Saga orchestrator is just created - SagaOrchestratorCreated event is triggered - and updated when participants replies to it - SagaOrchestratorUpdated event is triggered. It records message IDs as well, to ensure idempotence. Since it is an event-based scenario, the Event Store will keep all the events to reconstruct the orchestrator. The orchestrator will just:

  1. emit a SagaCommandEvent, containing all the data needed (channel, payload)
  2. then the service’s Event Handler will process the event and reply to the orchestrator ![[Pasted image 20260129114645.png|400]]

FTGO: ![[Pasted image 20260129113232.png|400]]

Command Query Responsibility Segregation (CQRS)

CQRS is a design that separates commands (CUD) from queries (read) to improve scalability and efficiency. Here each service has its dedicated model and data store for:

  • Command Side database: publishes domain events on data changes like create, update, delete operations
  • Query Side database: processes nontrivial queries with optimized databases or data models. It stores pre-materialized views of data. This database is updated every time that a command on query-side models is sent

![[Pasted image 20260129130329.png|500]]

Query-Only Services

These services expose APIs for query operations only. The databases used are kept synchronized because services are subscribed to events from other services. These services are mainly used to implement:

  • Standalone Queries: (FTGO) the Available Restaurant Service maintains data about available restaurants - exposed with the findAvailableRestaurants() query - and it is subscripted subscripted to the Restaurant Service
  • Cross-Service Views: (FTGO) subscribes to events from Order, Delivery and Accounting Services and keeps a databases for queries such as findOrderHistory() ![[Pasted image 20260206104513.png|400]]

A better representation can be: ![[Pasted image 20260206110000.png|400]]

  • View Database stores data in a optimized way/structure
    • The database technology must be suitable for the type of data to store and the type of queries to perform
    • SQL vs NoSQL: complex reporting vs flexible data model and stability
  • Data Access handles database logic
    • Ensure idempotent updates → handle concurrency
      • Pessimistic Locking mechanism: rows to be updated are locked
      • Optimistic Locking mechanism: allows recurrent access but data is verified before commit
    • If layer is non idempotent → risk of corruption and inconsistent state
      • Reliable Event Tracking:
        • SQL: record event IDs in a PROCESSED_EVENTS dedicated table
        • NoSQL: record event IDs in the updated record itself
      • Optimized Event Tracking: use monotonically increasing IDs
  • Event Handlers subscribed to other services’ events and update the View Database, through Data Access layer
  • Query API exposes operations to clients

Usual problems due to event-based architecture arise again:

  • (large quantity of) events must be stored somewhere, like on [[#S3]], and processed, like by Spark
  • large quantity of events must be processed → periodically take snapshots

Handling Eventual Consistency

A client querying immediately after a command might not see its update due to message delays - the query request arrives to the server before the command end processing.

Event tokens can be used as a solution, where everything is based on eventId and max(eventId):

  1. the command-side returns an event ID token to the client
  2. the query-side makes sure if the view has processed the event
    1. returns query result if updated
    2. else returns error

Aggregate Only vs Event-Sourcing

The Aggregate Only scenario does not use event stacking, so just one latest updated aggregate exists, with no history stored. Can be faster, but reconstruction of previous states is not possible. Each SQL row is stateful.

The Event-Sourcing scenario tracks every - immutable - event that is made on the starting aggregate and a snapshot is taken every N events. The approach is more complex, but allows some useful history-wise operations on data.

Reactive API Composition

Often happens that a single query involves multiple aggregates. Components needed are:

  • Gateway that wraps the API called externally in a Reactive Streams where multiple services’ operations are invoked
    • Services are called in an async fashion, with normalized fallback for non-critical services: values are left null
  • Per-Service Circuit Breaker Layer to handle failures

FTGO:

  1. Client calls API GET /orders/{id}
  2. Gateway wraps the workflow in a Reactive Stream that will emit a result
  3. Gateway spawns 4 async tasks and collect results
  4. Gateway emit the final result

Pros:

  • The gateway becomes an orchestrator and can evolve to support more services/endpoints
  • Async composition solves the domain problem, so components are more decoupled

External API Patterns

Recalling the FTGO example, multiple types of client can be found:

  • Inside the Firewall: clients that can access high-performance LAN
    • Backend service
  • Outside the Firewall: other clients that must use a lower-performance network
    • Browser
    • Mobile
    • 3-rd party applications

Direct Client-Service Interaction

![[Pasted image 20260206121715.png|400]] The naive approach is the one in figure above, where each client independently talks to the backend.

Cons:

  • multiple independent requests → no caching
  • developer overhead
  • lack of encapsulation
  • outside the firewall clients might not be able to use the same protocols used inside the firewall

API Gateway Pattern

![[Pasted image 20260206122631.png|400]] An additional component is used: the API Gateway. It acts as the single entry point into a micro-service based application.

Some features are:

  • Request Routing: API requests are processed by the proper service
  • API Composition (see [[#API Composition Pattern]]): combines responses from multiple services
  • Protocol Translation: translate external-friendly protocols with internal-friendly ones and viceversa, like REST ←→ gRPC
  • Security and performance: Authentication, Authorization, Rate Limiting
  • Optimization and monitoring: Caching, Auditing

![[Pasted image 20260206123649.png|400]] API Gateway can be broken into:

  • API Layer: contains independent modules for each specific type of client supported
  • Common Layer: contains independent shared functions (such as authentication, rate limiting) API Gateway can work with:
  • Synchronous I/O: each network connection is handled by a single thread
  • Asynchronous I/O: a single thread processes all network connections

API Gateway | Ownership models

  • Centralized Ownership: the whole API gateway is managed by a dedicated team
  • Decentralized Ownership:
    • the common layer only of the gateway is managed by a dedicated team
    • the API layer is managed by a dedicated team per type of client, which handles the client as well ![[Pasted image 20260206123945.png|400]] The entire gateway module is ofter owned my a dedicated API Gateway Team that implements APIs requested by other teams that develop client-specific modules.

Backends for Frontends Pattern (BFF)

![[Pasted image 20260206124448.png|400]] This structure allows to define clear ownerships of the module even more clearly, with enhanced scalability and reduced complexity. Each team is responsible for its client and respective API gateway.


API Gateway | Challenges

Of course, the API Gateway might become the main bottleneck of the application. Other common API Gateway challenges are:

  • Sequential calls can lead to high latency → invoke services in parallel
  • (Some) Requests might fail → [[#Circuit Breaker Pattern]]
  • API Gateway must identify services dynamically

API Gateway | Implementation

  • AWS API Gateway: routing, authentication
  • AWS Application Load Balancer: just routing
  • Kong
  • Traefik

API Gateway | GraphQL Implementation

The purpose is to map the server-side data as a graph of objects with fields and relationships. A client is responsible to retrieve all specific data requested to the gateway, and avoids overfetching. The standard used is GraphQL, now a newer version is Apollo GraphQL, made by Facebook.

![[Pasted image 20260206131452.png|400]] The components are:

  • GraphQL Schema (left white square): defines the server-side data model and the queries supported
    • Object Types: entities (FTGO: Consumer, Order, Restaurant, DeliveryInfo)
    • Fields
    • Enums
  • Resolver Functions (right white square): maps the schema to the backend services
    • [[#API Composition Pattern]] is supported, by combining data from multiple sources
    • works with this set of parameters:
      • Object: parent (or root) object for top-level queries
      • Query Arguments
      • Context such as user info, or global data in general
  • Proxy Classes: handle interactions with the (FTGO) application

The Query Language used lets the client have the full control over the data returned by the query:

  • it is better to select just interested fields, to make the query more efficient
  • it is possible to retrieve related data from a single request, so API calls are minimized
    • FTGO: fetch a consumer, its orders, etc.

Example: query

query {
  consumer(consumerId: 1) {
    id
    firstName
    lastName
    orders {
      orderId
      restaurant {
        id
        name
      }
      deliveryInfo {
        estimatedDeliveryTime
        status
      }
    }
  }
}

Example: query with aliases

query{
	c1: consumer(consumerId:1){ id, firstName, lastName }
	c2: consumer(consumerId:2){ id, firstName, lastName }
}

Example: order resolver Given the Order Service, a resolver that defines multiple queries can be created as follows.

const resolvers = {
	Query: {
		orders: resolveOrders, // Fetch orders by consumerId
		order: resolveOrder, // Fetch a single order
		consumer: resolveConsumer // Fetch a consumer
	},
	
	Order: {
		restaurant: resolveOrderRestaurant, // Resolve nested restaurant
		deliveryInfo: resolveOrderDeliveryInfo // Resolve nested delivery info
	},
	
	Consumer: {
		orders: resolveConsumerOrders // Resolve consumer.orders
	}
};

![[Pasted image 20260206155757.png|600]] As we can see from the picture above, execution starts from the top resolver, so the resolveConsumer(), then the nested ones (so just resolveConsumerOrder()) and so on, with the lower level (resolveOrderRestaurant() and resolveOrderDeliveryInfo())

GraphQL | Mutations

A mutation is a write operation, so can be create, update or delete. It still returns structured data.

Example: mutation

typeMutation{
	createOrder(
		consumerId: Int!
		restaurantId: Int!
		items:[OrderItemInput!]!
	): Order!
}

input OrderItemInput {
	sku: String!
	qty: Int!
}

mutation{
	createOrder(
		consumerId:1
		restaurantId:10
		items:[
			{sku:"PIZZA_MARGHERITA", qty:1}
			{sku:"COLA_CAN", qty:2}
		]
	){
		orderId
		status
		items { sku qty }
		kitchen { status }
		deliveryInfo { status estimatedDeliveryTime }
		payment { status amount }
	}
}
  1. createOrder is called
  2. the resolver performs the HTTP POST to the Order Service
  3. the resolver returns the object created into the GraphQL pipeline so it can resolve additional info (Order.restaurant, Order.deliveryInfo, Order.payment)

Two main tools are used:

  • Strawberry: has a code-first approach for Python. The schema is generated from python classes and it automatically creates the GraphQL schema and explorer. Example
type Order {
	orderId: ID
	status: String
	items:[OrderItem]
	kitchen: KitchenTicket
	delivery: Delivery
	payment: Payment
}
  • Ariadne: has a schema-first approach for Python. Example
@strawberry.type
class Order:
	orderId: str
	status: str
	items: List[OrderItem]

Of course, they have different implementation but same API, so the are interchangeable.

Big Data

The main characteristic of big data approach is to manage large-scale amount of data, that cannot be processed by just few nodes. A big data infrastructure must provide

  • horizontal scaling based on the workload requirements:
    • from a Short-term high demand scenario where a lot of nodes perform heavy computation
    • to a Long-term steady processing scenario where few nodes perform light computation.
  • compatibility for on-site, cloud-based or hybrid use, and is tailored for the specific need.
  • flexibility: it must accept structured, semi-structured and unstructured data.
  • security by providing encryption, access control and auth
  • a storage layer
    • scalable horizontally to increase capacity and speed
    • with disaster recovery
    • data clean-up policies
  • a resource management system
    • with task prioritization mechanisms
    • dashboard
    • cost tracking
  • metadata discovery using a central repository
    • that automates datasources to crawl
    • that allows searchability and modification
  • reporting with rich visualizations
  • monitoring of the system itself
  • testing of the system
  • lifecycle management of the data and the system through phases like planning, design, development, maintenance, deprecation

Big Data | Evolution

Two decades of evolution had meaningful impact on networking and data storage. Definitions of Big Data have also evolved with time. 1997 - Big Data used for remote and satellite sensor colleciton 1998 - Growth in storage needs and distributed systems 2004 - [[#MapReduce]] introduction

MapReduce

It is a new approach that simplifies the processing of large datasets with:

  • Map Function: generates intermediary key-value pairs
  • Reduce Function: merge values for output key-value pairs The idea wants basically allow big data processing on a computing cluster, with a bunch of computing nodes. Every node will perform a part of the total computation in order to minimize the amount of data transferred between nodes.

Example: word count Map phase: iterate though document content and emit key-value pairs where: word-count. Reduce phase: group and sum the value of identical keys and return key-value pairs where: word-tot_count. ![[Pasted image 20260206181049.png|600]] Phases in image above are:

  • Splitting: data is broken into M pieces
  • Mapping: pull data locally (on a node) and run the map function that will emit intermediary key-value results
  • Partitioning: result is split into R partitions so each partition can be assigned to the proper reducer
  • Shuffling and Sorting
  • Reducing: intermediary results are processed and final key-value pairs are returned

All the nodes are managed by a master node that assigns tasks and monitors the system. Tasks are idempotent, so if a node fails → task is reassigned to the same node or to another one. If master fails → computation is restored from the last snapshot.

Big Data | Architecture

![[Pasted image 20260207100959.png|400]]

  • Ingestion: collect data from various sources like IoT devices, web apps, logs, etc.
    • Extraction: data is being collected
    • Transformation: data gets cleaned
    • Loading: data is loaded into target system
  • Storage: data lakes and warehouses store structured and unstructured data
    • Two approaches are used:
      • Distributed Key-Value Stores → high availability
      • Columnar Database → high query performance
    • Eventual Consistency → faster writes
  • Computation: data is processed with clustering, regression, pattern recognition algorithms in online or offline fashion
  • Presentation: results are presented to stakeholders through dashboards or exposed via APIs

Example: Microservices x MapReduce #TODO

Big Data | On-Premise

On-Premise means that the organization deploys the entire Big Data platform on a internal data center. [[#Hadoop]] is generally a good choice to leverage costs.

Costs:

  • High starting costs for hardware and maintenance. Old hardware can be considered to lower initial investment.
  • Potentially lower long-term costs

Big Data | In the Cloud

Pros:

  • compliance to regulations
  • backups and history managed
  • providers offer different prices w.r.t. the data access frequency needed Cons:
  • pay for data retrieval → backup outputs

Big Data | Hybrid Solution

It combines [[#Big Data On-Premise]] cost-efficiency and [[#Big Data In the Cloud]] compliance to regulations. Still [[#Big Data Object Storage]] is used.

  • Store rarely access data off-premise (in cloud)
  • Use cloud for heavy ETL*
  • SLA** guarantees by the provider * ETL: Extract, Transform, Load ** SLA: Service Level Agreement

Used as:

  • primary data warehouse
  • shared data marts
  • long-term storage for non critical/sensitive data

Big Data | Applications

  • Querying large amount of data with complex ad-hoc SQL queries
  • Reporting: create interactive dashboards for business insight or third-party sharing
  • Alerting: detect anomalies using thresholds or ML Advanced techniques can be:
  • Mining using clustering, classification, collaborative filtering for recommender systems
  • Modeling: develop and train ML models
  • Impact: use results for decision-making systems

Big Data | Storage Patterns

There are various pattern available, one for each business case. They could work together as well, like data lakes usually feed data warehouses and marts after data processing.

Data Lake

It can store all types of data: structured, semi-structured and unstructured. It uses schema-on-read approach: data schema is applied only when accessing data, while data is stored raw.

Used:

  • as a sandbox
  • for ML and exploratory analysis

Data Warehouse

It uses schema-on-write approach: data schema is applied each time data is written. Data is aggregated, so a single source of truth is created for the organization. Usually, data comes from [[#Data Lake]]s.

Used for business intelligence dashboards, decision making-oriented. Can be easily integrated with [[#Big Data Object Storage]].

Pros:

  • supports standard SQL for querying massive data storage Cons:
  • cannot scale as easily as [[#Data Lake]]
  • requires metadata and documentation

Data is stored following the Column-Oriented Storage Structure:

Names: Mary, Paulina, John, Suzanne, GZ
Ages: 21, 43, 45, 11, 15

Pros:

  • more efficient for queries on a specific column
    • more space efficient by grouping similar data types together Cons:
  • slower updated

Provisioned Data Warehouse

![[Pasted image 20260210125055.png|400]]

  • Leader Node coordinates queries and data distribution
  • Compute Nodes store data and execute queries Data is distributed using and identified by keys.

Serverless Data Warehouse

![[Pasted image 20260210125130.png|400]] Distributed queries can be executed using the tree architecture where

  • Leaf Nodes fetch and process data
  • Intermediate Nodes aggregate results The system can auto-scale if the query requires it, to speed up the query processing.

Virtual Data Warehouse

![[Pasted image 20260210125153.png|400]]

  • Storage Layer
    • uses [[#Big Data Object Storage]] approach for everything, from table data to intermediary results and outputs.
  • Compute Layer
    • built with virtual warehouses, so resource clusters, that can be dynamically created, resized, destroyed
    • external clients interact with abstract clusters, not with concrete nodes underneath
  • Service Layer allows
    • authentication and access control
    • query optimization
    • transaction management

Pros:

  • this approach allows the application of [[#Separation of Concerns (SoC)]] principle

Data Mart

They are [[#Data Warehouse]]s dedicated to specific business areas, such as just for sales, costumers, partners, etc.

Big Data | Object Storage

With this approach, data is stored as objects where:

  • data: raw content in any format (structured or not)
  • metadata: timestamps, ownership, access control, etc.
  • unique identifier used to access data

Using unique identifiers allow easier data sharing

Data Processing

Offline Processing

It’s a type of - usually bulk - data processing where user intervention is not needed and there are no strict time boundaries, like hours and days are often acceptable. Scheduling tasks is required.

Steps in general are:

  1. clean
  2. transform
  3. aggregate
  4. consolidate

For further details, see [[#Big Data Offline Processing]].

Stream Big Data Processing

Approach used to real-time process data coming from (eg) IoT devices, social media, online transactions, etc. Cleaning, filtering, categorizing steps are made on data, without storing it. So low latency and agility are ensured.

Used for anomaly detection of real-time analysis.

Applications source of data just send a continuous stream of data in fixed-size blocks. The Hadoop client just buffers data locally till the data chunk can be created. Of course, data must be aggregated in small or big chunks before processing:

  • Time Window: data in a specific time range is aggregated
  • Count Window: a specific amount of data is aggregated Then data is written (actually, is sent to the NameNode, see [[#Hadoop Distributed File System (HDFS)]]).

In order to extract data from the source, Message Brokers are used, then data is processed by Stream Processing Engines.

Hadoop

It is a tool created in 2004, inspired by Google’s [[#MapReduce]]. It is used to process large datasets on commodity hardware, so it’s easier to move data computation closer to data source.

Hadoop 1.0

In the first version, Hadoop architecture is a tightly coupled stack:

  • Storage Layer is powered by the [[#Hadoop Distributed File System (HDFS)]]
  • Compute Layer is based on the [[#MapReduce]] approach, to allow data locality

Hadoop Distributed File System (HDFS)

It is optimized for large-scale systems with streaming data access. It scales efficiently even with commodity hardware. Access model is write-once, read-many.

HDFS 1.0 had the following architecture, with:

  • NameNode (Master)
    • manages file system namespace and metadata
    • maintains block mapping and DataNode location
  • DataNodes (Slaves)
    • store actual data blocks, with redundancy (see [[#Hadoop Distributed File System (HDFS)]])
    • serve read/write requests to/from NameNode
    • report their status to NameNode
    • reduce costs with Just a Bunch Of Disks (JBOD): storage architecture that combines multiple drives into a single logical unit, without data redundancy ![[Pasted image 20260209103033.png|400]]
HDFS | Read and Write workflow

Read algorithm

  1. client → NameNode to gather metadata
  2. NameNode → client with location of file blocks in the scattered across DataNodes
  3. client → DataNode opens a connection with the specific node
  4. DataNode → client data streaming, without involving NameNode

Write algorithm

  1. client → NameNode requests file creation
  2. NameNode selects DataNode targets for data block allocation
  3. client → first DataNode in the pipeline
  4. fist DataNode → other DataNodes in the pipeline, to allow data replication (see [[#HDFS Data Replication]])
  5. NameNode → client returns metadata
HDFS | Metadata vs Data

Metadata is conceptually more valuable than actual data: data blocks are meaningless without structure given by metadata. Metadata is used to:

  • represent the control plane of the file system
  • data represents the content plane Metadata must already exists before creating/reading content.

If data is unavailable → content is loss If metadata is unavailable → meaning is loss

HDFS | Data Replication

Data replication is implemented such that:

  • files → split into fixes-size blocks
  • each block → multiple replicas on different DataNodes, as many as the replication factor value says.

Of course, replicas are placed across different nodes and racks so that:

  • system is resilient to rack-level failures
  • network efficiency is increased: when a client requests data, the closer node will respond The NameNode decides the replica placement and continuously updates it.
HDFS | Handling Failures

DataNodes periodically send heartbeats and self-reports. If a DataNode becomes unavailable →

  • DataNode is marked as failed and data blocks that where hosted on it are considered missing
  • replication factor must be restored, so a new replica is created on a different DataNode

Data blocks are protected using checksums.

If a DataNode becomes unavailable → entire system becomes unavailable. High availability is not supported in HDFS 1.0.

JobTracker and DataNode Workflow

![[Pasted image 20260209122915.png|350]] JobTracker is just the module inside the NameNode that assign tasks to TaskTrackers, another module inside DataNodes, that execute MapReduce tasts. In Hadoop 1.0 architecture, there is just one JobTracker, making it a potential bottleneck.

Hadoop 2.0

Newer version of Hadoop introduces a clear separation between storage and resource management. There is no more a single master → metadata and job execution supports higher availability. There is just one Active NameNode that serves requests, while the Standby NameNodes keep metadata synchronized. So if the Active one fails, the Standby one is promoted. Moreover, new version keeps track of metadata changes on shared edit logs. ![[Pasted image 20260209123357.png|500]]

  • Storage is managed by [[#Hadoop Distributed File System (HDFS)]] version 2, with support for Active and Standby NameNodes → high availability.
  • Resources are managed by [[#Yet Another Resource Negotiator (YARN)]]

In general, version 2.0 is:

  • more available
  • more scalable
  • fault tolerant at both storage and resource levels
  • multi-framework support, beyond MapReduce

Yet Another Resource Negotiator (YARN)

![[Pasted image 20260209124302.png|400]] The YARN container represents a set of bounded resources - CPU and memory - and is the fundamental unit in the resource allocation. However YARN container is NOT a OS-level or Docker container (see [[#Containers]]); it is just a logical abstraction and corresponds to a process of the system.

As can be seen above, each node can run multiple containers that are relative to a specific application. The application in YARN is a set of jobs, requested by a client.

The main components are:

  • ResourceManager (one per system): it is the global master of the whole system
    • receives jobs from clients
    • allocates resources for applications based on
      • priority
      • resource locality
      • resource requirements
      • number of containers needed
    • monitors heath and capacity of the system
    • DOES NOT execute tasks
  • ApplicationMaster (one per application)
    • negotiates resources with the ResourceManager
    • runs inside a container of a node
    • monitors the execution of the application
  • NodeManager(s) (one per cluster node)
    • responsible for launching, monitoring, terminating containers
    • each container can execute just one mapper or reducer task

![[Pasted image 20260210122703.png]]

In general, tasks are executed following the Relax Locality approach where data necessary to the computation can be extracted from different nodes; if the first node is busy, another can respond.

Hadoop | Base Setup and example

In an empty directory, just create:

  • docker-compose.yaml to define the services needed to run Hadoop
    • namenode: metadata storage
    • datanodes x2: data block storage
    • resourcemanager
  • config file with env variables
    • HDFS config
    • MapReduce config
    • YARN config
services:
  namenode:
    image: apache/hadoop:3
    hostname: namenode
    ports:
      - 9870:9870
    env_file:
      - ./config
    environment:
      ENSURE_NAMEN
      O
      DE_DIR: "/tmp/hadoop-root/dfs/name"
    command: ["hdfs", "namenode"]

  datanode_1:
    image: apache/hadoop:3
    command: ["hdfs", "datanode"]
    env_file:
      - ./config

  datanode_2:
    image: apache/hadoop:3
    command: ["hdfs", "datanode"]
    env_file:
      - ./config

  resourcemanager:
    image: apache/hadoop:3
    hostname: resourcemanager
    command: ["yarn", "resourcemanager"]
    ports:
      - 8088:8088
    env_file:
      - ./config

  nodemanager:
    image: apache/hadoop:3
    command: ["yarn", "nodemanager"]
    env_file:
      - ./config

Example: MapReduce for $\pi$ estimation The idea is to generate random points, some of them inside and the others outside a circle. Then, estimate: $$ \pi = 4 \frac{\text{points inside circle}}{\text{points outside circle}} $$ To execute the example, just run:

docker compose exec -ti namenode /bin/bash

yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-HADOOP_VERSION.jar pi 10 15

The command above will generate a total of $10 \times 15 = 150$ numbers where:

  • $10$ is the number of mappers
  • $15$ is the number of random points generated by each mapper Of course, the more the points, the more accurate the estimation.

Mapper:

  1. generates random points
  2. assigns 1 for points inside and 0 for points outside the circle
  3. assigns the same key to all points Output:
Key Value
1   1 (Inside the circle)
1   0 (Outside the circle)
1   1 (Inside the circle)
1   0 (Outside the circle)
...

Reducer: aggregates all mapper outputs

  1. calculates the ratio of the points under the same key, so all of them
  2. estimates $\pi$ Output:
Key Values
1   [1, 0, 1, 0, ...]

Big Data | Formulas

Data Storage Formula

$$ dS = c \cdot r \cdot a \cdot \frac{(1+g)^y}{(1-i)(1-b)} $$ where:

  • $dS$: totale storage needed [TB or PB]
  • $c \in [0,1]$: compression ratio
  • $r \in \mathbb N$: replication factor (default 3)
  • $a \in \mathbb R$: initial data size
  • $g \in [0,1]$: yearly grow rate
  • $y \in \mathbb N$: planning period (in years)
  • $i \in [0,1]$: intermediate storage for temporary data
  • $b \in [0,1]$: buffer for unexpected spiker or peak loads

DataNode Formula

$$ dN = \frac{c \cdot r \cdot a}{d \cdot n \cdot (1-i)(1-b)} $$ where:

  • $dN$: totale number of DataNodes needed
  • $c \in [0,1]$: compression ratio
  • $r \in \mathbb N$: replication factor (default 3)
  • $a \in \mathbb R$: total raw data size, before hardware consideration
  • $d \in \mathbb R$: disk size per node
  • $n \in \mathbb N$: number of disks per node
  • $i \in [0,1]$: intermediate storage for temporary data
  • $b \in [0,1]$: buffer for unexpected spiker or peak loads

Big Data Offline Processing

Apache Hive

Hive is a SQL-based data warehouse layer that goes on top of [[#Hadoop Distributed File System (HDFS)]] where the user sends data using HiveQL and the output is the query enabled for distributed execution (MapReduce/Apache Tez/Spark). Basically allows the user to query an HDFS-based storage where:

  • Query → Hive
  • Resources → YARN
  • Storage level → Hadoop (so HDFS)

Features:

  • hive provides the user a logical schema over all the files stored, so queries are easier to create
  • works with managed and external tables
  • supports various drivers to work with the Hadoop ecosystem

![[Pasted image 20260210183904.png|400]]

  • Hive Clients
    • CLI
    • Thrift server for cross language communication
    • JDBC/ODBC drivers
  • Hive Driver
    • Compiler: queries → parsed tree, semantic analysis
    • Execution Engine: logical plans → execution-ready workflows
  • Metastore: stores tables, partitions, database information. Ensures a relation between data to be queried and actual directories
    • File-based storage → basic use
    • DB-based storage → scalable

Apache Hive | Data Model

Main components are:

  • Tables: corresponds to a directory in HDFS with one or more files. Hive cannot modify data, just to store schema and metadata.
  • Partitions: corresponds to subdirectories in HDFS Filter columns → skipping subdirectories
  • Buckets: each bucket correspond to a physical file inside the Partition and has an assigned hash. Bucketing is performed at load time.

![[Pasted image 20260210182658.png|400]]

Apache Hive | Data Format

Both row and columnar formats are supported:

  • Row-Based Format ← choose for frequent writes, real time updates
    • TEXTFILE: plain-tex, human readable
      • inefficient
    • SEQUENCEFILE: binary with compression
      • suitable for ETL workflows
    • AVRO: supports schema evolution
      • balanced
  • Columnar-Based Format ← choose for large-scale analytics, compression
    • ORC (Optimized Row Columnar): high compression
      • fast for analytical queries
    • PARQUET: schema-based
      • efficient for nested data

Apache Hive | Example

  • Input data is stored as CSV, on HDFS
  • Tables defined using schema-on-read
  • Execution Engine set to MapReduce
  1. HiveQL query is submitted via JDBC client
  2. Hive Driver parses the query → consults Metastore for metadata → generates an execution plan
  3. Execution plan → one or more MapReduce stages

Apache Spark

It is a distributed computing framework for big-data processing. It interfaces directly with various data stores, such as [[#Hadoop Distributed File System (HDFS)]], [[#Apache Hive]], Cassandra, [[#S3]].

  • Supports various APIs for Java, Python, etc.
  • Integrates with [[#Yet Another Resource Negotiator (YARN)]], Mesos, etc.

Apache Spark | Architecture

It follows a Master-Slave Architecture ![[Pasted image 20260210192334.png|300]]

  • Driver:
    • manages and distributes tasks execution
    • creates executors
  • Executors
    • provide resources to execute tasks
    • report status and results
  • Cluster Manager
    • Two schedulers available to choose from
      • Standalone Scheduler: default scheduler when Hadoop is not installed
      • YARN: default scheduler for Hadoop, optimized for MapReduce
    • mediates resource allocation
    • manages the state of physical resources
  • Spark UI (not in picture): provides a web interface to monitor execution

Apache Spark | Operation Modes

  • Local Mode: runs both driver and executors on a single JVM on the client. Used for quick development
  • Client Mode: runs driver on a single JVM and executors on the cluster Used when local data configuration is needed
  • Cluster Mode: runs both driver and executors on the cluster. Used in production

Apache Spark | Commands

  • spark-sql: write queries for structured datasets
  • spark-submit: submit applications written in Java, Scala, Python, R

Examples

  • Local Mode: ./bin/spark-submit --master local[8] /examples.jar 100
  • Standalone Cluster: ./bin/spark-submit --master spark://host:7077 --executor-memory 20G /examples.jar
  • YARN Cluster Mode: ./bin/spark-submit --master yarn --deploy-mode cluster --num-executors 50 /examples.jar

SparkContext

It is the core execution engine of Spark and responsible for connecting to the cluster and coordinating resources. Owned by the Driver and manages communication between Driver and Cluster (see [[#Apache Spark Architecture]]).

SparkSession

It is the main entry point for a Spark application and provides a unified interface to interact with Spark. Internally, manages a [[#SparkContext]] that can be accessed by the SparkSession itself. Applications should create a SparkSession instead of creating SparkContext directly.

Resilient Distributed Dataset (RDD)

It is the Spark’s core data abstraction: a read-only collection of objects that is broken in partitions to enable parallel processing by independent task.

Data contained is just about how data is derived, not where is stored.

  • Lineage information: points to parent RDD and tracks all the transformation made on it
  • Computation logic: function to compute each partition from its parents
  • Partition metadata: number of partitions and their indices
  • Preferred location: hints about data locality used to schedule tasks

Some Spark terminology:

  • A new collection is created every time a transformation is applied.
  • Transformations are lazy and define just how to process data.
  • A job is created every time an action (code function) is invoked and is the logical unit of work
  • Running a job means execute a Direct Acyclic Graph (DAG): represents the logical execution plan.
    • DAG components
      • nodes are the transformations
        • narrow transformations → each output partition depends on at most one input partition, so data exchange across the cluster is NOT required ![[Pasted image 20260211125051.png|500]] In the scenario in the image, both A and B options are equivalent
        • wide transformations → the output partition depends on more input partition, so data exchange across the cluster is required ![[Pasted image 20260211125931.png|500]]
      • edges are the dependencies
    • DAG is broken into smaller execution unit, called stages. A stage is a set of tasks (see below) that can be executed in parallel, across the cluster, without shuffling data → boundaries of a stage are defined by two wide transformations (see above)
  • before actually executing actions and producing results, Spark first lazily loads all the transformations, to guarantee an optimized execution

If partition gets lost → is recomputed using lineage: history of transformations.

Features:

  • Resilient because of the lineage mechanism (history of changes in database)
  • Immutable: transformations always create a new RDD, without ever changing the original one
  • Partitioned: data is split into logical partitions

Cons:

  • code can be difficult to understand
  • spark cannot further optimize operations inside lambdas
  • slower for Python/R w.r.t. Java version
Create an RDD

Define a RDD from numbers

numberRDD = spark.sparkContext.parallelize(range(1, 10))
numberRDD.collect() # Output: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Transform one RDD with filtering using a lambda function: a small anonymous function with just one expression, often used for short-lived functionality

evenNumberRDD = numberRDD.filter(lambda num: num % 2 == 0)
evenNumberRDD.collect() # Output: [2, 4, 6, 8]

Define a RDD from a textfile

logRDD = spark.sparkContext.textFile("/FileStore/tables/sampleFile.log")

Transform by filtering INFO and ERROR, then map to key-value where log_level-1 and reduce to key-value where log_level-count

resultRDD = logRDD\
	.filter(lambda line: line.split(" - ")[2] in ["INFO", "ERROR"])\
	.map(lambda line: (line.split(" - ")[2], 1))\
	.reduceByKey(lambda x, y: x + y)

Here there are 3 stages:

  • filter
  • map
  • reduceByKey

Export results

resultRDD.collect() # Output: [('INFO', 2), ('ERROR', 3)]

Here the result is a Pair RDD because of the key-value structure.

Apache Spark | Monitoring

Spark provides a built in UI for monitoring on default port 4040. It can be handy to monitor, optimize performance and manage Spark resources.

![[Pasted image 20260211122254.png|600]]

  • Jobs: displays DAG, stages, tasks
  • Stages: detailed information on the stage
  • Storage: show partitions
  • Environment: Spark config
  • Executors: resource usage
  • SQL: insights into SQL queries

Apache Spark | Functions

When calling a functions in Spark, it can be:

  • transformation: an already existing RDD is transformed and the resulting RDD is outputted
    • narrow (see [[#Resilient Distributed Dataset (RDD)]])
    • wide (see [[#Resilient Distributed Dataset (RDD)]])
  • action: produces a result or data is written
Apache Spark | Data Selection Actions
  • take(N): returns first N elements of RDD
  • first(): returns first element of RDD
Apache Spark | Sorting Actions
  • top(N): returns first N elements of the sorted RDD
Apache Spark | Writing and Aggregating Actions
  • saveAsTextFile(dir)
  • reduce(lambda): combines RDD elements to produce aggregated results, by processing the groups with lambda passed
  • foreach(lambda): executes lambda for each element of RDD
Apache Spark | Narrow Transformation Actions
  • map(lambda): applies lambda to each element of RDD, returning the RDD with respective outputs
  • flatMap(lambda): applies map (with lambda passed) and flattening
  • filter(lambda): filters out elements that satisfy lambda
  • RDD1.union(RDD2): combines two RDDs
Apache Spark | Wide Transformation Actions
  • distinct(): removes duplicates and returns a RDD with unique elements
  • sortBy(lambda): sorts a RDD
  • RDD1.intersection(RDD2): finds common elements between RDDs
  • RDD1.subtract(RDD2): removes RDD2 elements from RDD1
Apache Spark | Grouping Actions
  • groupByKey(): just groups elements with same key (no aggregation)
  • reduceByKey(lambda): aggregates values with same key using lambda
    • better than using groupByKey(): uses local aggregation so data is initially reduced inside their own node and then shuffled across the cluster
  • sortByKey(): sorts key-value elements w.r.t. the key
  • RDD1.join(RDD2): joins two RDDs on their keys
Apache Spark | Caching Actions

Caching enable storing frequently used RDDs in memory or disk to improve performance.

There are more modes for RDDs caching:

  • MEMORY_ONLY: unserialized data in memory
  • MEMORY_ONLY_SER: serialized data in memory
  • MEMORY_AND_DISK: unserialized data in disk
  • MEMORY_AND_DISK_SER: unserialized data in disk
  • DISK_ONLY: data in disk only
  • OFF_HEAP: serialized RDD stored off-heap

Actions:

  • cache(): cache RDD
  • unersist(): clear cached RDD
Apache Spark | Partitioning Actions

Partitions define the degree of parallelism in Spark.

  • repartition(N): creates N partitions from the starting RRD, so shuffles data across the nodes
  • coalesce(): #TODO sulle dispense c’è una definizione errata
  • partitionBy(): #TODO
Apache Spark | SQL on DataFrames Actions
  • createOrReplaceTempView(query): create a DataFrame based on the query output
  • createGlobalTempView(): to persist views across Spark sessions

Apache Spark | Variable Types

Apache Spark | Shared Variables

Used to enable - limited - data sharing across tasks. Two types are available in Spark:

  • Broadcast Variables: provide read-only large data sharing across nodes
    • cached on each node that uses them
    • avoid redundant data transfer
    • unpersist() for reuse
    • destroy() for permanent deletion
  • Accumulators: provide an easy way to aggregate results from various tasks, back to the driver. Supports only additive operations

Apache Spark | DataFrames

They are distributed collections of data organized into rows and columns. It is an abstraction on top of RDDs, used for efficient processing of structured data. Dimension can go from KB to PB.

A DataFrame can be generated from:

  • files
  • storage media
  • RDDs

Common actions:

  • printSchema(): displays the schema of a DataFrame in a tree format
  • select(*cols): selects some specific columns from a DataFrame
  • filter(lambda): filters rows w.r.t. lambda
  • groupBy(lambda): groups rows w.r.t. lambda

Example: load data from file

sales_df = spark.read.option("sep", "\t").option("header", "true") \
	.csv("hdfs:///sample_data/sales-data-sample.csv")

where:

  • option("sep", "\t") specifies the delimiter for a tab-separated file
  • option("header", "true") specifies that the file contains a header row

Example: load data to Parquet Parquet is just a columnar storage that preserves schema.

# Save as Parquet format
sales_df.write.parquet("sales.parquet")

# Read from Parquet and query
parquetSales = spark.read.parquet("sales.parquet")
parquetSales.createOrReplaceTempView("parquetSales")
ip = spark.sql("SELECT ip FROM parquetSales WHERE id BETWEEN 10 AND 19")
ip.show()

Example: working with JSONs

# Save as JSON format
sales_df.write.json("sales.json")

# Read from JSON and query
jsonSales = spark.read.json("sales.json")
jsonSales.createOrReplaceTempView("jsonSales")
ip = spark.sql("SELECT ip FROM jsonSales WHERE id BETWEEN 10 AND 19")
ip.show()

Apache Spark | Datasets

Datasets are strongly-typed collections of objects, suitable for relational and functional transformations. Available on Scale and Java only.

Available functions are:

  • transformations
    • map()
    • flatMap()
    • filter()
    • select()
    • aggregate()
  • actions
    • count()
    • show()
    • save()

Example: load data from CSV into dataset

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder
	.appName("Dataset Example")
	.getOrCreate()

// Load dataset from a CSV file
val salesDS = spark.read
	.option("sep", "\t")
	.option("header", "true")
	.csv("hdfs:///sample_data/sales-data-sample.csv")
	.as[SalesRecord] // Convert DataFrame to Dataset using a case class

// Show the content of the Dataset
salesDS.show()

// Define the case class for strong typing
caseclassSalesRecord(product: String, quantity: Int, price: Double)

Of course, schema mismatches can occur. To fix that Spark provides some strategies:

  • PERMISSIVE: invalid fields are replaces with nulls
  • DROPMALFORMED: invalid records - with invalid fields - are dropped
  • FAILFAST: invalid records trigger process abortion

Example: deal with schema mismatches

val salesDS = spark.read
	.option("sep", "\t")
	.option("header", "true")
	.option("mode", "PERMISSIVE") // Set the schema mismatch mode
	.csv("hdfs:///sample_data/sales-data-sample.csv")
	.as[SalesRecord]

salesDS.show()

Apache Spark | Full example

Create the example.py file:

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

spark = SparkSession.builder.appName('Sales Application').getOrCreate()
sales_df = spark.read.option("header", "true").csv("/user/data/sales.csv")

result = sales_df.groupBy("COUNTRY_CO
result.show()

Then run the command:

spark-submit \
	--master yarn \
	--deploy-mode client \
	--num-executors 3 \
	--executor-memory 2g \
	--total-executor-cores 1 \
	example.py

Localstack

LocalStack is a fully functional local cloud environment that emulates AWS services. Useful to local testing and development.

It emulates a variety of AWS services, can be used with provisioning tools (Terraform, AWS CDK, Pulumi) and runs inside a container.

Docker

Platform for developing, shipping, running apps in isolated containers: a packaged application with its dependencies that ensure running across various environments.

  • Dockerfile: a script that contains directives on how to build a Docker Image (see [[#Dockerfile]])
  • Docker Image: snapshot of an application and its dependencies (see [[#Docker Image]])
  • Docker Container: a running instance of a Docker Image

[!example] Manage containers

  • docker run
  • docker ps
  • docker stop
  • docker rm
  • docker exec -it [container-id] bash

Dockerfile

Directives to build images are broken into the following components:

  • base image to use
  • commands to run at startup
  • files to copy
  • software to install
  • env variables
  • ports to expose

Everything is translated to the keywords:

[!example]

  • FROM <image>
  • WORKDIR <path>
  • COPY <host-path> <image-path>
  • RUN <cmd>
  • ENV <key> <value>
  • EXPOSE <port-number>
  • USER <username-or-uid>
  • CMD ["<cmd>", "<arg1>", ...]
# Use the official Ubuntu 18.04 base image from the ECR Public Gallery
FROM public.ecr.aws/docker/library/ubuntu:18.04

# Install dependencies and Apache
RUN apt-get update && \
    apt-get -y install apache2 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Write the Hello World message
RUN echo 'Hello World!' > /var/www/html/index.html

# Configure the startup script for Apache
RUN echo '. /etc/apache2/envvars' > /root/run_apache.sh && \
    echo 'mkdir -p /var/run/apache2' >> /root/run_apache.sh && \
    echo 'mkdir -p /var/lock/apache2' >> /root/run_apache.sh && \
    echo '/usr/sbin/apache2 -D FOREGROUND' >> /root/run_apache.sh && \
    ln -sf /proc/self/fd/1 /var/log/apache2/access.log && \
    chmod 755 /root/run_apache.sh

# Expose port 80
EXPOSE 80

# Launch the script when the container starts
CMD ["/bin/bash", "/root/run_apache.sh"]

Once that the Dockerfile is defined, the image can be build with:

[!example] Build Docker image

docker build -t localstack-ecr-image .

Docker Image

Once that an image is built, it can be pushed to a Docker Registry.

When pushing to a Self-hosted Registry: the image name must follow the pattern:

[registry-host]:[registry-port]/[repository-name]:[tag-name]

Before pushing, the image must be tagged using the command:

docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]

The examples below show how to tag and push a Docker Image to a LocalStack Docker Registry (useful for [[#ECR]]).

[!example] Push Docker Image to Registry

  • docker tag localstack-ecr-image 000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/localstack-ecr-repository to tag image
  • docker push 000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/localstack-ecr-repository to push image

Docker Compose

It simplifies the management of multi-container Docker applications. The whole infrastructure is defined in the docker-compose.yml file and the main commands are:

[!example] Manage composable

  • docker compose up
  • docker compose down

Example of compose

version:"3"

services:
	web:
		image:nginx
		ports:
		- "80:80"
	db:
		image:mysql
		environment:
			MYSQL_ROOT_PASSWORD:example

LocalStack Web Application

Used to monitor the LocalStack instance and all the services connected to it.

AWS CLI Local

It’s just a thin wrapper around the AWS CLI, used to interact with LocalStack.

S3

The Amazon Simple Storage Service (S3) provides a scalable and secure object storage service.

[!example] Manage S3 buckets

  • awslocal s3api list-buckets
  • awslocal s3 mb s3://NAME*
  • awslocal s3 cp PATH_TO_FILE s3://NAME*
  • awslocal s3api list-objects --bucket NAME
  • awslocal s3api get-object --bucket NAME --key FILE_NAME PATH_TO_DOWNLOAD
  • awslocal s3api delete-object --bucket NAME --key FILE_NAME
  • awslocal s3api delete-bucket --bucket NAME

* mb → make bucket ** cp → like the linux command to copy files

There is a tiny difference between s3 and s3api:

  • s3: it’s designed to feel like working with local files, so a single command can use multiple API calls
  • s3api: every command maps almost 1:1 to AWS S3 API calls, so more low level than s3 command

EC2

Elastic Compute Cloud (EC2) is an AWS service that provides on-demand, scalable computing capacity. EC2 instances act as virtual servers that user can manage.

EC2 instances access is secured by a key pair:

  • public key: stored by Amazon EC2
  • private key: stored by the user

Warning

In order to allow EC2 to create new containers on the host, "/var/run/docker.sock:/var/run/docker.sock" must be mapped in the volumes.

[!example] Manage SSH keys

  • ssh-keygen -t rsa -b 2048 *
  • awslocal ec2 import-key-pair --key-name KEY_NAME --public-key-material fileb://PATH_TO_PUBKEY **

Both commands are different w.r.t. the ones in the slides: * AWS supports key pairs of 2048 bytes RSA ** fileb: is a workaround to file:

Before creating a EC2, a Virtual Private Cloud (VPC) must be set up.

  • logically isolated section of the AWS cloud:
  • defines a pool of IPs and allows subnets (6 created by default)
  • each resource runs inside a subnet and must be attached to at least one security group: defines its network rules
    • subnets are tied to a specific Availability Zone (AV)
      • in reality → they give geographic redundancy
      • in localstack → they are simulated
    • security group acts are virtual firewall, so controls inbound and outbound traffic

[!example] Manage VPC and Subnets

  • awslocal ec2 describe-vpcs to list all VPCs and gather VPC_ID
  • awslocal ec2 describe-availability-zones
  • awslocal ec2 describe-subnets --filters "Name=vpc-id,Values=VPC_ID" to list subnets in a vpc with id equals to VPC_ID

[!example] Manage VPC Security Group

  • awslocal ec2 authorize-security-group-ingress --group-id default --protocol tcp --port 8000 --cidr 0.0.0.0/0 to authorize incoming tcp traffic on port 8000 from any source (defined with Classless Inter-Domain Routing)
  • awslocal ec2 describe-security-groups --filters "Name=vpc-id,Values=VPC_ID" to fetch the security groups in a specific VPC

Now a EC2 instance can be created and a generic script user_script.sh can be fed to it.

[!example] Manage EC2 instance

  • awslocal ec2 run-instances --image-id ami-df5de72bdb3b --count 1 --instance-type t3.nano --key-name my-key --security-group-ids SECURITY_GROUP_ID --user-data file://./user_script.sh
  • ssh -p 22 -i ~/.ssh/id_rsa root@127.0.0.1

where:

  • --instance-type t3.nano determines the hardware used; t3 means general purpose
  • --image-id ami-df5de72bdb3b is given externally by Amazon AWS or the organization and is the template for the OS to boot on the EC2 instance. Must be compatible with instance type selected

Moreover, the EC2 instance can be monitored via the Resource Browser by accessing Instance→System Status in the LocalStack web app.

EC2 VM Managers

EC2 machines con be emulated with two different methods:

  • Mock/CRUD
  • Pro

This configuration can be changed from the EC2_VM_MANAGER option:

  • docker→ Docker containers are created (see [[#EC2 VM Managers Docker]])
  • libvirt→ real virtual machines are created
  • mock→ no real virtualization, stores all the resources as in-memory representation

EC2 VM Managers | Docker

Docker engine is used to emulate EC2 instances, so Docker socket must be mounted to use it. EC2 instances emulated will be subject to Docker containers limitations.

  • Containers created will be named as localstack-ec2.INSTANCE_ID.
  • User data (like the Python script used before)
    • can be found in the container at /var/lib/cloud/instances/INSTANCE_ID
    • can be sent via CLI with the UserData or ModifyInstanceAttribute variables
  • Logs can be found in the container at path /var/log/cloud-init-output.log.
  • Network Address is contained in PublicIpAddress

[!example] A summary of emulated actions (APIs) that can be made with EC2 instances is:

  • CreateImage: capture snapshot of a running instance into a new AMI (uses docker commit under the hood)
  • DescribeImages: list of AMIs
  • DescribeInstances, where docker instances are marked with ec2_vm_manager: docker
  • RunInstances (used before)
  • StopInstances → pause
  • StartInstances → resume
  • TerminateInstances → stop

Warning

Actions can be written into:

  • PascalCase: keywords used when interacting with AWS instance via REST APIs (using classic AWS)
    https://ec2.amazonaws.com/?Action=ModifyInstanceAttribute &InstanceId=i-1234567890abcdef0 &InstanceType.Value=m1.small &AUTHPARAMS
    
  • kebab-case: keywords used when interacting with AWS instance via CLI (LocalStack)
EC2 Docker Manager | Elastic Book Store

Such EC2 instances can che used as an Elastic Book Store (EBS) block device:

  • EBS Volumes
  • EBS Snapshots On LocalStack the volume size unit used is MiBs (GiBs in AWS).

#TODO Example: create an ext3 file system on the block device

ECR

The Amazon Elastic Container Registry (ECR) is a fully managed service for storing and managing Docker images. It can integrate with Lambda, Elastic Container Service ([[#ECS]]) and Elastic Kubernetes Service (EKS).

The service is private by default and lives in a specific AWS region. On LocalStack, everything is emulated and local.

See [[#Docker]] chapter to see how Docker images can be created and managed.

[!example] Manage ECR

  • awslocal ecr create-repository --repository-name localstack-ecr-repo --image-scanning-configuration scanOnPush=true to create a repository where newly pushed images are scanned for vulnerabilities

ECR instances can be monitored from the LocalStack web app under Status→ECR section.

ECS

The Amazon Elastic Container Service (ECS) is a fully managed service to orchestrate Docker images.

Features:

  • Logical Separation for containerized apps; useful to implement various stages - dev/test/prod - of the same app
  • Resource Management and Organization
  • Run multiple copies (replicas) of the same task
  • Scale apps for traffic spikes
  • Load balance traffic with Elastic Load Balancing (ELB)
  • pay-as-you-go

Containers orchestration can be done using two approaches:

  • AWS Fargate: serverless and infrastructure-free; the user just defines the container task and Fargate provisions compute on demand
    • focus on the application
    • auto-scaling
    • no idle cost
    • ideal for event driven apps or for spiky workloads
  • [[#EC2]]: full control; a cluster of EC2 instances is created
    • manual instance type and resources
    • auto-scale or manual scale
    • pay for EC2 instance 24/7
    • ideal for predictable workload
    • ECS Agent runs inside each EC2 instance and is responsible for task management and reporting

![[Pasted image 20260214111520.png|400]]

[!example] Create ECS

  • awslocal ecs create-cluster --cluster-name my-cluster

Now, a task can be defined in a JSON file:

{
  "containerDefinitions": [
    {
      "name": "server",
      "cpu": 10,
      "memory": 10,
      "image": "000000000000.dkr.ecr.us-east-1.localhost.localstack.cloud:4566/localstack-ecr-repository:latest",
      "portMappings": [
        {
          "containerPort": 80,
          "protocol": "tcp",
          "appProtocol": "http"
        }
      ],
      "essential": true,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "myloggroup",
          "awslogs-stream-prefix": "myprefix",
          "awslogs-region": "us-east-1"
        }
      }
    }
  ],
  "family": "my-family",
  "requiresCompatibilities": [
    "FARGATE"
  ]
}

where:

  • family is a custom identifier, used for to name different versions of the same task. If two tasks of the same family are registered, the version increases

[!example] Create ECS task

  • awslocal ecs register-task-definition --cli-input-json file://task_definition.json to create a task definition (it is NOT running yet) and the associated logs

[!example] Run ECS task

  • awslocal ecs create-service --service-name my-service --cluster my-cluster --task-definition my-family --launch-type FARGATE --desired-count 1 --network-configuration "awsvpcConfiguration={subnets=[\"<SUBNET-ID>\"],securityGroups=[\"SG-ID\"],assignPublicIp=\"ENABLED\"}" to create a task
  • awslocal ecs update-service --service my-service --cluster my-cluster --desired-count 2 to update a service configuration

where:

  • to get SUBNET_ID see [[#EC2]]
  • to get SG_ID see [[#EC2]]

[!example] Manage ECS logs

  • awslocal logs filter-log-events --log-grou-name my-log-group to access all logs
  • awslocal logs filter-log-events --log-grou-name my-log-group | select -first 20
  • awslocal logs filter-log-events --log-grou-name my-log-group --query "events[].message"

API Gateway

It is a AWS managed service to create, deploy and manage REST/HTTP/WebSockets APIs. Can integrate with [[#AWS Lambda]], [[#EC2]], Cognito, etc.

Features:

  • RESTful APIs creation based on HTTP and stateless
  • main front door for
    • workloads running EC2
    • [[#AWS Lambda]]
    • web apps
    • real-time apps

AWS Lambda

It allows running code without worrying about infrastructure.

  • Functions (code) run only when needed, automatically
  • pay-as-you-go
  • multiple runtimes supported

Example: API Gateway to route REST APIs to Lambda functions

  1. start LocalStack
  2. create Lambda function, zip and upload it
   'use strict'

const apiHandler = (event, context, callback) => {
    const name = event.queryStringParameters.name;
    const surname = event.queryStringParameters.surname;

    callback(null, {
        statusCode: 200,
        body: JSON.stringify({
            message: `Hello from Lambda: ${name} ${surname}`
        }),
    });
};

module.exports = {
    apiHandler,
};

where:

  • event (object) contains information from the invoker
  • context (object) contains details about the invocation, function and the execution environment
  • callback (function) can be used in non-async handlers to send a response

[!example] Upload Lambda function

 awslocal lambda create-function --function-name apigw-lambda \
--runtime nodejs16.x --handler lambda.apiHandler --memory-size 128 \
--zip-file fileb://function.zip --role arn:aws:iam::111111111111:role/apigw

where:

  • --handler is the name of the method within our code that Lambda calls to run our function (see source code above)
  1. create REST API and the associated resource

    [!example] Manage REST API Create the REST API

awslocal apigateway create-rest-api --name 'API Gateway Lambda integration'

Retrieve all APIs

awslocal apigateway get-resources --rest-api-id <REST_API_ID>

where the REST_API_ID is given from the successful execution of the creation command.

Create the resource

awslocal apigateway create-resource \
--rest-api-id <REST_API_ID> \
--partent-id <PARENT_ID> \
--path-part "hello"

Associate GET method to the resource

awslocal apigateway put-method \
--rest-api-id <REST_API_ID> \
--resource-id <RESOURCE_ID> \
--http-method GET \
--request-parameters file://paramters.json \
--authorization-type "NONE"

where parameters.json is:

{
	"httpMethod":"GET",
	"authorizationType":"NONE",
	"apiKeyRequired":false,
	"requestParameters":{
		"method.request.querystring.name":true,
		"method.request.querystring.surname":true
	}
}

here requestParameters can have different values of method.request.{location}.{name} where {location}:

  • querystring
  • path
  • header
  1. connect to Lambda

[!example] Connect resource to lambda

awslocal apigateway put-integration \
--rest-api-id <REST_API_ID> --resource-id <RESOURCE_ID> \
--http-method GET --type AWS_PROXY \
--integration-http-method POST \
--uri arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/arn:aws:lambda:us-east-1:000000000000:function:apigw-lambda/invocations \
--passthrough-behavior WHEN_NO_MATCH

where:

  • --type AWS_PROXY means that the API Gateway will pass the request and return the response given with no transformations
  • --passthrough-behavior WHEN_NO_MATCH means that when there is no Content-Type matching, the API Gateway will forward the request anyway
  • --uri used to interact with backend. It has a specific structure:
    • arn:aws:apigateway:{region}:lambda
      • ARN prefix
      • region used
      • AWS service used (lambda)
    • path/2015-03-31 version of the gateway, active since that date
    • functions/{lambda_arn}
    • invocations just a suffix
  • --integration-http-method POST because while the API gateway is exposing a GET, it interacts with backend using a POST
  1. Deploy and test

    [!example] Deploy the API

awslocal apigateway create-deployment \
--rest-api-id <REST_API_ID> \
--stage-name test

where:

  • --stage-name can be dev/prod/test

    [!example] Test the API

curl http://localhost:4566/restapis/cor3o5oeci/test/_user_request_/hello?name=Mario&surname=Rossi

where the API endpoint follows the patten: http://localhost:4566/restapis/{REST_API_ID}/{STAGE}/_user_request_/{PATH}?{KEY-VALUE-PAIRS}.

The expected output is:

{
	"message":"Hello from Lambda: Mario Rossi"
}

#TODO Exercise 1 #TODO Exercise 2

Step Functions

It is a serverless workflow engine for orchestrating multiple AWS services. It is programmed using Amazon States Language (ASL).

Can be managed from Resource Browser as well.

[!example] Create a state machine

aswlocal stepfunctions create-state-machine \
--name "CreateAndListBuckets" \
--definition file://definition.json \
--role-arn "arn:aws:iam::000000000000:role/stepfunctions-role"

where:

  • definition.json represents the state machine itself and it is written into ASL
{
  "Comment": "Create bucket and list buckets",
  "StartAt": "CreateBucket",
  "States": {
    "CreateBucket": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:s3:createBucket",
      "Parameters": {
        "Bucket": "new-sfn-bucket"
      },
      "Next": "ListBuckets"
    },
    "ListBuckets": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:s3:listBuckets",
      "End": true
    }
  }
}

where the at each stage, it is checked wether it is an End State; if it’s not, the execution continues to the next step.

  • States, StartAt are mandatory
  • Comment, Version, TimeoutSeconds are optionals

[!example] [ASL] Create Bucket

"CreateBucket":{
	"Type":"Task",
	"Resource":"arn:aws:states:::aws-sdk:s3:createBucket",
	"Parameters":{"Bucket":"new-sfn-bucket"},
	"Next":"ListBuckets"
}

[!example] Manage state machine Create the state machine

aswlocal stepfunctions start-execution \
--state-machine-arn "arn:aws:states:us-east-1:000000000000:stateMachine:CreateAndListBuckets"

Monitor the state machine

awslocal stepfunctions describe-execution --execution-arn "<YOUR-EXECUTION-ARN>"

Example: branched execution ![[Screenshot 2026-02-16 alle 08.04.34.png|500]] The idea is to create three independent lambdas and connect them together via the state machine changes.

  1. Create the lambdas source code, zip it and upload to LocalStack (see [[#AWS Lambda]])
  2. Create the workflow definition, create state machine and upload (see commands above)
{
  "Comment": "Branched state machine example",
  "StartAt": "FirstState",
  "States": {
    "FirstState": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:000000000000:function:generator"
      },
      "OutputPath": "$.Payload",
      "Next": "ChoiceState"
    },
    "ChoiceState": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.number",
          "NumericEquals": 0,
          "Next": "EvenState"
        },
        {
          "Variable": "$.number",
          "NumericEquals": 1,
          "Next": "OddState"
        }
      ],
      "Default": "DefaultState"
    },
    "EvenState": {
      "Type": "Pass",
      "Result": "The number was zero (even).",
      "End": true
    },
    "OddState": {
      "Type": "Pass",
      "Result": "The number was one (odd).",
      "End": true
    },
    "DefaultState": {
      "Type": "Fail",
      "Error": "InvalidInput",
      "Cause": "The number was neither 0 nor 1."
    }
  }
}

[!example] Branching logic

...
"ChoiceState": {
	"Type": "Choice",
    "Choices": [
        {
          "Variable": "$.number",
          "NumericEquals": 0,
          "Next": "EvenState"
        },
        {
          "Variable": "$.number",
          "NumericEquals": 1,
          "Next": "OddState"
        }
      ],
      "Default": "DefaultState"
    },
...

where:

  • ChoiceState adds branching

Example: parallel execution ![[Screenshot 2026-02-16 alle 08.05.19.png|500]] The procedure is the same:

  1. Create the lambdas source code, zip it and upload to LocalStack (see [[#AWS Lambda]])
  2. Create the workflow definition, create state machine and upload (see commands above)

[!example] Parallel logic

...
"Parallel State": {
	"Type": "Parallel",
	"Next": "combine",
	"Branches": [
		{
			"StartAt": "lambda1",
			"States": {...}
		},
		{
			"StartAt": "lambda2",
			"States": {...}
		}
	]
}

where the compiler waits to move on until each branch has terminated execution.

Example: input/output execution ![[Screenshot 2026-02-16 alle 08.05.51.png|300]] where:

  • InputPath: filters the incoming input and…

    [!example] InputPath

"InputPath":"$.customer"
  • Parameters: …transform it before passing to the Task

    [!example] Parameters

"Parameters":{
	"orderId.$":"$.orderId",
	"timestamp.$":"$$.State.EnteredTime",
	"fixedValue":"StaticData"
}
  • ResultSelector: transforms the Task output
  • ResultPath: merges the result with the original state and…

    [!example] ResultPath

"ResultPath":"$.shippingDetails"
  • OutputPath: …is passed to the NextState

    [!example] OutputPath

"OutputPath":"$.shippingDetails"

When using filter, if they are not defined the entire input goes through. For complete example of execution, see slides.

Keep in mind that:

  • $ refers to the State Input, so the data being passed from the previous state into the current one.
  • $$ refers to the Context Object, so the metadata about the execution itself that isn't part of the payload.

SQS

The Amazon Simple Queue Service (SQS) enables communication based on message queues between two decouples applications.

Features:

  • stores messages sent by producer until consumer is ready to process, in order of arrival
  • Types of queues:
    • Standard Queue (default): high throughput, with at-least-one delivery that could lead to duplicated messages or out of order
    • FIFO Queue: strict ordering and exactly-once delivery
  • Visibility Timeout: when a message is consumed, if it isn't deleted within the visibility timeout, it becomes visible again, allowing another consumer to pick it up

[!example] Manage queues

  • Create a queue
awslocal sqs create-queue --queue-name my-queue

ARN is returned

  • List queues
awslocal sqs list-queues
  • Get queue attributes*
awslocal sqs get-queue-attributes --attribute-names All \
--queue-url  http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/my-queue
  • Send message to queue
awslocal sqs send-message --message-body "Hello World" \
--queue-url http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/my-queue
  • Receive message from queue
awslocal sqs recieve-message \
--queue-url http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/my-queue
  • Delete message from queue
awslocal sqs delete-message --receipt-handle <RECEIPT_HANDLE> \
--queue-url http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/my-queue

where RECEIPT_HANDLE is given when reading a message

  • Purge the queue (delete all)
    awslocal sqs purge-queue \
    --queue-url http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/my-queue
    

* Queue attributes can be:

  • ApproximateNumberOfMessages returns the number of message available
  • MaximumMessageSize
  • MaximumRetentionPeriod
  • VisibilityTimeout

Example: Integrate SQS with [[#AWS Lambda]]:

[!example] Event Source Mapping

awslocal lambda create-event-mapping \
--function-name my-function
--event-source-arn <QUEUE_ARN>

In the lambda, the queue is accessible by:

exports.lambda_handler = async (event) => {
  for (const record of event.Records) {
    await processMessageAsync(record);
  }
  
  return {
    statusCode: 200,
    body: JSON.stringify({ message: "Processing complete" })
  };
};

async function processMessageAsync(record) {
  try {
    console.log(`Processed message ${record.body}`);
    
    // Placeholder for actual async work
    await Promise.resolve(1); 
  } catch (err) {
    console.error("An error occurred:", err.message);
    throw err;
  }
}

SNS

The Amazon Simple Notification Service (SNS) is a serverless messaging service that can distribute a message to multiple subscribers.

[!example] Manage SNS

  • Create a topic
awslocal sns create-topic --name my-topic

TopicArn is returned

  • List all topics
awslocal sns list-topic
  • Set topic attributes
awslocal sns set-topic-attribute \
--topic-arn <TOPIC_ARN> \
--attribute-name DisplayName \
--attribute-name MyTopicDisplayName
  • List topic attributes
awslocal sns get-topic-attributes \
--topic-arn <TOPIC_ARN>
  • Publish message to the topic
awslocal sns publish --message file://message.txt \
--topic-arn <TOPIC_ARN>

messageId is returned

[[#SNS]] + [[#SQS]]

  1. create a queue and a topic
  2. subscribe the queue to the topic

    [!example]

    • Subscribe queue to topic
awslocal sns subscribe --topic-arn <TOPIC_ARN> \
--protocol sqs --notification-endpoint <QUQUE_ARN>

SubscriptionArn is returned

  • List subscriptions
awslocal sns list-subscriptions
  • Unsubscribe
aswlocal sns unsubscribe \
--subscription-arn <SUBSCRIPTION_ARN>
  1. send a message to the queue, via the topic with sns publish
  2. check if the message is arrived with sqs recieve-message ![[Pasted image 20260216101702.png|300]] The fanout scenario is represented, where a SNS topic is sent to multiple endpoints.

CloudFormation

It is an Infrastructure as Code service by Amazon, though JSON or YAML files.

Features:

  • the declarative template specifies resources, configurations, dependencies, etc. and the service automates all the implementation process
  • once the template is executed (across multiple AWS accounts or regions) → becomes a stack that can be managed
  • preview of changes before actually modifying the stack
  • offers rollback and self-healing
  • repeatability and scalable

[!example] Create a stack template

{
  "Resources": {
    "LocalBucket": {
      "Type": "AWS::S3::Bucket",
      "Properties": {
        "BucketName": "cfn-quickstart-bucket"
      }
    }
  }
}

[!example] Manage stack templates

  • Deploy a stack
awslocal cloudformation deploy --stack-name my-stack \
--template-file "./stack.json"
  • Delete a stack
awslocal cloudformation delete --stack-name my-stack

CDK

It is another Infrastructure as Code service by Amazon, which purpose is to make AWS open-source development-enabled. The user can write the code for a custom AWS infrastructure implementation, deploy it and [[#CloudFormation]] will allocate necessary resources to execute the project. Supported languages are: TS/JS, Python, Java, .NET, Go.

The core building blocks are the constructs:

  • L1 constructs: low-level, direct mapping to AWS [[#CloudFormation]]
  • L2 constructs: provides come abstractions for common patterns
  • L3 constructs: high-level, provides entire solutions

Features:

  • it provides higher abstraction on top of [[#CloudFormation]]
  • modular code
  • ensures AWS best-practices

First, the tool must be installed separately:

[!example] Install CDK CLI and check version

npm install -g aws-cdk-local aws-cdk
cdklocal --version

[!example] Init a CDK project

  • Initialize and deploy
cdklocal init sample-app --language=javascript
  • Bootstrap the project: essential to connect it to the AWS instance because it prepares all required resources
cdklocal bootstrap

If the project is meant to run on multiple regions or accounts, it must be bootstrapped for each one of them.

cdklocal bootstrap aws://<ACCOUNT_ID>/<REGION>
  • Preview the deployment
cdklocal synth
  • Deployment to AWS
cdklocal deploy
  • Destroy deployment
cdklocal destroy

The initialized project follows this structure:

my-cdk-app/
├── bin/
│   └── my-cdk-app.js       # Entry point for the CDK application
├── lib/
│   └── my-cdk-app-stack.js # Main stack definition for AWS resources
├── cdk.json                # CDK app configuration
├── package.json            # Project dependencies and scripts
├── README.md               # Project documentation
└── .gitignore              # Files to ignore in git
  • bin/ will be the entrypoint for the app

    [!example] my-cdk-app.js

#!/usr/bin/env node

const cdk = require('aws-cdk-lib');
const { QueueSubscriptionStack } = require('../lib/queue_subscription-stack');

const app = new cdk.App();

// Ensure there is a space between 'new' and your Stack class name
new QueueSubscriptionStack(app, 'QueueSubscriptionStack', {
  /*
   * If you don't specify 'env', this stack will be environment-agnostic.
   * Account/Region details can be passed here if needed.
   */
});
  • lib/ holds the core business logic, organized in .js files with classes that extend cdk.Stack

    [!example] my-cdk-app-stack.js

const cdk = require('aws-cdk-lib');
const sqs = require('aws-cdk-lib/aws-sqs');

class QueueSubscriptionStack extends cdk.Stack {
  constructor(scope, id, props) {
    super(scope, id, props);

    // Define the SQS Queue
    const queue = new sqs.Queue(this, 'QueueSubscriptionQueue', {
      queueName: 'my-queue',
      visibilityTimeout: cdk.Duration.seconds(300)
    });

    // ... additional resources (SNS Subscriptions, etc.) go here
  }
}

module.exports = { QueueSubscriptionStack };
  • cdk.json defines how to run the CDK app with additional configuration
  • package.json lists the node dependencies, scripts and metadata

Example: CDK + [[#AWS Lambda]]

#TODO

DynamoDB

It is a fully managed key-value NoSQL database provided by AWS. It allows:

  • scalability
  • encryption
  • replication
  • auto-scaling
  • backup

[!example] Manage DynamoDB instance

  • Create instance
awslocal dynamodb create-table --table-name my-table \
--key-schema AttributeName=id,KeyType=HASH \
--attribute-definitions AttributeName=id,AttributeType=S \
--billing-mode PAY_PER_REQUEST --region us-east-1
  • List tables
awslocal dynamodb list-tables
  • Insert items
awslocal dynamodb put-item --table-name my-table \
--item file://table-item.json
  • Query num of items
awslocal dynamodb describe-table --table-name my-table \
--query 'Table.ItemCount'

AWS | SAGA Pattern

In AWS both [[#Orchestration-based Coordination]] and [[#Choreography-based Coordination]] of SAGA can be implemented with a combination of [[#Step Functions]] and [[#AWS Lambda]].

![[Pasted image 20260216125940.png|500]]

As said in the [[#SAGA Pattern]] chapter, each compensatable operation must have its respective compensating transaction (in red).

[!example] Compensating Transactions in Step Functions

"Catch": [
  {
    "ErrorEquals": [
      "States.ALL"
    ],
    "ResultPath": "$.reserveFlightError",
    "Next": "CancelFlightReservation"
  }
]

where:

  • Catch
  • ErrorEquals
  • "ResultPath": "$.reserveFlightError" specifies where to print the error information

Example: Create [[#DynamoDB]] table with [[#CDK]]

#TODO

EventBridge

It is a serverless event bus service by Amazon that enables building event-driven applications by routing events from various sources like [[#AWS Lambda]], [[#Step Functions]], [[#SQS]] queues.

Applications:

  • automating event-driven microservices or workflows (e.g.: trigger a [[#AWS Lambda]] on [[#S3]] object upload)
  • SaaS integration with third-party apps

EventBridge can work with a variety of rule types:

  • Scheduled Rules: trigger rules at regular time intervals
    • Examples
      • rate(5 minutes)→ every 5 minutes
      • cron(0 12 * * ? *)→ every day at 12:00
  • Event Pattern Rules: trigger rules when certain event occurs on an AWS component
    • Example
{
  "source": ["aws.s3"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": ["PutObject"]
  }
}
  • Custom Event Rules: tigger rules when certain event occurs on a microservice developed by the user
  • SaaS Integration Rules: trigger rules when certain event occurs on a third-party software

Rules can trigger actions on a variety of targets:

  • [[#AWS Lambda]]
  • [[#SNS]] topics
  • [[#SQS]] queues
  • [[#Step Functions]]

[!example] Manage EventBridge instances

  • Create a rule
awslocal events put-rule --name my-scheduled-rule \
--schedule-expression "rate(2 minutes)"

to run the rule every 2 minutes

  • Add lambda function as target, so add permission to do it
awslocal lambda add-permission --function-name my-events \
--statement-id my-scheduled-rule \
--action "lambda:InvokeFunction" \
--principal events.amazonaws.com --source-arn <SCHEDULED_RULE_ARN>
  • Add lambda as target
awslocal events put-targets --rule my-scheduled-rule \
--targets file://targets.json

where the targets.json contains all the lambdas’ ARNs.

[!example] Check logs to test

awslocal logs describe-log-groups
awslocal logs describe-log-groups --log-group-name /aws/lambda/my-events
localstack logs

Example: create a EventBridge rule using CDK #TODO

[!example] Manage custom event bus

AWS | Microservices

As seen for [[#AWS SAGA Pattern]], implementing Microservices on AWS means combining [[#AWS Lambda]] and [[#API Gateway]].

Implement the [[#API Gateway Pattern]]

  • API GW acts as an internal proxy
  • API GW as single external endpoint with extra features such as authentication

Example: single API GW ![[Screenshot 2026-02-16 alle 16.03.23.png|500]] Example: multiple API GWs ![[Screenshot 2026-02-16 alle 16.03.39.png|500]]

Implement the Decouple Messaging Pattern

As seen in the [[#Asynchronous Messaging]] communication. ![[Screenshot 2026-02-16 alle 16.06.29.png|500]] However, business actions are still synchronous.

Implement the Publish/Subscribe Pattern

  • Approach with [[#SNS]] where each subscriber receives a copy of the message and process it. ![[Screenshot 2026-02-16 alle 16.06.51.png|500]]
  • Approach with [[#EventBridge]] (bus) where a different tailored message is sent to every target. ![[Screenshot 2026-02-16 alle 16.07.58.png|500]]

AppSync

It is an AWS service for creating serverless GraphQL APIs to query databases, microservices, APIs, etc.

[!example] Query a DynamoDB table

  • First, get an API id to interact with a given table
awslocal appsync create-graphsql-api --name NotesApi \
--authentication-type <API_KEY>
  • Then, create an API key
awslocal appsync create-api-key --api-id <API_ID>
  • Define a GraphSQL schema to further query the database
awslocal appsync start-schema-creation --api-id <API_ID> \
--definition file://schema.graphql

see below for the GraphQL schema.

  • Create a Data Source for the dynamoDB table. The Data Source is an AWS abstraction used to interact with a table and necessary because provides AppSync a connector with an IAM Role associated.
awslocal appsync create-data-source --name AppSyncDB --api-id <API_ID> \
--type AMAZON_DYNAMODB \
--dynamodb-config tableName=DynamoDBNotesTable,awsRegion=us-east-1
  • Now create a Resolver (see [[#API Gateway GraphQL Implementation]]) to query the DB
awslocal appsync create-resolver --api-id <API_ID> --type-name Query \
--field-name allNotes --data-source-name AppSyncDB \
--request-mapping-template file://request-template.vtl
--response-mapping-template file://response-template.vtl

[!example] schema.graphql

type Note {
	NoteId: ID!
	title: String
	content: String
}
type PaginatedNotes {
	notes: [Note!]!
}
type Query {
	allNotes(limit: Int): PaginatedNotes!
	getNote(NoteId: ID!): Note
}
type Mutation {
	saveNote(NoteId: ID!, title: String!, content: String!): Note
	deleteNote(NoteId: ID!): Note
}
schema {
	query: Query
	mutation: Mutation
}

#TODO format better

[!example] request-template.vtl

#set($limit = $util.defaultIfNullOrEmpty($context.arguments.limit, 10))
{
	"version": "2018-05-29",
	"operation": "Scan",
	"limit": $limit
}

[!example] response-template.vtl

#if($context.result && $context.result.errorMessage)
	$util.error($context.result.errorMessage, $context.result.errorType)
#else
{
	"notes": $util.toJson($context.result.items)
}
#end

Example: query via curl

Just create a HTTP POST request viacurl, by adding the x-api-key header with the API key as value, with body:

{
	"query": "query { allNotes { notes { NoteId title content } } }"
}

Glue

It is a serverless data integration service by AWS used to implement ETLs. The process follows those steps:

  • Extract: data is fetched (e.g. from [[#S3]]) in raw format
  • Transform: data gets cleaned, optimized and enriched (e.g. with Glue PySpark jobs)
  • Load: data is loaded into another database (e.g. [[#S3]] or RDS table)

The entire Glue job is encoded in the etl-script-bucket.py:

[!example] etl-script-bucket.py

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col

# INIT
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# EXTRACT
input_data_path = f"s3://{raw_data_bucket}/raw-data/"
print(f"Reading data from: {input_data_path}")
datasource0 = glueContext.create_dynamic_frame.from_options(
	connection_type="s3",
	connection_options={"paths": [input_data_path]},
	format="csv",
	format_options={"withHeader": True, "separator": ";"},
)

# TRANSFORM
data = datasource0.toDF()
transformed_data = data.filter(
	~(
		col("id").isNull() | (col("id") == "") |
		col("name").isNull() | (col("name") == "") |
		col("age").isNull() | (col("age") == "")
	)
)

# LOAD
dynamic_frame_transformed = DynamicFrame.fromDF(transformed_data, glueContext, "transformedDF")
glueContext.write_dynamic_frame.from_options(
	frame=dynamic_frame_transformed,
	connection_type="s3",
	connection_options={"path": f"s3://{raw_data_bucket}/transformed-data/"},
	format="csv",
	format_options={"withHeader": True, "separator": ";"}
)

[!example] Manage glue jobs

  • Create a job
awslocal glue create-job --name MyGlueETLJob \
--role <ROLE_ARN> \
--command file://command-bucket.json \
--default-arguments file://arguments.json \
--max-retries 2 --timeout 10
  • Start job
awslocal glue start-job-run --job-name MyGlueETLJob

where:

[!example] command-bucket.json

{
	"Name":"glueetl",
	"ScriptLocation":"s3://my-raw-data-bucket/scripts/etl-script-bucket.py",
	"PythonVersion":"3"
}

[!example] arguments.json

{
	"--job-bookmark-option":"job-bookmark-enable",
	"--raw_data_bucket":"my-raw-data-bucket"
}

When running the start job command the following stuff happens under the hood:

  • a Spark environment is started, and Spark jobs are set up
  • ETL script is run
  • CloudWatch is available to monitor execution