Merge Lambda Managed Instance feature branch #947

litianningdatadog · 2025-12-01T16:34:01Z

https://datadoghq.atlassian.net/browse/SVLS-8080

Overview

Merge Lambda Managed Instance feature branch

Testing

Covered by individual commits

`INVOKE` event subscription in elevator would crash

* add `ec2-capacity-provider` init type needed for elevator mode * improve debug log for telemetry error while serializing

* update `ReportMetrics` to be an enum to allow `Elevator` metrics allows us to have a diff to check in the other components * set metrics for lifecycle given the type of report * only send log for `OnDemand` metrics * send correct enhanced metrics given the type of report * add doc coments * fmt

… mode https://datadoghq.atlassian.net/browse/SVLS-7740

route was changed, as opposed to schema version

…ce) mode support with stats generation https://datadoghq.atlassian.net/browse/SVLS-7584 Implement comprehensive LMI mode support for concurrent Lambda invocations: Add background periodic flusher for continuous data collection in LMI mode Implement PlatformReport event handling with proper stats generation Add LMI mode REPORT log formatting with status, duration, and error details Integrate StatsGenerator and StatsConcentratorService throughout event pipeline Add missing stats_generator field to SendingTraceProcessor for both PlatformReport and PlatformRuntimeDone events Architecture improvements: Remove InvocationProcessorService wrapper, use Arc<TokioMutex> directly Simplify event handling by passing stats_concentrator to all event handlers Add #[must_use] attribute to Listener::new() for better API safety

https://datadoghq.atlassian.net/browse/SVLS-7836?atlOrigin=eyJpIjoiMWNmZTMzOGE4NGEwNDE4MTk5Njk0N2ZmMmU3MzExMjgiLCJwIjoiaiJ9 The extension neither creates SnapStart spans nor emits SnapStart metrics. This PR adds both. When a lambda with snapshot enabled is invoked for the first time, we get `Platform.RestoreStart` and `Platform.RestoreReport`. These effectively take the place of `Platform.InitStart` and `Platform.InitReport` events, so our code flow is pretty much identical to how we handle the cold start span and duration metric. Note - When a SnapStart instance is restored, we actually receive the `Platform.InitStart` and `Platform.InitReport` events in addition to the `Platform.RestoreStart` and `Platform.RestoreReport`. However, the `Init` events are not from the sandbox starting for that invoke. These `Init` events are actually generated from when the Snapshot is created. This is very misleading - You can see that this [trace](https://ddserverless.datadoghq.com/serverless/aws/lambda?fromUser=false&graphType=flamegraph&group=&highlight=snapstart-java-cdk-function&panel_end=1761860524106&panel_paused=false&panel_start=1761846124106&shouldShowLegend=true&sp=%5B%7B%22p%22%3A%7B%22entityId%22%3A%22aws-lambda-functions%2Bsnapstart-java-cdk-function%2Bus-east-1%2B425362996713%22%7D%2C%22i%22%3A%22lambda-panel%22%7D%2C%7B%22p%22%3A%7B%22traceID%22%3A%225400520227836710313%22%2C%22selectedSpanID%22%3A%22644948261311059067%22%7D%2C%22i%22%3A%22trace-panel%22%7D%5D&spanID=644948261311059067&text_search=snapstart&traceID=5400520227836710313&traceQuery=&start=1761845683104&end=1761860083104&paused=false) is more than 3 hours long. The lambda was invoked more than 3 hours after the snapshot version was created. (This is the current experience). I deployed my own extension with the changes and confirmed we are now getting a restore span and not an init span, [link](https://ddserverless.datadoghq.com/serverless/aws/lambda?fromUser=false&graphType=flamegraph&group=&panel_end=1761860640000&panel_paused=false&panel_start=1761846240000&shouldShowLegend=true&sp=%5B%7B%22p%22%3A%7B%22entityId%22%3A%22aws-lambda-functions%2Bsnapstart-java-function%2Bus-east-1%2B425362996713%22%7D%2C%22i%22%3A%22lambda-panel%22%7D%2C%7B%22p%22%3A%7B%22traceID%22%3A%226634828896084800457%22%2C%22selectedSpanID%22%3A%222017721198037440020%22%7D%2C%22i%22%3A%22trace-panel%22%7D%5D&spanID=2017721198037440020&text_search=snapstart&traceID=6634828896084800457&traceQuery=&start=1761845683104&end=1761860083104&paused=false).

…ce) mode support with stats generation https://datadoghq.atlassian.net/browse/SVLS-7584 Implement comprehensive LMI mode support for concurrent Lambda invocations: Add background periodic flusher for continuous data collection in LMI mode Implement PlatformReport event handling with proper stats generation Add LMI mode REPORT log formatting with status, duration, and error details Integrate StatsGenerator and StatsConcentratorService throughout event pipeline Add missing stats_generator field to SendingTraceProcessor for both PlatformReport and PlatformRuntimeDone events Architecture improvements: Remove InvocationProcessorService wrapper, use Arc<TokioMutex> directly Simplify event handling by passing stats_concentrator to all event handlers Add #[must_use] attribute to Listener::new() for better API safety

Switch to new value of AWS_LAMBDA_INIT_TYPE Minor fix to ensure successful local testing.

…tion and logging

…LS-7879] (#44) * ship logs between invocations without request_id * fmt * test * Minor change to prepare for code merge

…rent invocations

…el [SVLS-7906] (#47) * emit fd/threads metrics at shutdown * pause monitoring on no active invocations * fmt

…#52)

* create empty context on init start to be updated on platform start/invoke * clippy

…iguration

…ntation details

litianningdatadog · 2025-12-01T17:05:18Z

/merge

dd-devflow-routing-codex · 2025-12-01T17:05:22Z

View all feedbacks in Devflow UI.

2025-12-01 17:05:22 UTC ℹ️ Start processing command /merge

2025-12-01 17:05:26 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).

2025-12-01 17:06:43 UTC ℹ️ MergeQueue: This merge request was merged

duncanista and others added 28 commits December 1, 2025 11:06

add AWS_LAMBDA_MAX_CONCURRENCY

3f8b792

subscribe to SHUTDOWN only on elevator` (#11)

32dd098

`INVOKE` event subscription in elevator would crash

chore(telemetry): add ec2 capacity provider init type (#12)

5746213

* add `ec2-capacity-provider` init type needed for elevator mode * improve debug log for telemetry error while serializing

add script to sync main from public repository to here (#13)

3390b3b

boot up into elevator using AWS_LAMBDA_INITIALIZATION_TYPE (#17)

7ef48fb

feat(telemetry): use elevator-specific API route required by elevator…

7a03cbb

… mode https://datadoghq.atlassian.net/browse/SVLS-7740

fix telemetry API subscription (#32)

dca1acd

route was changed, as opposed to schema version

Minor revision per comment

6d6b3ff

Remote debugging setup for Lambda Managed Instance project

1499eb8

Switched to the new value of AWS init type

1880f3a

check for additional network interface in managed instance mode

90bb336

fmt

b0fc85c

Refresh new version of RIE to align with AWS using the same schema

95a31db

Switch to new value of AWS_LAMBDA_INIT_TYPE Minor fix to ensure successful local testing.

feat(managed-instance): enforce continuous flush strategy with valida…

a2c38ae

…tion and logging

feat(managed-instance): immediately ship logs between invocations [SV…

cd84b63

…LS-7879] (#44) * ship logs between invocations without request_id * fmt * test * Minor change to prepare for code merge

feat(managed-instance): add request ID-based event pairing for concur…

14ae768

…rent invocations

dont inherit request id in managed instances (#48)

584ffc3

feat(managed-instance): set usage enhanced metrics at the sandbox lev…

97027ff

…el [SVLS-7906] (#47) * emit fd/threads metrics at shutdown * pause monitoring on no active invocations * fmt

fix: log level should be debug

09a7d77

ensure runtimeDone metric is used from span data in managed instances (…

5b0d8d2

…#52)

feat(managed-instance): send cold start span (#53)

87e3d36

* create empty context on init start to be updated on platform start/invoke * clippy

feat(managed-instance): warn user for unsupported flush strategy conf…

4238fa4

…iguration

docs(bottlecap): update README with accurate Managed Instance impleme…

33d7e7b

…ntation details

refactor(local-tests): fetch AWS Lambda RIE from GitHub releases

1e7824a

litianningdatadog requested a review from a team as a code owner December 1, 2025 16:34

astuyve approved these changes Dec 1, 2025

View reviewed changes

dd-devflow bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: queued labels Dec 1, 2025

dd-mergequeue bot merged commit 37f3c29 into main Dec 1, 2025
38 of 39 checks passed

dd-devflow bot added mergequeue-status: done and removed mergequeue-status: in_progress labels Dec 1, 2025

dd-mergequeue bot deleted the elevator branch December 1, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge Lambda Managed Instance feature branch #947

Merge Lambda Managed Instance feature branch #947

Uh oh!

litianningdatadog commented Dec 1, 2025

Uh oh!

litianningdatadog commented Dec 1, 2025

Uh oh!

dd-devflow-routing-codex bot commented Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Merge Lambda Managed Instance feature branch #947

Merge Lambda Managed Instance feature branch #947

Uh oh!

Conversation

litianningdatadog commented Dec 1, 2025

Overview

Testing

Uh oh!

litianningdatadog commented Dec 1, 2025

Uh oh!

dd-devflow-routing-codex bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dd-devflow-routing-codex bot commented Dec 1, 2025 •

edited

Loading