Skip to content

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Feb 18, 2025

The lambda extension process data in the background during an invocation:

apm-aws-lambda/app/run.go

Lines 197 to 204 in 3f5620c

// APM Data Processing
backgroundDataSendWg.Add(1)
go func() {
defer backgroundDataSendWg.Done()
if err := app.apmClient.ForwardApmData(invocationCtx); err != nil {
app.logger.Error(err)
}
}()

Data is sent to a channel. This is fine during an invocation when the goroutine is running but it's possible that during shutdown the channel will block because there's no goroutine running.

app.logsClient.FlushData(ctx, event.RequestID, event.InvokedFunctionArn, app.apmClient.ForwardLambdaData, true)

The solution is to bypass the channel and forward the data directly during shutdown.

do not send to the lambda data chan and potentially wait
forever causing a timeout on shutdown
@kruskall kruskall requested a review from rockdaboot February 18, 2025 13:37
@github-actions github-actions bot added the aws-λ-extension AWS Lambda Extension label Feb 18, 2025
@rockdaboot
Copy link
Contributor

Can you describe the issue in the PR description and how this PR solves it?

@kruskall
Copy link
Member Author

Updated! 👍

Copy link
Member

@dmathieu dmathieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add tests, or is that too unpredictable?

@rockdaboot
Copy link
Contributor

Updated! 👍

Thanks! Maybe I misunderstand the code changes, but it looks to me that you replace direct writes to the channel with a function that writes to the data channel. How is that a difference in the data flow logic and don't we still have the same issue then (channel reader doesn't read any more)?

I somewhat agree with @dmathieu that we should try to create a test that reproduces the issue, and then prove that this code solves the issue. It doesn't have to be a unit test, if that is too involved. Maybe some kind of tool / setup / script, whatever is the simplest way.

@sboomsma
Copy link

just nudging here in case it gets forgotten because the related SDH was auto-closed

@kruskall
Copy link
Member Author

kruskall commented Apr 9, 2025

Thanks! Maybe I misunderstand the code changes, but it looks to me that you replace direct writes to the channel with a function that writes to the data channel. How is that a difference in the data flow logic and don't we still have the same issue then (channel reader doesn't read any more)?

There are three places that have been updated. The first two are directly forwarding the data bypassing the channel, the last one is unchanged and just send to a channel.
For the third case: It's fine to send to a channel inside processEvent because there's someone reading from it and in the worst case the ctx will make the func exit.

I somewhat agree with @dmathieu that we should try to create a test that reproduces the issue, and then prove that this code solves the issue. It doesn't have to be a unit test, if that is too involved. Maybe some kind of tool / setup / script, whatever is the simplest way.

maybe we can reuse the setup from @xrmx ? I'm not sure how feasible it is to write a unit test for this.

@rockdaboot
Copy link
Contributor

For the third case: It's fine to send to a channel inside processEvent because there's someone reading from it and in the worst case the ctx will make the func exit.

Thanks for the explanation.

maybe we can reuse the setup from @xrmx ? I'm not sure how feasible it is to write a unit test for this.

@xrmx Can you run your reproducer again with the changes here and confirm that it fixes the issue?

@dmathieu
Copy link
Member

dmathieu commented Sep 5, 2025

@kruskall were you able to follow-up with @rockdaboot's question above?

@xrmx
Copy link
Member

xrmx commented Sep 8, 2025

Lambda has timeout at 8 seconds, run a bunch of calls and these are the results:

REPORT RequestId: ee1dd6e8-563b-4b0a-a7b2-17fc36ea3cf8	Duration: 633.98 ms	Billed Duration: 1811 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	Init Duration: 1176.12 ms	
REPORT RequestId: 1452a46f-cd2d-4649-a41e-4d79607603b0	Duration: 1046.10 ms	Billed Duration: 1047 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: 63734f0c-76b5-400d-9d1e-efdf22070c73	Duration: 1157.95 ms	Billed Duration: 1158 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: 78c1e4c3-7d79-4e14-b259-cb4e04c0860a	Duration: 1094.44 ms	Billed Duration: 1095 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: 79521443-4d7f-46a2-b5ba-a5fa4b99b3ae	Duration: 1261.75 ms	Billed Duration: 1262 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: ea6b8703-6007-4e36-aa1b-424edf8dc5a6	Duration: 1144.38 ms	Billed Duration: 1145 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: b92d3412-2b62-4bfb-b95a-5b92a707a4bd	Duration: 2209.68 ms	Billed Duration: 2210 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: ec300469-ded4-4b13-b1ac-8c394e460a09	Duration: 301.76 ms	Billed Duration: 302 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: 4150cc9e-4989-47c8-95c4-2b01361bef8d	Duration: 1356.12 ms	Billed Duration: 1357 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	
REPORT RequestId: 81c37509-ace6-4a85-94fd-6910f3f66a76	Duration: 1148.95 ms	Billed Duration: 1149 ms	Memory Size: 128 MB	Max Memory Used: 99 MB	

So it looks it exits just fine and timeout does not occur.

@dmathieu dmathieu merged commit 2cab1ad into main Sep 8, 2025
13 checks passed
@dmathieu dmathieu deleted the fix/log-blocking branch September 8, 2025 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aws-λ-extension AWS Lambda Extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants