Skip to content

[POC] Add Collector Payload Thrift encoding / decoding transformations#251

Draft
Josh (jbeemster) wants to merge 8 commits intomasterfrom
thrift-encoding-poc
Draft

[POC] Add Collector Payload Thrift encoding / decoding transformations#251
Josh (jbeemster) wants to merge 8 commits intomasterfrom
thrift-encoding-poc

Conversation

@jbeemster
Copy link
Member

@jbeemster Josh (jbeemster) commented Jan 20, 2023

This PR adds transformations that can convert a JSON string into a Collector Payload with Thrift encoding and can also take a Thrift Payload and convert it to a JSON string.

This can be used to:

  1. Get access to Collector Payloads in a usable format without needing to deal with Thrift directly
  2. To publish custom events directly into the "raw" stream (for example as part of a streaming recovery operation)

Q: Should the CollectorPayload module live in this repo or its own repo? Should it have the Apache2 license on it as well over Community?

Copy link

@miike miike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I don't think these serializers need to be concerned about the heal context as that can get added elsewhere - likely in the function that yields the event itself.

@jbeemster Josh (jbeemster) changed the title Thrift encoding poc [POC] Add Collector Payload Thrift encoding / decoding transformations Jan 23, 2023
@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Jan 23, 2023

Looks good!

Four questions/discussion points from me:

1. Re: Mike's comment:

I don't think these serializers need to be concerned about the heal context as that can get added elsewhere - likely in the function that yields the event itself.

I would argue that we should not rely on the function that yields the event to do this, because it's intended to be a user-configured custom script. We should neither rely on nor expect the everyday user to instrument that correctly IMO.

The way I see it, there are three places that it could reasonably go IMO:

  1. When we generate the bad row itself (ie in enrich and anywhere else we generate bad rows) - ie we have a 'failed event' context that gets attached to the event itself, recording some metadata about it. Then the 'should I attempt recovery' logic in this application can be 'if there's more than one failed event context, don't attempt to recover'.
  2. Where we generate the recovered event - ie the script to recover. If we want to do it here, I would suggest it's hardcoded into a wrapper around the scripting transformation itself - ie we don't rely on the user to instrument it. Even still, I'm not sure it's the cleanest (because what stops the user from just using normal JS transformation?)
  3. Where we serialise - ie this PR. I agree that it feels wrong to have this logic live in serialisation though.

2. Suggestion:

By a similar vein to my argument above, do we here rely on the usuer to return a correct 'raw event' type from their JS enrichment? I wonder if we could assist with this by defining a struct that raw event should fit, and coercing the data we received from the script into this struct.

Perhaps there's a way to fail out with an error if they get it wrong? Or some other way to ensure that we are receiving the correct data.

This might be the responsibility of another transformation, or some other tooling to assist with testing though.

3. Small question about b64:

When do we not want to b64 encode the data (ie why make it configurable)? Isn't it always b64 encoded in the raw stream?

4. Question about practicalities:

This PR is marked as a POC and is a draft, so I'm assuming that the review here is conceptual feedback and discussions at this point. Please let me know if I should begin to treat it as something to be scheduled into a release (or when). Just flagging in case my assumption is wrong. :)

@jbeemster
Copy link
Member Author

colmsnowplow on number 1 I think we have a few different options but they should be out of scope of this initial PoC. I think we could ideally have a decorator transformation that could append a context to a "raw" or "enriched" event potentially but we can think about that later!

On 2 - for sure we can add other transformations to ensure this but again for PoC I think we can write this into the JS / Lua layer to assert the idea works before improving on the whole flow here. We are likely going to need:

  1. A transformation to filter and prepare a failed_event for recovery and;
  2. A transformation to take a recovered event and prepare it for turning back into a collector payload

On 3 - the b64 part was really only added so that we could have reliable unit tests. In production you would not base64 encode at that layer as the transport to SQS / Kinesis / PubSub does this implictly based on the []byte that is passed to it which should just be the thrift encoded []byte. So entirely as an aid for portability of the result in unit tests or if using stdout (this is the same as what the Collector in stdout mode does actually).


For sure still a PoC - we want to prove that this is viable and then we can dig a bit deeper into a lot of the very good questions you have posed on how we would make this much more user friendly / what other transformations we would want to make this really simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants