[extension/azure_encoding] Implement general Azure Resource Log parsing functionality #44550

Fiery-Fenix · 2025-11-26T12:36:00Z

Description

This is a first part from the set of Azure Resource Log parsing implementations in azure_encoding extension.
This part is focused on generic Azure Resource Log parsing, without any Category-specific code that will be presented in the following PR's
Key point for this PR:

The code is actually a heavily refactored part of code from pkg/translator/azurelogs and has similar benchmark to compare performance
Code is performance optimized (using structs, partial parsing, etc.) to be able to handle large batches of logs
Log parsing was refactored in the way that allows parsing any JSON schema, because Azure Resource Logs not always follows common log schema. For example, Azure ServiceBus OperationalLogs has only 1 common field with mentioned schema - category, all other fields has different names
Parser is Category-oriented - it allows to have Category specific set of Resource Attributes, timestamp and log level parsers, as well as Category-specific fields to attributes translators
Parser implementation is using relatively fresh SemConv version and tries to be as much as possible SemConv-compatible

Link to tracking issue

Part of #41725.

Testing

Corresponding Unit Test were added. Also couple Benchmark test were added.

Documentation

Additional log-specific Readme was added.

…ng functionality

constanca-m

Thank you for starting this @Fiery-Fenix ! I would love if we could start this encoding extension with performance in mind from the start, since the translator logs made the mistake of not to. I left a couple of comments on the things I think we should improve

constanca-m · 2025-11-28T08:23:18Z

extension/encoding/azureencodingextension/internal/unmarshaler/helpers.go

+// attrPutMap is a helper function to set a map attribute with defined key,
+// trying to parse it from raw value
+// If parsing failed - no attribute will be set
+func AttrPutMapIf(attrs pcommon.Map, attrKey string, attrValue any) {


any is very bad performance wise, and giving that an encoding is an hot-path, this function would not work here

Nice catch, this function is intended to store a parsed JSON object into attribute, not an any value, I'll fix it.
Please keep in mind that this function is only for data that we don't know how to properly parse, i.e. we are not supporting it at the moment or we don't have defined structure for it (for example it's dynamic structure)

constanca-m · 2025-11-28T08:24:43Z

extension/encoding/azureencodingextension/internal/unmarshaler/logs/category.go

+	if err := gojson.Unmarshal(r.Properties, &properties); err != nil {
+		errs = append(errs, err)
+		// If failed - trying to parse using a primitive value
+		var val any
+		if err = gojson.Unmarshal(r.Properties, &val); err == nil {
+			// Parsed, put primitive value as "properties" attribute
+			if err = attrs.PutEmpty(attributesAzureProperties).FromRaw(val); err != nil {
+				errs = append(errs, err)
+				// All attempts above - failed, put unparsable properties to log.Body
+				body.SetStr(string(r.Properties))
+			}
+		}


This is a performance issue as well, we have no idea what properties is according to this code. Is this correct? We should use category to determine that, instead of unmarshaling to any

we have no idea what properties is according to this code. Is this correct?

Yeah, in some cases properties is a string.

We need to double-check if the properties never changes for the same category. It may change. For example, Azure functions emits log events with different values for the same category—depending on the hosting plan (Azure Functions logs is the source of the most hostile categories in the ecosystem).

I'll look for sample logs with properties other than string.

Sorry, I don't mean we need to unstructure immediately. But using []byte/json.RawMessage is already better than any. I want us to be cautious and only use any for very specific cases where we see no options.

Here's an example from the FunctionAppLogs category: https://github.com/elastic/integrations/blob/main/packages/azure_functions/data_stream/functionapplogs/_dev/test/pipeline/test-azure-functions-invalid-json.log

In this first case, the properties field holds a string value (which happens to be invalid JSON, but that's a separate issue).

However, in another instance of that same category https://github.com/elastic/integrations/blob/main/packages/azure_functions/data_stream/functionapplogs/_dev/test/pipeline/test-azure-functions-raw.log, the properties value is an actual object.

The schema varies depending on the function's hosting plan.

Please keep in mind that this function is used only when we couldn't correctly map to defined structure, i.e. we haven't added support yet or it simply couldn't be parsed at all as @zmoog mentioned.
It's a last resort code, all known "properties" should be mapped to defined structure without any usage, you'll see how it implemented in next PR's
BTW it's refactored version of existing code in pkg/translator/azurelogs

constanca-m · 2025-11-28T08:26:49Z

extension/encoding/azureencodingextension/internal/unmarshaler/logs/unmarshaler.go

+	// This will allow us to parse Azure Log Records in both formats:
+	// 1) As exported to Azure Event Hub, e.g. `{"records": [ {...}, {...} ]}`
+	// 2) As exported to Azure Blob Storage, e.g. `[ {...}, {...} ]`
+	rootPath, err := gojson.CreatePath(jsonPath)


Interesting approach, I wasn't familiar with this. Did you benchmark it? Does it performance better than just checking the first bytes:

Given we only offer support for resource logs and logs from a storage file:

Resource logs -> has records in the first bytes

Storage -> simply has [

I will try to implement byte-checking approach for format auto-detection, seems like we couldn't rely on data source for format detection as @zmoog highlighted

zmoog · 2025-11-28T08:57:38Z

extension/encoding/azureencodingextension/internal/unmarshaler/logs/README.md

+| `time`, `timestamp`   | `log.timestamp` |
+| `resourceId`          | `cloud.resource_id` (resource attribute) |
+| `tenantId`            | `azure.tenant.id` (resource attribute) |
+| `location`            | `cloud.region` (resource attribute) |
+| `operationName`       | `azure.operation.name` (log attribute) |
+| `operationVersion`    | `azure.operation.version` (log attribute) |
+| `category`, `type`    | `azure.category` (log attribute) |
+| `resultType`          | `azure.result.type` (log attribute) |
+| `resultSignature`     | `azure.result.signature` (log attribute) |
+| `resultDescription`   | `azure.result.description` (log attribute) |
+| `durationMs`          | `azure.duration` (log attribute) |
+| `callerIpAddress`     | `network.peer.address` (log attribute) |
+| `correlationId`       | `azure.correlation_id` (log attribute) |
+| `identity`            | `azure.identity` (log attribute) |
+| `Level`               | `log.SeverityNumber` |
+| `properties`          | see mapping for each Category below |


Good idea on showing if the field is for a resource or a log.

Nit: If we add it as a column, that might improve readability a bit.

zmoog · 2025-11-28T09:15:53Z

extension/encoding/azureencodingextension/internal/unmarshaler/logs/unmarshaler.go

+
+	// This will allow us to parse Azure Log Records in both formats:
+	// 1) As exported to Azure Event Hub, e.g. `{"records": [ {...}, {...} ]}`
+	// 2) As exported to Azure Blob Storage, e.g. `[ {...}, {...} ]`


Which categories did you test that came as [ {...}, {...} ] on Blob Storage?

I recently collected sample logs on blob storage for the following categories:

kube-apiserver (AKS control plane logs)

Policy (activity logs from a subscription)

FlowLogFlowEvent (VNet flow logs)

kube-apiserver and Policy came to blob storage from diagnostic settings as newline separated objects:

{} {} {}

FlowLogFlowEvent came from the VNet exporter (not DS) as records—just like what we get on Event Hubs:

{"records": [ {...}, {...} ]}

So, my current understanding is we cannot rely on the source to determine the format (DS, DCR, VNet, etc). Ideally, the unmarshaler should accept both {"records": [ {...}, {...} ]} and the direct log {}.

Hm, that's interesting...
I was testing it on Application Gateway Logs (ApplicationGatewayFirewallLog and ApplicationGatewayAccessLog)
I will try to implement some auto-detection functionality here, as seems we couldn't really rely on data source for format definition :(

zmoog · 2025-11-28T09:19:52Z

extension/encoding/azureencodingextension/internal/unmarshaler/logs/unmarshaler.go

+		// Filter out categories based on provided configuration
+		if _, exclude := r.excludeCategories[logCategory]; exclude {
+			continue
+		}
+		if hasIncludes {
+			if _, include := r.includeCategories[logCategory]; !include {
+				continue
+			}
+		}


Interesting. I've never had to filter log categories. What's the specific driver here? (No objections to the option, though!)

We faced this issue while using shared EventHub.
There was already existing export flow to EventHub for SIEM purposes, but we don't need all logs from it, only some specific categories.
Of course, it could be done by filter processor later in collector pipeline, but it's obviously very inefficient way to parse logs and than discard them later, instead of simply discarding them before parsing,

Fiery-Fenix requested review from a team and axw as code owners November 26, 2025 12:36

github-actions bot assigned MovieStoreGuy Nov 26, 2025

github-actions bot added the extension/encoding/azureencoding label Nov 26, 2025

github-actions bot requested a review from constanca-m November 26, 2025 12:36

Fiery-Fenix mentioned this pull request Nov 26, 2025

New component: azureencodingextension #41725

Open

Fiery-Fenix force-pushed the feat/azureencoding-logs-general branch from f4a6e22 to 4e4045f Compare November 26, 2025 14:23

[extension/azure_encoding] Implement general Azure Resource Log parsi…

ca24c7b

…ng functionality

Fiery-Fenix force-pushed the feat/azureencoding-logs-general branch from 4e4045f to ca24c7b Compare November 26, 2025 17:02

atoulme added the waiting-for-code-owners label Nov 27, 2025

constanca-m reviewed Nov 28, 2025

View reviewed changes

zmoog reviewed Nov 28, 2025

View reviewed changes

[extension/azure_encoding] Implement general Azure Resource Log parsing functionality #44550

Are you sure you want to change the base?

[extension/azure_encoding] Implement general Azure Resource Log parsing functionality #44550

Uh oh!

Conversation

Fiery-Fenix commented Nov 26, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

constanca-m left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants