Skip to content

chore: introduce metrics change#307

Open
josecorella wants to merge 9 commits intomasterfrom
jocorell/metrics-specification
Open

chore: introduce metrics change#307
josecorella wants to merge 9 commits intomasterfrom
jocorell/metrics-specification

Conversation

@josecorella
Copy link
Contributor

@josecorella josecorella commented Sep 15, 2025

For best reading and commenting experience, I suggest splitting your window in two; the review page and the rendered page.
Here are the rendered files:

Goals for 9-15-2025 Spec Review:

  • Agreement on optional metrics agents and how they will impact existing APIs
  • Agreement that Metrics Agent Interface and Implementation will be implemented in Dafny but only as wrappers and provide extern implementations to make moving off of Dafny easier.
  • Agreement on interface supported operations.

NOTE: name change from metrics-agent -> metrics-worker

Goals for 2-4-2025 Spec Review:

  • Agreement on interface supported operations.
  • Agreement that Metrics Agent Interface and Implementation will be implemented in Dafny but only as wrappers and provide extern implementations to make moving off of Dafny easier.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Check any applicable:

  • Were any files moved? Moving files changes their URL, which breaks all hyperlinks to the files.

@josecorella josecorella requested a review from a team as a code owner September 15, 2025 16:05
metrics. Customers can then ask for updates to the implementations
CT provides or customers can go an implement their own interfaces that are fine-tuned
to their use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a middle ground? For example, logging information locally without uploading it anywhere.

@texastony
Copy link
Contributor

@josecorella I have a meta/macro question.
Is there a proposal doc that accompanies these change docs?

I appreciate that the Background doc highlights issues and alternatives, but I feel like we a missing a "User Stories" document, that can be used to measure success criteria and what are the table stakes of this work.

It is also possible I just missed such a proposal doc; but without it, it is difficult to work backwards.

operation AddDate {
input: AddDateInput,
output: AddOutput,
errors: [MetricsPutError]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You say that the interface should not error, but you have errors here.

Comment on lines +234 to +235
// Common output structure
structure AddOutput {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why you might optimize this. But is this really the best choice? why not have a output per operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revisit this in discussion on 2026-02-04

@texastony
Copy link
Contributor

Potential Issue/Alternative/User Story:

Users of the MPL/ESDK/DB-ESDK are also users of the AWS SDKs. The AWS SDKs have established logging and metric interfaces.
AWS Crypto Tools products are AWS Products, just like the AWS SDKs.

There likely is an implicit customer expectation that Crypto Tools products behave and appear to be consistent with the AWS SDKs.

Therefore, I suggest we carefully evaluate if we can utilize the SDKs metric and logging tooling, and offer a customer experience that closely mimics the SDKs experience.

The current collection of docs does not state this as a goal, but it does leave it open as an opportunity.

i.e: the proposed metric interface could wrap an SDK metric class.

Client->>Client: Content Encryption
end
Client<<->>CMM: GetEncryption/Decryption Materials
CMM<<->>Keyring: OnEncrypt/OnDecrypt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc implies this, but we can go deeper down the stack.

i.e: Keyring <<->> KMS or Keyring <<->> Branch Key Store.

Technically, the H-Keyring is:
Keyring <<->> (Cache | Branch Key Store <<->> (DDB & KMS))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we could go dive deeper; however, in the interest of this change I don't want to go there besides painting the most basic model for the reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need the CMC here.

- Metrics: Throughout this document and other related documents the word, "metrics" is used extensively.
For Crypto Tools' libraries metrics means two things.

1. Measuring application performance, (e.g. api requests, cache performance, latency).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't render as a list, it renders as a code block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vscode render preview lied to me....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree about implementing this in Dafny.
I am concerned that putting additional synchronization blocks into our code base will negatively impact performance, and I do not see how a metric interface could exist in Dafny without a synchronization block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "synchronization blocks"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd need sync blocks - why can't externs fire (a new thread) and forget?

versions of these libraries have no logging or metrics publishing
to either a local application or to an observability service like AWS CloudWatch.

As client side encryption libraries emitting metrics must be done carefully as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As client side encryption libraries emitting metrics must be done carefully as
As client side encryption libraries, emitting metrics must be done carefully as

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about errors. How should or to what extent should errors show up? Can we log the stacktrace or is that an anti-pattern?

| ESDK | T.B.D | [ESDK.smithy](https://github.com/aws/aws-encryption-sdk/blob/mainline/AwsEncryptionSDK/dafny/AwsEncryptionSdk/Model/esdk.smithy) |
| MPL | T.B.D | [material-provider.smithy](https://github.com/aws/aws-cryptographic-material-providers-library/blob/main/AwsCryptographicMaterialProviders/dafny/AwsCryptographicMaterialProviders/Model/material-provider.smithy) |
| DB-ESDK | T.B.D | [DynamoDbEncryption.smithy](https://github.com/aws/aws-database-encryption-sdk-dynamodb/blob/main/DynamoDbEncryption/dafny/DynamoDbEncryption/Model/DynamoDbEncryption.smithy) |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3EC..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forgot s3ec

Comment on lines +34 to +36
A popular feature request has been for in depth insights into CT libraries. Many customers
ask for suggestions on how to reduce network calls to AWS Key Management Service (AWS KMS) and
followup questions around cache performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: "has been for" is rough phrase.
Ideally,
we would quantify the customer demand (x internal services, y external customers) but that takes time and is not worth it;
we all know that the demand is there.

Suggested change
A popular feature request has been for in depth insights into CT libraries. Many customers
ask for suggestions on how to reduce network calls to AWS Key Management Service (AWS KMS) and
followup questions around cache performance.
There is customer demand for in depth insights into CT libraries. Many customers
ask for suggestions on how to reduce network calls to AWS Key Management Service (AWS KMS) and
followup questions around cache performance.


### Issue 2: Should Data Plane APIs fail if metrics fail to publish?

#### No (recommended)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might want to at-least throw a warning?


## Requirements

The interface should have three requirements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's 2 listed?


This allows customers to test how their applications behave when they start to emit
metrics. Customers can then ask for updates to the implementations
CT provides or customers can go an implement their own interfaces that are fine-tuned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CT provides or customers can go an implement their own interfaces that are fine-tuned
CT provides or customers can go and implement their own interfaces that are fine-tuned

metrics to this one worker and to only sometimes capture metrics to this
other worker.

#### No (recommended)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say yes to allow metrics on client construction. It's painful (and easy to forget) to supply the agent on every call. We should keep it optional, but should provide a way to set once and forget.

this._bufferSize = builder._bufferSize;
this._instructionFileConfig = builder._instructionFileConfig;
this._commitmentPolicy = builder._commitmentPolicy;
+ this._metricsWorker = builder._metricsWorkerl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ this._metricsWorker = builder._metricsWorkerl
+ this._metricsWorker = builder._metricsWorker;


## Requirements

The interface should have three requirements.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The interface should have three requirements.
The interface should have two requirements.

| ESDK | T.B.D | [ESDK.smithy](https://github.com/aws/aws-encryption-sdk/blob/mainline/AwsEncryptionSDK/dafny/AwsEncryptionSdk/Model/esdk.smithy) |
| MPL | T.B.D | [material-provider.smithy](https://github.com/aws/aws-cryptographic-material-providers-library/blob/main/AwsCryptographicMaterialProviders/dafny/AwsCryptographicMaterialProviders/Model/material-provider.smithy) |
| DB-ESDK | T.B.D | [DynamoDbEncryption.smithy](https://github.com/aws/aws-database-encryption-sdk-dynamodb/blob/main/DynamoDbEncryption/dafny/DynamoDbEncryption/Model/DynamoDbEncryption.smithy) |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3EC..?


### count

A count is a long value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A count is a long value
A count is a long value.

Comment on lines +121 to +123
of the issues that are described above, (e.g. handling failing requests, perform
blocking requests to CT libraries, use a separate thread/thread pool that handles
these request). By providing a wrapper around a language's most popular logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue:

Agreement that Metrics Agent Interface and Implementation will be implemented in Dafny but only as wrappers and provide extern implementations to make moving off of Dafny easier.

I am concerned that Dafny will not allow for non-blocking requests as it does not have async syntax, nor does it have concurrent syntax.

This list is not exhaustive. Any downstream consumer of any API or client configuration SHOULD
also be updated as part of this proposed changed.

| API/ Configuration |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This list is missing the Cryptographic Materials Cache's APIs/Configuration.
I strongly believe that we should offer insights into cache hits/misses, and therefore need to support the CMC's operations.

@required
materials: EncryptionMaterials,

+ metricsWorker: aws.cryptography.materialProviders#MetricsWorkerReference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a Metrics instance rather than the worker?

@extendable
resource MetricsWorker {
operations: [
AddDate,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we integrate these with something like CoralMetricsWorker?

// Common output structure
structure AddOutput {}

@aws.polymorph#reference(resource: MetricsWorker)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use these custom traits given that we are planning to move away from custom smithy-dafny?

transactionId: String
}

// Common output structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In at least Coral Metrics and SLF4J (not sure about other languages) all of these API operations return void. I figure we're doing this for future extensibility, but it might be a little annoying for implementors, as now they need to return something instead of void. This could setup NPEs way down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants