Skip to content

Conversation

@avilagaston9
Copy link
Contributor

@avilagaston9 avilagaston9 commented Nov 22, 2024

Improve Telemetry Spans

Motivation

We found that sometimes our Batcher tries to cancel batches that were actually included in the net, calling the batcherTaskCreationFailed endpoint, which finalizes the trace and prevents the Aggregator from registering its spans in the trace.

Description

  • Stops finalizing the trace when a batcherTaskCreationFailed occurs.
  • Converts all events into spans to visualize them in real-time, eliminating the need to wait for the parent span to finish.

Observations

On a real batcherTaskCreationFailed, the Aggregator won't receive the new task, and the trace will remain unfinished. Furthermore, the trace metadata won't be removed from the Telemetry server store. Despite that, we will be able to visualize the orphans spans with a warning that their parent ID is invalid.

#1477 was created to address this issue.

How To Test

  1. Check that everything is working normally:

Run anvil, all Aligned components with one or more operators and start telemetry:

make telemetry_full_start

Go to jaeger and explore the generated traces.

image

  1. Check the scenario addressed in this PR:

Change the Batcher create_new_task_retryable function in batcher/aligned-batcher/src/retry/batcher_retryables.rs:165 to return an error after receiving the receipt:

 // timeout to prevent a deadlock while waiting for the transaction to be included in a block.
    let _result = timeout(Duration::from_millis(transaction_wait_timeout), pending_tx)
        .await
        .map_err(|e| {
            warn!("Error while waiting for batch inclusion: {e}");
            RetryError::Permanent(BatcherError::ReceiptNotFoundError)
        })?
        .map_err(|e| {
            warn!("Error while waiting for batch inclusion: {e}");
            RetryError::Permanent(BatcherError::ReceiptNotFoundError)
        })?
        .ok_or(RetryError::Permanent(BatcherError::ReceiptNotFoundError));
    Err(RetryError::Permanent(BatcherError::ReceiptNotFoundError))

Then, start all components again and you should be able to see the Aggregator spans even when the Batcher sends Batcher - Task Creation Failed

image

Type of change

Please delete options that are not relevant.

  • Bug fix
  • Refactor

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

@avilagaston9 avilagaston9 changed the base branch from testnet to staging November 22, 2024 13:35
@avilagaston9 avilagaston9 marked this pull request as ready for review November 22, 2024 16:12
@avilagaston9 avilagaston9 self-assigned this Nov 22, 2024
@PatStiles
Copy link
Contributor

Ran on local devnet and observed the Aggregator spans even when Batcher - Task Creation Failed was thrown.

@Oppen
Copy link
Contributor

Oppen commented Nov 22, 2024

I don't think a span for every event is the right solution here. I believe we still want real spans, just not a single one. The times for the spans seem off as well, and it looks like the aggregator sees the future here:
image
Still, if it works better than what we have I think we should merge it and review the solution later.

@avilagaston9
Copy link
Contributor Author

avilagaston9 commented Nov 22, 2024

I don't think a span for every event is the right solution here. I believe we still want real spans, just not a single one. The times for the spans seem off as well, and it looks like the aggregator sees the future here: image Still, if it works better than what we have I think we should merge it and review the solution later.

The problem with the events is that they are associated with a parent span, but if for some reason we don't finalize it, we lose all the associated events.
It seems that the task event arrives at the Aggregator before the receipt arrives at the Batcher. This behavior is also observed in staging, but I don't consider this a problem.
IMO, we should separate this into two traces in the future, one for each component, removing the dependency between these two. I have created #1477 to revisit this solution later.
@Oppen

Copy link
Member

@MarcosNicolau MarcosNicolau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on macos and it is working well.

@uri-99 uri-99 added this pull request to the merge queue Nov 27, 2024
@JuArce JuArce removed this pull request from the merge queue due to a manual request Nov 27, 2024
@avilagaston9 avilagaston9 marked this pull request as draft November 28, 2024 17:22
@avilagaston9
Copy link
Contributor Author

Addressed in #1670 .

@JuArce JuArce deleted the improve-telemetry-spans branch January 2, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants