Skip to content

Conversation

@nishanthp
Copy link

Summary

Adds to_parquet() method to OtelTracesSqlEngine for exporting trace data in Apache Parquet format, enabling efficient storage and high-performance analytics.

Motivation

Currently, trace data can only be exported to SQL databases or kept in memory as pandas DataFrames. For long-term storage, archival, and sharing trace datasets, a performant columnar file format is needed.

Changes

  • Added to_parquet() method with:
    • Configurable compression algorithms (snappy, gzip, brotli, lz4, zstd)
    • Optional partitioning support for efficient filtering
    • Automatic date column extraction for time-based partitioning
  • Comprehensive test suite with 10+ test cases covering:
    • Basic export functionality
    • Multiple compression algorithms
    • Partitioning by service and date
    • Data integrity validation
    • Edge cases (empty dataframes, file size efficiency)

Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is ok, but one thing that is not super clear to me is the usefulness: the to_parquet method is not used within the Streamlit application: I imagined that you wanted to use it to download the observability data, but in this way it's just an additional method with no direct value whatsoever for the user

Comment on lines +168 to +170
# Add date column for partitioning if needed
if partition_cols and "date" in partition_cols:
df["date"] = pd.to_datetime(df["start_time"], unit="us").dt.date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is adding the "date" column needed? Can't we just convert the start_time one to datetime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, there is no validation of the partition columns, meaning that they could include also columns that are not in the dataframe

@@ -0,0 +1,32 @@
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says there are 10+ test cases, but here I only see one: is there another test file you did not commit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it will be in the next patch.

@nishanthp
Copy link
Author

The change is ok, but one thing that is not super clear to me is the usefulness: the to_parquet method is not used within the Streamlit application: I imagined that you wanted to use it to download the observability data, but in this way it's just an additional method with no direct value whatsoever for the user

I wanted to get your opinion on the approach before I could add it to the Streamlit application.

If the overall approach looks good, I will add the rest of the changes in the next patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants