Skip to content

We're changing database #408

@samuelcolvin

Description

@samuelcolvin

Rollout

We're gradually rolling out queries to the new database now. If you're affected, you'll see a banner like this:

Screenshot 2024-09-18 at 14 42 24

If you notice queries taking longer or returning errors or different results, please let us know below or contact us via email or Slack.

If you need to continue querying the old database, you can do so by right-clicking on your profile picture in the top right and setting the query engine to 'TS' (Timescale, the old database):

Screenshot 2024-09-18 at 14 44 53

To get rid of the warning banner, set the query engine to 'TS' and then back to 'FF' (FusionFire, the new database) again.

We will be increasing the percentage of users whose default query engine is FF over time and monitoring the impact. We may decrease it again if we notice problems. If you set a query engine explicitly to either TS or FF, this won't affect you. Otherwise, your query engine may switch back and forth. For most users, there shouldn't be a noticeable difference.

Most queries should be faster with FF, especially if they aggregate lots of data over a long time period. If your dashboards were timing out before with TS, try using FF. However some specific queries that are very fast with TS are slower with FF. In particular, TS can look up trace and span IDs almost instantly without needing a specific time range. If you click on a link to a trace/span ID in a table, it will open the live view with a time range of 30 days because it doesn't know any better. If this doesn't load, reduce the time range.

Summary

We're changing the database that stores observability data in the Logfire platform from Timescale to a custom database built on Apache Datafusion.

This should bring big improvements in performance, but will lead to some SQL compatibility issues initially (details below).

Background

Timescale is great, it can be really performant when you know the kind of queries you regularly run (so you can set up continuous aggregates) and when you can enable their compression features (which both save money and make queries faster).

Unfortunately we can't use either of those features:

  • our users can query their data however they like using SQL, so continuous aggregates aren't that helpful
  • Timescale's compression features are incompatible with row level permissions — in Timescale/PostgreSQL we have to have row level permissions since we're running users SQL directly against the database

Earlier this year, as the volume of data the Logfire platform received increased in the beta, these limitations became clearer and clearer.

The other more fundamental limitation of Timescale was their open/closed source business model.

The ideal data architecture for us (and any analytics database I guess) is separated storage and compute: data is stored in S3/GCS as parquet (or equivalent), with an external index used by the query/compute nodes. Timescale has this, but it's completely closed source. So we can either get a scaleable architecture but be forced to use their SAAS, or run Timescale as a traditional "coupled storage and compute" database ourselves.

For lots of companies either of those solutions would be satisfactory, but if Logfire scales as we hope it does, we'd be scuppered with either.

Datafusion

We settled on Datafusion as the foundation for our new database for a few reasons:

  1. It's completely open source so we can build the separated storage and compute solution we want
  2. It's all Rust, quite a few of our team are comfortable writing Rust, meaning the database isn't just a black box, we can dive in and improve it as we wish (as an example, Datafusion didn't have JSON querying support until we implemented it in datafusion-functions-json). Since starting to use datafusion, our team has contributed 20 or 30 pull requests to datafusion, and associated projects like arrow-rs and sqlparser-rs
  3. Datafusion is extremely extensible, we can adjust the SQL syntax, how queries are planned and run and build indexes exactly as we need them
  4. Datafusion's SQL parser has pretty good compatibility with Postgres, and again, it's just Rust so we can improve it fairly easily
  5. The project is excellently run, part of Apache, leverages the Arrow/Parquet ecosystem, and is used by large organizations like InfluxDB, Apple and Nvidia

Transition

For the last couple of months we've been double-writing to Timescale and Fusionfire (our cringey internal name for the new datafusion-based database), working on improving reliability and performance of Fusionfire for all types of queries.

Fusionfire is now significantly (sometimes >10x) faster than timescale for most queries. There's a few low latency queries on very recent data which are still faster on timescale that we're working on improving.

Currently by default the live view, explore view, dashboards and alerts use timescale by default. You can try fusionfire now for everything except alerts by right clicking on your profile picture in the top right and selecting "FF" as the query engine.

In the next couple of weeks we'll migrate fully to Fusionfire and retire timescale.

We're working hard to make Fusionfire more compatible with PostgreSQL (see apache/datafusion-sqlparser-rs#1398, apache/datafusion-sqlparser-rs#1394, apache/datafusion-sqlparser-rs#1360, apache/arrow-rs#6211, apache/datafusion#11896, apache/datafusion#11876, apache/datafusion#11849, apache/datafusion#11321, apache/arrow-rs#6319, apache/arrow-rs#6208, apache/arrow-rs#6197, apache/arrow-rs#6082, apache/datafusion#11307), but there are still a few expressions which currently don't run correctly (a lot related to intervals):

If you notice any other issues, please let us know on this issue or a new issue, and we'll let you know how quickly we can fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions