Data Formulator 0.2 now supports working with large datasets, powered by the backend database!
Demonstration: Exploration of Metacritic's Best Games and Reviews - 2025
- This Kaggle dataset contains 13k+ games and 1.6M+ reviews of best games based on Metacritics reviews.
- Data source: https://www.kaggle.com/datasets/davutb/metacritic-games
- Exploration:
- What's the relation between user scores and critic scores?
- What are games where user reviews are really high but critic's scores are really low?
- How does the score distribution compare between critics and users?
df-demo-game-reviews.mp4
Release details: data visualization with large sized data.
Data Formulator integrates DuckDB as the backend local database to support data exploration with large datasets (million rows). It is also possible to connect external database with DuckDB, not all connection are supported at the moment, but that's the beginning!
- Upload large sized data to the local database, or connect to existing databases (mysql or postgres) to work with large data.
- A subset of sample data will be pulled to the frontend to explore, you can roll the dice 🎲 or sort the data by different columns to view different samples.
- Manage local database with the Database manager.
- Interaction with Data Formulator as usual:
- Use drag and drop to specify a chart, and Data Formulator can dynamically generate SQL query to fetch data to instantiate data. This process is quite fast!
- Specify new visualization fields / provide NL instructions as usual, and the newly introduced NL2SQL agents can generate SQL queries based on your instruction to prepare the data, and create visualizations.
- Anchor a dataset, followup, join some tables, can you can dive deep pretty fast into insights!
- (Minor feature updates)
- Updated how derived concept works in Data Formulator -- data transformation is executed in the backend and updated data is appended to the new dataset. New concepts can be applied directly to new dataset in one click.
- Improved system performance with configurable sandboxing options (main process versus subprocess) for LLM generated code (~3s interaction time reduction).
- Configurable default visualization size in the main panel.
More explorations on the demo dataset:
- What's your favorite games and how their review change over time?
- What's the franchise that consistently improved reviews?
- What are games that have most different reviews in different platforms?
- What are games with many positive critic reviews but no user bother to play?
- What about reviews trends for the No Man's sky?
Well, it is time to upgrade Data Formulator and play with it! Let us know what you come up with :)