PyQuery is a local-first data operating system that auto-heals broken CSVs, includes a native Code Editor, and processes 100GB+ files without breaking a sweat. โก
We built a suite of tools so perfect it hurts.
| Path | Vibe | Description | Link |
|---|---|---|---|
| CLI | ๐๏ธ Speedrun | The Headless Beast. Run data pipelines in your sleep. | CLI Manual |
| UI | ๐จ Creative | The Visual Studio. Drag, drop, analyze, visualize. | UI Guide |
| API | ๐ก Backbone | The Server. Build your own apps on our engine. | API Docs |
| SDK | ๐ Sorcery | The Python Library. For the code wizards. | SDK Guide |
โจ New Drop: Headless Ghost Mode ๐ป PyQuery now supports total Headless Automation. Run massive pipelines in CI/CD, schedule tasks, and bypass the UI entirely with the re-architected
runcommand.
- Install it:
pip install pyquery-polars(Don't be basic). - Run it:
pyquery ui(Visuals) orpyquery run(Speedrun/Headless). - The Flex: It's a local-first, privacy-focused engine that eats Excel sheets and CSVs for breakfast using Rust.
Long ago, the Data World was mid. Analysts lived in fear of the MemoryError. They bowed before the single-threaded tyranny of the Old Gods (Pandas). They accepted their fate of freezing screens, crashing kernels, and waiting 4 hours for a simple groupby.
But I refused.
From the depths of the Rusty abyss, PyQuery has awakened. I am not just an ETL tool anymore. I am the entire war room. I am here to obliterate your bottlenecks and ratio your old benchmarks.
- Lazy Execution: Nothing computes until you say "Export". This optimizes memory and speed so your hardware doesn't scream.
- Zero-Copy: Data is processed efficiently without redundant copies. We don't waste bits.
- Strict & Clean: Enforces strict typing and argument validation. No ambiguous magic, just pure logic.
- Automation First: While the UI is gorgeous, PyQuery is built to run alone in the dark.
We don't usually punch down, but you handed us the gloves.
| Feature | โก PyQuery (The Chad) | ๐ข Power Query (The Virgin) |
|---|---|---|
| Speed | Rust-Powered. Processes millions of rows before you blink. | Single-Threaded. Spends 20 mins saying "Loading Data..." just to crash. |
| Language | Python/SQL/Polars. The languages of gods. | M-Code. A language invented to punish humanity. |
| AI/ML | Built-in. Random Forests, Clustering, & Monte Carlo Sims. | Non-existent. You need a generic "AI Plugin" that costs extra. |
| Vibe | Dark Mode CLI & Streamlit. Cyberpunk aesthetic. | Corporate Grey. It sucks the soul out of your body. |
| Price | Free & Open Source. | Requires an Office 365 License (Subscription L). |
| Boot XP | Cinematic CLI with Themes & Logs | Static Spinner of Doom |
| Broken CSVs | Auto-healed at ingest | Crashes silently |
| One Bad File | Isolated & corrected | Pipeline dead |
| Headless | Full CLI Automation. Designed for CI/CD pipelines. | UI Dependent. Good luck automating that in a Linux shell. |
This is not a command line. This is a startup ritual.
Every time PyQuery boots, it behaves like a data OS coming online.
The CLI dynamically switches color gradients, borders, and mood based on your selected boot mode. Each theme announces itself during startup. You feel it before you run anything.
- Cyberpunk: (Default) Neon main-character energy.
- Rustacean: Pure Polars lore.
- Matrix: Hacker-core, green text supremacy.
- Villain Arc: Purple & gold. No mercy.
The CLI has been completely re-architected for Automation Supremacy. The run command is your primary entry point for headless operations.
# Basic Speedrun
pyquery run --source data.csv --output results.parquet
# Project Mode (Load the whole squad)
pyquery run --project daily_report.pyquery --output dist/- Source Mode (--source): Quick ad-hoc processing of single files, SQL queries, or APIs.
- Project Mode (--project): Load a predefined .pyquery project file containing multiple datasets and recipes.
Note: These flags are mutually exclusive. Choose your path.
Real-time kernel-style logs with cinematic pacing. It doesnโt say "loading"... It declares intent.
- Timestamped steps.
- Module icons (
โก Engine,๐พ IO,๐ง Planner). - Your terminal doesnโt just start PyQuery. It witnesses it.
Sidebars are for tourists. PyQuery loads data through dedicated modal dialogsโbecause loading data is a moment, not a side quest.
- Blazing-Fast & Optimistic: The dialog opens instantly.
- Lazy Preview: We scan 100k+ files without freezing the UI.
- Recent Paths: We remember so you don't have to.
- Preview Before Commit: See matched files and sheets before you import. You don't guess anymore; you confirm with intent.
We built an empire so you can rule yours. This isn't just software; it's a lifestyle.
"Most tools describe the past. PyQuery predicts the future."
EDA is no longer just "looking at data". It's hunting.
We scan your data's soul.
- Missing Cells: We don't just count nulls; we judge them. (<1% is excellence, >10% is sloppy).
- Cardinality Checks: Instantly know if a column is categorical or continuous.
- Duplicate Detection: We find the clones and eliminate them.
- Strategic Brief: A "Top 3 Insights" card that ranks every signal in your data. It whispers: "The money is here."
- Automated Drivers: It finds the hidden variables controlling your target.
- "Why is Churn high? It's not Price. It's Customer Support Wait Time > 5m." -> Boom. Solved.
- Correlation Matrix: Pearson, Cramerโs V, and F-Tests calculated automatically. We know the relationships better than you know your own situationship.
- Auto-Pilot Mode: Trains an army of models (Random Forest, Lasso, Ridge) to find the best fit. You sit back and look busy.
- Clustering (Unsupervised Rizz): Elbow Plots & Silhouette Scores optimization. We even name the segments for you ("Cluster 1 = High Spend, Low Age").
- Explainable Anomalies: Uses Isolation Forests to catch the weirdos and fraudsters instantly, with a Contextual Profiler to tell you why they are weird.
- "What-If" Sliders: Change variables in real-time. "If I raise Price by 10% and lower ad spend, do I still profit?"
- Monte Carlo Sims: Run 1,000+ simulations. We don't guess; we calculate the probability of your success.
- Waterfall Analysis: The Model breaks down exactly why the prediction changed.
- Holt-Winters Forecasting: Predicting the future with confidence intervals.
- Decomposition: Splitting data into Trend, Seasonality, and Noise.
- Cohort Comparison: Volcano Plots visualizing "Effect Size" vs "Significance." We bring the science.
For those who speak the language of the gods (Python/SQL), we built a React-based Code Editor right inside the UI.
- Embedded Ace Editor: Syntax highlighting, line numbers, and active line focus. Feels like VS Code, lives in your browser.
- Intelligent Auto-Completions: Context-aware suggestions for
pl,np,math. Typecolgetcol("name"). It knows your schema. - Sandboxed Custom Scripts:
- AST-Validated Security: We parse your code before execution.
- Blocked:
import os, private attributes, system calls. - Allowed:
numpy,scipy,sklearn. Pure math and logic only.
For when the GUI is too easy and you want to flex raw SQL. This isn't SQLite. This is High-Performance Lazy SQL.
- Zero-Lag Querying: Run
SELECT *on a 50GB file? It pulls a preview instantly. The engine effectively cheats physics. - Cross-Dataset Joins: Join
sales.csvwithtargets.xlsxusing standard SQL. - Materialize: Execute complex queries, then save as a new dataset.
Backend I/O that actually understands real-world data. Real data is cursed. We planned for that.
- ๐งฌ Advanced Auto-Encoding Healer:
- Scans the first bytes of every CSV to automatically fix
UnicodeDecodeError. - Stream-Based Healing: Processes multi-GB files in 4MB chunks. Memory usage stays flat.
- Sanitization: Strips
Null Bytes, normalizes newlines, and replaces garbage.
- Scans the first bytes of every CSV to automatically fix
- ๐งฉ Mixed-Encoding Folder Handling:
- If a folder contains files with different encodings, PyQuery detects it and switches strategy automatically.
- We isolate. We adapt. We continue.
- ๐ Recursive Folder Globbing (Upgraded):
- Patterns like
data/**/*.csvwork even when schemas differ slightly or headers are misaligned.
- Patterns like
- ๐๏ธ Staging Ground (Infrastructure Rizz):
- Control your intermediate storage. If your
%TEMP%partition is small, tell PyQuery where the real space is using thePYQUERY_STAGING_DIRenvironment variable.
# Linux/Mac Power Move export PYQUERY_STAGING_DIR="/mnt/fast_ssd/pyquery_cache" pyquery run ...
- Control your intermediate storage. If your
- ๐ Advanced File Filtering (Precision Strikes):
- Multiple Filter Types:
Glob,Regex,Contains,Not Contains,Exact,Is Not. - Stackable Logic: Must contain
sales+ Must NOT containbackup+ Must match regex\d{4}. - This is surgical file selection. No more loading junk and cleaning later.
- Multiple Filter Types:
- ๐ Excel Handling That Respects Your Sanity:
- Multi-Sheet Selection: Load one sheet, many sheets, or only the ones that matter.
- Template-Based Mapping: Pick a base file, preview its sheets, and apply that selection across all matching files.
- Sheet Name Filtering: Regex-powered selection like
Q[1-4]_Data.
- โจ Source Awareness & Cleanliness:
- Metadata Injection: Automatically add
__source_path__and__source_name__. - Auto Type Inference: Samples data, infers dtypes, and instantly appends a Clean & Cast step.
- Metadata Injection: Automatically add
- โจ Auto-Typecast: One click scans rows and forcibly converts
StringstoInt,Float, orDate. - ๐ญ PII Incinerator: Detects and obfuscates credit cards and SSNs. Secrets remain secret.
- ๐ฉน Smart Impute: Fill the voids. Forward fill, backward fill, median, or specific value injection. No null survives.
- ๐ฅ Explode & Coalesce: Flatten lists and merge columns like a boss.
This isn't just a library. It's a weapon system.
The Old Gods (Pandas) are Eager. They try to swallow the ocean (RAM) whole. They choke. PyQuery is Lazy. It waits. It plans.
- Scan: "It's a 100GB file. Interesting."
- Plan: Filters, joins, math. Nothing executes until the final blow.
- Stream: Data flows in chunks. Process. Write. Destroy.
- Result: Processing 100GB on a MacBook Air. The laws of physics are optional.
Most engines think in datasets. PyQuery thinks in files.
- Individual File Processing: Forces the engine to load files one-by-one instead of bulk scanning.
- Why it matters: One corrupted CSV no longer nukes the entire pipeline. We fix schemas and clean data before concatenation. This is how PyQuery survives enterprise-grade mess.
We rewired the backend for scale.
- True Streaming Discovery: Uses generators and lazy iteration. Point at 100k files without crashing.
- Partial Globbing: Simple text filters convert to filesystem-level globs. Python never even sees irrelevant files.
Python is dynamic (chaotic). PyQuery imposes Order.
- Every step is backed by a Pydantic Model.
- If a
Stringtries to infiltrate aFloatcolumn, it is terminated before execution. - No runtime surprises. Only calculated victories.
We don't post without proof. We mog the competition.
| Metric | ๐ผ Pandas (Legacy) | โก PyQuery (Polars) | The Diff |
|---|---|---|---|
| Load 10GB CSV | MemoryError (Crash) ๐ฅ |
0.2s (Lazy Scan) โก | Infinite |
| Filter Rows | 15.4s (Slow) | 0.5s (Parallel) | 30x Faster |
| Group By | 45s (Painful) | 2.1s (Instant) | 20x Faster |
| RAM Usage | 12GB+ (Bloated) | 500MB (Lean) | 95% Less |
Benchmarks run on a standard dev laptop. Results may vary but the vibe remains consistent.
We don't limit you. Dominate however you choose.
pip install pyquery-polars
For when you want to click things, see pretty charts, and feel like a data scientist in a sci-fi movie.
- Visual Recipe Builder: Nodes and edges of pure logic.
- Native File Picker: Access local filesystem directly.
pyquery ui
# Launches the Web App on localhost:8501 ๐
Building a machine? Run PyQuery as the engine.
- Swagger Docs: Auto-generated at
/docs. - Async: Fire and forget jobs via
POST /recipes/run.
pyquery api
# Serving high-performance ETL over HTTP at localhost:8000 ๐ก
For automation. No interface. Just speed.
pyquery run -s input.csv -r recipe.json -o output.parquet
# Task complete. โก
For the developers who want to weave PyQuery into their own code.
from pyquery_polars.backend.engine import PyQueryEngine
# Full programmatic control over the recipe engine.
# You are the architect now.Packed with every tool needed to clear the map.
| Category | The Tools | Why it slaps |
|---|---|---|
| Cleaning | Fill Nulls, Mask PII, Smart Extract, Regex |
Turns garbage data into gold. โจ |
| Analytics | Rolling Agg, Time Bin, Rank, Diff, Z-Score |
High-frequency trading vibes. ๐ |
| Combining | Smart Join, Concat, Pivot, Unpivot |
Merge datasets without the headache. ๐ค |
| Math | Log, Exp, Clip, Date Offset |
For the scientific girlies. ๐ฉโ๐ฌ |
| Text | Slice, Case, Replace, One-Hot |
String manipulation on steroids. ๐ช |
| I/O | CSV, Parquet, Excel, JSON, IPC |
Speaks every language. ๐ฃ๏ธ |
We aren't stopping here. We are aiming for the moon. ๐
- Phase 1: Native App Supremacy (Rust + Tauri): The browser has limits. The Native App will have none. GPU-accelerated plotting (10M points at 144Hz) and OLED black themes.
- Phase 2: Big Data Devourer: Cloud connectors (S3, GCS, Azure). We drink their milkshakes.
You want to contribute? Good. We need strong allies.
1. Backend Implementation:
- Define Params: Create a Pydantic model (
src/pyquery_polars/core/params.py). - Backend Logic: Write a pure polars function (
src/pyquery_polars/backend/transforms/). - Register: Add step to
register_all_steps()inregistry.py.
2. Frontend Implementation:
- Create a Renderer Function (
src/pyquery_polars/frontend/steps/). - Register: Add step to
register_frontend()inregistry_init.py.
It appears in the CLI, API, and UI automatically. ๐คฏ
# Only certified ballers contribute code.
# Are you up for it?GPL-3.0. Open source forever. ๐
Made with โ, ๐ฆ (Rust), and ๐ by Sudharshan TK
