⚡ PyQuery: The Main Character of Data Stacks 💫

ETL. EDA. ML. SQL. IDE.

🚩 Stop letting Pandas hold you back.
The single-threaded era is over.

PyQuery is a local-first data operating system that auto-heals broken CSVs, includes a native Code Editor, and processes 100GB+ files without breaking a sweat. ⚡

Feature Request · Report Bug

🎮 The Ecosystem (Choose Your Path)

We built a suite of tools so perfect it hurts.

Path	Vibe	Description	Link
CLI	🏎️ Speedrun	The Headless Beast. Run data pipelines in your sleep.	CLI Manual
UI	🎨 Creative	The Visual Studio. Drag, drop, analyze, visualize.	UI Guide
API	📡 Backbone	The Server. Build your own apps on our engine.	API Docs
SDK	🐍 Sorcery	The Python Library. For the code wizards.	SDK Guide

🧠 TL;DR (For the goldfish attention spans)

✨ New Drop: Headless Ghost Mode 👻 PyQuery now supports total Headless Automation. Run massive pipelines in CI/CD, schedule tasks, and bypass the UI entirely with the re-architected run command.

Install it: pip install pyquery-polars (Don't be basic).
Run it: pyquery ui (Visuals) or pyquery run (Speedrun/Headless).
The Flex: It's a local-first, privacy-focused engine that eats Excel sheets and CSVs for breakfast using Rust.

⛩️ The Awakening (Lore)

Long ago, the Data World was mid. Analysts lived in fear of the MemoryError. They bowed before the single-threaded tyranny of the Old Gods (Pandas). They accepted their fate of freezing screens, crashing kernels, and waiting 4 hours for a simple groupby.

But I refused.

From the depths of the Rusty abyss, PyQuery has awakened. I am not just an ETL tool anymore. I am the entire war room. I am here to obliterate your bottlenecks and ratio your old benchmarks.

The Core Philosophy (Our Ninja Way) 🥷

Lazy Execution: Nothing computes until you say "Export". This optimizes memory and speed so your hardware doesn't scream.
Zero-Copy: Data is processed efficiently without redundant copies. We don't waste bits.
Strict & Clean: Enforces strict typing and argument validation. No ambiguous magic, just pure logic.
Automation First: While the UI is gorgeous, PyQuery is built to run alone in the dark.

Welcome to your Villain Arc. 👹

🧾 PyQuery vs. Power Query: The Roast

We don't usually punch down, but you handed us the gloves.

Feature	⚡ PyQuery (The Chad)	🐢 Power Query (The Virgin)
Speed	Rust-Powered. Processes millions of rows before you blink.	Single-Threaded. Spends 20 mins saying "Loading Data..." just to crash.
Language	Python/SQL/Polars. The languages of gods.	M-Code. A language invented to punish humanity.
AI/ML	Built-in. Random Forests, Clustering, & Monte Carlo Sims.	Non-existent. You need a generic "AI Plugin" that costs extra.
Vibe	Dark Mode CLI & Streamlit. Cyberpunk aesthetic.	Corporate Grey. It sucks the soul out of your body.
Price	Free & Open Source.	Requires an Office 365 License (Subscription L).
Boot XP	Cinematic CLI with Themes & Logs	Static Spinner of Doom
Broken CSVs	Auto-healed at ingest	Crashes silently
One Bad File	Isolated & corrected	Pipeline dead
Headless	Full CLI Automation. Designed for CI/CD pipelines.	UI Dependent. Good luck automating that in a Linux shell.

🖥️ The Main Character CLI (The Experience)

This is not a command line. This is a startup ritual.

Every time PyQuery boots, it behaves like a data OS coming online.

⚡ Adaptive Theme Engine

The CLI dynamically switches color gradients, borders, and mood based on your selected boot mode. Each theme announces itself during startup. You feel it before you run anything.

Cyberpunk: (Default) Neon main-character energy.
Rustacean: Pure Polars lore.
Matrix: Hacker-core, green text supremacy.
Villain Arc: Purple & gold. No mercy.

👻 Headless Revamp: The `run` Command

The CLI has been completely re-architected for Automation Supremacy. The run command is your primary entry point for headless operations.

# Basic Speedrun
pyquery run --source data.csv --output results.parquet

# Project Mode (Load the whole squad)
pyquery run --project daily_report.pyquery --output dist/

🛠️ Execution Modes:

Source Mode (--source): Quick ad-hoc processing of single files, SQL queries, or APIs.
Project Mode (--project): Load a predefined .pyquery project file containing multiple datasets and recipes.

Note: These flags are mutually exclusive. Choose your path.

📟 Sequential Boot Logs

Real-time kernel-style logs with cinematic pacing. It doesn’t say "loading"... It declares intent.

Timestamped steps.
Module icons (⚡ Engine, 💾 IO, 🧠 Planner).
Your terminal doesn’t just start PyQuery. It witnesses it.

🧩 Focused UI (Modal Upgrade)

Sidebars are for tourists. PyQuery loads data through dedicated modal dialogs—because loading data is a moment, not a side quest.

Blazing-Fast & Optimistic: The dialog opens instantly.
Lazy Preview: We scan 100k+ files without freezing the UI.
Recent Paths: We remember so you don't have to.
Preview Before Commit: See matched files and sheets before you import. You don't guess anymore; you confirm with intent.

💪 The Flex (Capabilities)

We built an empire so you can rule yours. This isn't just software; it's a lifestyle.

🎯 EDA: The Crystal Ball (Expanded)

"Most tools describe the past. PyQuery predicts the future."

EDA is no longer just "looking at data". It's hunting.

1. 🧬 Dataset DNA & Health Check

We scan your data's soul.

Missing Cells: We don't just count nulls; we judge them. (<1% is excellence, >10% is sloppy).
Cardinality Checks: Instantly know if a column is categorical or continuous.
Duplicate Detection: We find the clones and eliminate them.

2. 🚀 The Action Engine (ML Strategist)

Strategic Brief: A "Top 3 Insights" card that ranks every signal in your data. It whispers: "The money is here."
Automated Drivers: It finds the hidden variables controlling your target.
- "Why is Churn high? It's not Price. It's Customer Support Wait Time > 5m." -> Boom. Solved.
Correlation Matrix: Pearson, Cramer’s V, and F-Tests calculated automatically. We know the relationships better than you know your own situationship.

3. 🧪 ML Laboratory (The Brain)

Auto-Pilot Mode: Trains an army of models (Random Forest, Lasso, Ridge) to find the best fit. You sit back and look busy.
Clustering (Unsupervised Rizz): Elbow Plots & Silhouette Scores optimization. We even name the segments for you ("Cluster 1 = High Spend, Low Age").
Explainable Anomalies: Uses Isolation Forests to catch the weirdos and fraudsters instantly, with a Contextual Profiler to tell you why they are weird.

4. 🎮 Decision Simulator (The Time Machine)

"What-If" Sliders: Change variables in real-time. "If I raise Price by 10% and lower ad spend, do I still profit?"
Monte Carlo Sims: Run 1,000+ simulations. We don't guess; we calculate the probability of your success.
Waterfall Analysis: The Model breaks down exactly why the prediction changed.

5. 📈 Time Series & Visuals That Don't Miss

Holt-Winters Forecasting: Predicting the future with confidence intervals.
Decomposition: Splitting data into Trend, Seasonality, and Noise.
Cohort Comparison: Volcano Plots visualizing "Effect Size" vs "Significance." We bring the science.

💻 The Integrated IDE (Code is Power)

For those who speak the language of the gods (Python/SQL), we built a React-based Code Editor right inside the UI.

Embedded Ace Editor: Syntax highlighting, line numbers, and active line focus. Feels like VS Code, lives in your browser.
Intelligent Auto-Completions: Context-aware suggestions for pl, np, math. Type col get col("name"). It knows your schema.
Sandboxed Custom Scripts:
- AST-Validated Security: We parse your code before execution.
- Blocked: import os, private attributes, system calls.
- Allowed: numpy, scipy, sklearn. Pure math and logic only.

🧪 SQL Lab: The Codex (God Mode)

For when the GUI is too easy and you want to flex raw SQL. This isn't SQLite. This is High-Performance Lazy SQL.

Zero-Lag Querying: Run SELECT * on a 50GB file? It pulls a preview instantly. The engine effectively cheats physics.
Cross-Dataset Joins: Join sales.csv with targets.xlsx using standard SQL.
Materialize: Execute complex queries, then save as a new dataset.

🧹 The Forge (Ruthless ETL)

Backend I/O that actually understands real-world data. Real data is cursed. We planned for that.

🧬 Advanced Auto-Encoding Healer:
- Scans the first bytes of every CSV to automatically fix UnicodeDecodeError.
- Stream-Based Healing: Processes multi-GB files in 4MB chunks. Memory usage stays flat.
- Sanitization: Strips Null Bytes, normalizes newlines, and replaces garbage.
🧩 Mixed-Encoding Folder Handling:
- If a folder contains files with different encodings, PyQuery detects it and switches strategy automatically.
- We isolate. We adapt. We continue.
📂 Recursive Folder Globbing (Upgraded):
- Patterns like data/**/*.csv work even when schemas differ slightly or headers are misaligned.
🏗️ Staging Ground (Infrastructure Rizz):
- Control your intermediate storage. If your %TEMP% partition is small, tell PyQuery where the real space is using the PYQUERY_STAGING_DIR environment variable.
```
# Linux/Mac Power Move
export PYQUERY_STAGING_DIR="/mnt/fast_ssd/pyquery_cache"
pyquery run ...
```
🔍 Advanced File Filtering (Precision Strikes):
- Multiple Filter Types: Glob, Regex, Contains, Not Contains, Exact, Is Not.
- Stackable Logic: Must contain sales + Must NOT contain backup + Must match regex \d{4}.
- This is surgical file selection. No more loading junk and cleaning later.
📊 Excel Handling That Respects Your Sanity:
- Multi-Sheet Selection: Load one sheet, many sheets, or only the ones that matter.
- Template-Based Mapping: Pick a base file, preview its sheets, and apply that selection across all matching files.
- Sheet Name Filtering: Regex-powered selection like Q[1-4]_Data.
✨ Source Awareness & Cleanliness:
- Metadata Injection: Automatically add __source_path__ and __source_name__.
- Auto Type Inference: Samples data, infers dtypes, and instantly appends a Clean & Cast step.
✨ Auto-Typecast: One click scans rows and forcibly converts Strings to Int, Float, or Date.
🎭 PII Incinerator: Detects and obfuscates credit cards and SSNs. Secrets remain secret.
🩹 Smart Impute: Fill the voids. Forward fill, backward fill, median, or specific value injection. No null survives.
💥 Explode & Coalesce: Flatten lists and merge columns like a boss.

🧠 The Tech Stack (Forbidden Knowledge) 🐐

This isn't just a library. It's a weapon system.

1. 🌊 The "Infinite Stream" Glitch (Lazy Execution)

The Old Gods (Pandas) are Eager. They try to swallow the ocean (RAM) whole. They choke. PyQuery is Lazy. It waits. It plans.

Scan: "It's a 100GB file. Interesting."
Plan: Filters, joins, math. Nothing executes until the final blow.
Stream: Data flows in chunks. Process. Write. Destroy.
Result: Processing 100GB on a MacBook Air. The laws of physics are optional.

2. ⚙️ File-Level Execution Control

Most engines think in datasets. PyQuery thinks in files.

Individual File Processing: Forces the engine to load files one-by-one instead of bulk scanning.
Why it matters: One corrupted CSV no longer nukes the entire pipeline. We fix schemas and clean data before concatenation. This is how PyQuery survives enterprise-grade mess.

3. 🚀 Streaming I/O Architecture

We rewired the backend for scale.

True Streaming Discovery: Uses generators and lazy iteration. Point at 100k files without crashing.
Partial Globbing: Simple text filters convert to filesystem-level globs. Python never even sees irrelevant files.

4. 🛡️ Type Safety (Absolute Order)

Python is dynamic (chaotic). PyQuery imposes Order.

Every step is backed by a Pydantic Model.
If a String tries to infiltrate a Float column, it is terminated before execution.
No runtime surprises. Only calculated victories.

🧾 The Receipts (Benchmarks)

We don't post without proof. We mog the competition.

Metric	🐼 Pandas (Legacy)	⚡ PyQuery (Polars)	The Diff
Load 10GB CSV	`MemoryError` (Crash) 💥	0.2s (Lazy Scan) ⚡	Infinite
Filter Rows	15.4s (Slow)	0.5s (Parallel)	30x Faster
Group By	45s (Painful)	2.1s (Instant)	20x Faster
RAM Usage	12GB+ (Bloated)	500MB (Lean)	95% Less

Benchmarks run on a standard dev laptop. Results may vary but the vibe remains consistent.

🎮 Choose Your Fighter (4 Paths to Power)

We don't limit you. Dominate however you choose.

📦 Installation

pip install pyquery-polars

1. 🌊 The GUI (God Mode)

For when you want to click things, see pretty charts, and feel like a data scientist in a sci-fi movie.

Visual Recipe Builder: Nodes and edges of pure logic.
Native File Picker: Access local filesystem directly.

pyquery ui
# Launches the Web App on localhost:8501 🚀

2. 🤖 The API (Headless Beast)

Building a machine? Run PyQuery as the engine.

Swagger Docs: Auto-generated at /docs.
Async: Fire and forget jobs via POST /recipes/run.

pyquery api
# Serving high-performance ETL over HTTP at localhost:8000 📡

3. ⚡ The Batch Runner (Speedrun)

For automation. No interface. Just speed.

pyquery run -s input.csv -r recipe.json -o output.parquet
# Task complete. ⚡

4. 🧙‍♂️ The Sorcerer (Python SDK)

For the developers who want to weave PyQuery into their own code.

from pyquery_polars.backend.engine import PyQueryEngine
# Full programmatic control over the recipe engine.
# You are the architect now.

🧰 The Loadout (Arsenal)

Packed with every tool needed to clear the map.

Category	The Tools	Why it slaps
Cleaning	`Fill Nulls`, `Mask PII`, `Smart Extract`, `Regex`	Turns garbage data into gold. ✨
Analytics	`Rolling Agg`, `Time Bin`, `Rank`, `Diff`, `Z-Score`	High-frequency trading vibes. 📈
Combining	`Smart Join`, `Concat`, `Pivot`, `Unpivot`	Merge datasets without the headache. 🤝
Math	`Log`, `Exp`, `Clip`, `Date Offset`	For the scientific girlies. 👩‍🔬
Text	`Slice`, `Case`, `Replace`, `One-Hot`	String manipulation on steroids. 💪
I/O	`CSV`, `Parquet`, `Excel`, `JSON`, `IPC`	Speaks every language. 🗣️

🗺️ The Roadmap (Manifesting Destiny) 🔮

We aren't stopping here. We are aiming for the moon. 🚀

Phase 1: Native App Supremacy (Rust + Tauri): The browser has limits. The Native App will have none. GPU-accelerated plotting (10M points at 144Hz) and OLED black themes.
Phase 2: Big Data Devourer: Cloud connectors (S3, GCS, Azure). We drink their milkshakes.

🧑‍💻 Join the Cult (Developer Guide)

You want to contribute? Good. We need strong allies.

The Blooding (Adding a Transform) 🖐️

1. Backend Implementation:

Define Params: Create a Pydantic model (src/pyquery_polars/core/params.py).
Backend Logic: Write a pure polars function (src/pyquery_polars/backend/transforms/).
Register: Add step to register_all_steps() in registry.py.

2. Frontend Implementation:

Create a Renderer Function (src/pyquery_polars/frontend/steps/).
Register: Add step to register_frontend() in registry_init.py.

It appears in the CLI, API, and UI automatically. 🤯

# Only certified ballers contribute code.
# Are you up for it?

📜 License

GPL-3.0. Open source forever. 💖

Made with ☕, 🦀 (Rust), and 💖 by Sudharshan TK

Name		Name	Last commit message	Last commit date
Latest commit History 586 Commits
.husky		.husky
.yarn/releases		.yarn/releases
docs		docs
src/pyquery_polars		src/pyquery_polars
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
.versionrc		.versionrc
.yarnrc.yml		.yarnrc.yml
CHANGELOG.md		CHANGELOG.md
EDA Guide.md		EDA Guide.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
app.py		app.py
commitlint.config.js		commitlint.config.js
package.json		package.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock
yarn.lock		yarn.lock

License

PyQuery-HQ/pyquery-legacy

Folders and files

Latest commit

History

Repository files navigation