Skip to content

PyQuery is a local-first data operating system built on lazy execution that processes 100GB+ files while you doomscroll. No cap. ๐Ÿงข

License

Notifications You must be signed in to change notification settings

PyQuery-HQ/pyquery-legacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

586 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โšก PyQuery: The Main Character of Data Stacks ๐Ÿ’ซ

ETL. EDA. ML. SQL. IDE.

Execution Mode Privacy Design Stack

PyPI Version Python Versions License

Rows of Data

๐Ÿšฉ Stop letting Pandas hold you back.
The single-threaded era is over.

PyQuery is a local-first data operating system that auto-heals broken CSVs, includes a native Code Editor, and processes 100GB+ files without breaking a sweat. โšก

Feature Request ยท Report Bug


๐ŸŽฎ The Ecosystem (Choose Your Path)

We built a suite of tools so perfect it hurts.

Path Vibe Description Link
CLI ๐ŸŽ๏ธ Speedrun The Headless Beast. Run data pipelines in your sleep. CLI Manual
UI ๐ŸŽจ Creative The Visual Studio. Drag, drop, analyze, visualize. UI Guide
API ๐Ÿ“ก Backbone The Server. Build your own apps on our engine. API Docs
SDK ๐Ÿ Sorcery The Python Library. For the code wizards. SDK Guide

๐Ÿง  TL;DR (For the goldfish attention spans)

โœจ New Drop: Headless Ghost Mode ๐Ÿ‘ป PyQuery now supports total Headless Automation. Run massive pipelines in CI/CD, schedule tasks, and bypass the UI entirely with the re-architected run command.

  1. Install it: pip install pyquery-polars (Don't be basic).
  2. Run it: pyquery ui (Visuals) or pyquery run (Speedrun/Headless).
  3. The Flex: It's a local-first, privacy-focused engine that eats Excel sheets and CSVs for breakfast using Rust.

โ›ฉ๏ธ The Awakening (Lore)

Long ago, the Data World was mid. Analysts lived in fear of the MemoryError. They bowed before the single-threaded tyranny of the Old Gods (Pandas). They accepted their fate of freezing screens, crashing kernels, and waiting 4 hours for a simple groupby.

But I refused.

From the depths of the Rusty abyss, PyQuery has awakened. I am not just an ETL tool anymore. I am the entire war room. I am here to obliterate your bottlenecks and ratio your old benchmarks.

The Core Philosophy (Our Ninja Way) ๐Ÿฅท

  • Lazy Execution: Nothing computes until you say "Export". This optimizes memory and speed so your hardware doesn't scream.
  • Zero-Copy: Data is processed efficiently without redundant copies. We don't waste bits.
  • Strict & Clean: Enforces strict typing and argument validation. No ambiguous magic, just pure logic.
  • Automation First: While the UI is gorgeous, PyQuery is built to run alone in the dark.

Welcome to your Villain Arc. ๐Ÿ‘น


๐Ÿงพ PyQuery vs. Power Query: The Roast

We don't usually punch down, but you handed us the gloves.

Feature โšก PyQuery (The Chad) ๐Ÿข Power Query (The Virgin)
Speed Rust-Powered. Processes millions of rows before you blink. Single-Threaded. Spends 20 mins saying "Loading Data..." just to crash.
Language Python/SQL/Polars. The languages of gods. M-Code. A language invented to punish humanity.
AI/ML Built-in. Random Forests, Clustering, & Monte Carlo Sims. Non-existent. You need a generic "AI Plugin" that costs extra.
Vibe Dark Mode CLI & Streamlit. Cyberpunk aesthetic. Corporate Grey. It sucks the soul out of your body.
Price Free & Open Source. Requires an Office 365 License (Subscription L).
Boot XP Cinematic CLI with Themes & Logs Static Spinner of Doom
Broken CSVs Auto-healed at ingest Crashes silently
One Bad File Isolated & corrected Pipeline dead
Headless Full CLI Automation. Designed for CI/CD pipelines. UI Dependent. Good luck automating that in a Linux shell.

๐Ÿ–ฅ๏ธ The Main Character CLI (The Experience)

This is not a command line. This is a startup ritual.

Every time PyQuery boots, it behaves like a data OS coming online.

โšก Adaptive Theme Engine

The CLI dynamically switches color gradients, borders, and mood based on your selected boot mode. Each theme announces itself during startup. You feel it before you run anything.

  • Cyberpunk: (Default) Neon main-character energy.
  • Rustacean: Pure Polars lore.
  • Matrix: Hacker-core, green text supremacy.
  • Villain Arc: Purple & gold. No mercy.

๐Ÿ‘ป Headless Revamp: The run Command

The CLI has been completely re-architected for Automation Supremacy. The run command is your primary entry point for headless operations.

# Basic Speedrun
pyquery run --source data.csv --output results.parquet

# Project Mode (Load the whole squad)
pyquery run --project daily_report.pyquery --output dist/
๐Ÿ› ๏ธ Execution Modes:
  • Source Mode (--source): Quick ad-hoc processing of single files, SQL queries, or APIs.
  • Project Mode (--project): Load a predefined .pyquery project file containing multiple datasets and recipes.

Note: These flags are mutually exclusive. Choose your path.

๐Ÿ“Ÿ Sequential Boot Logs

Real-time kernel-style logs with cinematic pacing. It doesnโ€™t say "loading"... It declares intent.

  • Timestamped steps.
  • Module icons (โšก Engine, ๐Ÿ’พ IO, ๐Ÿง  Planner).
  • Your terminal doesnโ€™t just start PyQuery. It witnesses it.

๐Ÿงฉ Focused UI (Modal Upgrade)

Sidebars are for tourists. PyQuery loads data through dedicated modal dialogsโ€”because loading data is a moment, not a side quest.

  • Blazing-Fast & Optimistic: The dialog opens instantly.
  • Lazy Preview: We scan 100k+ files without freezing the UI.
  • Recent Paths: We remember so you don't have to.
  • Preview Before Commit: See matched files and sheets before you import. You don't guess anymore; you confirm with intent.

๐Ÿ’ช The Flex (Capabilities)

We built an empire so you can rule yours. This isn't just software; it's a lifestyle.

๐ŸŽฏ EDA: The Crystal Ball (Expanded)

"Most tools describe the past. PyQuery predicts the future."

EDA is no longer just "looking at data". It's hunting.

1. ๐Ÿงฌ Dataset DNA & Health Check

We scan your data's soul.

  • Missing Cells: We don't just count nulls; we judge them. (<1% is excellence, >10% is sloppy).
  • Cardinality Checks: Instantly know if a column is categorical or continuous.
  • Duplicate Detection: We find the clones and eliminate them.

2. ๐Ÿš€ The Action Engine (ML Strategist)

  • Strategic Brief: A "Top 3 Insights" card that ranks every signal in your data. It whispers: "The money is here."
  • Automated Drivers: It finds the hidden variables controlling your target.
    • "Why is Churn high? It's not Price. It's Customer Support Wait Time > 5m." -> Boom. Solved.
  • Correlation Matrix: Pearson, Cramerโ€™s V, and F-Tests calculated automatically. We know the relationships better than you know your own situationship.

3. ๐Ÿงช ML Laboratory (The Brain)

  • Auto-Pilot Mode: Trains an army of models (Random Forest, Lasso, Ridge) to find the best fit. You sit back and look busy.
  • Clustering (Unsupervised Rizz): Elbow Plots & Silhouette Scores optimization. We even name the segments for you ("Cluster 1 = High Spend, Low Age").
  • Explainable Anomalies: Uses Isolation Forests to catch the weirdos and fraudsters instantly, with a Contextual Profiler to tell you why they are weird.

4. ๐ŸŽฎ Decision Simulator (The Time Machine)

  • "What-If" Sliders: Change variables in real-time. "If I raise Price by 10% and lower ad spend, do I still profit?"
  • Monte Carlo Sims: Run 1,000+ simulations. We don't guess; we calculate the probability of your success.
  • Waterfall Analysis: The Model breaks down exactly why the prediction changed.

5. ๐Ÿ“ˆ Time Series & Visuals That Don't Miss

  • Holt-Winters Forecasting: Predicting the future with confidence intervals.
  • Decomposition: Splitting data into Trend, Seasonality, and Noise.
  • Cohort Comparison: Volcano Plots visualizing "Effect Size" vs "Significance." We bring the science.

๐Ÿ’ป The Integrated IDE (Code is Power)

For those who speak the language of the gods (Python/SQL), we built a React-based Code Editor right inside the UI.

  • Embedded Ace Editor: Syntax highlighting, line numbers, and active line focus. Feels like VS Code, lives in your browser.
  • Intelligent Auto-Completions: Context-aware suggestions for pl, np, math. Type col get col("name"). It knows your schema.
  • Sandboxed Custom Scripts:
    • AST-Validated Security: We parse your code before execution.
    • Blocked: import os, private attributes, system calls.
    • Allowed: numpy, scipy, sklearn. Pure math and logic only.

๐Ÿงช SQL Lab: The Codex (God Mode)

For when the GUI is too easy and you want to flex raw SQL. This isn't SQLite. This is High-Performance Lazy SQL.

  • Zero-Lag Querying: Run SELECT * on a 50GB file? It pulls a preview instantly. The engine effectively cheats physics.
  • Cross-Dataset Joins: Join sales.csv with targets.xlsx using standard SQL.
  • Materialize: Execute complex queries, then save as a new dataset.

๐Ÿงน The Forge (Ruthless ETL)

Backend I/O that actually understands real-world data. Real data is cursed. We planned for that.

  • ๐Ÿงฌ Advanced Auto-Encoding Healer:
    • Scans the first bytes of every CSV to automatically fix UnicodeDecodeError.
    • Stream-Based Healing: Processes multi-GB files in 4MB chunks. Memory usage stays flat.
    • Sanitization: Strips Null Bytes, normalizes newlines, and replaces garbage.
  • ๐Ÿงฉ Mixed-Encoding Folder Handling:
    • If a folder contains files with different encodings, PyQuery detects it and switches strategy automatically.
    • We isolate. We adapt. We continue.
  • ๐Ÿ“‚ Recursive Folder Globbing (Upgraded):
    • Patterns like data/**/*.csv work even when schemas differ slightly or headers are misaligned.
  • ๐Ÿ—๏ธ Staging Ground (Infrastructure Rizz):
    • Control your intermediate storage. If your %TEMP% partition is small, tell PyQuery where the real space is using the PYQUERY_STAGING_DIR environment variable.
    # Linux/Mac Power Move
    export PYQUERY_STAGING_DIR="/mnt/fast_ssd/pyquery_cache"
    pyquery run ...
  • ๐Ÿ” Advanced File Filtering (Precision Strikes):
    • Multiple Filter Types: Glob, Regex, Contains, Not Contains, Exact, Is Not.
    • Stackable Logic: Must contain sales + Must NOT contain backup + Must match regex \d{4}.
    • This is surgical file selection. No more loading junk and cleaning later.
  • ๐Ÿ“Š Excel Handling That Respects Your Sanity:
    • Multi-Sheet Selection: Load one sheet, many sheets, or only the ones that matter.
    • Template-Based Mapping: Pick a base file, preview its sheets, and apply that selection across all matching files.
    • Sheet Name Filtering: Regex-powered selection like Q[1-4]_Data.
  • โœจ Source Awareness & Cleanliness:
    • Metadata Injection: Automatically add __source_path__ and __source_name__.
    • Auto Type Inference: Samples data, infers dtypes, and instantly appends a Clean & Cast step.
  • โœจ Auto-Typecast: One click scans rows and forcibly converts Strings to Int, Float, or Date.
  • ๐ŸŽญ PII Incinerator: Detects and obfuscates credit cards and SSNs. Secrets remain secret.
  • ๐Ÿฉน Smart Impute: Fill the voids. Forward fill, backward fill, median, or specific value injection. No null survives.
  • ๐Ÿ’ฅ Explode & Coalesce: Flatten lists and merge columns like a boss.

๐Ÿง  The Tech Stack (Forbidden Knowledge) ๐Ÿ

This isn't just a library. It's a weapon system.

1. ๐ŸŒŠ The "Infinite Stream" Glitch (Lazy Execution)

The Old Gods (Pandas) are Eager. They try to swallow the ocean (RAM) whole. They choke. PyQuery is Lazy. It waits. It plans.

  • Scan: "It's a 100GB file. Interesting."
  • Plan: Filters, joins, math. Nothing executes until the final blow.
  • Stream: Data flows in chunks. Process. Write. Destroy.
  • Result: Processing 100GB on a MacBook Air. The laws of physics are optional.

2. โš™๏ธ File-Level Execution Control

Most engines think in datasets. PyQuery thinks in files.

  • Individual File Processing: Forces the engine to load files one-by-one instead of bulk scanning.
  • Why it matters: One corrupted CSV no longer nukes the entire pipeline. We fix schemas and clean data before concatenation. This is how PyQuery survives enterprise-grade mess.

3. ๐Ÿš€ Streaming I/O Architecture

We rewired the backend for scale.

  • True Streaming Discovery: Uses generators and lazy iteration. Point at 100k files without crashing.
  • Partial Globbing: Simple text filters convert to filesystem-level globs. Python never even sees irrelevant files.

4. ๐Ÿ›ก๏ธ Type Safety (Absolute Order)

Python is dynamic (chaotic). PyQuery imposes Order.

  • Every step is backed by a Pydantic Model.
  • If a String tries to infiltrate a Float column, it is terminated before execution.
  • No runtime surprises. Only calculated victories.

๐Ÿงพ The Receipts (Benchmarks)

We don't post without proof. We mog the competition.

Metric ๐Ÿผ Pandas (Legacy) โšก PyQuery (Polars) The Diff
Load 10GB CSV MemoryError (Crash) ๐Ÿ’ฅ 0.2s (Lazy Scan) โšก Infinite
Filter Rows 15.4s (Slow) 0.5s (Parallel) 30x Faster
Group By 45s (Painful) 2.1s (Instant) 20x Faster
RAM Usage 12GB+ (Bloated) 500MB (Lean) 95% Less

Benchmarks run on a standard dev laptop. Results may vary but the vibe remains consistent.


๐ŸŽฎ Choose Your Fighter (4 Paths to Power)

We don't limit you. Dominate however you choose.

๐Ÿ“ฆ Installation

pip install pyquery-polars

1. ๐ŸŒŠ The GUI (God Mode)

For when you want to click things, see pretty charts, and feel like a data scientist in a sci-fi movie.

  • Visual Recipe Builder: Nodes and edges of pure logic.
  • Native File Picker: Access local filesystem directly.
pyquery ui
# Launches the Web App on localhost:8501 ๐Ÿš€

2. ๐Ÿค– The API (Headless Beast)

Building a machine? Run PyQuery as the engine.

  • Swagger Docs: Auto-generated at /docs.
  • Async: Fire and forget jobs via POST /recipes/run.
pyquery api
# Serving high-performance ETL over HTTP at localhost:8000 ๐Ÿ“ก

3. โšก The Batch Runner (Speedrun)

For automation. No interface. Just speed.

pyquery run -s input.csv -r recipe.json -o output.parquet
# Task complete. โšก

4. ๐Ÿง™โ€โ™‚๏ธ The Sorcerer (Python SDK)

For the developers who want to weave PyQuery into their own code.

from pyquery_polars.backend.engine import PyQueryEngine
# Full programmatic control over the recipe engine.
# You are the architect now.

๐Ÿงฐ The Loadout (Arsenal)

Packed with every tool needed to clear the map.

Category The Tools Why it slaps
Cleaning Fill Nulls, Mask PII, Smart Extract, Regex Turns garbage data into gold. โœจ
Analytics Rolling Agg, Time Bin, Rank, Diff, Z-Score High-frequency trading vibes. ๐Ÿ“ˆ
Combining Smart Join, Concat, Pivot, Unpivot Merge datasets without the headache. ๐Ÿค
Math Log, Exp, Clip, Date Offset For the scientific girlies. ๐Ÿ‘ฉโ€๐Ÿ”ฌ
Text Slice, Case, Replace, One-Hot String manipulation on steroids. ๐Ÿ’ช
I/O CSV, Parquet, Excel, JSON, IPC Speaks every language. ๐Ÿ—ฃ๏ธ

๐Ÿ—บ๏ธ The Roadmap (Manifesting Destiny) ๐Ÿ”ฎ

We aren't stopping here. We are aiming for the moon. ๐Ÿš€

  • Phase 1: Native App Supremacy (Rust + Tauri): The browser has limits. The Native App will have none. GPU-accelerated plotting (10M points at 144Hz) and OLED black themes.
  • Phase 2: Big Data Devourer: Cloud connectors (S3, GCS, Azure). We drink their milkshakes.

๐Ÿง‘โ€๐Ÿ’ป Join the Cult (Developer Guide)

You want to contribute? Good. We need strong allies.

The Blooding (Adding a Transform) ๐Ÿ–๏ธ

1. Backend Implementation:

  • Define Params: Create a Pydantic model (src/pyquery_polars/core/params.py).
  • Backend Logic: Write a pure polars function (src/pyquery_polars/backend/transforms/).
  • Register: Add step to register_all_steps() in registry.py.

2. Frontend Implementation:

  • Create a Renderer Function (src/pyquery_polars/frontend/steps/).
  • Register: Add step to register_frontend() in registry_init.py.

It appears in the CLI, API, and UI automatically. ๐Ÿคฏ

# Only certified ballers contribute code.
# Are you up for it?

๐Ÿ“œ License

GPL-3.0. Open source forever. ๐Ÿ’–


Made with โ˜•, ๐Ÿฆ€ (Rust), and ๐Ÿ’– by Sudharshan TK

About

PyQuery is a local-first data operating system built on lazy execution that processes 100GB+ files while you doomscroll. No cap. ๐Ÿงข

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages