PyQuery Core is the headless, high-performance ETL and Analytics engine that powers the PyQuery ecosystem.
Previously hidden inside a generic monorepo, it has now been extracted into its own pure-python library. It handles the heavy lifting: File I/O, Data Transformation, Statistical Analytics, and Machine Learning.
It has no UI. It has no CLI. It is just raw, unadulterated Polars power wrapped in a strict, type-safe architecture.
- 🚀 Lazy-First Architecture: Built on Polars LazyFrames. Nothing executes until you say so.
- 🛡️ Strict Type Safety: Every transform, every parameter, and every I/O operation is validated with Pydantic models. No more stringly-typed chaos.
- 🔌 Universal I/O:
- Readers: CSV, Parquet, Excel, JSON, IPC.
- Healers: Auto-detects encoding issues and "heals" broken CSVs on the fly.
- 🧪 Analytics Module:
- Built-in
scikit-learnintegration for Clustering and Regression. - Automatic "What-If" simulation engines.
- Built-in
- 🔧 Transform Registry: A modular plugin system for registering data transformation steps.
pip install pyquery-coreThis is a library for builders. Use it to construct your own data pipelines.
The PyQueryEngine is the orchestrator.
from pyquery_core.core import PyQueryEngine
from pyquery_core.io.files import FileLoader
# Initialize
engine = PyQueryEngine()
# Load Data (Lazy)
df = FileLoader.read_csv("massive_data.csv")
# Register a Pipeline
pipeline = [
{"type": "filter", "params": {"column": "revenue", "operator": ">", "value": 1000}},
{"type": "group_by", "params": {"by": "region", "agg": {"revenue": "sum"}}}
]
# Execute
result = engine.run(df, pipeline)
print(result.collect())Run complex statistical analysis without the boilerplate.
from pyquery_core.analytics.ml import ClusterEngine
# Auto-Clustering
model = ClusterEngine(data=df, n_clusters=3)
segments = model.fit_predict()
print(segments)The library is structured for modularity:
| Module | Description |
|---|---|
pyquery_core.io |
Input/Output. Smart loaders for Excel, CSV, Parquet, and SQL. |
pyquery_core.transforms |
Logic. Atomic data manipulation steps (Filter, Sort, Mutate). |
pyquery_core.analytics |
Intelligence. Statistical tests, ML models, and forecasting. |
pyquery_core.recipes |
Orchestration. JSON-serializable pipeline definitions. |
pyquery_core.jobs |
Async Workers. Background task management for long-running ops. |
This is the Core. Code quality here is paramount.
- Fork it.
- Branch it (
git checkout -b feature/fancy-algo). - Test it. (If it breaks the engine, we break your PR).
- Push it.
GPL-3.0. Open source forever. 💖
Made with ☕, 🦀 (Rust), and 💖 by Sudharshan TK
