Skip to content

Commit 4049034

Browse files
authored
Merge pull request #2 from legout/refactor
Refactor
2 parents f9f75c1 + 8b12348 commit 4049034

40 files changed

+13373
-780
lines changed

README.md

Lines changed: 138 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,135 +1,183 @@
1-
# PyDala2
1+
# PyDala2
22

33
<p align="center">
44
<img src="logo.jpeg" width="400" alt="PyDala2">
55
</p>
66

77
[![PyPI version](https://badge.fury.io/py/pydala2.svg)](https://badge.fury.io/py/pydala2)
88
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
9-
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/legout/pydala2)
10-
9+
[![Documentation](https://img.shields.io/badge/docs-latest-blue.svg)](https://pydala2.readthedocs.io)
1110

1211
## Overview 📖
13-
Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.
12+
13+
PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:
14+
15+
- Smart dataset management with metadata optimization
16+
- Multi-format support (Parquet, CSV, JSON)
17+
- Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
18+
- Advanced querying with predicate pushdown
19+
- Schema management with automatic validation
20+
- Performance optimization with caching and partitioning
21+
- Catalog system for centralized dataset management
1422

1523
## ✨ Key Features
16-
- 📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
17-
- 🔄 Robust Caching: Built-in support for faster data access
18-
- 🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
19-
- 🔍 Advanced Querying: SQL-like filtering with predicate pushdown
20-
- 🛠️ Schema Management: Automatic validation and tracking
24+
25+
- **🚀 High Performance**: Built on Apache Arrow with optimized memory usage and processing speed
26+
- **📊 Smart Dataset Management**: Efficient Parquet handling with metadata optimization and caching
27+
- **🔄 Multi-backend Support**: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
28+
- **🔍 Advanced Querying**: SQL-like filtering with predicate pushdown for maximum efficiency
29+
- **📋 Schema Management**: Automatic validation, evolution, and tracking of data schemas
30+
- **⚡ Performance Optimization**: Built-in caching, compression, and intelligent partitioning
31+
- **🛡️ Type Safety**: Comprehensive validation and error handling throughout the library
32+
- **🏗️ Catalog System**: Centralized dataset management across namespaces
2133

2234
## 🚀 Quick Start
35+
2336
### Installation
37+
2438
```bash
39+
# Install PyDala2
2540
pip install pydala2
26-
```
2741

28-
### 📊 Creating a Dataset
29-
```python
30-
from pydala.dataset import ParquetDataset
42+
# Install with all optional dependencies
43+
pip install pydala2[all]
3144

32-
dataset = ParquetDataset(
33-
path="path/to/dataset",
34-
partitioning="hive", # Hive-style partitioning
35-
timestamp_column="timestamp", # For time-based operations
36-
cached=True # Enable performance caching
37-
)
45+
# Install with specific backends
46+
pip install pydala2[polars,duckdb]
3847
```
3948

40-
### 💾 Writing Data
49+
### Basic Usage
50+
4151
```python
42-
import polars as pl
52+
from pydala import ParquetDataset
53+
import pandas as pd
4354

44-
# Create sample time-series data
45-
df = pl.DataFrame({
46-
"timestamp": pl.date_range(0, 1000, "1d"),
47-
"value": range(1000)
48-
})
55+
# Create a dataset
56+
dataset = ParquetDataset("data/my_dataset")
4957

50-
# Write with smart partitioning and compression
58+
# Write data
59+
data = pd.DataFrame({
60+
'id': range(100),
61+
'category': ['A', 'B', 'C'] * 33 + ['A'],
62+
'value': [i * 2 for i in range(100)]
63+
})
5164
dataset.write_to_dataset(
52-
data=df, # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result
53-
mode="overwrite", # Options: "overwrite", "append", "delta"
54-
row_group_size=250_000, # Optimize chunk size
55-
compression="zstd", # High-performance compression
56-
partition_by=["year", "month"], # Auto-partition by time
57-
unique=True # Ensure data uniqueness
65+
data=data,
66+
partition_cols=['category']
5867
)
59-
```
6068

61-
### 📥 Reading & Converting Data
62-
```python
63-
dataset.load(update_metadata=True)
64-
65-
# Flexible data format conversion
66-
pt = dataset.t # PyDala Table
67-
df_polars = pt.to_polars() # Convert to Polars
68-
df_pandas = pt.to_pandas() # Convert to Pandas
69-
df_arrow = pt.to_arrow() # Convert to Arrow
70-
rel_ddb = pt.to_ddb() # Convert DuckDB relation
69+
# Read with filtering - automatic backend selection
70+
result = dataset.filter("category IN ('A', 'B') AND value > 50")
7171

72-
# and many more...
72+
# Export to different formats
73+
df_polars = result.table.to_polars() # or use shortcut: result.t.pl
74+
df_pandas = result.table.df # or result.t.df
75+
duckdb_rel = result.table.ddb # or result.t.ddb
7376
```
7477

75-
### 🔍 Smart Querying
76-
```python
77-
# Efficient filtered reads with predicate pushdown
78-
pt_filtered = dataset.filter("timestamp > '2023-01-01'")
79-
80-
# Chaining operations
81-
df_filtered = (
82-
dataset
83-
.filter("column_name > 100")
84-
.pl.with_columns(
85-
pl.col("column_name").str.slice(0, 5).alias("new_column_name")
86-
)
87-
.to_pandas()
88-
)
89-
90-
# Fast metadata-only scans
91-
pt_scanned = dataset.scan("column_name > 100")
92-
93-
# Access matching files
94-
matching_files = ds.scan_files
95-
```
78+
### Using Different Backends
9679

97-
### 🔄 Metadata Management
9880
```python
99-
# Incremental metadata update
100-
dataset.load(update_metadata=True) # Update for new files
81+
# PyDala2 provides automatic backend selection
82+
# Just access data in your preferred format:
83+
84+
# Polars LazyFrame (recommended for performance)
85+
lazy_df = dataset.table.pl # or dataset.t.pl
86+
result = (
87+
lazy_df
88+
.filter(pl.col("value") > 100)
89+
.group_by("category")
90+
.agg(pl.mean("value"))
91+
.collect()
92+
)
93+
94+
# DuckDB (for SQL queries)
95+
result = dataset.ddb_con.sql("""
96+
SELECT category, AVG(value) as avg_value
97+
FROM dataset
98+
GROUP BY category
99+
""").to_arrow()
101100

102-
# Full metadata refresh
103-
dataset.load(reload_metadata=True) # Reload all metadata
101+
# PyArrow Table (for columnar operations)
102+
table = dataset.table.arrow # or dataset.t.arrow
104103

105-
# Repair schema/metadata
106-
dataset.repair_schema()
104+
# Pandas DataFrame (for compatibility)
105+
df_pandas = dataset.table.df # or dataset.t.df
106+
107+
# Direct export methods
108+
df_polars = dataset.table.to_polars(lazy=False)
109+
table = dataset.table.to_arrow()
110+
df_pandas = dataset.table.to_pandas()
107111
```
108112

109-
### ⚡ Performance Optimization Tools
113+
### Catalog Management
114+
110115
```python
111-
# Optimize storage types
112-
dataset.opt_dtypes() # Automatic type optimization
113-
114-
# Smart file management
115-
dataset.compact_by_rows(max_rows=100_000) # Combine small files
116-
dataset.repartition(partitioning_columns=["date"]) # Optimize partitions
117-
dataset.compact_by_timeperiod(interval="1d") # Time-based optimization
118-
dataset.compact_partitions() # Partition structure optimization
116+
from pydala import Catalog
117+
118+
# Create catalog from YAML configuration
119+
catalog = Catalog("catalog.yaml")
120+
121+
# YAML configuration example:
122+
# tables:
123+
# sales_2023:
124+
# path: "/data/sales/2023"
125+
# filesystem: "local"
126+
# customers:
127+
# path: "/data/customers"
128+
# filesystem: "local"
129+
130+
# Query across datasets using automatic table loading
131+
result = catalog.query("""
132+
SELECT
133+
s.*,
134+
c.customer_name,
135+
c.segment
136+
FROM sales_2023 s
137+
JOIN customers c ON s.customer_id = c.id
138+
WHERE s.date >= '2023-01-01'
139+
""")
140+
141+
# Or access datasets directly
142+
sales_dataset = catalog.get_dataset("sales_2023")
143+
filtered_sales = sales_dataset.filter("amount > 1000")
119144
```
120145

121-
## ⚠️ Important Notes
122-
Type optimization involves full dataset rewrite
123-
Choose compaction strategy based on your access patterns
124-
Regular metadata updates ensure optimal query performance
125-
126146
## 📚 Documentation
127-
There is a comprehensive [tutorial](https://code2tutorial.com/tutorial/a988dfd0-820d-471e-a802-14feedba5813/index.md) available to help you get started with PyDala2, covering all features and functionalities in detail.
128147

129-
*Note: This is generated with [Code2Tutorial](https://code2tutorial.com/).*
148+
Comprehensive documentation is available at [pydala2.readthedocs.io](https://pydala2.readthedocs.io):
149+
150+
### Getting Started
151+
- [Installation Guide](https://pydala2.readthedocs.io/getting-started)
152+
- [Quick Start Tutorial](https://pydala2.readthedocs.io/quick-start)
153+
154+
### User Guide
155+
- [Basic Usage](https://pydala2.readthedocs.io/user-guide/basic-usage)
156+
- [Data Operations](https://pydala2.readthedocs.io/user-guide/data-operations)
157+
- [Performance Optimization](https://pydala2.readthedocs.io/user-guide/performance)
158+
- [Catalog Management](https://pydala2.readthedocs.io/user-guide/catalog-management)
159+
- [Schema Management](https://pydala2.readthedocs.io/user-guide/schema-management)
160+
161+
### API Reference
162+
- [Core Classes](https://pydala2.readthedocs.io/api/core)
163+
- [Dataset Classes](https://pydala2.readthedocs.io/api/datasets)
164+
- [Table Operations](https://pydala2.readthedocs.io/api/table)
165+
- [Metadata Management](https://pydala2.readthedocs.io/api/metadata)
166+
- [Catalog System](https://pydala2.readthedocs.io/api/catalog)
167+
- [Filesystem](https://pydala2.readthedocs.io/api/filesystem)
168+
- [Utilities](https://pydala2.readthedocs.io/api/utilities)
169+
170+
### Advanced Topics
171+
- [Performance Tuning](https://pydala2.readthedocs.io/advanced/performance-tuning)
172+
- [Integration Patterns](https://pydala2.readthedocs.io/advanced/integration)
173+
- [Deployment Guide](https://pydala2.readthedocs.io/advanced/deployment)
174+
- [Troubleshooting](https://pydala2.readthedocs.io/advanced/troubleshooting)
130175

131176
## 🤝 Contributing
132-
Contributions welcome! See our contribution guidelines.
177+
178+
Contributions are welcome! Please see our [Contributing Guide](https://pydala2.readthedocs.io/contributing) for details.
133179

134180
## 📝 License
181+
135182
[MIT License](LICENSE)
183+

0 commit comments

Comments
 (0)