Skip to content

Commit 5cb26cc

Browse files
committed
ENH: Add Polars engine to read_csv
- Add PolarsParserWrapper class for polars CSV parsing - Update type annotations to include 'polars' as valid engine - Add polars compatibility checks and imports - Update readers.py to integrate polars engine - Add comprehensive test suite for polars engine - Add validation for unsupported options - Add documentation and implementation notes Closes #61813
1 parent 728be93 commit 5cb26cc

File tree

9 files changed

+744
-8
lines changed

9 files changed

+744
-8
lines changed

POLARS_ENGINE_IMPLEMENTATION.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Polars Engine Implementation for pandas read_csv
2+
3+
This document summarizes the implementation of the polars engine for pandas' `read_csv` function.
4+
5+
## Files Modified/Created
6+
7+
### 1. Core Implementation
8+
- **`pandas/io/parsers/polars_parser_wrapper.py`** - New file implementing the PolarsParserWrapper class
9+
- **`pandas/_typing.py`** - Updated CSVEngine type to include "polars"
10+
- **`pandas/io/parsers/readers.py`** - Updated to include polars engine support
11+
12+
### 2. Compatibility Support
13+
- **`pandas/compat/polars.py`** - New file for polars compatibility checks
14+
- **`pandas/compat/__init__.py`** - Updated to export HAS_POLARS
15+
16+
### 3. Test Infrastructure
17+
- **`pandas/tests/io/parser/conftest.py`** - Updated to include PolarsParser class and test fixtures
18+
- **`pandas/tests/io/parser/test_polars_engine.py`** - New test file for polars engine specific tests
19+
20+
## Key Features Implemented
21+
22+
### Basic Functionality
23+
- ✅ Reading CSV files with polars engine
24+
- ✅ Converting polars DataFrame to pandas DataFrame
25+
- ✅ Support for file paths and file-like objects
26+
- ✅ Lazy evaluation using polars scan_csv when possible
27+
28+
### Supported Options
29+
-`sep` - Field delimiter
30+
-`header` - Row number(s) to use as column names
31+
-`skiprows` - Lines to skip at start of file
32+
-`na_values` - Additional strings to recognize as NA/NaN
33+
-`names` - List of column names to use
34+
-`usecols` - Return subset of columns (string names only)
35+
-`nrows` - Number of rows to read
36+
-`quotechar` - Character used to quote fields
37+
-`comment` - Character(s) to treat as comment
38+
-`encoding` - Encoding to use for UTF when reading
39+
-`dtype` - Data type for data or columns (dict mapping)
40+
41+
### Unsupported Options (raises ValueError)
42+
-`chunksize` - Not supported (similar to pyarrow)
43+
-`iterator` - Not supported (similar to pyarrow)
44+
-`skipfooter` - Not supported
45+
-`float_precision` - Not supported
46+
-`thousands` - Not supported
47+
-`memory_map` - Not supported
48+
-`dialect` - Not supported
49+
-`quoting` - Not supported
50+
-`lineterminator` - Not supported
51+
-`converters` - Not supported
52+
-`dayfirst` - Not supported
53+
-`skipinitialspace` - Not supported
54+
-`low_memory` - Not supported
55+
- ❌ Callable `usecols` - Not supported
56+
- ❌ Dict `na_values` - Not supported
57+
58+
## Performance Benefits
59+
60+
The polars engine is designed to provide:
61+
62+
1. **Fast CSV parsing** - Polars has state-of-the-art CSV parsing performance
63+
2. **Memory efficiency** - Lazy evaluation where possible
64+
3. **Parallel processing** - Polars can utilize multiple CPU cores
65+
4. **Column pruning** - Only read requested columns when using `usecols`
66+
5. **Predicate pushdown** - Future optimization for row filtering
67+
68+
## Usage Examples
69+
70+
```python
71+
import pandas as pd
72+
73+
# Basic usage
74+
df = pd.read_csv("data.csv", engine="polars")
75+
76+
# With options
77+
df = pd.read_csv("data.csv",
78+
engine="polars",
79+
usecols=["name", "age"],
80+
nrows=1000,
81+
na_values=["NULL", "N/A"])
82+
83+
# Custom column names
84+
df = pd.read_csv("data.csv",
85+
engine="polars",
86+
names=["col1", "col2", "col3"],
87+
header=None)
88+
```
89+
90+
## Error Handling
91+
92+
The implementation includes comprehensive error handling:
93+
94+
1. **Missing polars dependency** - Graceful ImportError with suggestion to install polars
95+
2. **Unsupported options** - Clear ValueError messages listing unsupported parameters
96+
3. **Polars parsing errors** - Wrapped in pandas ParserError with context
97+
4. **File handling errors** - Proper cleanup and error propagation
98+
99+
## Testing
100+
101+
A comprehensive test suite has been implemented covering:
102+
103+
- Basic functionality tests
104+
- Option validation tests
105+
- Error condition tests
106+
- Comparison with other engines
107+
- Edge cases and compatibility
108+
109+
## Future Enhancements
110+
111+
Potential improvements for future versions:
112+
113+
1. **Enhanced dtype mapping** - Better support for pandas-specific dtypes
114+
2. **Date parsing** - Leverage polars' built-in date parsing capabilities
115+
3. **Index handling** - More sophisticated index column processing
116+
4. **Streaming support** - Large file processing with minimal memory usage
117+
5. **Schema inference** - Automatic optimal dtype detection
118+
119+
## Documentation Updates
120+
121+
The implementation includes updated documentation:
122+
123+
- Engine parameter documentation in `read_csv` docstring
124+
- Version notes indicating experimental status
125+
- Clear listing of supported and unsupported options
126+
127+
## Implementation Notes
128+
129+
### Design Decisions
130+
131+
1. **Lazy evaluation preferred** - Uses `scan_csv` for file paths when possible
132+
2. **Pandas compatibility first** - All results converted to pandas DataFrame
133+
3. **Error parity** - Similar error handling to existing engines
134+
4. **Test infrastructure reuse** - Leverages existing parser test framework
135+
136+
### Limitations
137+
138+
1. **Experimental status** - Marked as experimental similar to pyarrow engine
139+
2. **Option subset** - Only supports subset of pandas read_csv options
140+
3. **Polars dependency** - Requires polars to be installed
141+
4. **Performance trade-off** - Conversion to pandas may negate some performance benefits
142+
143+
This implementation provides a solid foundation for using polars as a high-performance CSV parsing engine within pandas while maintaining compatibility with the existing pandas API.

pandas/_typing.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,7 @@ def closed(self) -> bool:
374374
WindowingRankType: TypeAlias = Literal["average", "min", "max"]
375375

376376
# read_csv engines
377-
CSVEngine: TypeAlias = Literal["c", "python", "pyarrow", "python-fwf"]
377+
CSVEngine: TypeAlias = Literal["c", "python", "pyarrow", "polars", "python-fwf"]
378378

379379
# read_json engines
380380
JSONEngine: TypeAlias = Literal["ujson", "pyarrow"]

pandas/compat/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
WASM,
2424
)
2525
from pandas.compat.numpy import is_numpy_dev
26+
from pandas.compat.polars import HAS_POLARS
2627
from pandas.compat.pyarrow import (
2728
HAS_PYARROW,
2829
pa_version_under12p1,

pandas/compat/polars.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
"""support polars compatibility across versions"""
2+
3+
from __future__ import annotations
4+
5+
from pandas.util.version import Version
6+
7+
try:
8+
import polars as pl
9+
10+
_plv = Version(Version(pl.__version__).base_version)
11+
HAS_POLARS = _plv >= Version("0.20.0") # Minimum version for to_pandas compatibility
12+
except ImportError:
13+
HAS_POLARS = False

0 commit comments

Comments
 (0)