Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions POLARS_ENGINE_IMPLEMENTATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Polars Engine Implementation for pandas read_csv

This document summarizes the implementation of the polars engine for pandas' `read_csv` function.

## Files Modified/Created

### 1. Core Implementation
- **`pandas/io/parsers/polars_parser_wrapper.py`** - New file implementing the PolarsParserWrapper class
- **`pandas/_typing.py`** - Updated CSVEngine type to include "polars"
- **`pandas/io/parsers/readers.py`** - Updated to include polars engine support

### 2. Compatibility Support
- **`pandas/compat/polars.py`** - New file for polars compatibility checks
- **`pandas/compat/__init__.py`** - Updated to export HAS_POLARS

### 3. Test Infrastructure
- **`pandas/tests/io/parser/conftest.py`** - Updated to include PolarsParser class and test fixtures
- **`pandas/tests/io/parser/test_polars_engine.py`** - New test file for polars engine specific tests

## Key Features Implemented

### Basic Functionality
- ✅ Reading CSV files with polars engine
- ✅ Converting polars DataFrame to pandas DataFrame
- ✅ Support for file paths and file-like objects
- ✅ Lazy evaluation using polars scan_csv when possible

### Supported Options
- ✅ `sep` - Field delimiter
- ✅ `header` - Row number(s) to use as column names
- ✅ `skiprows` - Lines to skip at start of file
- ✅ `na_values` - Additional strings to recognize as NA/NaN
- ✅ `names` - List of column names to use
- ✅ `usecols` - Return subset of columns (string names only)
- ✅ `nrows` - Number of rows to read
- ✅ `quotechar` - Character used to quote fields
- ✅ `comment` - Character(s) to treat as comment
- ✅ `encoding` - Encoding to use for UTF when reading
- ✅ `dtype` - Data type for data or columns (dict mapping)

### Unsupported Options (raises ValueError)
- ❌ `chunksize` - Not supported (similar to pyarrow)
- ❌ `iterator` - Not supported (similar to pyarrow)
- ❌ `skipfooter` - Not supported
- ❌ `float_precision` - Not supported
- ❌ `thousands` - Not supported
- ❌ `memory_map` - Not supported
- ❌ `dialect` - Not supported
- ❌ `quoting` - Not supported
- ❌ `lineterminator` - Not supported
- ❌ `converters` - Not supported
- ❌ `dayfirst` - Not supported
- ❌ `skipinitialspace` - Not supported
- ❌ `low_memory` - Not supported
- ❌ Callable `usecols` - Not supported
- ❌ Dict `na_values` - Not supported

## Performance Benefits

The polars engine is designed to provide:

1. **Fast CSV parsing** - Polars has state-of-the-art CSV parsing performance
2. **Memory efficiency** - Lazy evaluation where possible
3. **Parallel processing** - Polars can utilize multiple CPU cores
4. **Column pruning** - Only read requested columns when using `usecols`
5. **Predicate pushdown** - Future optimization for row filtering

## Usage Examples

```python
import pandas as pd

# Basic usage
df = pd.read_csv("data.csv", engine="polars")

# With options
df = pd.read_csv("data.csv",
engine="polars",
usecols=["name", "age"],
nrows=1000,
na_values=["NULL", "N/A"])

# Custom column names
df = pd.read_csv("data.csv",
engine="polars",
names=["col1", "col2", "col3"],
header=None)
```

## Error Handling

The implementation includes comprehensive error handling:

1. **Missing polars dependency** - Graceful ImportError with suggestion to install polars
2. **Unsupported options** - Clear ValueError messages listing unsupported parameters
3. **Polars parsing errors** - Wrapped in pandas ParserError with context
4. **File handling errors** - Proper cleanup and error propagation

## Testing

A comprehensive test suite has been implemented covering:

- Basic functionality tests
- Option validation tests
- Error condition tests
- Comparison with other engines
- Edge cases and compatibility

## Future Enhancements

Potential improvements for future versions:

1. **Enhanced dtype mapping** - Better support for pandas-specific dtypes
2. **Date parsing** - Leverage polars' built-in date parsing capabilities
3. **Index handling** - More sophisticated index column processing
4. **Streaming support** - Large file processing with minimal memory usage
5. **Schema inference** - Automatic optimal dtype detection

## Documentation Updates

The implementation includes updated documentation:

- Engine parameter documentation in `read_csv` docstring
- Version notes indicating experimental status
- Clear listing of supported and unsupported options

## Implementation Notes

### Design Decisions

1. **Lazy evaluation preferred** - Uses `scan_csv` for file paths when possible
2. **Pandas compatibility first** - All results converted to pandas DataFrame
3. **Error parity** - Similar error handling to existing engines
4. **Test infrastructure reuse** - Leverages existing parser test framework

### Limitations

1. **Experimental status** - Marked as experimental similar to pyarrow engine
2. **Option subset** - Only supports subset of pandas read_csv options
3. **Polars dependency** - Requires polars to be installed
4. **Performance trade-off** - Conversion to pandas may negate some performance benefits

This implementation provides a solid foundation for using polars as a high-performance CSV parsing engine within pandas while maintaining compatibility with the existing pandas API.
2 changes: 1 addition & 1 deletion pandas/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ def closed(self) -> bool:
WindowingRankType: TypeAlias = Literal["average", "min", "max"]

# read_csv engines
CSVEngine: TypeAlias = Literal["c", "python", "pyarrow", "python-fwf"]
CSVEngine: TypeAlias = Literal["c", "python", "pyarrow", "polars", "python-fwf"]

# read_json engines
JSONEngine: TypeAlias = Literal["ujson", "pyarrow"]
Expand Down
1 change: 1 addition & 0 deletions pandas/compat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
WASM,
)
from pandas.compat.numpy import is_numpy_dev
from pandas.compat.polars import HAS_POLARS
from pandas.compat.pyarrow import (
HAS_PYARROW,
pa_version_under12p1,
Expand Down
13 changes: 13 additions & 0 deletions pandas/compat/polars.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""support polars compatibility across versions"""

from __future__ import annotations

from pandas.util.version import Version

try:
import polars as pl

_plv = Version(Version(pl.__version__).base_version)
HAS_POLARS = _plv >= Version("0.20.0") # Minimum version for to_pandas compatibility
except ImportError:
HAS_POLARS = False
Loading
Loading