|
| 1 | +# Polars Engine Implementation for pandas read_csv |
| 2 | + |
| 3 | +This document summarizes the implementation of the polars engine for pandas' `read_csv` function. |
| 4 | + |
| 5 | +## Files Modified/Created |
| 6 | + |
| 7 | +### 1. Core Implementation |
| 8 | +- **`pandas/io/parsers/polars_parser_wrapper.py`** - New file implementing the PolarsParserWrapper class |
| 9 | +- **`pandas/_typing.py`** - Updated CSVEngine type to include "polars" |
| 10 | +- **`pandas/io/parsers/readers.py`** - Updated to include polars engine support |
| 11 | + |
| 12 | +### 2. Compatibility Support |
| 13 | +- **`pandas/compat/polars.py`** - New file for polars compatibility checks |
| 14 | +- **`pandas/compat/__init__.py`** - Updated to export HAS_POLARS |
| 15 | + |
| 16 | +### 3. Test Infrastructure |
| 17 | +- **`pandas/tests/io/parser/conftest.py`** - Updated to include PolarsParser class and test fixtures |
| 18 | +- **`pandas/tests/io/parser/test_polars_engine.py`** - New test file for polars engine specific tests |
| 19 | + |
| 20 | +## Key Features Implemented |
| 21 | + |
| 22 | +### Basic Functionality |
| 23 | +- ✅ Reading CSV files with polars engine |
| 24 | +- ✅ Converting polars DataFrame to pandas DataFrame |
| 25 | +- ✅ Support for file paths and file-like objects |
| 26 | +- ✅ Lazy evaluation using polars scan_csv when possible |
| 27 | + |
| 28 | +### Supported Options |
| 29 | +- ✅ `sep` - Field delimiter |
| 30 | +- ✅ `header` - Row number(s) to use as column names |
| 31 | +- ✅ `skiprows` - Lines to skip at start of file |
| 32 | +- ✅ `na_values` - Additional strings to recognize as NA/NaN |
| 33 | +- ✅ `names` - List of column names to use |
| 34 | +- ✅ `usecols` - Return subset of columns (string names only) |
| 35 | +- ✅ `nrows` - Number of rows to read |
| 36 | +- ✅ `quotechar` - Character used to quote fields |
| 37 | +- ✅ `comment` - Character(s) to treat as comment |
| 38 | +- ✅ `encoding` - Encoding to use for UTF when reading |
| 39 | +- ✅ `dtype` - Data type for data or columns (dict mapping) |
| 40 | + |
| 41 | +### Unsupported Options (raises ValueError) |
| 42 | +- ❌ `chunksize` - Not supported (similar to pyarrow) |
| 43 | +- ❌ `iterator` - Not supported (similar to pyarrow) |
| 44 | +- ❌ `skipfooter` - Not supported |
| 45 | +- ❌ `float_precision` - Not supported |
| 46 | +- ❌ `thousands` - Not supported |
| 47 | +- ❌ `memory_map` - Not supported |
| 48 | +- ❌ `dialect` - Not supported |
| 49 | +- ❌ `quoting` - Not supported |
| 50 | +- ❌ `lineterminator` - Not supported |
| 51 | +- ❌ `converters` - Not supported |
| 52 | +- ❌ `dayfirst` - Not supported |
| 53 | +- ❌ `skipinitialspace` - Not supported |
| 54 | +- ❌ `low_memory` - Not supported |
| 55 | +- ❌ Callable `usecols` - Not supported |
| 56 | +- ❌ Dict `na_values` - Not supported |
| 57 | + |
| 58 | +## Performance Benefits |
| 59 | + |
| 60 | +The polars engine is designed to provide: |
| 61 | + |
| 62 | +1. **Fast CSV parsing** - Polars has state-of-the-art CSV parsing performance |
| 63 | +2. **Memory efficiency** - Lazy evaluation where possible |
| 64 | +3. **Parallel processing** - Polars can utilize multiple CPU cores |
| 65 | +4. **Column pruning** - Only read requested columns when using `usecols` |
| 66 | +5. **Predicate pushdown** - Future optimization for row filtering |
| 67 | + |
| 68 | +## Usage Examples |
| 69 | + |
| 70 | +```python |
| 71 | +import pandas as pd |
| 72 | + |
| 73 | +# Basic usage |
| 74 | +df = pd.read_csv("data.csv", engine="polars") |
| 75 | + |
| 76 | +# With options |
| 77 | +df = pd.read_csv("data.csv", |
| 78 | + engine="polars", |
| 79 | + usecols=["name", "age"], |
| 80 | + nrows=1000, |
| 81 | + na_values=["NULL", "N/A"]) |
| 82 | + |
| 83 | +# Custom column names |
| 84 | +df = pd.read_csv("data.csv", |
| 85 | + engine="polars", |
| 86 | + names=["col1", "col2", "col3"], |
| 87 | + header=None) |
| 88 | +``` |
| 89 | + |
| 90 | +## Error Handling |
| 91 | + |
| 92 | +The implementation includes comprehensive error handling: |
| 93 | + |
| 94 | +1. **Missing polars dependency** - Graceful ImportError with suggestion to install polars |
| 95 | +2. **Unsupported options** - Clear ValueError messages listing unsupported parameters |
| 96 | +3. **Polars parsing errors** - Wrapped in pandas ParserError with context |
| 97 | +4. **File handling errors** - Proper cleanup and error propagation |
| 98 | + |
| 99 | +## Testing |
| 100 | + |
| 101 | +A comprehensive test suite has been implemented covering: |
| 102 | + |
| 103 | +- Basic functionality tests |
| 104 | +- Option validation tests |
| 105 | +- Error condition tests |
| 106 | +- Comparison with other engines |
| 107 | +- Edge cases and compatibility |
| 108 | + |
| 109 | +## Future Enhancements |
| 110 | + |
| 111 | +Potential improvements for future versions: |
| 112 | + |
| 113 | +1. **Enhanced dtype mapping** - Better support for pandas-specific dtypes |
| 114 | +2. **Date parsing** - Leverage polars' built-in date parsing capabilities |
| 115 | +3. **Index handling** - More sophisticated index column processing |
| 116 | +4. **Streaming support** - Large file processing with minimal memory usage |
| 117 | +5. **Schema inference** - Automatic optimal dtype detection |
| 118 | + |
| 119 | +## Documentation Updates |
| 120 | + |
| 121 | +The implementation includes updated documentation: |
| 122 | + |
| 123 | +- Engine parameter documentation in `read_csv` docstring |
| 124 | +- Version notes indicating experimental status |
| 125 | +- Clear listing of supported and unsupported options |
| 126 | + |
| 127 | +## Implementation Notes |
| 128 | + |
| 129 | +### Design Decisions |
| 130 | + |
| 131 | +1. **Lazy evaluation preferred** - Uses `scan_csv` for file paths when possible |
| 132 | +2. **Pandas compatibility first** - All results converted to pandas DataFrame |
| 133 | +3. **Error parity** - Similar error handling to existing engines |
| 134 | +4. **Test infrastructure reuse** - Leverages existing parser test framework |
| 135 | + |
| 136 | +### Limitations |
| 137 | + |
| 138 | +1. **Experimental status** - Marked as experimental similar to pyarrow engine |
| 139 | +2. **Option subset** - Only supports subset of pandas read_csv options |
| 140 | +3. **Polars dependency** - Requires polars to be installed |
| 141 | +4. **Performance trade-off** - Conversion to pandas may negate some performance benefits |
| 142 | + |
| 143 | +This implementation provides a solid foundation for using polars as a high-performance CSV parsing engine within pandas while maintaining compatibility with the existing pandas API. |
0 commit comments