This file provides guidance to coding agents collaborating on this repository.
Lance is a modern columnar data format optimized for ML workflows and datasets. It provides:
- High-performance random access
- Vector search
- Zero-copy, automatic versioning
- Ecosystem integrations
The de facto standard columnar data format for machine learning and large language models.
- Always use English in code, examples, and comments.
- Features should be implemented concisely, maintainably, and efficiently.
- Code is not just for execution, but also for readability.
- Only add meaingful comments and tests.
The project is organized as a Rust workspace with Python and Java bindings:
rust/lance/- Main Lance library implementing the columnar formatrust/lance-arrow/- Apache Arrow integration layerrust/lance-core/- Core types, traits, and utilitiesrust/lance-encoding/- Data encoding and compression algorithmsrust/lance-file/- File format reading/writingrust/lance-index/- Vector and scalar indexing implementationsrust/lance-io/- I/O operations and object store integrationrust/lance-linalg/- Linear algebra operations for vector searchrust/lance-table/- Table format and operationsrust/lance-datafusion/- DataFusion query engine integrationpython/- Python bindings using PyO3/maturinjava/- Java bindings using JNI
- Check for build errors:
cargo check --all --tests --benches - Run tests:
cargo test - Run specific test:
cargo test -p <package> <test_name> - Lint:
cargo clippy --all --tests --benches -- -D warnings - Format:
cargo fmt --all
Use the makefile for most actions:
- Build:
maturin develop - Test:
make test - Run single test:
pytest python/tests/<test_file>.py::<test_name> - Doctest:
make doctest - Lint:
make lint - Format:
make format
# Start required services
cd test_data && docker compose up -d
# Run S3/DynamoDB tests
AWS_DEFAULT_REGION=us-east-1 pytest --run-integration python/tests/test_s3_ddb.py
# Performance profiling
maturin develop --release -m python/Cargo.toml -E benchmarks
python python/benchmarks/test_knn.py --iterations 100- Async-first Architecture: Heavy use of tokio and async/await throughout Rust codebase
- Arrow-native: All data operations work directly with Apache Arrow arrays
- Version Control: Every write creates a new version with manifest tracking
- Indexing: Supports both vector indices (for similarity search) and scalar indices (BTree, inverted)
- Encoding: Custom encodings optimized for ML data patterns
- Object Store: Unified interface for local, S3, Azure, GCS storage
- All public APIs should have comprehensive documentation with examples
- Performance-critical code uses SIMD optimizations where available
- Always rebuild Python extension after Rust changes using
maturin develop - Integration tests require Docker for local S3/DynamoDB emulation
- Use feature flags to control dependencies (e.g.,
datafusionfor SQL support)
Code standards:
- Be mindful of memory use:
- When dealing with streams of
RecordBatch, avoid collecting all data into memory whenever possible. - Use
RoaringBitmapinsteadHashSet<u32>.
- When dealing with streams of
Tests:
-
When writing unit tests, prefer using the
memory://URI instead of creating a temporary directory. -
Use rstest to generate parameterized tests to cover more cases with fewer lines of code.
- Use syntax
#[case::{name}(...)]to provide human-readable names for each case.
- Use syntax
-
For backwards compatibility, use the
test_datadirectory to check in datasets written with older library version.- Check in a
datagen.pythat creates the test data. It should assert the version of Lance used as part of the script. - Use
pip install pylance=={version}and then runpython datagen.pyto create the dataset. The data files should be checked into git. - Use
copy_test_data_to_tmpto read this data in Lance
- Check in a
-
Avoid using
ignorein doctests. For APIs with complex inputs, like methods onDataset, instead write Rust doctests that just compile a function. This guarantees that the example code compiles and is in sync with the API. For example:/// ``` /// # use lance::{Dataset, Result}; /// # async fn test(dataset: &Dataset) -> Result<()> { /// dataset.delete("id = 25").await?; /// # Ok(()) /// # } /// ```
Please consider the following when reviewing code contributions.
- Design public APIs so they can be evolved easily in the future without breaking changes. Often this means using builder patterns or options structs instead of long argument lists.
- For public APIs, prefer inputs that use
Into<T>orAsRef<T>traits to allow more flexible inputs. For example, usename: Into<String>instead ofname: String, so we don't have to writefunc("my_string".to_string()).
- Ensure all new public APIs have documentation and examples.
- Ensure that all bugfixes and features have corresponding tests. We do not merge code without tests.
- New features must include updates to the rust documentation comments. Link to relevant structs and methods to increase the value of documentation.