This file provides guidance to coding agents collaborating on this repository.
Lance is a modern columnar data format optimized for ML workflows and datasets. It provides:
- High-performance random access
- Vector search
- Zero-copy, automatic versioning
- Ecosystem integrations
The de facto standard columnar data format for machine learning and large language models.
- Always use English in code, examples, and comments.
- Features should be implemented concisely, maintainably, and efficiently.
- Code is not just for execution, but also for readability.
- Only add meaningful comments and tests.
The project is organized as a Rust workspace with Python and Java bindings. Rust crates (workspace members unless noted) include:
rust/examples/- Sample binaries and demonstrations.rust/lance/- Main Lance library implementing the columnar format.rust/lance-arrow/- Apache Arrow integration layer.rust/lance-core/- Core types, traits, and utilities.rust/lance-datagen/- Data generation helpers for tests and benchmarks.rust/lance-encoding/- Data encoding and compression algorithms.rust/lance-file/- File format reading/writing.rust/lance-geo/- Geospatial data support.rust/lance-index/- Vector and scalar indexing implementations.rust/lance-io/- I/O operations and object store integration.rust/lance-linalg/- Linear algebra operations for vector search.rust/lance-namespace/- Namespace/catalog interfaces.rust/lance-namespace-impls/- Concrete namespace/catalog implementations.rust/lance-table/- Table format and operations.rust/lance-test-macros/- Procedural macros for testing.rust/lance-testing/- Shared test utilities.rust/lance-tools/- CLI and developer tooling.rust/compression/bitpacking/- Bit-packing codec implementation.rust/compression/fsst/- Fast string compression (FSST).rust/lance-datafusion/- DataFusion integration helpers (present in repo; built separately from the default workspace).python/- Python bindings using PyO3/maturin.java/- Java bindings using JNI.
- Check for build errors:
cargo check --workspace --tests --benches - Run tests:
cargo test --workspace - Run specific test:
cargo test -p <package> <test_name> - Lint:
cargo clippy --all --tests --benches -- -D warnings - Format:
cargo fmt --all
Use the makefile for most actions:
- Build:
maturin develop - Test:
make test - Run single test:
pytest python/tests/<test_file>.py::<test_name> - Doctest:
make doctest - Lint:
make lint - Format:
make format
# Start required services
cd test_data && docker compose up -d
# Run S3/DynamoDB tests
AWS_DEFAULT_REGION=us-east-1 pytest --run-integration python/tests/test_s3_ddb.py
# Performance profiling
maturin develop --release -m python/Cargo.toml -E benchmarks
python python/benchmarks/test_knn.py --iterations 100- Async-first Architecture: Heavy use of tokio and async/await throughout Rust codebase
- Arrow-native: All data operations work directly with Apache Arrow arrays
- Version Control: Every write creates a new version with manifest tracking
- Indexing: Supports both vector indices (for similarity search) and scalar indices (BTree, inverted)
- Encoding: Custom encodings optimized for ML data patterns
- Object Store: Unified interface for local, S3, Azure, GCS storage
- All public APIs should have comprehensive documentation with examples
- Performance-critical code uses SIMD optimizations where available
- Always rebuild Python extension after Rust changes using
maturin develop - Integration tests require Docker for local S3/DynamoDB emulation
- Use feature flags to control dependencies (e.g.,
datafusionfor SQL support)
Code standards:
- Be mindful of memory use:
- When dealing with streams of
RecordBatch, avoid collecting all data into memory whenever possible. - Use
RoaringBitmapinsteadHashSet<u32>.
- When dealing with streams of
Tests:
-
When writing unit tests, prefer using the
memory://URI instead of creating a temporary directory. -
Use rstest to generate parameterized tests to cover more cases with fewer lines of code.
- Use syntax
#[case::{name}(...)]to provide human-readable names for each case.
- Use syntax
-
For backwards compatibility, use the
test_datadirectory to check in datasets written with older library version.- Check in a
datagen.pythat creates the test data. It should assert the version of Lance used as part of the script. - Use
pip install pylance=={version}and then runpython datagen.pyto create the dataset. The data files should be checked into git. - Use
copy_test_data_to_tmpto read this data in Lance
- Check in a
-
Avoid using
ignorein doctests. For APIs with complex inputs, like methods onDataset, instead write Rust doctests that just compile a function. This guarantees that the example code compiles and is in sync with the API. For example:/// ``` /// # use lance::{Dataset, Result}; /// # async fn test(dataset: &Dataset) -> Result<()> { /// dataset.delete("id = 25").await?; /// # Ok(()) /// # } /// ```
Please consider the following when reviewing code contributions.
- Design public APIs so they can be evolved easily in the future without breaking changes. Often this means using builder patterns or options structs instead of long argument lists.
- For public APIs, prefer inputs that use
Into<T>orAsRef<T>traits to allow more flexible inputs. For example, usename: Into<String>instead ofname: String, so we don't have to writefunc("my_string".to_string()).
- Ensure all new public APIs have documentation and examples.
- Ensure that all bugfixes and features have corresponding tests. We do not merge code without tests.
- New features must include updates to the rust documentation comments. Link to relevant structs and methods to increase the value of documentation.