This library brings a disciplined structure to data reconciliation, providing deterministic conflict resolution, data validation and quality checks, clean merging primitives, and a trustworthy source of truth—so engineers can focus on building systems, not untangling inconsistent or unreliable records. It exposes the properties and attributes of datasets to make differences, inconsistencies, and risks explicit.
In its initial versions, the library focuses on reconciliation, producing deterministic insights and reports that explain how disparate data silos relate to one another and how they can be merged into a single, unified source of truth. This enables data store migrations and integrations to surface gaps when discovered and resolve them efficiently, before trust is lost downstream.
In Mesopotamian/Sumeria mythology, Nisaba is the goddess of writing, accounting, and the orderly keeping of records, entrusted with maintaining clarity across ledgers and knowledge archives.
-
Reconciliation-first architecture: Establishes dataset equivalence across systems as the strongest guarantee of correctness using Lancedb for vector persistence and similarity search and Fastembed for embedding generation
-
Deterministic reconciliation engine: Produces order-independent, repeatable results suitable for CI and automated workflows.
-
Cross-store data support: Unified handling of tabular data across SQL (MySQL, PostgreSQL, SQLite), NoSQL (MongoDB), and file formats (CSV, Excel, Parquet).
-
Store-agnostic internal data model: Logical representation of data decoupled from physical storage or format.
To get started, just add to Cargo.toml
[dependencies]
nisaba = { version = "0.2.0" }Prefer using the example and the generated docs or:
use nisaba::{
AnalyzerConfig, DistanceType, EmbeddingModel, FileStoreType, SchemaAnalyzer, ScoringConfig,
SimilarityConfig, Source,
};
#[tokio::main]
async fn main() {
let config = AnalyzerConfig::builder()
.sample_size(10)
.scoring(ScoringConfig {
type_weight: 0.65,
structure_weight: 0.35,
})
.similarity(SimilarityConfig {
threshold: 0.59,
top_k: Some(7),
algorithm: DistanceType::Cosine,
})
.build();
// analyzer
let analyzer = SchemaAnalyzer::builder()
.name("nisaba")
.config(config)
.embedding_model(EmbeddingModel::MultilingualE5Small)
.source(
Source::files(FileStoreType::Csv)
.path("./assets/csv")
.num_rows(10)
.has_header(true)
.build()
.unwrap(),
)
.sources(vec![
Source::files(FileStoreType::Parquet)
.path("./assets/parquet")
.num_rows(10)
.build()
.unwrap(),
])
.build()
.await
.unwrap();
let _result = analyzer.analyze().await.unwrap();
}Assume that a data engineer discovers multiple schema/sources with several tables that have been long been ignored and wants to deduce how they are connected and related between themselves and (or) the contemporary data store. The engineer would:
- Map out the sources and relevant credentials
- Setup Nisaba StorageConfigs
- Setup SchemaAnalyzer
- Run the analyzer with the storage configs
- Review the Results/Report for reconcialiation hints
Successive improvements will allow more features in providing quality and validation as documented in the roadmap
As with most Rust crates, this library is versioned according to Semantic Versioning. Breaking changes will only be made with good reason, and as infrequently as is feasible. Such changes will generally be made in releases where the major version number is increased (note Cargo's caveat for pre-1.x versions), although limited exceptions may occur. Increases in the minimum supported Rust version (MSRV) are not considered breaking, but will result in a minor version bump.
See also the changelog for details about changes in recent versions.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.