feat: save collection to disk via file write (#85)

edwinkys · web-flow · commit 9db83fa9a247 · 2024-05-15T20:12:15.000-05:00
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "oasysdb"
-version = "0.4.5"
+version = "0.5.0"
 edition = "2021"
 license = "Apache-2.0"
 readme = "readme.md"
diff --git a/docs/guide.md b/docs/guide.md
@@ -56,7 +56,7 @@ I made this decision to make the indexing algorithm more efficient and performan
 
 By default, due to the nature of the vector indexing algorithm, OasysDB stores the vector record data in memory via the collection interface. This means that unless persisted to disk via the database save collection method, the data will be lost when the program is closed.
 
-Under the hood, OasysDB serializes the collection using [Serde](https://github.com/serde-rs/serde) and saves it to the database file using [Sled](https://github.com/spacejam/sled). Because of this, **whenever you modify a collection, you need to save the collection back to the database to persist the changes to disk.**
+Under the hood, OasysDB serializes the collection to bytes using [Serde](https://github.com/serde-rs/serde) and writes it to a file. The reference to the file is then saved, along with other details, to the database powered by [Sled](https://github.com/spacejam/sled). Because of this, **whenever you modify a collection, you need to save the collection back to the database to persist the changes to disk.**
 
 When opening the database, OasysDB doesn't automatically load the collections from the database file into memory as this would be inefficient if you have many collections you don't necessarily use all the time. Instead, you need to load the collections you want to use into memory manually using the get collection method.
 
diff --git a/docs/migrations/0.4.5-to-0.5.0.md b/docs/migrations/0.4.5-to-0.5.0.md
@@ -0,0 +1,66 @@
+# Migrating from v0.4.5 to v0.5.0
+
+Due to the breaking changes introduced in v0.5.0 on the persistence system, you might need to update your codebase to make it compatible with the new version. This is not required if you are starting a new project from scratch.
+
+### What happened?
+
+In v0.5.0, we introduced a new persistence system that is more optimized for rapidly changing data. Previously, we were using Sled to store the serialized collection blobs. We found that it was not the best option for our use case as each blob size could be somewhere in between 100MB to 10GB.
+
+When the data change rapidly, the collections need to be saved periodically to avoid data loss. With this, the collections need to be reserialized and rewritten back into Sled. The dirty IO buffer during these operations caused some storage issues, bloating the space required to store the collection for up to 100x the collection size.
+
+This new system is more optimized for our use case since we now write the serialized collection data directly to a dedicated file on the disk. Now, we only use Sled for storing the collection metadata and the path to where the collection is stored.
+
+## How to migrate?
+
+To migrate OasysDB from v0.4.5 to v0.5.0, I recommend creating a new Rust project and migrating the database from there. This migration project will read the data from the old database and write them to the new database. And for that, this project need to have access to the database files.
+
+If you are using OasysDB on Python, you might want to use Rust to migrate the database as it supports installing both versions of OasysDB on the same project easily which is required for the migration. I can promise you that the migration process is quite simple and straightforward.
+
+**Friendly Reminder**: Make sure to create a back-up of your database files before proceeding 😉
+
+### 1. Install both versions of OasysDB
+
+After setting up the new project, you can install both versions of OasysDB by specifying the package and the version in the `Cargo.toml` file.
+
+```toml
+[dependencies]
+odb4 = { package = "oasysdb", version = "0.4.5" }
+odb5 = { package = "oasysdb", version = "0.5.0" }
+```
+
+### 2. Migrate the database
+
+The following script will read the collections from the old database and write them to the new database which is all we need to do to migrate the database.
+
+```rust
+use odb4::prelude::Database;
+use odb5::prelude::Database as NewDatabase;
+
+fn main() {
+    // Change the path to the database accordingly.
+    let db = Database::open("database").unwrap();
+    let mut new_db = NewDatabase::new("new-database").unwrap();
+
+    // Collection names you want to migrate.
+    let names = vec!["collection_a", "collection_b"];
+
+    // This will read the collections from the old
+    // database and write them to the new database.
+    for name in names {
+        let collection = db.get_collection(name).unwrap();
+        new_db.save_collection(name, &collection).unwrap();
+    }
+}
+```
+
+### 3. Verify the migration
+
+After running the script, you can verify the migration by checking the new database files. The new database path should contain a sub-directory called `collections` which stores the serialized collection data. The number of files in this directory should be equal to the number of collections you migrated.
+
+Don't forget to point your application to the new database path after the migration or rename the new database path to the old database path to make sure that your application uses the new database correctly.
+
+## Conclusion
+
+If all the steps are followed correctly, you should have successfully migrated your OasysDB database from v0.4.5 to v0.5.0. If you face any issues during the migration, feel free to reach out to me on our [Discord](https://discord.gg/bDhQrkqNP4).
+
+I will be happy to personally assist you with the migration process 😁
diff --git a/readme.md b/readme.md
@@ -32,17 +32,23 @@ OasysDB is very flexible! You can use it for systems related with vector similar
 
 ### Core Features
 
-🔸 **Embedded Database**: Zero setup & no server required.\
-🔸 **Optional Persistence**: In-memory or disk-based collection.\
-🔸 **Incremental Ops**: Modify vectors without rebuilding indexes.\
-🔸 **Flexible Schema**: Store additional metadata for each vector.
+🔸 **Embedded Database**: Zero setup and no dedicated server or process required.
+
+🔸 **Optional Persistence**: In-memory vector collections that can be persisted to disk.
+
+🔸 **Incremental Ops**: Insert, modify, and delete vectors without rebuilding indexes.
+
+🔸 **Flexible Schema**: Store additional and flexible metadata for each vector record.
 
 ### Technical Features
 
-🔹 **Fast HNSW**: Efficient approximate vector similarity search.\
-🔹 **Configurable Metric**: Use Euclidean, Cosine, or other metric.\
-🔹 **Parallel Processing**: Multi-threaded & SIMD optimized calculation.\
-🔹 **Built-in Incremental ID**: No headache vector record management.
+🔹 **Fast HNSW**: Efficient vector similarity search with state-of-the-art algorithm.
+
+🔹 **Configurable Metric**: Use Euclidean, Cosine, or other metric for your specific use-case.
+
+🔹 **Parallel Processing**: Multi-threaded & SIMD-optimized vector distance calculation.
+
+🔹 **Built-in Incremental ID**: No headache record management and efficient storage.
 
 ## Design Philosophy
 
@@ -121,6 +127,14 @@ fn main() {
 }
 ```
 
+## Feature Flags
+
+OasysDB provides several feature flags to enable or disable certain features. You can do this by adding the feature flags to your project `Cargo.toml` file. Below are the available feature flags and their descriptions:
+
+- `json`: Enables easy Serde's JSON conversion from and to the metadata type. This feature is very useful if you have a complex metadata type or if you use APIs that communicate using JSON.
+
+- `gen`: Enables the vector generator trait and modules to extract vector embeddings from your contents using OpenAI or other embedding models. This feature allows OasysDB to handle vector embedding extraction for you without separate dependencies.
+
 # 🚀 Quickstart with Python
 
 ![Python-Banner.png](https://i.postimg.cc/rp1qjBZJ/Python-Banner.png)
diff --git a/src/db/database.rs b/src/db/database.rs
@@ -1,10 +1,29 @@
 use super::*;
 
+/// The directory where collections are stored in the database.
+const COLLECTIONS_DIR: &str = "collections";
+
+/// The database record for the persisted vector collection.
+#[derive(Serialize, Deserialize, Debug)]
+pub struct CollectionRecord {
+    /// Name of the collection.
+    pub name: String,
+    /// File path where the collection is stored.
+    pub path: String,
+    /// Number of vector records in the collection.
+    pub count: usize,
+    /// Timestamp when the collection was created.
+    pub created_at: usize,
+    /// Timestamp when the collection was last updated.
+    pub updated_at: usize,
+}
+
 /// The database storing vector collections.
 #[cfg_attr(feature = "py", pyclass(module = "oasysdb.database"))]
 pub struct Database {
     collections: Db,
     count: usize,
+    path: String,
 }
 
 /// Python only methods.
@@ -31,32 +50,62 @@ impl Database {
 #[cfg_attr(feature = "py", pymethods)]
 impl Database {
     /// Gets a collection from the database.
-    /// * `name` - Name of the collection.
+    /// * `name`: Name of the collection.
     pub fn get_collection(&self, name: &str) -> Result<Collection, Error> {
-        let value = self.collections.get(name)?;
-        match value {
-            Some(value) => Ok(bincode::deserialize(&value)?),
-            None => Err(Error::collection_not_found()),
-        }
+        // Retrieve the collection record from the database.
+        let record: CollectionRecord = match self.collections.get(name)? {
+            Some(value) => bincode::deserialize(&value)?,
+            None => return Err(Error::collection_not_found()),
+        };
+
+        self.read_from_file(&record.path)
     }
 
     /// Saves new or update existing collection to the database.
-    /// * `name` - Name of the collection.
-    /// * `collection` - Vector collection to save.
+    /// * `name`: Name of the collection.
+    /// * `collection`: Vector collection to save.
     pub fn save_collection(
         &mut self,
         name: &str,
         collection: &Collection,
     ) -> Result<(), Error> {
+        // This variable is required since some operations require
+        // the write_to_file method to succeed.
         let mut new = false;
 
+        let mut record: CollectionRecord;
+        let path: String;
+
         // Check if it's a new collection.
         if !self.collections.contains_key(name)? {
             new = true;
+            path = self.create_new_collection_path(name)?;
+
+            // Create a new collection record.
+            let timestamp = self.get_timestamp();
+            record = CollectionRecord {
+                name: name.to_string(),
+                path: path.clone(),
+                count: collection.len(),
+                created_at: timestamp,
+                updated_at: timestamp,
+            };
+        } else {
+            let bytes = self.collections.get(name)?.unwrap().to_vec();
+            record = bincode::deserialize(&bytes)?;
+            path = record.path.clone();
+
+            // Update the record values.
+            record.count = collection.len();
+            record.updated_at = self.get_timestamp();
         }
 
-        let value = bincode::serialize(collection)?;
-        self.collections.insert(name, value)?;
+        // Write the collection to a file.
+        self.write_to_file(&path, collection)?;
+
+        // Insert or update the collection record in the database.
+        let bytes = bincode::serialize(&record)?;
+        self.collections.insert(name, bytes)?;
 
         // If it's a new collection, update the count.
         if new {
@@ -67,8 +116,17 @@ impl Database {
     }
 
     /// Deletes a collection from the database.
-    /// * `name` - Collection name to delete.
+    /// * `name`: Collection name to delete.
     pub fn delete_collection(&mut self, name: &str) -> Result<(), Error> {
+        let record: CollectionRecord = match self.collections.get(name)? {
+            Some(value) => bincode::deserialize(&value)?,
+            None => return Err(Error::collection_not_found()),
+        };
+
+        // Delete the collection file first before removing
+        // the reference from the database.
+        self.delete_file(&record.path)?;
+
         self.collections.remove(name)?;
         self.count -= 1;
         Ok(())
@@ -101,26 +159,106 @@ impl Database {
 impl Database {
     /// Re-creates and opens the database at the given path.
     /// This method will delete the database if it exists.
-    /// * `path` - Directory to store the database.
+    /// * `path`: Directory to store the database.
     pub fn new(path: &str) -> Result<Self, Error> {
         // Remove the database dir if it exists.
         if Path::new(path).exists() {
             remove_dir_all(path)?;
         }
 
+        // Setup the directory where collections will be stored.
+        Self::setup_collections_dir(path)?;
+
         // Using sled::Config to prevent name collisions
         // with collection's Config.
         let config = sled::Config::new().path(path);
         let collections = config.open()?;
-        Ok(Self { collections, count: 0 })
+        Ok(Self { collections, count: 0, path: path.to_string() })
     }
 
     /// Opens existing or creates new database.
     /// If the database doesn't exist, it will be created.
-    /// * `path` - Directory to store the database.
+    /// * `path`: Directory to store the database.
     pub fn open(path: &str) -> Result<Self, Error> {
         let collections = sled::open(path)?;
         let count = collections.len();
-        Ok(Self { collections, count })
+        Self::setup_collections_dir(path)?;
+        Ok(Self { collections, count, path: path.to_string() })
+    }
+
+    /// Serializes and writes the collection to a file.
+    /// * `path`: File path to write the collection to.
+    /// * `collection`: Vector collection to write.
+    fn write_to_file(
+        &self,
+        path: &str,
+        collection: &Collection,
+    ) -> Result<(), Error> {
+        let data = bincode::serialize(collection)?;
+
+        let file = OpenOptions::new()
+            .create(true)
+            .write(true)
+            .truncate(true)
+            .open(path)?;
+
+        let mut writer = BufWriter::new(file);
+        writer.write_all(&data)?;
+        Ok(())
+    }
+
+    /// Reads and deserializes the collection from a file.
+    /// * `path`: File path to read the collection from.
+    fn read_from_file(&self, path: &str) -> Result<Collection, Error> {
+        let file = OpenOptions::new().read(true).open(path)?;
+        let mut reader = BufReader::new(file);
+        let mut data = Vec::new();
+        reader.read_to_end(&mut data)?;
+
+        // Deserialize the collection.
+        let collection = bincode::deserialize(&data)?;
+        Ok(collection)
+    }
+
+    /// Deletes a file at the given path.
+    fn delete_file(&self, path: &str) -> Result<(), Error> {
+        remove_file(path)?;
+        Ok(())
+    }
+
+    /// Returns the path where the collection will be stored.
+    /// * `name`: Name of the collection.
+    fn create_new_collection_path(&self, name: &str) -> Result<String, Error> {
+        // Hash the collection name to create a unique filename.
+        let mut hasher = DefaultHasher::new();
+        name.hash(&mut hasher);
+        let filename = hasher.finish();
+
+        let path = Path::new(&self.path)
+            .join(COLLECTIONS_DIR)
+            .join(filename.to_string())
+            .to_str()
+            .unwrap()
+            .to_string();
+
+        Ok(path)
+    }
+
+    /// Creates the collections directory on the path if it doesn't exist.
+    fn setup_collections_dir(path: &str) -> Result<(), Error> {
+        let collections_dir = Path::new(path).join(COLLECTIONS_DIR);
+        if !collections_dir.exists() {
+            create_dir_all(&collections_dir)?;
+        }
+
+        Ok(())
+    }
+
+    /// Returns the UNIX timestamp in milliseconds.
+    fn get_timestamp(&self) -> usize {
+        let now = SystemTime::now();
+        // We can unwrap safely since UNIX_EPOCH is always valid.
+        let timestamp = now.duration_since(UNIX_EPOCH).unwrap();
+        timestamp.as_millis() as usize
     }
 }
diff --git a/src/db/mod.rs b/src/db/mod.rs
@@ -3,9 +3,13 @@ pub mod database;
 
 use crate::collection::*;
 use crate::func::err::Error;
+use serde::{Deserialize, Serialize};
 use sled::Db;
-use std::fs::remove_dir_all;
+use std::fs::{create_dir_all, remove_dir_all, remove_file, OpenOptions};
+use std::hash::{DefaultHasher, Hash, Hasher};
+use std::io::{BufReader, BufWriter, Read, Write};
 use std::path::Path;
+use std::time::{SystemTime, UNIX_EPOCH};
 
 #[cfg(feature = "py")]
 use pyo3::prelude::*;
diff --git a/src/func/err.rs b/src/func/err.rs
@@ -1,8 +1,9 @@
+use std::fmt::{Display, Formatter, Result};
+
 // Other error types.
 use bincode::ErrorKind as BincodeError;
 use sled::Error as SledError;
 use std::error::Error as StandardError;
-use std::fmt::{Display, Formatter, Result};
 use std::io::Error as IOError;
 
 #[cfg(feature = "py")]