Skip to content

Commit 9db83fa

Browse files
authored
feat: save collection to disk via file write (#85)
2 parents 2e86947 + e0a28aa commit 9db83fa

File tree

8 files changed

+251
-28
lines changed

8 files changed

+251
-28
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "oasysdb"
3-
version = "0.4.5"
3+
version = "0.5.0"
44
edition = "2021"
55
license = "Apache-2.0"
66
readme = "readme.md"

docs/guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ I made this decision to make the indexing algorithm more efficient and performan
5656

5757
By default, due to the nature of the vector indexing algorithm, OasysDB stores the vector record data in memory via the collection interface. This means that unless persisted to disk via the database save collection method, the data will be lost when the program is closed.
5858

59-
Under the hood, OasysDB serializes the collection using [Serde](https://github.com/serde-rs/serde) and saves it to the database file using [Sled](https://github.com/spacejam/sled). Because of this, **whenever you modify a collection, you need to save the collection back to the database to persist the changes to disk.**
59+
Under the hood, OasysDB serializes the collection to bytes using [Serde](https://github.com/serde-rs/serde) and writes it to a file. The reference to the file is then saved, along with other details, to the database powered by [Sled](https://github.com/spacejam/sled). Because of this, **whenever you modify a collection, you need to save the collection back to the database to persist the changes to disk.**
6060

6161
When opening the database, OasysDB doesn't automatically load the collections from the database file into memory as this would be inefficient if you have many collections you don't necessarily use all the time. Instead, you need to load the collections you want to use into memory manually using the get collection method.
6262

docs/migrations/0.4.5-to-0.5.0.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Migrating from v0.4.5 to v0.5.0
2+
3+
Due to the breaking changes introduced in v0.5.0 on the persistence system, you might need to update your codebase to make it compatible with the new version. This is not required if you are starting a new project from scratch.
4+
5+
### What happened?
6+
7+
In v0.5.0, we introduced a new persistence system that is more optimized for rapidly changing data. Previously, we were using Sled to store the serialized collection blobs. We found that it was not the best option for our use case as each blob size could be somewhere in between 100MB to 10GB.
8+
9+
When the data change rapidly, the collections need to be saved periodically to avoid data loss. With this, the collections need to be reserialized and rewritten back into Sled. The dirty IO buffer during these operations caused some storage issues, bloating the space required to store the collection for up to 100x the collection size.
10+
11+
This new system is more optimized for our use case since we now write the serialized collection data directly to a dedicated file on the disk. Now, we only use Sled for storing the collection metadata and the path to where the collection is stored.
12+
13+
## How to migrate?
14+
15+
To migrate OasysDB from v0.4.5 to v0.5.0, I recommend creating a new Rust project and migrating the database from there. This migration project will read the data from the old database and write them to the new database. And for that, this project need to have access to the database files.
16+
17+
If you are using OasysDB on Python, you might want to use Rust to migrate the database as it supports installing both versions of OasysDB on the same project easily which is required for the migration. I can promise you that the migration process is quite simple and straightforward.
18+
19+
**Friendly Reminder**: Make sure to create a back-up of your database files before proceeding 😉
20+
21+
### 1. Install both versions of OasysDB
22+
23+
After setting up the new project, you can install both versions of OasysDB by specifying the package and the version in the `Cargo.toml` file.
24+
25+
```toml
26+
[dependencies]
27+
odb4 = { package = "oasysdb", version = "0.4.5" }
28+
odb5 = { package = "oasysdb", version = "0.5.0" }
29+
```
30+
31+
### 2. Migrate the database
32+
33+
The following script will read the collections from the old database and write them to the new database which is all we need to do to migrate the database.
34+
35+
```rust
36+
use odb4::prelude::Database;
37+
use odb5::prelude::Database as NewDatabase;
38+
39+
fn main() {
40+
// Change the path to the database accordingly.
41+
let db = Database::open("database").unwrap();
42+
let mut new_db = NewDatabase::new("new-database").unwrap();
43+
44+
// Collection names you want to migrate.
45+
let names = vec!["collection_a", "collection_b"];
46+
47+
// This will read the collections from the old
48+
// database and write them to the new database.
49+
for name in names {
50+
let collection = db.get_collection(name).unwrap();
51+
new_db.save_collection(name, &collection).unwrap();
52+
}
53+
}
54+
```
55+
56+
### 3. Verify the migration
57+
58+
After running the script, you can verify the migration by checking the new database files. The new database path should contain a sub-directory called `collections` which stores the serialized collection data. The number of files in this directory should be equal to the number of collections you migrated.
59+
60+
Don't forget to point your application to the new database path after the migration or rename the new database path to the old database path to make sure that your application uses the new database correctly.
61+
62+
## Conclusion
63+
64+
If all the steps are followed correctly, you should have successfully migrated your OasysDB database from v0.4.5 to v0.5.0. If you face any issues during the migration, feel free to reach out to me on our [Discord](https://discord.gg/bDhQrkqNP4).
65+
66+
I will be happy to personally assist you with the migration process 😁

readme.md

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,23 @@ OasysDB is very flexible! You can use it for systems related with vector similar
3232

3333
### Core Features
3434

35-
🔸 **Embedded Database**: Zero setup & no server required.\
36-
🔸 **Optional Persistence**: In-memory or disk-based collection.\
37-
🔸 **Incremental Ops**: Modify vectors without rebuilding indexes.\
38-
🔸 **Flexible Schema**: Store additional metadata for each vector.
35+
🔸 **Embedded Database**: Zero setup and no dedicated server or process required.
36+
37+
🔸 **Optional Persistence**: In-memory vector collections that can be persisted to disk.
38+
39+
🔸 **Incremental Ops**: Insert, modify, and delete vectors without rebuilding indexes.
40+
41+
🔸 **Flexible Schema**: Store additional and flexible metadata for each vector record.
3942

4043
### Technical Features
4144

42-
🔹 **Fast HNSW**: Efficient approximate vector similarity search.\
43-
🔹 **Configurable Metric**: Use Euclidean, Cosine, or other metric.\
44-
🔹 **Parallel Processing**: Multi-threaded & SIMD optimized calculation.\
45-
🔹 **Built-in Incremental ID**: No headache vector record management.
45+
🔹 **Fast HNSW**: Efficient vector similarity search with state-of-the-art algorithm.
46+
47+
🔹 **Configurable Metric**: Use Euclidean, Cosine, or other metric for your specific use-case.
48+
49+
🔹 **Parallel Processing**: Multi-threaded & SIMD-optimized vector distance calculation.
50+
51+
🔹 **Built-in Incremental ID**: No headache record management and efficient storage.
4652

4753
## Design Philosophy
4854

@@ -121,6 +127,14 @@ fn main() {
121127
}
122128
```
123129

130+
## Feature Flags
131+
132+
OasysDB provides several feature flags to enable or disable certain features. You can do this by adding the feature flags to your project `Cargo.toml` file. Below are the available feature flags and their descriptions:
133+
134+
- `json`: Enables easy Serde's JSON conversion from and to the metadata type. This feature is very useful if you have a complex metadata type or if you use APIs that communicate using JSON.
135+
136+
- `gen`: Enables the vector generator trait and modules to extract vector embeddings from your contents using OpenAI or other embedding models. This feature allows OasysDB to handle vector embedding extraction for you without separate dependencies.
137+
124138
# 🚀 Quickstart with Python
125139

126140
![Python-Banner.png](https://i.postimg.cc/rp1qjBZJ/Python-Banner.png)

src/db/database.rs

Lines changed: 153 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,29 @@
11
use super::*;
22

3+
/// The directory where collections are stored in the database.
4+
const COLLECTIONS_DIR: &str = "collections";
5+
6+
/// The database record for the persisted vector collection.
7+
#[derive(Serialize, Deserialize, Debug)]
8+
pub struct CollectionRecord {
9+
/// Name of the collection.
10+
pub name: String,
11+
/// File path where the collection is stored.
12+
pub path: String,
13+
/// Number of vector records in the collection.
14+
pub count: usize,
15+
/// Timestamp when the collection was created.
16+
pub created_at: usize,
17+
/// Timestamp when the collection was last updated.
18+
pub updated_at: usize,
19+
}
20+
321
/// The database storing vector collections.
422
#[cfg_attr(feature = "py", pyclass(module = "oasysdb.database"))]
523
pub struct Database {
624
collections: Db,
725
count: usize,
26+
path: String,
827
}
928

1029
/// Python only methods.
@@ -31,32 +50,62 @@ impl Database {
3150
#[cfg_attr(feature = "py", pymethods)]
3251
impl Database {
3352
/// Gets a collection from the database.
34-
/// * `name` - Name of the collection.
53+
/// * `name`: Name of the collection.
3554
pub fn get_collection(&self, name: &str) -> Result<Collection, Error> {
36-
let value = self.collections.get(name)?;
37-
match value {
38-
Some(value) => Ok(bincode::deserialize(&value)?),
39-
None => Err(Error::collection_not_found()),
40-
}
55+
// Retrieve the collection record from the database.
56+
let record: CollectionRecord = match self.collections.get(name)? {
57+
Some(value) => bincode::deserialize(&value)?,
58+
None => return Err(Error::collection_not_found()),
59+
};
60+
61+
self.read_from_file(&record.path)
4162
}
4263

4364
/// Saves new or update existing collection to the database.
44-
/// * `name` - Name of the collection.
45-
/// * `collection` - Vector collection to save.
65+
/// * `name`: Name of the collection.
66+
/// * `collection`: Vector collection to save.
4667
pub fn save_collection(
4768
&mut self,
4869
name: &str,
4970
collection: &Collection,
5071
) -> Result<(), Error> {
72+
// This variable is required since some operations require
73+
// the write_to_file method to succeed.
5174
let mut new = false;
5275

76+
let mut record: CollectionRecord;
77+
let path: String;
78+
5379
// Check if it's a new collection.
5480
if !self.collections.contains_key(name)? {
5581
new = true;
82+
path = self.create_new_collection_path(name)?;
83+
84+
// Create a new collection record.
85+
let timestamp = self.get_timestamp();
86+
record = CollectionRecord {
87+
name: name.to_string(),
88+
path: path.clone(),
89+
count: collection.len(),
90+
created_at: timestamp,
91+
updated_at: timestamp,
92+
};
93+
} else {
94+
let bytes = self.collections.get(name)?.unwrap().to_vec();
95+
record = bincode::deserialize(&bytes)?;
96+
path = record.path.clone();
97+
98+
// Update the record values.
99+
record.count = collection.len();
100+
record.updated_at = self.get_timestamp();
56101
}
57102

58-
let value = bincode::serialize(collection)?;
59-
self.collections.insert(name, value)?;
103+
// Write the collection to a file.
104+
self.write_to_file(&path, collection)?;
105+
106+
// Insert or update the collection record in the database.
107+
let bytes = bincode::serialize(&record)?;
108+
self.collections.insert(name, bytes)?;
60109

61110
// If it's a new collection, update the count.
62111
if new {
@@ -67,8 +116,17 @@ impl Database {
67116
}
68117

69118
/// Deletes a collection from the database.
70-
/// * `name` - Collection name to delete.
119+
/// * `name`: Collection name to delete.
71120
pub fn delete_collection(&mut self, name: &str) -> Result<(), Error> {
121+
let record: CollectionRecord = match self.collections.get(name)? {
122+
Some(value) => bincode::deserialize(&value)?,
123+
None => return Err(Error::collection_not_found()),
124+
};
125+
126+
// Delete the collection file first before removing
127+
// the reference from the database.
128+
self.delete_file(&record.path)?;
129+
72130
self.collections.remove(name)?;
73131
self.count -= 1;
74132
Ok(())
@@ -101,26 +159,106 @@ impl Database {
101159
impl Database {
102160
/// Re-creates and opens the database at the given path.
103161
/// This method will delete the database if it exists.
104-
/// * `path` - Directory to store the database.
162+
/// * `path`: Directory to store the database.
105163
pub fn new(path: &str) -> Result<Self, Error> {
106164
// Remove the database dir if it exists.
107165
if Path::new(path).exists() {
108166
remove_dir_all(path)?;
109167
}
110168

169+
// Setup the directory where collections will be stored.
170+
Self::setup_collections_dir(path)?;
171+
111172
// Using sled::Config to prevent name collisions
112173
// with collection's Config.
113174
let config = sled::Config::new().path(path);
114175
let collections = config.open()?;
115-
Ok(Self { collections, count: 0 })
176+
Ok(Self { collections, count: 0, path: path.to_string() })
116177
}
117178

118179
/// Opens existing or creates new database.
119180
/// If the database doesn't exist, it will be created.
120-
/// * `path` - Directory to store the database.
181+
/// * `path`: Directory to store the database.
121182
pub fn open(path: &str) -> Result<Self, Error> {
122183
let collections = sled::open(path)?;
123184
let count = collections.len();
124-
Ok(Self { collections, count })
185+
Self::setup_collections_dir(path)?;
186+
Ok(Self { collections, count, path: path.to_string() })
187+
}
188+
189+
/// Serializes and writes the collection to a file.
190+
/// * `path`: File path to write the collection to.
191+
/// * `collection`: Vector collection to write.
192+
fn write_to_file(
193+
&self,
194+
path: &str,
195+
collection: &Collection,
196+
) -> Result<(), Error> {
197+
let data = bincode::serialize(collection)?;
198+
199+
let file = OpenOptions::new()
200+
.create(true)
201+
.write(true)
202+
.truncate(true)
203+
.open(path)?;
204+
205+
let mut writer = BufWriter::new(file);
206+
writer.write_all(&data)?;
207+
Ok(())
208+
}
209+
210+
/// Reads and deserializes the collection from a file.
211+
/// * `path`: File path to read the collection from.
212+
fn read_from_file(&self, path: &str) -> Result<Collection, Error> {
213+
let file = OpenOptions::new().read(true).open(path)?;
214+
let mut reader = BufReader::new(file);
215+
let mut data = Vec::new();
216+
reader.read_to_end(&mut data)?;
217+
218+
// Deserialize the collection.
219+
let collection = bincode::deserialize(&data)?;
220+
Ok(collection)
221+
}
222+
223+
/// Deletes a file at the given path.
224+
fn delete_file(&self, path: &str) -> Result<(), Error> {
225+
remove_file(path)?;
226+
Ok(())
227+
}
228+
229+
/// Returns the path where the collection will be stored.
230+
/// * `name`: Name of the collection.
231+
fn create_new_collection_path(&self, name: &str) -> Result<String, Error> {
232+
// Hash the collection name to create a unique filename.
233+
let mut hasher = DefaultHasher::new();
234+
name.hash(&mut hasher);
235+
let filename = hasher.finish();
236+
237+
let path = Path::new(&self.path)
238+
.join(COLLECTIONS_DIR)
239+
.join(filename.to_string())
240+
.to_str()
241+
.unwrap()
242+
.to_string();
243+
244+
Ok(path)
245+
}
246+
247+
/// Creates the collections directory on the path if it doesn't exist.
248+
fn setup_collections_dir(path: &str) -> Result<(), Error> {
249+
let collections_dir = Path::new(path).join(COLLECTIONS_DIR);
250+
if !collections_dir.exists() {
251+
create_dir_all(&collections_dir)?;
252+
}
253+
254+
Ok(())
255+
}
256+
257+
/// Returns the UNIX timestamp in milliseconds.
258+
fn get_timestamp(&self) -> usize {
259+
let now = SystemTime::now();
260+
// We can unwrap safely since UNIX_EPOCH is always valid.
261+
let timestamp = now.duration_since(UNIX_EPOCH).unwrap();
262+
timestamp.as_millis() as usize
125263
}
126264
}

src/db/mod.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,13 @@ pub mod database;
33

44
use crate::collection::*;
55
use crate::func::err::Error;
6+
use serde::{Deserialize, Serialize};
67
use sled::Db;
7-
use std::fs::remove_dir_all;
8+
use std::fs::{create_dir_all, remove_dir_all, remove_file, OpenOptions};
9+
use std::hash::{DefaultHasher, Hash, Hasher};
10+
use std::io::{BufReader, BufWriter, Read, Write};
811
use std::path::Path;
12+
use std::time::{SystemTime, UNIX_EPOCH};
913

1014
#[cfg(feature = "py")]
1115
use pyo3::prelude::*;

src/func/err.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1+
use std::fmt::{Display, Formatter, Result};
2+
13
// Other error types.
24
use bincode::ErrorKind as BincodeError;
35
use sled::Error as SledError;
46
use std::error::Error as StandardError;
5-
use std::fmt::{Display, Formatter, Result};
67
use std::io::Error as IOError;
78

89
#[cfg(feature = "py")]

0 commit comments

Comments
 (0)