-
-
Notifications
You must be signed in to change notification settings - Fork 35
feat: add FileStore implementation for cache (#427) #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…r read after remove
| }; | ||
|
|
||
| if expiry.is_expired(None) { | ||
| return Ok(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not remove the file here (early on), right when we find out the file has expired, making this a single source of truth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be more efficient. Initially, I was aiming to match the eviction policy that only read() and contains_key() may delete. I guess it would be better to eagerly remove it once expired
| let mut buffer = Vec::new(); | ||
|
|
||
| // advances cursor by the expiry header offset | ||
| file.seek(SeekFrom::Start(8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to seek here again after doing so when we parse the expiry. Does that not make this seek redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again, I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree. Firstly, the parse_expiry function resets the cursor back to the beginning (as it should) so this is needed. Secondly, this should stay here for correctness purpose, we can't assume the cursor on the file will always be at the beginning.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
For the failing nightly tests, I suspect it might be broken upstream, I have that pinned in my PR( cot/.github/workflows/rust.yml Line 20 in 26ef6bb
|
m4tx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience waiting for my review! This looks very good already, but there are a few things I'd like to see clarified before we merge.
| //! # use std::path::PathBuf; | ||
| //! # #[tokio::main] | ||
| //! # async fn main() { | ||
| //! let path = PathBuf::from("./cache_data"); | ||
| //! let store = FileStore::new(path).expect("Failed to initialize store"); | ||
| //! | ||
| //! let key = "example_key".to_string(); | ||
| //! let value = serde_json::json!({"data": "example_value"}); | ||
| //! | ||
| //! store.insert(key.clone(), value.clone(), Default::default()).await.unwrap(); | ||
| //! | ||
| //! let retrieved = store.get(&key).await.unwrap(); | ||
| //! assert_eq!(retrieved, Some(value)); | ||
| //! | ||
| //! # } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //! # use std::path::PathBuf; | |
| //! # #[tokio::main] | |
| //! # async fn main() { | |
| //! let path = PathBuf::from("./cache_data"); | |
| //! let store = FileStore::new(path).expect("Failed to initialize store"); | |
| //! | |
| //! let key = "example_key".to_string(); | |
| //! let value = serde_json::json!({"data": "example_value"}); | |
| //! | |
| //! store.insert(key.clone(), value.clone(), Default::default()).await.unwrap(); | |
| //! | |
| //! let retrieved = store.get(&key).await.unwrap(); | |
| //! assert_eq!(retrieved, Some(value)); | |
| //! | |
| //! # } | |
| //! # use std::path::PathBuf; | |
| //! # #[tokio::main] | |
| //! # async fn main() { | |
| //! | |
| //! let path = PathBuf::from("./cache_data"); | |
| //! let store = FileStore::new(path).expect("Failed to initialize store"); | |
| //! | |
| //! let key = "example_key".to_string(); | |
| //! let value = serde_json::json!({"data": "example_value"}); | |
| //! | |
| //! store.insert(key.clone(), value.clone(), Default::default()).await.unwrap(); | |
| //! | |
| //! let retrieved = store.get(&key).await.unwrap(); | |
| //! assert_eq!(retrieved, Some(value)); | |
| //! # } |
nitpick: formatting
| use std::path::Path; | ||
|
|
||
| use chrono::{DateTime, Utc}; | ||
| use md5::{Digest, Md5}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth using sha2 instead? I know md5 is already in our indirect dependencies - but sha2 is a direct dependency (used in the auth), and probably less prone to collisions. After all it doesn't matter that much, but it might be worth sticking to one algorithm as broadly as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got, it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for caching purpose we don't need a classic cryptographic secure hashing algorithm, we should focus on a fast one instead. I'd suggest using BLAKE3 for that case and it has official Rust implementation. If we trust their benchmarks, it's much faster than SHA2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seqre only if we (or any of our dependencies) use it already. I don't see much value in having the fastest function for hashing the cache keys (which will typically be small).
| key: &str, | ||
| ) -> CacheStoreResult<Option<(tokio::fs::File, std::path::PathBuf)>> { | ||
| let key_hash = FileStore::create_key_hash(key); | ||
| let path = self.dir_path.join(&key_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do in case there's a collision when hashing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there's no resolve on collision. My ideas are to embed real name on the file as header (this is simpler but may incur syscall only if the real name doesn't match the file). Another option would be to implement a jump table and sync to file on push new hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need a collision discovery unless it'd be simple to implement and fast at runtime. If we switch to another algorithm, their collision resistance is pretty high. For example for BLAKE3, it's 2**128. At that level of resistance, getting any collision is a feat and getting a specific collision for an attack seems almost impossible.
| let data = serde_json::to_string(&value) | ||
| .map_err(|e| FileCacheStoreError::Serialize(Box::new(e)))?; | ||
|
|
||
| let mut buffer: Vec<u8> = Vec::with_capacity(8 + data.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract the magic number to a named constant to clarify what it means.
And since we essentially create a custom binary format, it might be worth documenting this at the module level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
| if let Ok(meta) = entry.metadata().await | ||
| && meta.is_file() | ||
| { | ||
| total_size += meta.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return the number of bytes; not the number of entries, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this would return the total bytes. I was under the impression that approx_size depends on the cache type to track its quantity unit. If this is changed to entries number, should this bytes aggregation function be keep around for future use (maybe for monitoring)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say so, until we add functionality for that for all cache stores, it would be just dead code. We can always recreate it!
seqre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution, it's a great start! There are some things that needs changing before we merge it though.
| //! let key = "example_key".to_string(); | ||
| //! let value = serde_json::json!({"data": "example_value"}); | ||
| //! | ||
| //! store.insert(key.clone(), value.clone(), Default::default()).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd change it to Timeout::default() so that people reading the docs know what the 3rd argument is.
| use std::path::Path; | ||
|
|
||
| use chrono::{DateTime, Utc}; | ||
| use md5::{Digest, Md5}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for caching purpose we don't need a classic cryptographic secure hashing algorithm, we should focus on a fast one instead. I'd suggest using BLAKE3 for that case and it has official Rust implementation. If we trust their benchmarks, it's much faster than SHA2.
| use crate::config::Timeout; | ||
| use crate::error::error_impl::impl_into_cot_error; | ||
|
|
||
| const ERROR_PREFIX: &str = "file based cache store error:"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| const ERROR_PREFIX: &str = "file based cache store error:"; | |
| const ERROR_PREFIX: &str = "file-based cache store error:"; |
| #[error("{ERROR_PREFIX} file dir creation error: {0}")] | ||
| DirCreation(Box<dyn std::error::Error + Send + Sync>), | ||
|
|
||
| /// An error occured during temp file creation | ||
| #[error("{ERROR_PREFIX} file temp file creation error: {0}")] | ||
| TempFileCreation(Box<dyn std::error::Error + Send + Sync>), | ||
|
|
||
| /// An error occured during write/stream file | ||
| #[error("{ERROR_PREFIX} file io error: {0}")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "file" is already included in ERROR_PREFIX
| #[error("{ERROR_PREFIX} file dir creation error: {0}")] | |
| DirCreation(Box<dyn std::error::Error + Send + Sync>), | |
| /// An error occured during temp file creation | |
| #[error("{ERROR_PREFIX} file temp file creation error: {0}")] | |
| TempFileCreation(Box<dyn std::error::Error + Send + Sync>), | |
| /// An error occured during write/stream file | |
| #[error("{ERROR_PREFIX} file io error: {0}")] | |
| #[error("{ERROR_PREFIX} dir creation error: {0}")] | |
| DirCreation(Box<dyn std::error::Error + Send + Sync>), | |
| /// An error occured during temp file creation | |
| #[error("{ERROR_PREFIX} temp file creation error: {0}")] | |
| TempFileCreation(Box<dyn std::error::Error + Send + Sync>), | |
| /// An error occured during write/stream file | |
| #[error("{ERROR_PREFIX} io error: {0}")] |
| store.create_dir_sync_root()?; | ||
|
|
||
| Ok(store) | ||
| } | ||
|
|
||
| fn create_dir_sync_root(&self) -> CacheStoreResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think sync should be a suffix to show it's another version of existing async function
| store.create_dir_sync_root()?; | |
| Ok(store) | |
| } | |
| fn create_dir_sync_root(&self) -> CacheStoreResult<()> { | |
| store.create_dir_root_sync()?; | |
| Ok(store) | |
| } | |
| fn create_dir_root_sync(&self) -> CacheStoreResult<()> { |
| &self, | ||
| file: &mut tokio::fs::File, | ||
| ) -> CacheStoreResult<Option<Value>> { | ||
| if !self.parse_expiry(file).await? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully convinced the parse_expiry is the best name, I had to spent few seconds parsing this line mentally. It's what the function does technically, but logically it checks if the file expired. I wonder if it would be better to rename it to check_expiry, is_expired or similar
| key: &str, | ||
| ) -> CacheStoreResult<Option<(tokio::fs::File, std::path::PathBuf)>> { | ||
| let key_hash = FileStore::create_key_hash(key); | ||
| let path = self.dir_path.join(&key_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need a collision discovery unless it'd be simple to implement and fast at runtime. If we switch to another algorithm, their collision resistance is pretty high. For example for BLAKE3, it's 2**128. At that level of resistance, getting any collision is a feat and getting a specific collision for an attack seems almost impossible.
| Ok((temp_file, temp_path)) | ||
| } | ||
|
|
||
| async fn file_open( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe open_file_for_reading instead? It seems more aligned with what it does, or at least it says more about what it specifically does.
| if let Ok(meta) = entry.metadata().await | ||
| && meta.is_file() | ||
| { | ||
| total_size += meta.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say so, until we add functionality for that for all cache stores, it would be just dead code. We can always recreate it!
| } | ||
|
|
||
| async fn contains_key(&self, key: &str) -> CacheStoreResult<bool> { | ||
| let Ok(Some(mut file_tuple)) = self.file_open(key).await else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please deconstruct the tuple here with pattern matching into specific parts, so you don't have to use .0 and .1 below.
Pull Request to Issue #427
Key info:
Note: