Skip to content

Conversation

@Swatinem
Copy link
Contributor

@Swatinem Swatinem commented Jul 1, 2025

No description provided.

@Swatinem Swatinem self-assigned this Jul 1, 2025
@Swatinem Swatinem requested a review from a team as a code owner July 1, 2025 13:20
Copy link
Member

@jan-auer jan-auer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and ideas in comments, none of them blocking.


#[derive(Clone)]
pub struct StorageId {
pub id: Uuid,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we have several variants for this StorageId now, from protobuf and the manual ones. Should we consolidate?

In a follow-up we should define whether we want to encode the use case and scope (org) into the ID. I think that would be a valuable addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting all of that into a unified ID sounds good to me. we might as well just encode all of that as a path like {usecase}/{scope}/{id}.


impl StorageService {
pub fn new(path: &Path) -> Self {
let db_path = path.join("db.sqlite");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not write the blobs directly onto the FS? I know this is just for dev, but we know that sqlite is pretty bad with large blobs. Compression we could detect via magic bytes and metadata needs to go somewhere else anyway.

Comment on lines +47 to +48
segment BLOB NOT NULL,
segment_offset INTEGER NOT NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add an offset at this point and assume every blob is on its own. If the application decides to chunk files, those would be separate blobs, but for dev there's no need to chunk. As for the scalable solution, we ideally test without chunking first.

@Swatinem Swatinem force-pushed the swatinem/write-local-segments branch from f9cfaf4 to 0a59dd7 Compare July 1, 2025 14:47
@Swatinem
Copy link
Contributor Author

Swatinem commented Jul 1, 2025

My idea with that segmentation/chunking is like this:

  • Clients are splitting large files into parts, and those parts are being uploaded/stored as multi-part requests.
  • Internally, the server then groups those parts into segments. The reason for this is to better optimize for small files.
  • Depending on where the segment is located (locally, or remote in GCS, we can do a range request to fetch it).

Yes, this means that large files are split up, just to be then grouped together again into completely different segments.
This might seem weird, but I think it actually makes sense. This means that clients that use our client SDK can parallelize downloads according to these internal parts, and re-assemble the file locally accordingly.
A non-client-SDK enabled client will just request the whole file via a HTTP GET, where we will then stream those parts one after the other.

That was my idea. Whether it makes sense to do is another thing, but I think its worth a try.

@Swatinem Swatinem merged commit f8ff20f into master Jul 1, 2025
3 checks passed
@Swatinem Swatinem deleted the swatinem/write-local-segments branch July 1, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants