-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Content addressed data access
Why?
- To de-duplicate files between sites.
- Allow better site archiving
- Avoid data loss on site moderation/changes
What?
Store and access based on file's hash (or merkle root for big files)
How?
File storage
data/static/[download_date]/[filename].[ext]
- Possible alternative Crashes on visting home page and zeroboard #1:
data/__static__/[download_date]/[hash].[ext]
- Possible alternative Always says: UiServer Websocket error, please reload the page #2:
data/__static__/[download_date]/[partial_hash].[ext]
- Possible alternative EllipticCurvePoint.SignECDSA k is not cryptographically secure #3:
data/__static__/[partial_hash]/[hash].[ext]
Possible alternative to static content root directory (instead of data/__static__/
):
-
data-static/
-
data/__immutable__/
Variables:
- download_date (example: 2019-09-05): To avoid the per-directory file number limit and make the files easier to find.
- hash: The merkle root of the the file (sha512t256)
- partial_hash: The first 8 character of the hash the path length (incremental postfix could be required on file name collision)
- filename: File name (The first requested, may vary between sites) (incremental postfix could be required on file name collision)
- ext: File extension (The first requested, may vary between sites)
Url access
http://127.0.0.1:43110/f/[hash].[ext] (for non-big file)
http://127.0.0.1:43110/bf/[hash].[ext] (for big file)
File name could be added optionally as, but the hash does not depends on the filename:
http://127.0.0.1:43110/f/[hash]/[anyfilename].[ext]
File upload
- Create an interface similar to big life upload (XMLHttpRequest based)
- Scan directory:
data/__static__/__add__
: Copy file to this directory, visit ZeroHello Files tab, Click on "Hash added files"
File download process
- Find possible peers with site-local findHashId/getHashfield / trackers
- For big files: Download piecefield.msgpack
- Use normal getFile to download the file/pieces (use sha512 in the request instead of the site/inner_path)
Directory upload
For directory uploads we need to generate a content.json
that contains the reference to other files.
Basically these would be sites where the content.json
is authenticated by sha512t instead of the public address of the owner.
Example:
{
"title": "Directory name",
"files_link": {
"any_file.jpg": {"link": "/f/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754", "size": 3242},
"other_dir/any_file.jpg": {"link": "/bf/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754", "size": 3821232}
}
}
These directories can be accessed on the web interface using http://127.0.0.1:43110/d/{sha512t hash of generated content.json}/any_file.jpg
(file list can be displayed on directory access)
Downloaded files and content.json
stored in data/static/[download_date]/{Directory name}
directory.
Each files in the directory also accessible using
http://127.0.0.1:43110/f/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754/any_file.jpg
As optimization if the files accessed using a directory reference the peer list can be fetched using
findHashId/getHashId from other peers without accessing the trackers.
Possible problems
Too many tracker requests
Announcing and keep track of peers for large amount (10k+) of files can be problematic.
Solution #1
Send tracker request only for large (10MB+) files.
To get peer list for smaller files we use the current, getHashfield
/ findHashId
solution.
Cons:
- It could be hard/impossible to find peers to small files if you are not connected to a site where that file is popular.
- Hash collision as we use only the first 4 letter of the hash in hashfield
Solution #2
Announce all files to zero:// trackers, reduce re-announce time to eg. 4 hours (re-announce within 1 minute if new file added)
(sending this amount of request to bittorrent trackers could be problematic)
Don't store peers for file that you have 100% downloaded.
Request for 10k files: 32 * 10k = 320k (optimal case)
Possible optimization #1:
Change tracker communication to request client id token and only communicate hash additions / deletions until the expiry time.
Token expiry time extends with every request.
Possible optimization #2:
Take some risk of hash collision and allow the tracker to specify how many character it needs from the hashes.
(based on how many how many hashes it stores)
Estimated request size to announce 22k files:
- Full hash (32bytes): 770k
- First 6 bytes (should be good until 10m hashes): 153k
- First 7 bytes (should be good until 2560m hashes): 175k
- First 8 bytes (should be good until 655360m hashes): 197k
Cons:
- Depends on the zero:// trackers
- Heavy requests, more CPU/BW load to trackers
Download all optional files / help initial seed for specific user
Downloading all optional files in a site or uploaded by a specific user won't be possible anymore:
The optional files no longer will be stored in the user's content.json
file files_optional
node.
Solution #1
Add a files_link
node to content.json
that stores uploaded files in the last X days.
(with sha512, ext, size, date_added nodes)