Skip to content

Proposal: Content addressed data #2192

@HelloZeroNet

Description

@HelloZeroNet

Content addressed data access

Why?

  • To de-duplicate files between sites.
  • Allow better site archiving
  • Avoid data loss on site moderation/changes

What?

Store and access based on file's hash (or merkle root for big files)

How?

File storage

data/static/[download_date]/[filename].[ext]

Possible alternative to static content root directory (instead of data/__static__/):

  • data-static/
  • data/__immutable__/

Variables:

  • download_date (example: 2019-09-05): To avoid the per-directory file number limit and make the files easier to find.
  • hash: The merkle root of the the file (sha512t256)
  • partial_hash: The first 8 character of the hash the path length (incremental postfix could be required on file name collision)
  • filename: File name (The first requested, may vary between sites) (incremental postfix could be required on file name collision)
  • ext: File extension (The first requested, may vary between sites)

Url access

http://127.0.0.1:43110/f/[hash].[ext] (for non-big file)
http://127.0.0.1:43110/bf/[hash].[ext] (for big file)

File name could be added optionally as, but the hash does not depends on the filename:

http://127.0.0.1:43110/f/[hash]/[anyfilename].[ext]

File upload

  • Create an interface similar to big life upload (XMLHttpRequest based)
  • Scan directory: data/__static__/__add__: Copy file to this directory, visit ZeroHello Files tab, Click on "Hash added files"

File download process

  • Find possible peers with site-local findHashId/getHashfield / trackers
  • For big files: Download piecefield.msgpack
  • Use normal getFile to download the file/pieces (use sha512 in the request instead of the site/inner_path)

Directory upload

For directory uploads we need to generate a content.json that contains the reference to other files.
Basically these would be sites where the content.json is authenticated by sha512t instead of the public address of the owner.

Example:

{
	"title": "Directory name",
	"files_link": {
		"any_file.jpg": {"link": "/f/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754", "size": 3242},
		"other_dir/any_file.jpg": {"link": "/bf/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754", "size": 3821232}
	}
}

These directories can be accessed on the web interface using http://127.0.0.1:43110/d/{sha512t hash of generated content.json}/any_file.jpg
(file list can be displayed on directory access)

Downloaded files and content.json stored in data/static/[download_date]/{Directory name} directory.

Each files in the directory also accessible using
http://127.0.0.1:43110/f/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754/any_file.jpg

As optimization if the files accessed using a directory reference the peer list can be fetched using
findHashId/getHashId from other peers without accessing the trackers.

Possible problems

Too many tracker requests

Announcing and keep track of peers for large amount (10k+) of files can be problematic.

Solution #1

Send tracker request only for large (10MB+) files.
To get peer list for smaller files we use the current, getHashfield / findHashId solution.

Cons:

  • It could be hard/impossible to find peers to small files if you are not connected to a site where that file is popular.
  • Hash collision as we use only the first 4 letter of the hash in hashfield

Solution #2

Announce all files to zero:// trackers, reduce re-announce time to eg. 4 hours (re-announce within 1 minute if new file added)
(sending this amount of request to bittorrent trackers could be problematic)
Don't store peers for file that you have 100% downloaded.

Request for 10k files: 32 * 10k = 320k (optimal case)

Possible optimization #1:

Change tracker communication to request client id token and only communicate hash additions / deletions until the expiry time.
Token expiry time extends with every request.

Possible optimization #2:

Take some risk of hash collision and allow the tracker to specify how many character it needs from the hashes.
(based on how many how many hashes it stores)
Estimated request size to announce 22k files:

  • Full hash (32bytes): 770k
  • First 6 bytes (should be good until 10m hashes): 153k
  • First 7 bytes (should be good until 2560m hashes): 175k
  • First 8 bytes (should be good until 655360m hashes): 197k

Cons:

  • Depends on the zero:// trackers
  • Heavy requests, more CPU/BW load to trackers

Download all optional files / help initial seed for specific user

Downloading all optional files in a site or uploaded by a specific user won't be possible anymore:
The optional files no longer will be stored in the user's content.json file files_optional node.

Solution #1

Add a files_link node to content.json that stores uploaded files in the last X days.
(with sha512, ext, size, date_added nodes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions