Proposal: Content addressed data

# Content addressed data access

## Why?

- To de-duplicate files between sites.
- Allow better site archiving
- Avoid data loss on site moderation/changes


## What?

Store and access based on file's hash (or merkle root for big files)


## How?

### File storage

data/__static__/[download_date]/[filename].[ext]

- [ ] Possible alternative #1: `data/__static__/[download_date]/[hash].[ext]`
- [ ] Possible alternative #2: `data/__static__/[download_date]/[partial_hash].[ext]`
- [ ] Possible alternative #3: `data/__static__/[partial_hash]/[hash].[ext]`

Possible alternative to static content root directory (instead of `data/__static__/`):
 - [ ] `data-static/`
 - [ ] `data/__immutable__/`

Variables:
 - download_date (example: 2019-09-05): To avoid the per-directory file number limit and make the files easier to find.
 - hash: The merkle root of the the file (sha512t256)
 - partial_hash: The first 8 character of the hash the path length (incremental postfix could be required on file name collision)
 - filename: File name (The first requested, may vary between sites)  (incremental postfix could be required on file name collision)
 - ext: File extension (The first requested, may vary between sites)


### Url access

http://127.0.0.1:43110/f/[hash].[ext] (for non-big file)
http://127.0.0.1:43110/bf/[hash].[ext] (for big file)

File name could be added optionally as, but the hash does not depends on the filename:

http://127.0.0.1:43110/f/[hash]/[anyfilename].[ext]

### File upload

- Create an interface similar to big life upload (XMLHttpRequest based)
- Scan directory: `data/__static__/__add__`: Copy file to this directory, visit ZeroHello Files tab, Click on "Hash added files"

### File download process

- Find possible peers with site-local findHashId/getHashfield / trackers
- For big files: Download piecefield.msgpack
- Use normal getFile to download the file/pieces (use sha512 in the request instead of the site/inner_path)

### Directory upload

For directory uploads we need to generate a `content.json` that contains the reference to other files.
Basically these would be sites where the `content.json` is authenticated by sha512t instead of the public address of the owner.

Example:
```json
{
	"title": "Directory name",
	"files_link": {
		"any_file.jpg": {"link": "/f/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754", "size": 3242},
		"other_dir/any_file.jpg": {"link": "/bf/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754", "size": 3821232}
	}
}
```

These directories can be accessed on the web interface using `http://127.0.0.1:43110/d/{sha512t hash of generated content.json}/any_file.jpg` 
(file list can be displayed on directory access)

Downloaded files and `content.json` stored in `data/static/[download_date]/{Directory name}` directory.

Each files in the directory also accessible using 
http://127.0.0.1:43110/f/602b8a1e5f3fd9ab65325c72eb4c3ced1227f72ba855bef0699e745cecec2754/any_file.jpg

As optimization if the files accessed using a directory reference the peer list can be fetched using 
findHashId/getHashId from other peers without accessing the trackers.


## Possible problems

### Too many tracker requests

Announcing and keep track of peers for large amount (10k+) of files can be problematic.

#### Solution #1
Send tracker request only for large (10MB+) files.
To get peer list for smaller files we use the current, `getHashfield` / `findHashId` solution.

Cons:
 - It could be hard/impossible to find peers to small files if you are not connected to a site where that file is popular.
 - Hash collision as we use only the first 4 letter of the hash in hashfield

#### Solution #2
Announce all files to zero:// trackers, reduce re-announce time to eg. 4 hours (re-announce within 1 minute if new file added)
(sending this amount of request to bittorrent trackers could be problematic)
Don't store peers for file that you have 100% downloaded.

Request for 10k files: 32 * 10k = 320k (optimal case)

#### Possible optimization #1:
Change tracker communication to request client id token and only communicate hash additions / deletions until the expiry time.
Token expiry time extends with every request.

#### Possible optimization #2:
Take some risk of hash collision and allow the tracker to specify how many character it needs from the hashes.
(based on how many how many hashes it stores)
Estimated request size to announce 22k files:
 - Full hash (32bytes): 770k
 - First 6 bytes (should be good until 10m hashes): 153k
 - First 7 bytes (should be good until 2560m hashes): 175k
 - First 8 bytes (should be good until 655360m hashes): 197k

Cons:
 - Depends on the zero:// trackers
 - Heavy requests, more CPU/BW load to trackers


### Download all optional files / help initial seed for specific user

Downloading all optional files in a site or uploaded by a specific user won't be possible anymore:
The optional files no longer will be stored in the user's `content.json` file `files_optional` node.

####  Solution #1
Add a `files_link` node to `content.json` that stores uploaded files in the last X days.
(with sha512, ext, size, date_added nodes) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal: Content addressed data #2192

Content addressed data access

Why?

What?

How?

File storage

Url access

File upload

File download process

Directory upload

Possible problems

Too many tracker requests

Solution #1

Solution #2

Possible optimization #1:

Possible optimization #2:

Download all optional files / help initial seed for specific user

Solution #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Content addressed data #2192

Description

Content addressed data access

Why?

What?

How?

File storage

Url access

File upload

File download process

Directory upload

Possible problems

Too many tracker requests

Solution #1

Solution #2

Possible optimization #1:

Possible optimization #2:

Download all optional files / help initial seed for specific user

Solution #1

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions