Conversation
flea89
left a comment
There was a problem hiding this comment.
This is getting together 🎉
I did a really quick review pass and left some high-level comments for you.
Also noticed some types still need updating across the board.
| this.MAX_DAG_SIZE = 1024 * 1024 * 1024 * 32 // don't try to transfer a DAG that's bigger than 32GB | ||
| this.log = debug('backup:pins') |
There was a problem hiding this comment.
What happens to this DAGs?
Is there a different approach we have in mind for those?
There was a problem hiding this comment.
We don't... but it's a pattern that is found in other areas of the system. So might be worth flagging.
There was a problem hiding this comment.
The comment is not pointing to the right line anymore, should be line abovehttps://github.com/web3-storage/web3.storage/pull/1844/files#diff-4df8cd6f6a1194703527832b6e009d1515b1d8f708afe9a5af7c31f446f9aadeR25
@alanshaw you might know more about this.
There was a problem hiding this comment.
My thinking was along the lines of that being the max in our ToS and wanting to have a cap at some point on the size of data we're willing to have uploaded/pinned. If you uploaded more than what we've said is the max allowed then we shouldn't be obliged to store that.
There was a problem hiding this comment.
But we don't have that limit for pins do we? I can't find the ToS for pins, but I might just be missing it?
Assuming there isn't one, there's a chance content bigger than that is stored there and users should expect it to be migrated?
There was a problem hiding this comment.
I've commented out the size check part, because, unless we have strong reasons not to, we should move all the files to eipfs.
There was a problem hiding this comment.
Add logging to keep track of enormous dags
There was a problem hiding this comment.
I added 2 pieces of logging:
- every time for whatever reason dag export fails we log the number of bytes read to that point
- If a successful export is greater than 32TiB, we log it.
There was a problem hiding this comment.
32TiB or 32 GiB? i think we want to know about pins >= 32 GiB
There was a problem hiding this comment.
yeah, I actually added 32 GiB in the code 👍 (see here)
|
Ok, so before this is merged it needs a the migration running which adds a new column |
2d4715a to
6017c94
Compare
|
@alanshaw, while there are still a few tweaks required (ie. some types are missing/need fixing), I wonder if you could review this PR to see if the approach is what you expected it to be. |
|
|
||
| on: | ||
| schedule: | ||
| - cron: '*/30 * * * *' |
There was a problem hiding this comment.
Why not just schedule for the max amount of time a job can run for 6h?
There was a problem hiding this comment.
If I understand this correctly the job has 2 goals:
- move historical pins to EIPFS
- keep moving new psa requests to EIPFS until we move to pickup.
For 2 I guess it's ideal to keep moving stuff as promptly as possible (30 min make sense, even less than that?) while we know the first runs of the job will be super slow (since they will have to go through all the historical data).
Isn't a solution to satisfy both words to keep the schedule as is and set concurrency on the job?
packages/db/postgres/tables.sql
Outdated
| inserted_at TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL, | ||
| updated_at TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL | ||
| updated_at TIMESTAMP WITH TIME ZONE DEFAULT timezone('utc'::text, now()) NOT NULL, | ||
| backup_urls TEXT[] |
There was a problem hiding this comment.
Might be nice if this was NOT NULL DEFAULT [] so you don't have to distinguish between null and empty.
There was a problem hiding this comment.
Changed :) FWIW, I used the same definition as backup_urls in uploads. Should I change that one too?
4edf2bf to
b1fd1a9
Compare
e6d131b to
df32027
Compare
df32027 to
52984d6
Compare
2fe2914 to
9dec63f
Compare
|
@olizilla can you please give this a thorough review 🙏 |
| let reportInterval | ||
| const libp2p = await getLibp2p() | ||
| try { | ||
| const dagula = await Dagula.fromNetwork(libp2p, { peer }) |
There was a problem hiding this comment.
Cache instances for same peer location.
It should be fine to keep a single lilp2p instance
| throw (err) | ||
| } finally { | ||
| if (bytesReceived > this.MAX_UPLOAD_DAG_SIZE) { | ||
| this.log(`⚠️ CID: ${cid} dag is greater than ${this.fmt(this.MAX_UPLOAD_DAG_SIZE)}`) |
There was a problem hiding this comment.
Add a per batch summary:
- failed cids (and their size)
- Successful (bigger than MAX_UPLOAD_DAG_SIZE)
Remove all unnecessary per cids logging, log just errors by default. (leave it with higher debug)
There was a problem hiding this comment.
Done!
Default logging (DEBUG=backupPins:log ) is quite succinct now, while the job can be run manually passing a more verbose DEBUG=backupPins:* through workflow inputs.
Example of default logging:
❯ DEBUG=backupPins:log npm test --workspace=packages/cron
❯ DEBUG=backupPins:log npm test --workspace=packages/cron
| * @returns {Promise<number | undefined>} | ||
| */ | ||
|
|
||
| // Given for PIN requests we never limited files size we shouldn't check this. ie. |
| Bucket: bucketName, | ||
| Key: key, | ||
| Body: bak.content, | ||
| Metadata: { structure: 'Complete' } |
There was a problem hiding this comment.
Look into sending checksum for file, reject the upload if the bytes of car don't match the cid
There was a problem hiding this comment.
I wonder if this is required.
Looking at the headers sent by the client
'content-type': 'application/xml',
'content-length': '11985',
Expect: '100-continue',
host: '127.0.0.1',
'x-amz-user-agent': 'aws-sdk-js/3.53.1',
'user-agent': 'aws-sdk-js/3.53.1 os/darwin/21.6.0 lang/js md/nodejs/16.14.0 md/crt-avail api/s3/3.53.1',
'amz-sdk-invocation-id': '25110079-acaf-425f-8933-3527fd8366c7',
'amz-sdk-request': 'attempt=1; max=3',
'x-amz-date': '20221031T122520Z',
'x-amz-content-sha256': 'ebd8a0f42b66a7756aaee73e6275d918143525f125137b584e6b079b364a6b5f',
authorization: 'AWS4-HMAC-SHA256 Credential=minioadmin/20221031/us-east-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-request;content-length;content-type;host;x-amz-content-sha256;x-amz-date;x-amz-user-agent, Signature=90453c633c07234480f3319eb3c1b058d25b39a077eb3653063165c5bc137722'
}
you can see it sends x-amz-content-sha256 which is the sha256 of the payload and implies the payload is signed (see docs.
If every chunk is hashed and verified I don't think we need the overall one? Or am I missing something?


psa_pins_requesttable to add the newbackup_urlscolumn.About this job
This new cron job which runs every 4 hours, grabs 10,000 pin requests, gets the car file and uploads it to s3.
It behaves in a similar way to nftstorage/backup.
It uses Dagula to grab the car file from the IPFS peer.