Haystack hardware requirement for massive document set #6140
TheRabidWolverine
started this conversation in
General
Replies: 1 comment 2 replies
-
These are some hard questions @TheRabidWolverine . IIRC when I stored 16 million Wikipedia paragraphs the storage took an order of tens of GB IIRC. I know it was in GB but I don't remember exactly how much. The vectors had 768 dimensions. Back of the napkins calculations below, please do it yourself carefully: 768dimensions per vector × 2bytes/dimension (assuming float16) = 1.5KB per paragraph Adds some vector db overhead to it and it's easily 25GB. Perhaps that can serve as a starting calculation for your requirements. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, as far as I understand, Haystack has to be trained with a document set so that it can use encoder-decoder model for extraction answers for questions asked from those documents. Can anyone explain the hardware requirements for the entire setup? Suppose there are hundred billion web article documents (can be anything - from a Wikipedia article to a BBC article lambasting Trump to a NYT article explaining climate change). Obviously they need to be fed to Haystack for training before it is ready for inference, i.e. answering questions whose answers can be found somewhere among those 100 billion documents.
Questions:
What will be the disk requirement for hosting and infering? Those 100 billion documents are stored in object storage (assuming average document size being 12 kB, the total storage size is 1.2 PB), they are downloaded serially, fed into the Haystack during training process. Will the new required disk space be just as big, i.e. 1.2 PB? Or more? Or less?
What are the GPU and memory requirements for answering questions in less than a second from this 100B document set? Will it require one full A100 node (8 GPUs, each having 40 GB of GPU RAM)? Or more/less?
How much can the performance suffer if all memory requirements are not met? Can a large part of the model reside on disk and be searched at runtime along with search on the data stored in memory? If so, what kind of time delay are we talking about?
I know these questions are not perfectly answerable from this information. I just want to have an idea from some estimates that experienced users can speak on from what they have observed.
Beta Was this translation helpful? Give feedback.
All reactions