Haystack hardware requirement for massive document set #6140

TheRabidWolverine · 2023-10-20T17:17:30Z

TheRabidWolverine
Oct 20, 2023

Hi, as far as I understand, Haystack has to be trained with a document set so that it can use encoder-decoder model for extraction answers for questions asked from those documents. Can anyone explain the hardware requirements for the entire setup? Suppose there are hundred billion web article documents (can be anything - from a Wikipedia article to a BBC article lambasting Trump to a NYT article explaining climate change). Obviously they need to be fed to Haystack for training before it is ready for inference, i.e. answering questions whose answers can be found somewhere among those 100 billion documents.

Questions:

What will be the disk requirement for hosting and infering? Those 100 billion documents are stored in object storage (assuming average document size being 12 kB, the total storage size is 1.2 PB), they are downloaded serially, fed into the Haystack during training process. Will the new required disk space be just as big, i.e. 1.2 PB? Or more? Or less?
What are the GPU and memory requirements for answering questions in less than a second from this 100B document set? Will it require one full A100 node (8 GPUs, each having 40 GB of GPU RAM)? Or more/less?
How much can the performance suffer if all memory requirements are not met? Can a large part of the model reside on disk and be searched at runtime along with search on the data stored in memory? If so, what kind of time delay are we talking about?

I know these questions are not perfectly answerable from this information. I just want to have an idea from some estimates that experienced users can speak on from what they have observed.

vblagoje · 2023-10-31T13:37:09Z

vblagoje
Oct 31, 2023
Maintainer

These are some hard questions @TheRabidWolverine . IIRC when I stored 16 million Wikipedia paragraphs the storage took an order of tens of GB IIRC. I know it was in GB but I don't remember exactly how much. The vectors had 768 dimensions.

Back of the napkins calculations below, please do it yourself carefully:

768dimensions per vector × 2bytes/dimension (assuming float16) = 1.5KB per paragraph
16M paragraphs×1.5KB = 24,000,000KB or appx 22.89GB

Adds some vector db overhead to it and it's easily 25GB.

Perhaps that can serve as a starting calculation for your requirements.
HTH,
Vladimir

2 replies

TheRabidWolverine Oct 31, 2023
Author

Thanks a lot, Vladimir. That is a very good starting point for us. Some questions:

You mentioned 768 dimensions. Why 768? Is it something you have played with and found that question answering in general works well with embeddings of 768 dimensions? Or is it Haystack official recommendation?
You mentioned 25 GB. That is disk storage I assume? How much RAM will be needed to host a stable version of this whole system? Will that need GPU at runtime for question answering (I guess so)? If so, how well will one A100 node (8 A100 GPUs, each with 40 GB RAM) be able to host in terms of holding the entire content and responding to large number of queries simultaneously?

vblagoje Nov 8, 2023
Maintainer

Apologies for the delayed response @TheRabidWolverine

If I'm not mistaken, 768 was the best trade-off regarding the quality of embedding space representation and the amount of actual disk space to store the data. There are embedders with other dimensions, but I'm unfamiliar with the latest developments. Please read a bit about it; I'm sure there is tons of info online about it.
Yes, disk storage. TBH, I don't know, as I've never personally built such a demanding high-performance system that serves embeddings for 100 billion documents. Your best bet would be forums for vector db vendors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Haystack hardware requirement for massive document set #6140

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Haystack hardware requirement for massive document set #6140

Uh oh!

Uh oh!

TheRabidWolverine Oct 20, 2023

Replies: 1 comment · 2 replies

Uh oh!

vblagoje Oct 31, 2023 Maintainer

Uh oh!

TheRabidWolverine Oct 31, 2023 Author

Uh oh!

vblagoje Nov 8, 2023 Maintainer

TheRabidWolverine
Oct 20, 2023

Replies: 1 comment 2 replies

vblagoje
Oct 31, 2023
Maintainer

TheRabidWolverine Oct 31, 2023
Author

vblagoje Nov 8, 2023
Maintainer