Skip to content

Commit b318d61

Browse files
authored
benchmark readme updates (#508)
* benchmark readme updates Signed-off-by: Lawrence Lane <llane@nvidia.com> * benchmark image update Signed-off-by: Lawrence Lane <llane@nvidia.com> * benchmark text update Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com>
1 parent 907ae08 commit b318d61

File tree

3 files changed

+5
-1
lines changed

3 files changed

+5
-1
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,11 @@ The following figure shows that the use of different data curation modules imple
187187
<img src="./docs/user-guide/assets/zeroshot_ablations.png" alt="drawing" width="700"/>
188188
</p>
189189

190-
In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours with 64 NVIDIA A100 Tensor Core GPUs.
190+
In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.96 Trillion token subset of the RedPajama V2 dataset in 0.5 hours with 32 NVIDIA H100 GPUs.
191+
192+
Processing Time | Comparison to Alternative Libraries
193+
:-------------------------:|:---------------------------------------:
194+
![](./docs/user-guide/assets/readme/fuzzy-dedup-processing-time.png) | ![](./docs/user-guide/assets/readme/fuzzy-dedup-processing-optimization-16x.png)
191195

192196
Additionally, using the CPU-based modules, the following table shows the time required and resulting data size reduction for each processing step [Common Crawl snapshot from November/December of 2020](https://commoncrawl.org/2020/12/nov-dec-2020-crawl-archive-now-available/) using 30 CPU nodes (with hardware similar to the `c5.24xlarge` [Amazon AWS C5 instance](https://aws.amazon.com/ec2/instance-types/c5/)).
193197

115 KB
Loading
74.2 KB
Loading

0 commit comments

Comments
 (0)