Skip to content

Commit b517728

Browse files
authored
Adds citation (huggingface#101)
* added citation * added to toc * updated author list
1 parent 1e7315c commit b517728

File tree

2 files changed

+39
-0
lines changed

2 files changed

+39
-0
lines changed

CITATION.cff

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
cff-version: 1.2.0
2+
title: 'DataTrove: large scale data processing'
3+
message: >-
4+
If you use this software, please cite it using the metadata from this file.
5+
type: software
6+
authors:
7+
- given-names: Guilherme
8+
family-names: Penedo
9+
- given-names: Alessandro
10+
family-names: Cappelli
11+
- given-names: Thomas
12+
family-names: Wolf
13+
- given-names: Mario
14+
family-names: Sasko
15+
repository-code: 'https://github.com/huggingface/datatrove'
16+
abstract: "DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality. DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data."
17+
keywords:
18+
- deep-learning
19+
- pytorch
20+
- transformers
21+
- llms
22+
- data
23+
- scale
24+
license: Apache-2.0
25+
version: 0.0.1

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Local, remote and other file systems are supported through [fsspec](https://file
3232
+ [Custom function](#custom-function)
3333
+ [Custom block](#custom-block)
3434
- [Contributing](#contributing)
35+
- [Citation](#citation)
3536

3637
<!-- tocstop -->
3738

@@ -399,3 +400,16 @@ Run the tests:
399400
```bash
400401
pytest -sv ./tests/
401402
```
403+
404+
## Citation
405+
406+
```bibtex
407+
@misc{penedo2024datatrove,
408+
author = {Penedo, Guilherme and Cappelli, Alessandro and Wolf, Thomas and Sasko, Mario},
409+
title = {DataTrove: large scale data processing},
410+
year = {2024},
411+
publisher = {GitHub},
412+
journal = {GitHub repository},
413+
url = {https://github.com/huggingface/datatrove}
414+
}
415+
```

0 commit comments

Comments
 (0)