This repository contains the code for the paper "TeraGram: A Structured Longitudinal Dataset of the Telegram Messenger".
If you use the dataset, please cite it as follows:
@misc{TeraGram,
title = {TeraGram: A Structured Longitudinal Dataset of the Telegram Messenger},
author = {Golovin, Anastasia and Mohr, Sebastian B. and Gottwald, Arne I. and Hvid, Ulrik and Trivedi, Srushhti and Neto, Joao P. and Schneider, Andreas C. and Priesemann, Viola},
note = {in review},
}
As the paper is currently under review, the dataset is not yet available for the public. A preview of the dataset in CSV format is available here.
We use Poetry to manage Python environments and dependencies. To install Poetry on Linux, Windows or macOS, go to the documentation. Then, use
poetry installto create a new environment for the project and install all dependencies.
To install topic modeling dependencies, you need to install additional GPU libraries with poetry install -E cu130. To plot the entity-relation diagram, you need to install GraphViz on your system and then call poetry insatll -E erd.
To connect to a locally running copy of the database, you need to provide your credentials in the .env file. The file example.env provides the environment variables that need to be set. Copy the file, rename it to .env and set the correct credentials.
The variables OUTPUT_FOLDER and SCRATCH_FOLDER set the directories where the results will be stored. The OUTPUT_FOLDER contains final results and the SCRATCH_FOLDER is used for caching intermediate states. The cache can get large in size or contain many files that would overload the backup infrastructure, so it makes sense to separate those folders on a cluster. If those two things are not an issue in your environment, both folders can be set to local subfolders of the project.