Lost-In-Translation/README.md at main · dorianquelle/Lost-In-Translation

Data

The Data is available on Zenodo:

FullData.csv.gz: Contains links to all claims in the data-set.

publishing_date: Date on which the fact-check was published.
claim_date: Date that claim was made.
verdict: Rating given by the fact-checking organisation.
language: Language of the claim.
cluster_{threshold}: ID of the cluster that claim belongs to at all given clusters. Entry "0" means that claim is singleton and not clustered with any other claims.

Embeddings.npy: Contains a dictionary linking each claim to it's embedding calculated with LaBSE.

File Descriptions:

00_FactCheckersMap.ipynb - Creates Maps visualising number of fact-checks and fact-checking organisations per Country.
01_CreateData.ipynb - Parses scraped fact-checks and Data commons fact-check dump. Removes duplicates and cleans data.
02_CleanClaims.ipynb - Data cleaning for the claim entries in the data-set.
03_EmbeddClaims.ipynb - Implements embedding of claims and structures the data for similarity comparison using Annoy indexing. Exports Edge-List of similar claims.
04_Clustering.ipynb - Creates Clusters of the most similar claims by threshold and runs analysis.
05_Translate.ipynb - Translates all claims to english.
06_Tokenanalysis.ipynb - Analyzes token usage to identify and differentiate long-lasting and transient terms in claims across languages.