The Data is available on Zenodo:
FullData.csv.gz: Contains links to all claims in the data-set.
- publishing_date: Date on which the fact-check was published.
- claim_date: Date that claim was made.
- verdict: Rating given by the fact-checking organisation.
- language: Language of the claim.
- cluster_{threshold}: ID of the cluster that claim belongs to at all given clusters. Entry "0" means that claim is singleton and not clustered with any other claims.
Embeddings.npy: Contains a dictionary linking each claim to it's embedding calculated with LaBSE.
File Descriptions:
- 00_FactCheckersMap.ipynb - Creates Maps visualising number of fact-checks and fact-checking organisations per Country.
- 01_CreateData.ipynb - Parses scraped fact-checks and Data commons fact-check dump. Removes duplicates and cleans data.
- 02_CleanClaims.ipynb - Data cleaning for the claim entries in the data-set.
- 03_EmbeddClaims.ipynb - Implements embedding of claims and structures the data for similarity comparison using Annoy indexing. Exports Edge-List of similar claims.
- 04_Clustering.ipynb - Creates Clusters of the most similar claims by threshold and runs analysis.
- 05_Translate.ipynb - Translates all claims to english.
- 06_Tokenanalysis.ipynb - Analyzes token usage to identify and differentiate long-lasting and transient terms in claims across languages.