A curated list of Turkish AI models, datasets, papers
The purpose of this repo to share and spread the information of Turkish AI models, datasets and papers. The amount of these Turkish resources are low and spread across the web. This repo aims to bring a curated selection of these resources together. This is not a list of all Turkish NLP/LLM models or datasets but a selection. So not all BERT or LLaMA based models are gonna make it here. The same applies to low quality Google translate translations of datasets. We aim each entry to have some kind of unique element to its own. This can be model performance, uniqueness in the task, highlighting the groups/companies (not everyone share their stuff so why not appreciate it!) etc. If you want to add anything you are welcomed 😏 , please check out the contributing section.
- ytu-ce-cosmos/Turkish-Llama
- Trendyol/Llama-3-Trendyol-LLM-8b-chat-v2.0
- Trendyol/Trendyol-LLM-7B-chat-v4.1.0
- TURKCELL/Turkcell-LLM-7b-v1
- KOCDIGITAL/Kocdigital-LLM-8b-v0.1
- WiroAI/OpenR1-Qwen-7B-Turkish Reasoning model
- WiroAI/wiroai-turkish-llm-9b
- ytu-ce-cosmos/Turkish-Gemma-9b-v0.1
- Trendyol/Trendyol-LLM-8B-T1 Qwen3 finetune, has thinking mode
- ytu-ce-cosmos/Turkish-Gemma-9b-T1
- vngrs-ai/Kumru-2B Kumru model has the architecture of Mistral. Its a model trained from scratch (not a finetune).
- Trendyol/tybert
- Trendyol/tyroberta
- ytu-ce-cosmos/turkish-base-bert-uncased
- ytu-ce-cosmos/turkish-colbert
- ytu-ce-cosmos/turkish-gpt2-large
- dbmdz/bert-base-turkish-128k-uncased
- TURKCELL/bert-offensive-lang-detection-tr
- asafaya/kanarya-2b
- boun-tabi-LMG/TURNA
- Helsinki-NLP group Lots of translation models for turkish
- VRLLab/TurkishBERTweet Tweet sentiment analysis
- akdeniz27/bert-base-turkish-cased-ner
- Trendyol/TY-ecomm-embed-multilingual-base-v1.2.0 Turkish and multilingual embeddings
- artiwise-ai/modernbert-base-tr-uncased
- ytu-ce-cosmos/turkish-e5-large Turkish retrieval model
To be added
- kesimeg/lora-turkish-clip CLIP model finetuned on turkish dataset
- merve/turkish_instructions Instruction tuning dataset
- BrewInteractive/alpaca-tr Instruction tuning dataset
- Metin/WikiRAG-TR
- MBZUAI/Bactrian-X
- Helsinki-NLP group Lots of translation models datasets for turkish
- turkish-nlp-suite/turkish-wikiNER
- turkish-nlp-suite/InstrucTurca
- WiroAI/dolphin-r1-turkish Reasoning dataset
- allenai/c4 Web scrape
- HPLT/HPLT2.0_cleaned Web scrape
- unimelb-nlp/wikiann NER
- TUR2SQL Text to SQL query dataset
- dolphin-r1-turkish Reasoning dataset
- emre/ct_tree_of_thought_turkish Turkish Tree of Thoughts (ToT) dataset
- evreny/prompt_injection_tr Turkish prompts for prompt injection
- HuggingFaceFW/fineweb-2 Has ~95 million turkish text
- TURSpider Text-to-SQL dataset
- vngrs-ai/vngrs-web-corpus Pretraining data which is a collection of different datasets crawled from the internet
- HuggingFaceFW/finetranslations Has 58 Million Turkish-English text pairs for translation. Translations were generated with Gemma3-27B (From original Turkish dataset to English)
- ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0 Pretraining data which is a collection of different datasets crawled from the internet
- ytu-ce-cosmos/Turkish-LLaVA-Finetune
- ytu-ce-cosmos/Turkish-LLaVA-Pretrain
- ytu-ce-cosmos/turkce-kitap
- 99eren99/LLaVA1.5-Data-Turkish
- TasvirEt
- nezahatkorkmaz/turkish-medical-vqa-evaluated Medical image question and answer dataset
- nezahatkorkmaz/unsloth-pmc-vqa-tr Medical image question answering dataset. Translted from PMC-VQA dataset. Reiquires access to images from original dataset.
- BosphorusSign22k Sign recognition
- FinePDFs Has 1.7 million Turkish entries. A PDF dataset that can be great for pretraining, RAG benchmark curation.
- ituperceptron/image-captioning-turkish Image captioning dataset. 200k long, 100k short captions
- mozilla-foundation/common_voice_17_0 This dataset also has older versions v16,v15, etc.
- malhajar/OpenLLMTurkishLeaderboard_v0.2
- KUIS-AI/Cetvel
- kesimeg/Turkish-rewardbench Reward model comparison
- TurkBench/TurkBench
- newmindai/Mezura Has RAG, Human evaluation (ELO score) and other benchmark scores. It also includes benchmarks in malhajar/OpenLLMTurkishLeaderboard_v0.2
- newmindai/Mizan Embedding model leaderboard. Compares abilities of embedding models on tasks such as retrieval, clustering etc.
- AYueksel/TurkishMMLU
- alibayram/turkish_mmlu
- ytu-ce-cosmos/gsm8k_tr
- Holmeister's Collections A collection of 17 datasets for 11 different tasks (Truthfulness, fairness, summarization etc.). For more see the paper
- CohereLabs/Global-MMLU MMLU for multiple languages including Turkish
- mrlbenchmarks/global-piqa-nonparallel Cultural commonsense benchmark.
- ytu-ce-cosmos/gpqa-extended_tr Graduate level science questions.
- CohereLabsCommunity/multilingual-reward-bench Reward benchmark (preference prediction)
- CohereLabs/m-WildVision
- CohereLabs/AyaVisionBench
- kesimeg/MMStar_tr
- metu-yks/yksbench A visual benchmark based on university entrance exam. Questions include visuals related to mathematics, geometry, physics, chemistry, biology, and geography
- Cosmos-LLaVA: Chatting with the Visual
- Introducing cosmosGPT: Monolingual Training for Turkish Language Models
- TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
- TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study
- How do LLMs perform on Turkish? A multi-faceted multi-prompt evaluation Performances of various LLMs in Turkish
- Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish
- YKSBench: Stress-Testing Multimodal Models with Exam-Style Questions Paper of YKSBench benchmark.
- TurkBench: A Benchmark for Evaluating Turkish Large Language Models Paper of TurkBench benchmark
- Glosbe
- Wiktionary
- Zemberek Some turkish NLP tools
- 3rt4nm4n/turkish-apis A list of turkish-apis
- THY-MCP
- borsa-mcp MCP Server for Istanbul Stock Exchange and Turkish Investment Fund Data
- yargi-cmp MCP Server For Turkish Legal Databases
- mezuat-mcp MCP Server for Searching Turkish Legislation
- yoktez-mcp MCP Server for Turkish Thesis Database
- yokatlas-mcp MCP Server for YOK Atlas
- KUIS-AI Youtube channel
- TR-AI Youtube channel
- Trendyol Tech Youtube channel Has videos related to their AI products and how they integrate AI
- Mukayese: Turkish NLP Strikes Back
- Mukayese github repo
- Wikipedia dumps Can be used as a dataset
- Turkish Encoder-only Models List A collection of encoder only turkish models
- Turkish Instruction Datasets List A collection of turkish instruction datasets
- Turkish Vision-Language Datasets List A collection of turkish vision language datasets
- Cosmos App The app of Cosmos AI Research group hosting their cosmos model. (Also has an iOS version)
- ITU NLP Research Tools and Resources
If you got anything to be added here just make a pull request! Before making a pull request please consider if a model/dataset/etc. has enough quality/uniqueness. Huggingface is crowded with finetuning of LLama and BERT, same applies to dataset. Many datasets have multiple machine translation version. This makes it hard to find good quality sources. We want to keep this list as curated as possible but still be able to cover enough sources.