UNLP Kaggle Competition Research Report

This repository contains my experimental work for the UNLP Challenge, a text classification competition on Kaggle focused on identifying manipulation techniques in textual content.

Approach

I explored multiple classification approaches, focusing on both traditional NLP methods and modern LLM-based techniques. The primary goal was to determine the most effective approach given resource constraints and the nature of the classification problem, particularly considering class imbalance issues.

Techniques and Results

Approach	Description	F1 Score
XML-RoBERTa (large)	Fine-tuned transformer model with classification head, best handling of imbalanced classes	0.392
XML-RoBERTa (base)	Smaller version of XML-RoBERTa with classification head	0.368
TF-IDF with Linear Regression	Traditional NLP approach enhanced with synthetic data for minority classes generated using Mistral Large	0.359
TF-IDF with SVM	Classical machine learning approach (synthetic data actually reduced performance)	0.310
RAG with Mistral Nemo	Retrieval-augmented generation searching for similar labeled messages from training data	0.309
Mistral Nemo Fine-tuning	Fine-tuned on training dataset using question-answer format	0.280
Gemma 1B Fine-tuning	Fine-tuned using question-answer format	0.280

Key Findings

Best Performer: XML-RoBERTa (large) outperformed other approaches, likely due to its pre-training on cross-lingual data and better handling of imbalanced classes.
Traditional NLP: TF-IDF based methods performed surprisingly well, particularly when augmented with synthetic data for minority classes. However, synthetic data generation proved to be complex and requires careful implementation, in my case it was not very well quality data.
Small LLMs: Models like Gemma 1B underperformed, primarily due to not being optimized for classification tasks. These models frequently produced hallucinations and irrelevant outputs.
RAG Approach: While innovative, the retrieval-augmented generation approach with Mistral Nemo didn't achieve competitive results compared to fine-tuned transformer models.

Resources and Limitations

The experiments were conducted using:

Google Colab free tier
Kaggle free tier notebooks

These computational constraints influenced model selection and training strategies, potentially limiting the performance of more resource-intensive approaches.

Conclusion

XML-RoBERTa large proved to be the most effective model for this classification task, demonstrating superior performance in handling imbalanced classes. Traditional TF-IDF approaches remain competitive when properly implemented, especially when enhanced with synthetic data. Small language models showed significant limitations for classification tasks in this context.

Future work could explore ensemble methods combining the strengths of different approaches or more sophisticated data augmentation techniques to address class imbalance issues.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/bin		data/bin
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UNLP Kaggle Competition Research Report

Approach

Techniques and Results

Key Findings

Resources and Limitations

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

olehmell/unlp

Folders and files

Latest commit

History

Repository files navigation

UNLP Kaggle Competition Research Report

Approach

Techniques and Results

Key Findings

Resources and Limitations

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages