Authors

<

Authors

Paula Martinez, Christian Plan, Joshua Simon, Nico Ting

Project Description

This project addresses the critical challenge of data scarcity in artificial intelligence by developing a proof-of-concept that uses large language models (LLMs) and interpretability techniques to generate high-quality synthetic textual data. With the looming depletion of high-quality language data projected by 2026, our approach utilizes LLMs, such as ChatGPT and Claude, combined with the Shapley Additive Explanations (SHAP) algorithm, to create synthetic datasets that enhance emotion detection models.

Key Takeaways

Model Performance Improvement: The study observed that the model's predictive capabilities improved consistently as more data, regardless of its source, was introduced. This supports the well-established principle that a diverse set of training data generally leads to enhanced model performance.

Equivalence of Synthetic and Real Data: The performance of the model using synthetic data generated by LLMs was almost indistinguishable from that using real data. The model reached a peak accuracy of 78.50% with synthetic data compared to 78.80% with real data, highlighting the effectiveness of synthetic data in mimicking real-world distributions.

Significance of Synthetic Data Generation: The close alignment between the performance curves for synthetic and real data augmentation suggests that synthetic data generated by LLMs successfully captures essential characteristics and patterns of real data. This finding underscores the potential of LLMs to produce high-quality synthetic data that can serve as a viable substitute for real data in various applications.

Superiority of Claude.AI: Although not detailed in the excerpt provided, additional information indicates that Claude.AI was more effective than ChatGPT in generating synthetic data. This suggests that Claude.AI might have better capabilities or methodologies for prompt engineering and data generation, tailored specifically for emotion prediction tasks

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
20anger.csv		20anger.csv
20joy.csv		20joy.csv
500-angry.csv		500-angry.csv
500-joy.csv		500-joy.csv
LICENSE		LICENSE
README.md		README.md
V3_ML_MainNotebook.ipynb		V3_ML_MainNotebook.ipynb
all.csv		all.csv
final-result.png		final-result.png
negative_0.csv		negative_0.csv
negative_1.csv		negative_1.csv
pipeline.png		pipeline.png
positive_0.csv		positive_0.csv
positive_1.csv		positive_1.csv
title.png		title.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Authors

Project Description

Key Takeaways

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

paumartinez1/llm-data-augmentation

Folders and files

Latest commit

History

Repository files navigation

Authors

Project Description

Key Takeaways

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages