This project implements an Automated Essay Scoring (AES) system trained on the Learning Agency Lab – AES 2 Kaggle dataset. The system evaluates essays based on linguistic richness, coherence, structure, and semantic quality to predict their final human-assigned score.
📂 Project Structure
AES-Project/
│
├── Dataset/
│ ├── train.zip
│ └── test.csv
│
├── Notebook/
│ └── Model_Usage.ipynb
│
├── References/
│ └── CEP.pdf
│
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md
The dataset comes from the Learning Agency Lab Automated Essay Scoring 2 challenge (Kaggle). It contains thousands of student essays, each scored by human graders.
Included in the dataset:
- Essay text
- Human-assigned scores
- Essay set information
- Training & testing partitions
🧹 Preprocessing Pipeline
- Lowercasing & normalization
- Removing special characters
- Tokenization (NLTK + spaCy)
- Lemmatization
- Stopword removal
- Sentence segmentation
- Grammar/spelling cleanup
- Removal of extremely short essays
- Lexical Features: word count, unique words, vocabulary richness
- Syntactic Features: POS ratios, sentence lengths
- Semantic Features: TF-IDF, transformer embeddings
- Error-Based Features: grammar & spelling errors
- Structural Features: paragraph count, transitions
🔥 Model Comparison
| Model | R² Score | Notes |
|---|---|---|
| Linear Regression | 0.40–0.45 | Simple baseline |
| Random Forest | ~0.68 | Strong classical model |
| Gradient Boosting | ~0.70 | Handles non-linear patterns |
| XGBoost | ~0.75 | High performance |
| BERT / RoBERTa Regression Model | 0.80–0.82 | Best overall results |
📘Notebook: Model_Usage.ipynb
The notebook performs:
Dataset loading Full preprocessing pipeline Feature engineering ML + Transformer model training Model evaluation Generating predictions
Make sure Python 3.8+ is installed, then run:
pip install -r Requirements.txtThis project uses spaCy for text preprocessing. Install the required model:
python -m spacy download en_core_web_smRun the main project notebook from your preferred environment:
- Google Colab
- Jupyter Notebook
- VS Code (Jupyter Extension)
Notebook/Model_Usage.ipynb✔️ After completing these steps, the system is ready for use.
This project uses the dataset from the Learning Agency Lab – Automated Essay Scoring 2 competition.
📄 License
This project is licensed under the MIT License.
📎 References
Kaggle AES 2 Dataset
CEP PDF (in /References/)
Standard NLP & AES literature