Decoding Cognitive States: Using Transformers to Find the Linguistic Fingerprints of Reading Effort

A Research Project by Neuroquantix

Abstract

Can the way we use language reveal the cognitive effort we’re expending? This project investigates whether it’s possible to distinguish between two different cognitive states—Normal Reading (NR) and Task-Specific Reading (TSR)—using only the linguistic properties of a text. Starting with classical machine learning models that achieved a baseline F1-score of ~0.74, we developed a novel hybrid approach. By combining hand-crafted syntactic features with powerful embeddings from a fine-tuned BERT transformer, our final MLP-based model successfully classified the reading state with an F1-score of 0.9474, demonstrating a strong, quantifiable link between language patterns and cognitive load.

Interactive Demo

You can test the final model live on Hugging Face Spaces. Click the image below to open the demo in a new tab.

Click the image to try the interactive demo!

The Scientific Question

The human brain is not a static processor; it adapts its strategy based on the task at hand. When we read casually, our cognitive state is different from when we read to find a specific piece of information. This latter state, known as Task-Specific Reading (TSR), involves a higher degree of attention and cognitive effort.

Using the comprehensive ZuCo 2.0 dataset, which uniquely pairs text with corresponding EEG and eye-tracking data, our research posed a challenging question: Can we bypass the direct neural recordings and identify the cognitive state of the reader purely from the text they are processing? In other words, does increased cognitive effort leave a detectable “fingerprint” on the linguistic and structural properties of language?

Our Approach

1. Baseline with Classical Models

Our first phase involved a systematic evaluation of classical machine learning models (such as Logistic Regression, RandomForest, and SVM). We explored two distinct feature engineering paths:

LLM-based Embeddings: We used a pre-trained sentence transformer (all-MiniLM-L6-v2) to create dense vector representations of each sentence.
Discrete Linguistic Features: We engineered a rich set of 19 features, including readability scores (Flesch Reading Ease), lexical diversity (TTR), and advanced syntactic complexity metrics derived using spaCy (e.g., dependency distance, POS tag counts).

These initial models performed reasonably well, with the RandomForest classifier achieving the best baseline F1-score of approximately 0.74 for the TSR class.

2. Fine-Tuning BERT for Task-Specific Embeddings

The breakthrough came from fine-tuning a bert-base-uncased model using the Hugging Face transformers library and PyTorch. This process, validated robustly using 5-fold Stratified Cross-Validation, transformed the general-purpose BERT into a specialist whose embeddings were highly discriminative for our specific cognitive state classification task.

3. The Hybrid Model

Our final and most successful model architecture was a hybrid one. We concatenated two feature sets:

Fine-Tuned Embeddings (768 dimensions): Capturing the semantic and contextual essence.
Scaled Enhanced Discrete Features (19 dimensions): Capturing explicit rules about syntax and readability.

This combined feature vector was then fed into an advanced Multi-Layer Perceptron (MLP), architected in PyTorch with Dropout and BatchNorm to prevent overfitting.

Results

The final hybrid model, evaluated on a held-out test set, achieved a remarkable F1-score of 0.9474 for the TSR class.

This significant leap in performance provides a strong answer to our initial research question. Yes, cognitive effort leaves a quantifiable and highly predictive fingerprint in our use of language. This finding has exciting implications for developing non-invasive, text-based digital biomarkers to assess cognitive load, attention, or even mental fatigue.

Technology Stack

Core Libraries: Pandas, NumPy, Scikit-learn, NLTK, spaCy, Textstat
Deep Learning: PyTorch, Hugging Face Transformers
Models: Logistic Regression, RandomForest, SVM, LightGBM, MLP, BERT
Methodology: Stratified K-Fold Cross-Validation, Fine-Tuning, Feature Engineering

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

NeuroQuantix - [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decoding Cognitive States: Using Transformers to Find the Linguistic Fingerprints of Reading Effort

Abstract

Interactive Demo

Table of Contents

The Scientific Question

Our Approach

1. Baseline with Classical Models

2. Fine-Tuning BERT for Task-Specific Embeddings

3. The Hybrid Model

Results

Technology Stack

License

Contact

About

Uh oh!

Releases

Packages

License

NeuroQuantix/zuco-cognitive-analysis

Folders and files

Latest commit

History

Repository files navigation

Decoding Cognitive States: Using Transformers to Find the Linguistic Fingerprints of Reading Effort

Abstract

Interactive Demo

Table of Contents

The Scientific Question

Our Approach

1. Baseline with Classical Models

2. Fine-Tuning BERT for Task-Specific Embeddings

3. The Hybrid Model

Results

Technology Stack

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages