This project evaluates current Large Language Models (LLMs) on their ability to identify American Sign Language (ASL) alphabets. The goal is to assess the current state of LLMs in the area of accessibility, specifically their vision capabilities for sign language recognition.
ASL/
├── asl_llm_evaluation.ipynb # Main evaluation notebook
├── requirements.txt # Python dependencies
├── asl_alphabet_dataset/ # Dataset folder (create this)
│ ├── A/ # Images for letter A
│ ├── B/ # Images for letter B
│ └── ... # Continue for all letters
└── evaluation_results/ # Results output folder (auto-created)
pip install -r requirements.txtCreate the dataset folder structure and add ASL alphabet images:
mkdir -p asl_alphabet_dataset/{A..Z}Place ASL hand sign images in their respective letter folders.
Set your API keys as environment variables:
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GOOGLE_API_KEY="your-google-api-key"Or update them directly in the notebook's CONFIG section.
Open the Jupyter notebook:
jupyter notebook asl_llm_evaluation.ipynbFollow the cells in order to:
- Load your ASL dataset
- Initialize LLM evaluators
- Run the evaluation pipeline
- Analyze results with visualizations
- Multi-Model Testing: Evaluate GPT-4V, Claude 3, Gemini Pro Vision, and more
- Prompt Engineering: Compare different prompting strategies
- Comprehensive Metrics: Accuracy, confusion matrices, per-class performance
- Response Time Analysis: Measure inference speed
- Error Analysis: Identify common misclassifications
- Model comparison charts
- Confusion matrices
- Per-class accuracy breakdown
- Response time distributions
- Prompt strategy comparisons
- Supported formats: JPG, PNG, JPEG
- Clear hand signs against contrasting background
- Consistent lighting recommended
- Various hand positions and orientations for robustness
Each letter (A-Z) should have its own folder containing multiple image examples.
Results are automatically saved to CSV files in the evaluation_results/ folder with timestamps. Each evaluation run generates:
- Detailed predictions for each image
- Model performance metrics
- Response times
- Raw LLM responses
This evaluation framework can help:
- Assess LLM vision capabilities for accessibility
- Identify areas for improvement in sign language recognition
- Compare different LLM providers and models
- Optimize prompting strategies for ASL classification
- Generate insights for future model development
Extend the evaluator classes in the notebook to add support for additional LLMs.
Edit the create_prompt() method in the BaseLLMEvaluator class to test new prompting strategies.
The framework can be adapted for other sign languages by modifying the class labels and dataset structure.
If you use this evaluation framework in your research, please cite appropriately and acknowledge the accessibility focus of this work.
This project is for research and educational purposes.