LogiEval - Large Language Model Logical Reasoning Evaluation

A high-performance evaluation framework for testing large language models on logical reasoning datasets including LogiQA2.0, LogiQA, and ReClor.

Features

🚀 High Performance: Optimized with vLLM for fast model inference
📊 Multiple Datasets: Support for LogiQA2.0, LogiQA, and ReClor datasets
🎯 Multiple Sampling Methods: Direct sampling, Best-of-N (BoN), and majority voting
💾 Result Persistence: Save intermediate results and outputs as JSON files
🔧 Flexible Configuration: Easy-to-use configuration system

Installation

pip install -r requirements.txt

Quick Start

Basic Evaluation

python evaluate.py \
    --model_name "meta-llama/Llama-2-7b-chat-hf" \
    --datasets "logiqa2" "logiqa" "reclor" \
    --sampling_method "direct" \
    --batch_size 8 \
    --output_dir "./results"

Advanced Evaluation with Multiple Sampling

python evaluate.py \
    --model_name "meta-llama/Llama-2-7b-chat-hf" \
    --datasets "logiqa2" "logiqa" "reclor" \
    --sampling_method "bon" \
    --num_samples 5 \
    --batch_size 4 \
    --use_vllm \
    --output_dir "./results"

Configuration

The evaluation framework supports various configuration options:

--model_name: HuggingFace model identifier
--datasets: List of datasets to evaluate (logiqa2, logiqa, reclor)
--sampling_method: Sampling strategy (direct, bon, majority_vote)
--num_samples: Number of samples for BoN and majority voting
--batch_size: Batch size for inference
--use_vllm: Enable vLLM acceleration
--temperature: Sampling temperature
--max_tokens: Maximum generation length
--output_dir: Directory to save results

Project Structure

LogiEval/
├── src/
│   ├── __init__.py
│   ├── datasets/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── logiqa2.py
│   │   ├── logiqa.py
│   │   └── reclor.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── hf_model.py
│   │   └── vllm_model.py
│   ├── sampling/
│   │   ├── __init__.py
│   │   ├── direct.py
│   │   ├── bon.py
│   │   └── majority_vote.py
│   └── utils/
│       ├── __init__.py
│       ├── config.py
│       └── metrics.py
├── evaluate.py
├── requirements.txt
└── README.md

Results

Evaluation results are saved in JSON format with the following structure:

{
    "config": {...},
    "results": {
        "dataset_name": {
            "accuracy": 0.85,
            "total_samples": 1000,
            "correct_predictions": 850,
            "detailed_results": [...]
        }
    },
    "intermediate_outputs": [...]
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
bash_scripts		bash_scripts
data/reclor		data/reclor
src		src
.gitignore		.gitignore
README.md		README.md
config_example.json		config_example.json
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LogiEval - Large Language Model Logical Reasoning Evaluation

Features

Installation

Quick Start

Basic Evaluation

Advanced Evaluation with Multiple Sampling

Configuration

Project Structure

Results

About

Uh oh!

Releases

Packages

Languages

BiNLP/LogiEval

Folders and files

Latest commit

History

Repository files navigation

LogiEval - Large Language Model Logical Reasoning Evaluation

Features

Installation

Quick Start

Basic Evaluation

Advanced Evaluation with Multiple Sampling

Configuration

Project Structure

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages