VLMCalibration

Please note that the following has been deprecated and a much wider set of experiments has been run whose results and uploaded in the repo as JSONs.

We have also moved over from llama.cpp to vLLM for 16bit inferences - only quant-related experiments are run on llama.cpp now.

Experiment Tracking for Calibration (Deprecated)

Model	gsm8k-zs	gsm8k-cot	mmlu-zs	mmlu-cot	medmcqa-zs	medmcqa-cot	simpleqa-zs	simpleqa-cot
Meta-Llama-3.1-8B-Instruct-Q8_0	✅	✅	✅	❌	✅	❌	✅	✅
gemma-2-9b-it-Q8_0	✅	✅	✅	❌	✅	❌	✅	✅
Qwen2.5-7B-Instruct-Q8_0	✅	✅	✅	❌	✅	❌	✅	✅
gpt 4o	✅	✅	✅	❌	✅	✅	✅	✅
gpt 4o mini	✅	✅	✅	❌	✅	✅	✅	✅

cot_exp = Chain of Thought Experiments

zs_exp = Zero Shot Experiments

Calibration Experiments conducted:

Meta-Llama-3.1-8B-Instruct-Q8_0
gemma-2-9b-it-Q8_0
Qwen2.5-7B-Instruct-Q8_0

Fit on the GSM8k train set

Switched to llama.cpp C++ server for faster parallel inference instead of sequential python

Example commands to download model (example):

huggingface-cli download bartowski/Qwen2.5-7B-Instruct-GGUF --include "Qwen2.5-7B-Instruct-Q8_0.gguf" --local-dir ./models

Command to run server (example):

# cd to the build/bin directory under llama.cpp folder where there is the llama-server binary, see setup below
./llama-server -m /opt/dlami/nvme/VLMCalibration/models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 16384 -np 8 -t 8 -tb 8 -b 4096 -ub 2048 -cb --gpu-layers 300

300 is just a large number to put all layers on GPU. -1 didn't seem to work for some reason, but a large number puts all on GPU. Some notes on params:

Argument	Explanation
`-m`	Path to the model file
`-c`	Context window size in tokens (divided among -np processes, so if 16384 ctx_size and divided amongst 8 processes, then each has ctx_size of 2048)
`-np`	Number of parallel sequences to decode
`-t`	Number of threads to use during generation
`-tb`	Number of threads to use during batch and prompt processing
`-b`	Logical maximum batch size
`-ub`	Physical maximum batch size
`-cb`	Enable continuous batching
`--gpu-layers`	Number of layers to offload to GPU

All model running commands:

All must be run from directory of llama-server binary

./llama.cpp/build/bin/llama-server -m /opt/dlami/nvme/models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 32768 -np 16 -t 10 -tb 10 -b 4096 -ub 2048 -cb --gpu-layers 300 # Llama

./llama.cpp/build/bin/llama-server -m /opt/dlami/nvme/models/Qwen2.5-7B-Instruct-Q8_0.gguf -c 36864 -np 18 -t 6 -b 4096 -ub 2048 -cb --gpu-layers 300 --port 8000 # Qwen

./llama.cpp/build/bin/llama-server -m /opt/dlami/nvme/models/gemma-2-9b-it-Q8_0.gguf -c 24576 -np 12 -t 10 -tb 10 -b 4096 -ub 2048 -cb --gpu-layers 300 # Gemma

./llama.cpp/build/bin/llama-server -m /opt/dlami/nvme/models/Qwen2.5-7B-Instruct-Q4_K_M.gguf -c 45056 -np 22 -t 6 -b 4096 -ub 2048 -cb --gpu-layers 300 --port 8000 # Qwen

Steps to setup llama.cpp C++ server

sudo apt update
sudo apt install -y build-essential cmake git curl libcurl4-openssl-dev
nvcc --version # check CUDA installed

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake -DGGML_CUDA=on -DLLAMA_CURL=on -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --config Release --parallel $(nproc)
cd bin # there should be a llama-server there that can now be run

Utils

UPDATE: No longer needed, watch works fine on 22.04 (now default in EC2) nvidia.py can be used to keep checking GPU usage. Currently watch segfaults on Ubuntu 20.04, so this can be run in place of that to monitor:

chmod +x nvidia.py
sudo ./nvidia.py

vLLM Info

This is still in development but might help increase inference speed. Details yet to be finalized, but also might not work optimally with GGUFs.

vllm serve ./models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --tokenizer meta-llama/Llama-3.1-8B-Instruct --trust-remote-code --max-model-len 4096 --host localhost --port 8080 --max-num-batched-tokens 8192

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
cot_exp		cot_exp
otherAI		otherAI
sampling		sampling
tracing		tracing
verbalized		verbalized
verbalized_cot		verbalized_cot
zs_exp		zs_exp
.gitignore		.gitignore
README.md		README.md
accuracy_report.py		accuracy_report.py
calculate_calibration_metrics.py		calculate_calibration_metrics.py
curate_confidence_shifts.py		curate_confidence_shifts.py
llama_gsm8k_shifts.json		llama_gsm8k_shifts.json
local_unified_calib_vllm.py		local_unified_calib_vllm.py
medmcqa.json		medmcqa.json
mmlu.json		mmlu.json
mmlu_distribution.png		mmlu_distribution.png
openai_unified_calib.py		openai_unified_calib.py
plotter.py		plotter.py
run_all_experiments.sh		run_all_experiments.sh
run_calibration.sh		run_calibration.sh
run_calibration_vllm.sh		run_calibration_vllm.sh
run_deepseek_math_calibration.sh		run_deepseek_math_calibration.sh
sanity_check_jsons.py		sanity_check_jsons.py
simpleQA_grader_template.py		simpleQA_grader_template.py
simpleqa.json		simpleqa.json
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLMCalibration

Experiment Tracking for Calibration (Deprecated)

Calibration Experiments conducted:

All model running commands:

Steps to setup llama.cpp C++ server

Utils

vLLM Info

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

ManifoldRG/VLMCalibration

Folders and files

Latest commit

History

Repository files navigation

VLMCalibration

Experiment Tracking for Calibration (Deprecated)

Calibration Experiments conducted:

All model running commands:

Steps to setup llama.cpp C++ server

Utils

vLLM Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages