[README] Add instructions for 📏 RULER benchmarks

yzhangcs · yzhangcs · commit a523d2715b45 · 2025-05-08T18:12:34.000Z
diff --git a/README.md b/README.md
@@ -411,13 +411,35 @@ Running the command above will provide the task results reported in the GLA pape
 To perform data-parallel evaluation (where each GPU loads a separate full copy of the model), we leverage the accelerate launcher as follows:
 ```sh
 $ PATH='fla-hub/gla-1.3B-100B'
-$ accelerate launch -m evals.harness --model hf \
-    --model_args pretrained=$PATH,dtype=bfloat16 \
+$ accelerate launch -m evals.harness --model hf  \
+    --model_args pretrained=$PATH,dtype=bfloat16,trust_remote_code=True  \
     --tasks wikitext,lambada_openai,piqa,hellaswag,winogrande,arc_easy,arc_challenge,boolq,sciq,copa,openbookqa \
-    --batch_size 64 \
-    --num_fewshot 0 \
-    --device cuda \
-    --show_config  
+    --batch_size 64  \
+    --num_fewshot 0  \
+    --device cuda  \
+    --show_config  \
+    --trust_remote_code
+```
+
+4. 📏 RULER Benchmark suite
+
+The RULER benchmarks are commonly used for evaluating model performance on long-context tasks. 
+You can evaluate `fla` models on RULER directly using `lm-evaluation-harness`.
+
+First, install the necessary dependencies for RULER:
+```sh
+pip install lm_eval["ruler"]
+```
+Then, run evaluation by (e.g., 32k contexts):
+```sh
+accelerate launch -m lm_eval \
+    --output_path $OUTPUT \
+    --tasks niah_single_1,niah_single_2,niah_single_3,niah_multikey_1,niah_multikey_2,niah_multikey_3,niah_multiquery,niah_multivalue,ruler_vt,ruler_cwe,ruler_fwe,ruler_qa_hotpot,ruler_qa_squad \
+    --model_args pretrained=$PATH,dtype=bfloat16,max_length=32768,trust_remote_code=True \
+    --metadata='{"max_seq_lengths":[4096,8192,16384,32768]}' \
+    --batch_size 2 \
+    --show_config  \
+    --trust_remote_code
 ```
 
 If a GPU can't load a full copy of the model, please refer to [this link](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#multi-gpu-evaluation-with-hugging-face-accelerate) for FSDP settings.