@@ -411,13 +411,35 @@ Running the command above will provide the task results reported in the GLA pape
411411To perform data-parallel evaluation (where each GPU loads a separate full copy of the model), we leverage the accelerate launcher as follows:
412412``` sh
413413$ PATH=' fla-hub/gla-1.3B-100B'
414- $ accelerate launch -m evals.harness --model hf \
415- --model_args pretrained=$PATH ,dtype=bfloat16 \
414+ $ accelerate launch -m evals.harness --model hf \
415+ --model_args pretrained=$PATH ,dtype=bfloat16,trust_remote_code=True \
416416 --tasks wikitext,lambada_openai,piqa,hellaswag,winogrande,arc_easy,arc_challenge,boolq,sciq,copa,openbookqa \
417- --batch_size 64 \
418- --num_fewshot 0 \
419- --device cuda \
420- --show_config
417+ --batch_size 64 \
418+ --num_fewshot 0 \
419+ --device cuda \
420+ --show_config \
421+ --trust_remote_code
422+ ```
423+
424+ 4 . 📏 RULER Benchmark suite
425+
426+ The RULER benchmarks are commonly used for evaluating model performance on long-context tasks.
427+ You can evaluate ` fla ` models on RULER directly using ` lm-evaluation-harness ` .
428+
429+ First, install the necessary dependencies for RULER:
430+ ``` sh
431+ pip install lm_eval[" ruler" ]
432+ ```
433+ Then, run evaluation by (e.g., 32k contexts):
434+ ``` sh
435+ $ accelerate launch -m evals.harness \
436+ --output_path $OUTPUT \
437+ --tasks niah_single_1,niah_single_2,niah_single_3,niah_multikey_1,niah_multikey_2,niah_multikey_3,niah_multiquery,niah_multivalue,ruler_vt,ruler_cwe,ruler_fwe,ruler_qa_hotpot,ruler_qa_squad \
438+ --model_args pretrained=$PATH ,dtype=bfloat16,max_length=32768,trust_remote_code=True \
439+ --metadata=' {"max_seq_lengths":[4096,8192,16384,32768]}' \
440+ --batch_size 2 \
441+ --show_config \
442+ --trust_remote_code
421443```
422444
423445If a GPU can't load a full copy of the model, please refer to [ this link] ( https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#multi-gpu-evaluation-with-hugging-face-accelerate ) for FSDP settings.
0 commit comments