Unable to Reproduce Benchmark MSE

I am unable to recreate the MSE values in [appendix O](https://arxiv.org/pdf/2406.08627#appendix.O) of the paper. Here is the script I ran which includes my config.

```python
export CUDA_VISIBLE_DEVICES=$3

all_models=("FiLM" "DLinear" "Transformer" "Reformer" "Informer" "Autoformer" "FEDformer" "Nonstationary_Transformer" "Crossformer" "PatchTST" "iTransformer")
start_index=$1
end_index=$2
models=("${all_models[@]:$start_index:$end_index-$start_index+1}")
root_paths=("./data/Public_Health")
data_paths=("US_FLURATIO_Week.csv") 
pred_lengths=(12 24 36 48)
seeds=(2021)
use_fullmodel=0
length=${#root_paths[@]}
for seed in "${seeds[@]}"
do
  for model_name in "${models[@]}"
  do
    for ((i=0; i<$length; i++))
    do
      for pred_len in "${pred_lengths[@]}"
      do
        root_path=${root_paths[$i]}
        data_path=${data_paths[$i]}
        model_id=$(basename ${root_path})

        echo "Running model $model_name with root $root_path, data $data_path, and pred_len $pred_len"
        python -u run.py \
          --task_name long_term_forecast \
          --is_training 1 \
          --root_path $root_path \
          --data_path $data_path \
          --model_id ${model_id}_${seed}_24_${pred_len}_fullLLM_${use_fullmodel} \
          --model $model_name \
          --data custom \
          --features M \
          --seq_len 24 \
          --label_len 12 \
          --pred_len $pred_len \
          --des 'Exp' \
          --seed $seed \
          --type_tag "#F#" \
          --text_len 4 \
          --prompt_weight 0.1 \
          --pool_type "avg" \
          --save_name "results/result_health_gpt2_all.txt" \
          --llm_model GPT2 \
          --huggingface_token 'NA'\
          --use_fullmodel $use_fullmodel
      done
    done
  done
done
```

The paper states GPT2 was used in the experiments included in Appendix O, however the provided sample script `week_health.sh` that was meant for us to reproduce the experiment uses BERT. I used GPT2, and here are my results below:

![Image](https://github.com/user-attachments/assets/beb51d83-4d9a-408f-98cd-69fb2231d3bd)

Here is the percentage difference when comparing with results from paper in Appendix O, Table 14.

![Image](https://github.com/user-attachments/assets/6dabce77-d12d-4117-bb43-778e48752e36)

Do my configs match what was done to produce the results in the paper? Or, any other ideas of where I might have gone wrong? Thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to Reproduce Benchmark MSE #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to Reproduce Benchmark MSE #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions