Discrepancy between paper and codebase regarding the optimizer (Adam vs. AdamW)

Hi,

Thanks for open-sourcing the code for your work.

While reviewing the implementation and comparing it with the details in the paper, I noticed a discrepancy regarding the optimizer used for training. The paper states that the model was trained using the AdamW optimizer, but the current codebase appears to implement the standard Adam optimizer.