I noticed in damage_inference.py that the testing images are all scaled by 1.4. Inside damage_classification.py however the images are all scaled by 1/255 which makes sense to me.
The performance of the pretrained model changes greatly depending on the choice of scaling I've noticed. When I change the scaling to 1/255 the model only predicts "no-damage" but when you keep the scaling as 1.4 the results are more similar to the paper.