Evaluation in WA differs from the official implementation

Why call the evaluator at every step? In the official implementation, it only evaluates at the final step.

https://github.com/ServiceNow/BrowserGym/blob/ec6b802cd655f2c6a84ebd66a22a4435d8147272/browsergym/webarena/src/browsergym/webarena/task.py#L185C9-L185C11

https://github.com/web-arena-x/webarena/blob/df352854eef255b007110948f6d4f539af039717/run.py#L330