Why call the evaluator at every step? In the official implementation, it only evaluates at the final step. https://github.com/ServiceNow/BrowserGym/blob/ec6b802cd655f2c6a84ebd66a22a4435d8147272/browsergym/webarena/src/browsergym/webarena/task.py#L185C9-L185C11 https://github.com/web-arena-x/webarena/blob/df352854eef255b007110948f6d4f539af039717/run.py#L330