Due to the small size of these datasets, and due to the way cells are selected, most of the results we are seeing are really about cells from the mean image, which as a majority do not have activity. I don't care much about these cells, and no one should, because they are overwhelmingly neuropil contaminated. It is perfectly possible that an algorithm detects a lot of these silent cells and does very well by your metrics, while doing very poorly on the 10% of cells that actually matter: the active cells.
I would suggest labelling every cell in your current ground truth as active or inactive, and then also running all the benchmarks on the active subset only. There could be a switch at the top of the website to flip to "active cells only". The definition of active should definitely subtract off neuropil from each ROI, before quantifying something about the variance of the trace, perhaps relative to very high-frequency content of that trace.