GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151
Replies: 3 comments
-
Hey @Ben-geo, First of all, congratulations again on your acceptance to GSoC — and really great work on the proposal! Below is a comprehensive roadmap to help structure collaboration on the Mesa-Frames DataCollector. We can approach the development in well-defined phases: starting with a high-level architecture (which your diagrams already capture very well), then an abstract API interface that mirrors Mesa’s conventions, and finally diving into the implementation for MVP and examples. 1 ArchitectureYour proposed architecture already lays down a solid and intuitive structure — Model → DataCollector → Storage backend. One important shift for Mesa-Frames is to use Polars LazyFrames as the main internal representation, rather than standard DataFrames. This enables deferred computation, better performance with large datasets, and compatibility with Polars-native operations. Here are some guiding principles and architectural decisions to keep in mind:
2 API
from mesa_frames import DataCollector, every_n_steps
# minimal example
dc = DataCollector(
model_reporters = {
"total_wealth": lambda m: m.agents["wealth"].sum(),
"wealth": ["mean", "max", "count:nz"],
"gini": lambda lf: gini_expr(lf["wealth"])
},
agent_reporters = {
"wealth": "wealth"
},
trigger = every_n_steps(10),
storage = "parquet:./runs/exp42/*.parquet",
) We don't need the 2.2 Trigger HelpersWe provide two primary ways to trigger data collection:
However, we are deferring the implementation of more complex event-driven collection for now. Mesa-Frames currently doesn’t support events natively, and Mesa-core’s event hook design is still evolving. We can revisit this in a later phase if there's time. 2.3 Stats ConfigurationUsers can specify computed statistics either via named presets or custom callables. Working with model_reporters = {
"wealth": ["mean", "max", "count:nz"],
"gini": lambda lf: gini_expr(lf["wealth"])
}
This approach ensures that computations are composable, lazy, and expressive. It also keeps the API clean while letting advanced users write powerful metrics. 2.4 Storage URI SchemaWe support a unified URI scheme to specify where data should go:
2.5 Memory Management & Materialization StrategyWe can implement two main strategies for managing LazyFrames during simulation runs, especially for long simulations with many steps. Each approach makes different trade-offs between memory usage and computational efficiency: In-Memory Accumulation with Deferred Execution# Default strategy - keep all LazyFrames in memory, collect only at end
dc = DataCollector(...)
model.run(1000) # Collect LazyFrames every step/trigger
# After run is complete:
dfs = dc.get_data() # triggers pl.collect_all([lf1, lf2, ...]) This approach leverages Polars' ability to optimize across all accumulated LazyFrames. When
This is optimal for computational efficiency but requires enough memory to hold all DataFrames at once. Periodic Materialization with Collect-Write-Lazy Pattern# Periodic materialization mode - collect, write, and create new LazyFrame base
dc = DataCollector(
...,
storage="parquet:./runs/exp42/{step}.parquet",
flush_every=50 # Materialize and flush every 50 steps
)
model.run(1000) For very large simulations or memory-constrained environments, we'll implement a periodic materialization approach that:
# Example implementation within DataCollector._flush()
# self._current_batch contains LazyFrames from both agent and model reporters
dfs = pl.collect_all(self._current_batch) # Optimize across all reporter LazyFrames in the batch
for i, df in enumerate(dfs):
df.write_parquet(f"{self._path_template.format(step=self._current_step + i)}")
# Keep the last DataFrame as LazyFrame to continue operations
if i == len(dfs) - 1:
self._base_frame = df.lazy()
self._current_batch = [] # Clear the batch buffer
This approach maintains the best of both worlds: batch-level optimization with 3 Stats Layer (Built on Polars Expr)We define a small catalog of commonly used
We also want to allow user-extensible statistics. Developers can register their own custom functions that return a valid def gini_expr(s):
return (2 * s.rank().sum() / (len(s) * s.sum())) - (len(s) + 1) / len(s)
dc.register_stat("gini", gini_expr) 4 Implementation Guidelines4.1 Collect‑all algorithm (single scan)
lf = (
model.agents.lazy() #this will be lazy by default in the future
.select([*required_cols, *exprs])
.with_columns(step=pl.lit(model.time))
)
dc._frames.append(lf) DataCollector can run a background loop (thread or asyncio) that batches N steps or T seconds, optimizing the balance between memory usage and computation efficiency. 5 Event‑Driven Collection (future work)Mesa core currently does not expose explicit event hooks, and Mesa-Frames lacks an internal event mechanism or lifecycle-based pub/sub system. Introducing full event-driven data collection would require infrastructure that does not yet exist in either project. To keep Phase 1 focused and feasible, we will postpone event-based collection mechanisms — such as tracking custom agent-level transitions, or listening to internal simulation milestones — until a later stage. For now, we'll rely on periodic or conditional predicate-based collection (e.g., 6 Deliverables
Stretch goals may include Event-Driven collection or integration with dashboards for real-time insight. These can be explored if time allows, but are not required for successful completion of the core project. |
Beta Was this translation helpful? Give feedback.
-
After further discussion, we've decided on the following changes:
Additional context on the data collected
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
This proposal outlines my plan to significantly enhance Mesa-Frames’ data collection capabilities during Google Summer of Code 2025. The focus is on developing a flexible, efficient, and scalable framework tailored for advanced researchers working with large-scale agent-based simulations.
The core enhancements include:
mean
,max
,min
,count
etc) they want to collect, reducing memory usage and computational overhead.lambda model: model.sheep_count < 10
) or time-based triggers (every_n_steps=50
) ensuring researchers capture only important insights while avoiding unnecessary storage.Each of these ideas, along with their motivations and potential benefits, are explored in greater depth in my proposal. I encourage you to take a look. : proposal
Code draft from proposal :
Beta Was this translation helpful? Give feedback.
All reactions