GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151

Ben-geo · 2025-05-13T20:52:02Z

Ben-geo
May 13, 2025
Maintainer

Overview

This proposal outlines my plan to significantly enhance Mesa-Frames’ data collection capabilities during Google Summer of Code 2025. The focus is on developing a flexible, efficient, and scalable framework tailored for advanced researchers working with large-scale agent-based simulations.

The core enhancements include:

Stats Collection – A lightweight module that allows users to specify exactly which statistics (e.g., mean, max, min, count etc) they want to collect, reducing memory usage and computational overhead.
Event Driven Collection – A mechanism to record data only when predefined conditions are met, via predicate functions (e.g., lambda model: model.sheep_count < 10) or time-based triggers (every_n_steps=50) ensuring researchers capture only important insights while avoiding unnecessary storage.
External Data Storage Integration – Direct support for storing collected data in PostgreSQL/S3 or other databases, minimizing memory footprint and enabling large-scale analysis.

Each of these ideas, along with their motivations and potential benefits, are explored in greater depth in my proposal. I encourage you to take a look. : proposal

Code draft from proposal :

class DataCollector:
    def __init__(self, model, reporters=None, trigger=None, stat_config=None):
        self.model = model
        self.reporters = reporters or {}
        self.trigger = trigger or (lambda model: True)
        self.stat_config = stat_config or {}
        self.data = pl.DataFrame()
        self._stat_methods = {
            "max": lambda df, var: df[var].max(),
            "min": lambda df, var: df[var].min(),
            "mean": lambda df, var: df[var].mean(),
            "sum": lambda df, var: df[var].sum(),
        }

    def collect(self):
        if not self.trigger(self.model):
            return
        row = {"timestep": self.model._steps}
        for name, func in self.reporters.items():
            row[name] = func(self.model)
        if self.stat_config:
            agent_df = self.model._agents._agentsets[0]
            row.update(self.compute_stats(agent_df))
        self.data = self.data.vstack(pl.DataFrame([row]))

    def get_data(self):
        return self.data

    def compute_stats(self, df: pl.DataFrame) -> dict:
        results = {}
        cache = {}
        for key, stats in self.stat_config.items():
            for stat in stats:
                if stat.startswith("count"):
                    parts = stat.split(":")
                    mode = parts[1] if len(parts) > 1 else None
                    val = self.get_count(df, key, mode, cache)
                    label = f"{key}_count" if not mode else f"{key}_{mode}_count"
                    results[label] = val
                elif stat in self._stat_methods:
                    val = self._stat_methods[stat](df, key)
                    results[f"{key}_{stat}"] = val
                    cache[stat] = val
        return results

    def get_count(self, df, var, mode=None, cache=None):
        if mode in self._stat_methods:
            val = cache.get(mode) or self._stat_methods[mode](df, var)
            cache[mode] = val
            return (df[var] == val).sum()
        return len(df)

   def external_data_storage():
       # This function is a placeholder for the external data storage logic.
       pass

adamamer20 · 2025-05-14T20:07:44Z

adamamer20
May 14, 2025
Maintainer

Hey @Ben-geo,

First of all, congratulations again on your acceptance to GSoC — and really great work on the proposal!

Below is a comprehensive roadmap to help structure collaboration on the Mesa-Frames DataCollector. We can approach the development in well-defined phases: starting with a high-level architecture (which your diagrams already capture very well), then an abstract API interface that mirrors Mesa’s conventions, and finally diving into the implementation for MVP and examples.

1 Architecture

Your proposed architecture already lays down a solid and intuitive structure — Model → DataCollector → Storage backend. One important shift for Mesa-Frames is to use Polars LazyFrames as the main internal representation, rather than standard DataFrames. This enables deferred computation, better performance with large datasets, and compatibility with Polars-native operations. Here are some guiding principles and architectural decisions to keep in mind:

Lazy by default: Instead of materializing data every step, each collection builds a pl.LazyFrame plan. This plan is only executed (i.e., turned into actual data) when explicitly requested (e.g., get_data().collect()) or when flushed by a writer (e.g., to Parquet).
Single scan per step: It's crucial that agent attributes or variables are not read multiple times during the same model tick. This is in line with the mesa design conversation here, and ensures scalability and efficiency,
Abstract Interface + Polars Backend: As we did with AgentSet, it's a good idea to clearly separate an abstract interface (e.g., AbstractDataCollector) from its concrete Polars-based implementation. This design pattern improves modularity, allows us to introduce other backends in the future and makes reasoning about system components much easier and cleaner during development and maintenance.
Mesa Compatibility: We want our DataCollector API to mirror Mesa's as closely as possible. This includes having the same scope names (model_reporters, agent_reporters, etc.) and similar calling semantics, so that users can easily migrate their code and benefit from Mesa-Frames performance enhancements with minimal refactoring.
Injection vs Explicit Instantiation: One open decision is whether to inject the DataCollector automatically within ModelDF.__init__() (making it an opt-out default), or require the user to create and manage it explicitly. I currently lean toward opt-in for clarity and flexibility, but this is up for discussion and feedback from early users will be key.

2 API

Goal: Mirror Mesa’s signatures, extend only where Frames genuinely needs it.

from mesa_frames import DataCollector, every_n_steps

# minimal example

dc = DataCollector(
    model_reporters = {
        "total_wealth": lambda m: m.agents["wealth"].sum(),
        "wealth": ["mean", "max", "count:nz"],
        "gini":   lambda lf: gini_expr(lf["wealth"])
    },
    agent_reporters = {
        "wealth": "wealth"
    },
    trigger = every_n_steps(10),
    storage = "parquet:./runs/exp42/*.parquet",
)

We don't need the stats parameter because we can put everything into model_reporters.

2.2 Trigger Helpers

We provide two primary ways to trigger data collection:

every_n_steps(n): A simple convenience wrapper that triggers collection every n steps. Internally, this is just a lambda function checking model.time % n == 0.
Any Callable[[ModelDF], bool]: This allows users to define custom conditions for triggering data collection. For instance:
```
trigger = lambda model: model.agents["wealth"].mean() < 100
```
This provides flexibility to collect only under certain behavioral or system-state conditions.

However, we are deferring the implementation of more complex event-driven collection for now. Mesa-Frames currently doesn’t support events natively, and Mesa-core’s event hook design is still evolving. We can revisit this in a later phase if there's time.

2.3 Stats Configuration

Users can specify computed statistics either via named presets or custom callables. Working with pl.Exprs directly is ideal for performance and composability with the rest of the LazyFrame pipeline.

model_reporters = {
    "wealth": ["mean", "max", "count:nz"],
    "gini": lambda lf: gini_expr(lf["wealth"])
}

Each identifier (like "mean" or "count:nz") resolves to a registered pl.Expr under the hood. We maintain a simple internal registry to map these names to expressions. Custom callables (like the lambda above) can also return a full expression directly.

This approach ensures that computations are composable, lazy, and expressive. It also keeps the API clean while letting advanced users write powerful metrics.

2.4 Storage URI Schema

We support a unified URI scheme to specify where data should go:

memory: (default) – Keeps all collected LazyFrames in memory. No materialization unless explicitly triggered.
parquet:/absolute/or/relative/path/*.parquet – Writes step-wise LazyFrames to Parquet files locally.
parquet:s3://bucket/prefix-{step}.parquet – For cloud users, supports writing to AWS S3 (or other compatible buckets).
postgres://user:pass@host:5432/db#table=collector – Sends data to a PostgreSQL table using batched inserts or COPY.

2.5 Memory Management & Materialization Strategy

We can implement two main strategies for managing LazyFrames during simulation runs, especially for long simulations with many steps. Each approach makes different trade-offs between memory usage and computational efficiency:

In-Memory Accumulation with Deferred Execution

# Default strategy - keep all LazyFrames in memory, collect only at end
dc = DataCollector(...) 
model.run(1000)  # Collect LazyFrames every step/trigger
# After run is complete:
dfs = dc.get_data()  # triggers pl.collect_all([lf1, lf2, ...])

This approach leverages Polars' ability to optimize across all accumulated LazyFrames. When collect_all is called, Polars:

Combines all LazyFrames into one optimized plan with diverging branches
Applies common-subplan elimination, so shared operations are computed only once
Returns a list of materialized DataFrames in a single parallel execution

This is optimal for computational efficiency but requires enough memory to hold all DataFrames at once.

Periodic Materialization with Collect-Write-Lazy Pattern

# Periodic materialization mode - collect, write, and create new LazyFrame base
dc = DataCollector(
    ..., 
    storage="parquet:./runs/exp42/{step}.parquet",
    flush_every=50  # Materialize and flush every 50 steps
)
model.run(1000)

For very large simulations or memory-constrained environments, we'll implement a periodic materialization approach that:

Accumulates LazyFrames for flush_every steps (benefiting from optimization across that batch of both agent and model LazyFrames)
Uses a collect-write-lazy pattern for each batch:

# Example implementation within DataCollector._flush()
# self._current_batch contains LazyFrames from both agent and model reporters
dfs = pl.collect_all(self._current_batch)  # Optimize across all reporter LazyFrames in the batch
for i, df in enumerate(dfs):
    df.write_parquet(f"{self._path_template.format(step=self._current_step + i)}")
    # Keep the last DataFrame as LazyFrame to continue operations
    if i == len(dfs) - 1:
        self._base_frame = df.lazy()
self._current_batch = []  # Clear the batch buffer

Uses the last materialized DataFrame as the new base for subsequent LazyFrames, avoiding recomputation of previous operations

This approach maintains the best of both worlds: batch-level optimization with collect_all() while limiting memory usage by periodically materializing and writing to storage. Unlike streaming sinks (which don't properly capture time snapshots), this approach preserves the semantics of frame snapshots at each collection point.

3 Stats Layer (Built on Polars Expr)

We define a small catalog of commonly used pl.Expr-based statistics, which are resolved by name. These expressions are optimized for lazy execution and integrate naturally into the broader LazyFrame collection pipeline.

name	Polars expression	notes
`mean`	`pl.col(c).mean()`
`max`	`pl.col(c).max()`
`count:nz`	`(pl.col(c) != 0).sum()`	non-zero count
`median`	`pl.col(c).median()`	optional support

We also want to allow user-extensible statistics. Developers can register their own custom functions that return a valid pl.Expr, making it easy to support domain-specific needs while reusing the same lazy infrastructure:

def gini_expr(s):
    return (2 * s.rank().sum() / (len(s) * s.sum())) - (len(s) + 1) / len(s)

dc.register_stat("gini", gini_expr)

4 Implementation Guidelines

4.1 Collect‑all algorithm (single scan)

Discover reporters: at dc.collect() gather all requested columns & custom
expressions for this tick.
Build plan: one LazyFrame pipeline that projects all those columns &
attaches any computed expressions.
Tag with current step/time and cache.
Defer: materialise only when a writer flushes or get_data() triggers
.collect() down the line.

lf = (
    model.agents.lazy() #this will be lazy by default in the future
        .select([*required_cols, *exprs])
        .with_columns(step=pl.lit(model.time))
)
dc._frames.append(lf)

DataCollector can run a background loop (thread or asyncio) that batches N steps or T seconds, optimizing the balance between memory usage and computation efficiency.

5 Event‑Driven Collection (future work)

Mesa core currently does not expose explicit event hooks, and Mesa-Frames lacks an internal event mechanism or lifecycle-based pub/sub system. Introducing full event-driven data collection would require infrastructure that does not yet exist in either project.

To keep Phase 1 focused and feasible, we will postpone event-based collection mechanisms — such as tracking custom agent-level transitions, or listening to internal simulation milestones — until a later stage.

For now, we'll rely on periodic or conditional predicate-based collection (e.g., every_n_steps, or lambdas) which should cover a broad range of use cases.

6 Deliverables

Item	Description
Abstract Interface PR	Initial `DataCollector` class with clear docstrings and method signatures
Concrete Implementation	Working Polars-based backend that respects the abstract interface
Stats Registry	Built-in stat expressions and support for user-defined extensions
Docs / Example Notebook	e.g. Boltzmann Wealth model using `DataCollector` every 10 steps

Stretch goals may include Event-Driven collection or integration with dashboards for real-time insight. These can be explored if time allows, but are not required for successful completion of the core project.

0 replies

Ben-geo · 2025-05-28T20:45:05Z

Ben-geo
May 28, 2025
Maintainer Author

After further discussion, we've decided on the following changes:

Remove the separate stats reporter, as the existing model_reporter already provides the same functionality.
Remove the register_stats method, since it's no longer necessary.

Additional context on the data collected

`model_reporter`

Required columns:
Step

`agent_reporter`

Required columns:
Step
Agent Type Name (i.e., the name of the agent set, such as wolf or sheep)
Unique ID for each agent (each row corresponds to an agent in the respective agent set)

These changes aim to simplify the reporting system while retaining full functionality.

Injection vs Explicit Instantiation: One open decision is whether to inject the DataCollector automatically within ModelDF.init() (making it an opt-out default), or require the user to create and manage it explicitly. I currently lean toward opt-in for clarity and flexibility, but this is up for discussion and feedback from early users will be key.

Opt in sounds much better imo

0 replies

Ben-geo · 2025-07-27T11:21:37Z

Ben-geo
Jul 27, 2025
Maintainer Author

Notes :

in memory scales same as default - so the bottleneck is with the write functions
csv write is slower than parquet write - this is most probably due to how polars write backend works
Deferred doesn't actually make it faster - this is because our current logic works by saving the files 1 by 1 - so total write_csv 's being called does not change
- we are thinking of concatenating every step's frames then saving - this should reduce the number of times writes are being called
we will try to use async - this should be same as default saving

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151

Uh oh!

Ben-geo May 13, 2025 Maintainer

Overview

Code draft from proposal :

Replies: 3 comments

Uh oh!

Uh oh!

adamamer20 May 14, 2025 Maintainer

1 Architecture

2 API

2.2 Trigger Helpers

2.3 Stats Configuration

2.4 Storage URI Schema

2.5 Memory Management & Materialization Strategy

In-Memory Accumulation with Deferred Execution

Periodic Materialization with Collect-Write-Lazy Pattern

3 Stats Layer (Built on Polars Expr)

4 Implementation Guidelines

4.1 Collect‑all algorithm (single scan)

5 Event‑Driven Collection (future work)

6 Deliverables

Uh oh!

Uh oh!

Ben-geo May 28, 2025 Maintainer Author

Additional context on the data collected

model_reporter

agent_reporter

Uh oh!

Ben-geo Jul 27, 2025 Maintainer Author

Ben-geo
May 13, 2025
Maintainer

adamamer20
May 14, 2025
Maintainer

Ben-geo
May 28, 2025
Maintainer Author

`model_reporter`

`agent_reporter`

Ben-geo
Jul 27, 2025
Maintainer Author