MeanFlow with U-Net on CIFAR-10 by juanwulu · Pull Request #8 · PurdueDigitalTwin/research

juanwulu · 2025-12-01T22:06:13Z

Description

This pull request introduces several important refactorings and improvements to the core training and evaluation pipeline, with a focus on modularity, extensibility, and improved logging. The main changes include replacing references to the old data module with datamodule, updating the evaluation and training step interfaces, enhancing logging and error handling, and adding support for new dependency targets. Key changes are as follows:

Refactor to use `datamodule` instead of `data`

All references to the old data module have been updated to use datamodule throughout the codebase, including imports, type annotations, and Bazel dependencies. This improves clarity and modularity in the data handling pipeline. [1] [2] [3] [4] [5] [6] [7]

Training and evaluation loop improvements

The training and evaluation loops (train.py and evaluate.py) now accept explicit training_step and evaluation_step callables instead of relying on model methods, allowing for more flexible and decoupled step function definitions. [1] [2] [3] [4] [5] [6]
Enhanced logging and error handling have been added, including stack traces on exceptions and more informative status messages during compilation and evaluation. [1] [2] [3]

Model interface refactor

The Model class interface has been refactored: the training_step, evaluation_step, and predict_step methods have been replaced with more generic compute_loss and forward methods. The StepOutputs container now also supports histogram outputs. This makes the model API more extensible and explicit. [1] [2]

Evaluation and logging enhancements

The evaluation loop now logs histograms, flushes the writer more frequently, and improves metric naming consistency (e.g., using _step and _epoch suffixes).
Improved error reporting with stack traces and ensured proper resource cleanup (writer closure) in evaluation.

Bazel and dependency updates

Added a new ml_infra_mps_3_10 pip dependency target to MODULE.bazel for MPS (Apple Silicon) support, and updated Bazel build dependencies for clarity and correctness. [1] [2]

These changes collectively improve the maintainability, extensibility, and robustness of the core ML infrastructure.

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

… models Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

Signed-off-by: Juanwu <juanwu@purdue.edu>

…nditions Signed-off-by: Juanwu <juanwu@purdue.edu>

Signed-off-by: Juanwu <juanwu@purdue.edu>

…tions Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

…rd pass of meanflow model Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

…le of timestamps Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

Copilot

Pull request overview

This pull request introduces a comprehensive refactoring of the ML training infrastructure to support MeanFlow generative models with U-Net architecture on CIFAR-10. The changes improve modularity by decoupling training/evaluation loops from model implementations and add support for Apple Silicon (MPS) hardware.

Key Changes

Module refactoring: Renamed data module to datamodule throughout the codebase for clarity
Training interface improvements: Replaced model-bound training_step/evaluation_step methods with explicit callable functions, enabling more flexible training pipelines
New U-Net implementation: Added complete U-Net architecture for score-based generative modeling with attention mechanisms and skip connections
MPS support: Added infrastructure for Apple Silicon GPU acceleration through jax-metal dependency

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
MODULE.bazel	Added MPS pip dependency target for Apple Silicon support
third_party/requirements.in	Downgraded chex version; added jax-metal requirements
third_party/defs.bzl	Added MPS platform selection logic and conditional jax-metal dependency
third_party/BUILD	Added MPS config setting and requirements compilation target
src/core/model.py	Refactored interface: replaced step methods with `compute_loss` and `forward`
src/core/train.py	Accepts explicit training/evaluation step functions; improved error handling and logging
src/core/evaluate.py	Updated to use explicit evaluation step function; enhanced metric logging
src/core/config.py	Updated imports from `data` to `datamodule`; added checkpoint frequency config
src/data/huggingface.py	Major refactor: improved dataset caching, removed `seed` in favor of `rng`, simplified data loading
src/projects/generative/meanflow.py	Complete rewrite using U-Net backbone; improved timestamp conditioning and loss computation
src/projects/generative/model/unet.py	New U-Net implementation with ResNet blocks, attention, and up/downsampling
src/projects/generative/model/refinenet.py	Updated type hints and documentation; fixed deprecated JAX APIs
src/projects/generative/main.py	New training entry point with configuration support and proper device setup
src/utilities/visualization.py	New utility for creating image grids for visualization

Comments suppressed due to low confidence (1)

src/core/evaluate.py:140

The condition in line 93 checks for outputs.scalars is not None, but the code inside at line 136 accesses outputs.scalars.items() without checking if outputs itself is None. If outputs is None (which could happen if no batches were processed), this will raise an AttributeError. Consider adding a check if outputs is not None and outputs.scalars is not None: before accessing the scalars.

                if outputs.scalars is not None:
                    writer.write_scalars(
                        step=step,
                        scalars={
                            f"eval/{k}_step": sum(v) / len(v)
                            for k, v in outputs.scalars.items()
                        },
                    )
                if outputs.images is not None:
                    writer.write_images(
                        step=step,
                        images={
                            f"eval/{k}_step": v
                            for k, v in outputs.images.items()
                        },
                    )
                if outputs.histograms is not None:
                    writer.write_histograms(
                        step=step,
                        arrays={
                            f"eval/{k}_step": v
                            for k, v in outputs.histograms.items()
                        },
                    )
                writer.flush()

            # logging at the end of evaluation
            logging.rank_zero_info("Evaluation done.")
            scalar_output = {
                f"eval/{k.replace('_', ' ')}_epoch": sum(v) / len(v)
                for k, v in eval_metrics.items()
            }
            writer.write_scalars(
                step=step,
                scalars=scalar_output,
            )
            writer.flush()

        except Exception as e:
            logging.rank_zero_error(
                "Exception occurred during evaluation: %s", e
            )
            error_trace = traceback.format_exc()
            logging.rank_zero_error("Stack trace:\n%s", error_trace)
            _status = 1
        finally:
            writer.close()
            logging.rank_zero_info(
                "Evaluation done. Exit with code %d.",
                _status,
            )
    return _status

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/projects/generative/model/unet.py

src/projects/generative/meanflow.py

src/core/train.py

src/projects/generative/main.py

src/core/evaluate.py

src/utilities/visualization.py

src/core/train.py

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

juanwulu and others added 30 commits November 19, 2025 04:40

hotfix: Fixed typo in core module

ccfa898

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation fo RefineNet

0810627

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed typo in core module

b60e746

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation of HuggingFace datamodule

95be2cb

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed build target for utility module

2e923ce

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed typo in build targets

7ca6e1e

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation of MeanFlow

35ad184

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Added main entrypoint for training and evaluation of generative…

8ca3ed3

… models Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated configuration for training U-Net meanflow on CIFAR-10

0ab6cfc

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed issue with version of chex

f36fbeb

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Updated main train logic

e9816c8

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Improve the log frequency for meanflow on CIFAR-10

c2fb1ba

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated the main logic for training step in MeanFlow

a2f8fbe

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation of MeanFlow

91e96b3

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation of MeanFlow

dd4df9b

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Added checkpoint frequency attribute to trainer config

d585b66

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation of train logic

41223e0

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated the main logic for training step in MeanFlow

4357a02

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated the model protocol

b7d1967

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated the training logic

60db887

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Implemented the new model protocol for MeanFlow

c435f02

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Implemented the new main entrypoint with train logic

3b12acb

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation for huggingface dataset

cbfaa3c

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed error in huggingface datamodule

d368e52

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Updated checkpoint frequency

d91518f

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Added visualization utility to create a grid of images

07ab0aa

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed wrong implementation of t\neq{r} in meanflow

f462f55

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: symmetric mean flow

3649da1

hotfix: Switch back to original meanflow loss

4bad082

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Added dependencies for running on MPS framework

63b4cdf

Signed-off-by: Juanwu <juanwu@purdue.edu>

juanwulu added 15 commits November 29, 2025 21:32

feat: Updated grid visualization function to use jax array

a42758a

Signed-off-by: Juanwu <juanwu@purdue.edu>

feat: Implements the evaluation step for meanflow with visualization

cef9f5a

Signed-off-by: Juanwu <juanwu@purdue.edu>

feat: Moved evaluation to before the training inner loop

de1f95c

Signed-off-by: Juanwu <juanwu@purdue.edu>

feat: Added random left-right flip in training loop

74e69ce

Signed-off-by: Juanwu <juanwu@purdue.edu>

feat: Updated implementation for MeanFlow network and remove label co…

000f1a6

…nditions Signed-off-by: Juanwu <juanwu@purdue.edu>

hotfix: Fixed error raised by wrong shape checking

d8e4b76

Signed-off-by: Juanwu <juanwu@purdue.edu>

feat: Fixed training collapse by adding fc layers for timestamp condi…

a50c12e

…tions Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed wrong implementation of timestamp conditioning in forwa…

1c4f404

…rd pass of meanflow model Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Added histogram attribute to the model step output

551c366

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Added histogram logging for training and evaluation

a2060a7

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation for meanflow model to take arbitrary tup…

06ee4c5

…le of timestamps Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed error in logging histograms

50df9aa

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation for U-Net model in MeanFlow

ef2bcac

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Increased data loading batch size to 1024 for CIFAR-10

2b97614

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

feat: Updated implementation for evaluation step

0635517

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

juanwulu added this to the 2025.10 milestone Dec 1, 2025

juanwulu self-assigned this Dec 1, 2025

juanwulu added the enhacements New features or enhancements to existing ones. label Dec 1, 2025

juanwulu requested a review from Copilot December 1, 2025 22:06

Copilot started reviewing on behalf of juanwulu December 1, 2025 22:07 View session

Copilot finished reviewing on behalf of juanwulu December 1, 2025 22:08

Copilot AI reviewed Dec 1, 2025

View reviewed changes

hotfix: Fixed typo

c4f14b9

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

This was referenced Dec 2, 2025

Fix typo in UNet docstring: "feaatures" → "features" #9

Closed

Resolve stale docstring reference in train.py #10

Closed

PurdueDigitalTwin deleted a comment from Copilot AI Dec 2, 2025

juanwulu added 2 commits December 1, 2025 19:20

hotfix: Fixed infinite outer loop in training

abcf902

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

hotfix: Fixed conflict in naming of batch

29296d1

Signed-off-by: Juanwu Lu <juanwu@purdue.edu>

juanwulu merged commit 2b93386 into master Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MeanFlow with U-Net on CIFAR-10#8

MeanFlow with U-Net on CIFAR-10#8
juanwulu merged 67 commits intomasterfrom
meanflow

juanwulu commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanwulu commented Dec 1, 2025

Description

Refactor to use datamodule instead of data

Training and evaluation loop improvements

Model interface refactor

Evaluation and logging enhancements

Bazel and dependency updates

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor to use `datamodule` instead of `data`