Skip to content

Latest commit

 

History

History
1517 lines (1269 loc) · 56.3 KB

File metadata and controls

1517 lines (1269 loc) · 56.3 KB

Migration Strategy: iDG GPS Codebase

C++17 · Python 3.8+ · Proto3 · Production Test Suite

Branch: iDG | Date: 2026-02-10 | Author: Senior Research Scientist / Principal Engineer


Preamble

This document is the authoritative migration strategy for the Minimax Iterative Dynamic Game GPS implementation. Every finding cited here was verified by exhaustive static analysis of all 237 Python files, 41 C++ files, 5 build-system files, the proto schema, and 133+ experiment hyperparameter files. Exact file paths and line numbers are given for every issue.

The strategy is structured as six sequential phases. Each phase has a hard gate: a test suite milestone that must pass before the next phase begins. There is no phase that defers testing — tests are written concurrently with or before the code changes they cover. This is non-negotiable: the only way to prove that a migration is correct is to run it against a well-specified oracle.

Overall go/no-go criterion: before any code is merged into main/iDG, the following must be true:

  1. pytest --tb=short exits 0 with coverage ≥ 70 % on python/gps/
  2. catkin_make -DCMAKE_CXX_STANDARD=17 -DCMAKE_BUILD_TYPE=Release exits 0 with 0 warnings at -Wall -Wextra -Werror
  3. GitHub Actions CI is green on Python 3.8, 3.10, 3.11 matrix
  4. All soak tests pass an 8-hour overnight run with RSS growth < 10 MB/hour

Failure Inventory (What Breaks First on Python 3.8+)

The following failures are ordered by when they occur, not by severity. Any migration that does not address them in this order will hit cascading failures.

Order File Line Failure Kind
1 python/gps/gui/ps3_config.py 25 AttributeError: 'dict' has no attribute 'iteritems' — on module import, before any logic runs Py2 API
2 python/gps/sample/sample_list.py 2 ModuleNotFoundError: No module named 'cPickle' Py2 API
3 All 7 ABC files various __metaclass__ silently ignored → abstract enforcement lost Py2 semantics
4 python/gps/algorithm/policy_opt/policy_opt_pytorch.py 55 NameError: name 'use_cuda' is not defined Logic bug
5 python/gps/algorithm/policy_opt/policy_opt_pytorch.py 61 NameError: name 'net' is not defined Logic bug
6 python/gps/algorithm/cost/cost_state.py 68 return inside loop — only first data type evaluated Logic bug
7 python/gps/algorithm/traj_opt/traj_opt_pi2.py 76 NameError: name 'xrange' is not defined Py2 API
8 python/gps/algorithm/policy/lin_gauss_init.py 4 TypeError: catching classes that do not inherit from BaseException Wrong import
9 gps_agent_pkg/src/positioncontroller.cpp 154–167 UB: bool is_finished() returns nothing on TASK_SPACE path C++ UB
10 gps_agent_pkg/src/robotplugin.cpp 263, 586 Infinite spinlock on trylock() — misses real-time deadline C++ logic

Phase 0 — Test Infrastructure Bootstrap (prerequisite to all phases)

Purpose: establish the scaffolding, fixtures, and CI pipeline so that every subsequent code change is immediately validated. No production code is modified in this phase.

0.1 Install test dependencies

Add to requirements-dev.txt (new file, not requirements.txt):

pytest>=7.4
pytest-cov>=4.1
pytest-timeout>=2.1
pytest-xdist>=3.3          # parallel test execution
pytest-mock>=3.11
memory-profiler>=0.61
psutil>=5.9
numpy>=1.24,<2.0
scipy>=1.11
torch>=2.1
protobuf>=4.23

0.2 pytest configuration

Create pytest.ini at repo root:

[pytest]
testpaths = python/tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts = --tb=short --strict-markers -q
markers =
    unit: pure unit tests with no external dependencies
    integration: requires simulator (Box2D or MuJoCo)
    load: performance / memory tests
    soak: long-running endurance tests (>1 hour)
    fault: fault-injection / adversarial input tests
    gpu: requires CUDA device
    ros: requires ROS 1 environment
filterwarnings =
    error::DeprecationWarning
    error::RuntimeWarning

0.3 conftest.py with shared fixtures

Create python/tests/conftest.py:

"""Shared pytest fixtures for the GPS test suite."""
import copy
import numpy as np
import pytest

SEED = 42

@pytest.fixture(autouse=True)
def fix_random_seed():
    """Enforce deterministic numpy/random state for every test."""
    np.random.seed(SEED)
    import random; random.seed(SEED)
    yield

@pytest.fixture
def minimal_hyperparams():
    """Minimal valid hyperparams dict for algorithm instantiation."""
    return {
        'conditions': 1,
        'T': 10,
        'dU': 2,
        'dX': 4,
        'dV': 2,
        'iterations': 2,
        'num_samples': 2,
        'kl_step': 1.0,
        'min_step_mult': 0.01,
        'max_step_mult': 10.0,
        'initial_state_var': 1e-6,
        'fit_dynamics': False,
        'sample_on_policy': False,
    }

@pytest.fixture
def tiny_trajectory(minimal_hyperparams):
    """N=2 x T=10 x dX=4 state trajectory and N=2 x T=10 x dU=2 actions."""
    hp = minimal_hyperparams
    N, T, dX, dU, dV = 2, hp['T'], hp['dX'], hp['dU'], hp['dV']
    X = np.random.randn(N, T, dX)
    U = np.random.randn(N, T, dU)
    V = np.random.randn(N, T, dV)
    return X, U, V

0.4 GitHub Actions CI

Create .github/workflows/ci.yml:

name: CI

on: [push, pull_request]

jobs:
  python:
    runs-on: ubuntu-22.04
    strategy:
      matrix:
        python-version: ["3.8", "3.10", "3.11"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: pip install -r requirements-dev.txt
      - name: Lint (black + isort)
        run: |
          pip install black isort
          black --check python/gps/ python/tests/
          isort --check-only python/gps/ python/tests/
      - name: Unit + integration tests
        run: |
          pytest -m "unit or integration" \
            --cov=python/gps --cov-report=xml \
            --cov-fail-under=70 \
            -x --timeout=120
      - name: Upload coverage
        uses: codecov/codecov-action@v4

  cpp:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - name: Install ROS Noetic + catkin
        run: |
          sudo sh -c 'echo "deb http://packages.ros.org/ros/ubuntu $(lsb_release -sc) main" > /etc/apt/sources.list.d/ros.list'
          sudo apt-get update
          sudo apt-get install -y ros-noetic-catkin catkin-tools libboost-all-dev libeigen3-dev libprotobuf-dev
      - name: Build with C++17 and -Werror
        run: |
          source /opt/ros/noetic/setup.bash
          catkin build gps_agent_pkg \
            --cmake-args -DCMAKE_CXX_STANDARD=17 \
                         -DCMAKE_CXX_STANDARD_REQUIRED=ON \
                         -DCMAKE_BUILD_TYPE=Release \
                         -DCMAKE_CXX_FLAGS="-Wall -Wextra -Werror"

  docker:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - name: Build Docker image smoke test
        run: docker build -t gps-iDG:smoke --target builder .

Gate for Phase 0: CI pipeline is green on the unmodified codebase except for expected Python 2 failures. All test infrastructure code passes black and isort.


Phase 1 — Python 3.8+ Compatibility (Language-Level Fixes)

Goal: the codebase imports without error and all existing algorithms run to completion on Python 3.8+ with no behavioural regressions.

1.1 Complete fix list with locations

# File Line(s) Issue Fix
P3-01 python/gps/agent/agent.py 15 __metaclass__ = abc.ABCMeta class Agent(abc.ABC):
P3-02 python/gps/algorithm/algorithm.py 20 same class Algorithm(abc.ABC):
P3-03 python/gps/algorithm/cost/cost.py 7 same class Cost(abc.ABC):
P3-04 python/gps/algorithm/dynamics/dynamics.py 9 same class Dynamics(abc.ABC):
P3-05 python/gps/algorithm/policy/policy.py 7 same class Policy(abc.ABC):
P3-06 python/gps/algorithm/policy_opt/policy_opt.py 7 same class PolicyOpt(abc.ABC):
P3-07 python/gps/algorithm/traj_opt/traj_opt.py 7 same class TrajOpt(abc.ABC):
P3-08 python/gps/algorithm/traj_opt/traj_opt_pi2.py 76,79,81,83,85,137,162 xrange range
P3-09 python/gps/algorithm/cost/cost_binary_region.py 56 xrange range
P3-10 python/gps/gui/config.py 38, 68, 71 .iteritems() .items()
P3-11 python/gps/gui/ps3_config.py 25, 50 .iteritems() .items()
P3-12 python/gps/gui/action_panel.py 63, 78 .iteritems() .items()
P3-13 python/gps/algorithm/policy_opt/tf_utils.py 128, 144 .iteritems() .items()
P3-14 python/gps/sample/sample_list.py 2 import cPickle import pickle
P3-15 python/gps/sample/sample_list.py 79 cPickle.dump(data_file, samples) pickle.dump(samples, data_file)
P3-16 python/gps/utility/data_logger.py 3–6 try: import cPickle import pickle (drop try/except)
P3-17 gps_agent_pkg/test/run_train.py 8 import cPickle import pickle
P3-18 python/gps/gps_main.py 6, 546 import imp, imp.load_source() importlib.util.spec_from_file_location
P3-19 python/gps/gps_main.py 18, 502 '/'.join(str.split(...)) pathlib.Path(__file__).parent.parent
P3-20 python/gps/gps_main.py 608 Wrong arg order to GPSMain() GPSMain(config, closeloop, robust, quit_on_end=args.quit)
P3-21 python/gps/algorithm/cost/cost_action.py 129 os._exit(string) raise ValueError(...)
P3-22 python/gps/algorithm/cost/cost_state.py 24–25 Hardcoded self.gamma=5, self.mode='antagonist' Read from self._hyperparams
P3-23 python/gps/algorithm/cost/cost_state.py 68 return inside loop Dedent return past loop
P3-24 python/gps/algorithm/policy/lin_gauss_init.py 4 import numpy.linalg as LinAlgError from numpy.linalg import LinAlgError
P3-25 python/gps/algorithm/traj_opt/traj_opt_lqr_python.py 17–18 Circular import via AlgorithmBADMM, AlgorithmMDGPS Break cycle with TYPE_CHECKING guard or lazy import
P3-26 python/gps/utility/data_logger.py 25, 30 open() without context manager with open(...) as f:
P3-27 python/tests/tests_tensorflow/test_policy_opt_tf.py 225, 234 print statement syntax print(...)
P3-28 gps_agent_pkg/test/check_jac.py 40 pdb.set_trace() Remove
P3-29 gps_agent_pkg/test/run_train.py 55 pdb.set_trace() Remove
P3-30 All gps_main.py os._exit(1) 135,192,198,237,469 os._exit bypasses atexit sys.exit(1) in threads, raise SystemExit elsewhere

1.2 Test suite — Phase 1 gate

Write these tests before or alongside making the fixes. Each test is written to fail on the unmodified Python 2 codebase and pass after the fix.

python/tests/unit/test_imports.py

"""Verify all GPS modules import cleanly under Python 3.8+.

These tests fail on the unmodified codebase and pass after Phase 1 fixes.
"""
import importlib
import pytest


GPS_MODULES = [
    "gps.agent.agent",
    "gps.algorithm.algorithm",
    "gps.algorithm.cost.cost",
    "gps.algorithm.dynamics.dynamics",
    "gps.algorithm.policy.policy",
    "gps.algorithm.policy_opt.policy_opt",
    "gps.algorithm.traj_opt.traj_opt",
    "gps.sample.sample_list",
    "gps.utility.data_logger",
    "gps.gui.ps3_config",
    "gps.gui.config",
    "gps.algorithm.cost.cost_binary_region",
    "gps.algorithm.traj_opt.traj_opt_pi2",
    "gps.algorithm.policy_opt.tf_utils",
]


@pytest.mark.unit
@pytest.mark.parametrize("module", GPS_MODULES)
def test_module_importable(module):
    """Every GPS module must import without error on Python 3."""
    importlib.import_module(module)


@pytest.mark.unit
def test_abstract_base_classes_enforce_methods():
    """After P3-01 through P3-07: ABCs must refuse instantiation when
    abstract methods are not implemented."""
    import abc
    from gps.algorithm.cost.cost import Cost
    from gps.algorithm.dynamics.dynamics import Dynamics
    from gps.algorithm.policy.policy import Policy

    with pytest.raises(TypeError, match="Can't instantiate abstract class"):
        Cost({})  # type: ignore

    with pytest.raises(TypeError, match="Can't instantiate abstract class"):
        Dynamics({})  # type: ignore

    with pytest.raises(TypeError, match="Can't instantiate abstract class"):
        Policy()  # type: ignore

python/tests/unit/test_sample.py

"""Unit tests for Sample and SampleList data structures."""
import pickle
import tempfile

import numpy as np
import pytest


@pytest.mark.unit
class TestSampleList:
    def _make_fake_agent(self, T=10, dX=4, dU=2, dV=2, dO=4, dM=0):
        """Minimal agent-like namespace for Sample construction."""
        import types
        from gps.proto.gps_pb2 import ACTION, ACTION_V
        agent = types.SimpleNamespace(
            T=T, dX=dX, dU=dU, dV=dV, dO=dO, dM=dM,
            x_data_types=[ACTION],
            obs_data_types=[ACTION],
            meta_data_types=[],
        )
        agent.pack_data_x = lambda mat, data, data_types, axes=None: None
        agent.pack_data_obs = lambda mat, data, data_types, axes=None: None
        agent.pack_data_meta = lambda mat, data, data_types, axes=None: None
        return agent

    def test_sample_list_get_X_shape(self):
        from gps.sample.sample import Sample
        from gps.sample.sample_list import SampleList
        from gps.proto.gps_pb2 import ACTION, ACTION_V
        agent = self._make_fake_agent()
        N, T, dX, dU, dV = 3, agent.T, agent.dX, agent.dU, agent.dV
        samples = []
        for _ in range(N):
            s = Sample(agent)
            s.set(ACTION, np.random.randn(T, dU))
            s.set(ACTION_V, np.random.randn(T, dV))
            samples.append(s)
        sl = SampleList(samples)
        assert sl.get_U().shape == (N, T, dU)
        assert sl.get_V().shape == (N, T, dV)

    def test_pickle_sample_writer_correct_arg_order(self):
        """After P3-15: pickle.dump(samples, file) not (file, samples)."""
        from gps.sample.sample_list import PickleSampleWriter
        with tempfile.NamedTemporaryFile(suffix='.pkl', delete=False) as f:
            path = f.name
        writer = PickleSampleWriter(path)
        writer.write([1, 2, 3])
        with open(path, 'rb') as f:
            result = pickle.load(f)
        assert result == [1, 2, 3]

python/tests/unit/test_data_logger.py

"""Unit tests for DataLogger pickle/unpickle with context manager safety."""
import os
import tempfile

import pytest


@pytest.mark.unit
class TestDataLogger:
    def test_pickle_unpickle_round_trip(self):
        from gps.utility.data_logger import DataLogger
        logger = DataLogger()
        payload = {'key': [1, 2, 3], 'nested': {'a': 1.0}}
        with tempfile.TemporaryDirectory() as td:
            path = os.path.join(td, 'test.pkl')
            logger.pickle(path, payload)
            recovered = logger.unpickle(path)
        assert recovered == payload

    def test_unpickle_missing_file_returns_none(self):
        from gps.utility.data_logger import DataLogger
        logger = DataLogger()
        result = logger.unpickle('/nonexistent/path/file.pkl')
        assert result is None

    def test_no_fd_leak(self):
        """open() must be used as context manager — no file descriptor leak."""
        import psutil
        import os
        from gps.utility.data_logger import DataLogger
        logger = DataLogger()
        proc = psutil.Process(os.getpid())
        fds_before = proc.num_fds()
        with tempfile.TemporaryDirectory() as td:
            path = os.path.join(td, 'fdleak.pkl')
            for _ in range(100):
                logger.pickle(path, {'x': 1})
                logger.unpickle(path)
        fds_after = proc.num_fds()
        assert fds_after - fds_before < 5, \
            f"Possible file descriptor leak: {fds_before} -> {fds_after}"

python/tests/unit/test_cost_state.py

"""Verify cost_state evaluates ALL data types (not just first)."""
import numpy as np
import pytest


@pytest.mark.unit
class TestCostState:
    def _make_sample(self, T=10, dX=4, dU=2, dV=2, dO=4):
        import types
        from gps.sample.sample import Sample
        from gps.proto.gps_pb2 import ACTION, ACTION_V, JOINT_ANGLES, JOINT_VELOCITIES
        agent = types.SimpleNamespace(
            T=T, dX=dX, dU=dU, dV=dV, dO=dO, dM=0,
            x_data_types=[JOINT_ANGLES, JOINT_VELOCITIES],
            obs_data_types=[],
            meta_data_types=[],
            _x_data_idx={JOINT_ANGLES: list(range(2)),
                         JOINT_VELOCITIES: list(range(2, 4))},
        )
        def pack_data_x(mat, data, data_types, axes=None):
            pass
        agent.pack_data_x = pack_data_x
        agent.pack_data_obs = lambda *a, **kw: None
        agent.pack_data_meta = lambda *a, **kw: None
        s = Sample(agent)
        s.set(ACTION, np.zeros((T, dU)))
        s.set(ACTION_V, np.zeros((T, dV)))
        s.set(JOINT_ANGLES, np.random.randn(T, 2))
        s.set(JOINT_VELOCITIES, np.random.randn(T, 2))
        return s

    def test_return_outside_loop(self):
        """After P3-23: cost evaluated for ALL data types, not just the first."""
        from gps.proto.gps_pb2 import JOINT_ANGLES, JOINT_VELOCITIES
        from gps.algorithm.cost.cost_state import CostState
        hp = {
            'data_types': {
                JOINT_ANGLES: {
                    'wp': np.ones(2),
                    'target_state': np.zeros(2),
                },
                JOINT_VELOCITIES: {
                    'wp': np.ones(2),
                    'target_state': np.zeros(2),
                },
            },
            'ramp_option': 0,
            'wp_final_multiplier': 1.0,
            'l1': 0.0,
            'l2': 1.0,
            'alpha': 1e-2,
        }
        cost = CostState(hp)
        sample = self._make_sample()
        l, lx, lu, lxx, luu, lux = cost.eval(sample)
        # Both JOINT_ANGLES and JOINT_VELOCITIES should contribute to lx
        assert not np.all(lx == 0), \
            "lx is all zeros — premature return likely only evaluated one data type"

Phase 1 gate: pytest -m unit --cov=python/gps/agent --cov=python/gps/algorithm/cost --cov=python/gps/sample --cov=python/gps/utility --cov-fail-under=60 -x exits 0.


Phase 2 — PyTorch Policy Optimizer Rewrite

Goal: policy_opt_pytorch.py is a complete, self-contained PyTorch nn.Module with zero TensorFlow references.

2.1 Root cause analysis

The file is a verbatim copy of policy_opt_tf.py with PyTorch layer definitions pasted into __init__ but the execution layer (update, prob, save, restore) never converted. Specific broken references confirmed by grep:

Line Symbol Origin Status
55 use_cuda (bare name) undefined NameError
61 net undefined NameError
61 self.obs_tensor, self.act_op, self.feat_op TF graph ops Never set
76–77 F.ReLU(...) should be F.relu(...) AttributeError
146–162 self.sess, feed_dict, self.solver.get_last_conv_values TF 1.x session Non-existent
173–175 Variable(obs) PyTorch 0.3 (deprecated 2018) DeprecationWarning → error
177 self.critetion typo; not defined AttributeError
227–228 tf.device(...), self.sess.run(...) TF 1.x ImportError / AttributeError
242, 245 self.saver.save, self.saver.restore TF 1.x AttributeError
269–270 from tensorflow.python.framework import ops TF internal API ImportError
282 '' (stray literal) SyntaxError potential

2.2 Rewrite specification

The rewritten class must:

  • Inherit from PolicyOpt (policy_opt.py) and nn.Module (torch)
  • __init__: build layers, init optimizer, set device correctly
  • forward(x): proper F.relu() activations
  • update(obs, tgt_mu, tgt_prc, tgt_wt): pure PyTorch training loop using loss.backward(); optimizer.step()
  • prob(obs): forward pass returning (output, pol_sigma, pol_prec, pol_det_sigma)
  • save_model(fname) / restore_model(fname): torch.save / torch.load
  • __getstate__ / __setstate__: serialize via io.BytesIO + torch.save
  • No import tensorflow anywhere in the file

2.3 Test suite — Phase 2 gate

python/tests/unit/test_policy_opt_pytorch.py

"""Unit tests for the rewritten PyTorch policy optimizer.

All tests must pass with zero TF imports and PyTorch >= 2.1.
"""
import io
import pickle
import tempfile

import numpy as np
import pytest
import torch


@pytest.mark.unit
class TestPolicyOptPyTorch:
    """Tests are structured as: construct → forward → update → prob → serialize."""

    def _make_policy_opt(self, dO=14, dU=7, dV=7):
        from gps.algorithm.policy_opt.policy_opt_pytorch import PolicyOptPyTorch
        from gps.algorithm.policy_opt.config import POLICY_OPT_PYTORCH
        hp = dict(POLICY_OPT_PYTORCH)
        hp['network_params'] = {
            'obs_include': [],
            'sensor_dims': {},
            'obs_image_data': [],
        }
        return PolicyOptPyTorch(hp, dO, dU, dV)

    def test_no_tensorflow_import(self):
        """The module must not import tensorflow at any point."""
        import sys
        # Ensure tf is not loaded as side-effect of importing policy_opt_pytorch
        tf_loaded_before = 'tensorflow' in sys.modules
        from gps.algorithm.policy_opt import policy_opt_pytorch  # noqa: F401
        tf_loaded_after = 'tensorflow' in sys.modules
        if not tf_loaded_before:
            assert not tf_loaded_after, \
                "policy_opt_pytorch imported tensorflow as a side-effect"

    def test_instantiation(self):
        opt = self._make_policy_opt()
        assert opt is not None
        assert isinstance(opt, torch.nn.Module)

    def test_forward_output_shape(self):
        opt = self._make_policy_opt(dO=14, dU=7)
        x = torch.randn(5, 14)
        out = opt.forward(x)
        assert out.shape == (5, 7), f"Expected (5,7), got {out.shape}"

    def test_forward_no_nan(self):
        opt = self._make_policy_opt(dO=14, dU=7)
        x = torch.randn(100, 14)
        out = opt.forward(x)
        assert not torch.isnan(out).any(), "forward() produced NaN outputs"

    def test_relu_not_ReLU(self):
        """Verify F.relu (function) is used, not F.ReLU (class constructor call)."""
        import inspect
        from gps.algorithm.policy_opt.policy_opt_pytorch import PolicyOptPyTorch
        src = inspect.getsource(PolicyOptPyTorch.forward)
        assert 'F.relu(' in src or 'torch.relu(' in src or 'nn.functional.relu(' in src, \
            "forward() must use F.relu() not F.ReLU()"
        assert 'F.ReLU(' not in src, "F.ReLU( is incorrect — use F.relu("

    def test_update_returns_policy(self):
        opt = self._make_policy_opt(dO=14, dU=7)
        N, T = 4, 10
        obs = np.random.randn(N, T, 14)
        tgt_mu = np.random.randn(N, T, 7)
        tgt_prc = np.tile(np.eye(7), (N, T, 1, 1))
        tgt_wt = np.ones((N, T))
        policy = opt.update(obs, tgt_mu, tgt_prc, tgt_wt)
        assert policy is not None

    def test_prob_output_shapes(self):
        opt = self._make_policy_opt(dO=14, dU=7)
        # update first so policy.scale and policy.bias are set
        N, T = 4, 10
        obs = np.random.randn(N, T, 14)
        tgt_mu = np.random.randn(N, T, 7)
        tgt_prc = np.tile(np.eye(7), (N, T, 1, 1))
        tgt_wt = np.ones((N, T))
        opt.update(obs, tgt_mu, tgt_prc, tgt_wt)
        output, pol_sigma, pol_prec, pol_det_sigma = opt.prob(obs)
        assert output.shape == (N, T, 7)
        assert pol_sigma.shape == (N, T, 7, 7)
        assert pol_prec.shape == (N, T, 7, 7)
        assert pol_det_sigma.shape == (N, T)

    def test_prob_no_nan(self):
        opt = self._make_policy_opt(dO=14, dU=7)
        N, T = 2, 5
        obs = np.random.randn(N, T, 14)
        tgt_mu = np.random.randn(N, T, 7)
        tgt_prc = np.tile(np.eye(7), (N, T, 1, 1))
        tgt_wt = np.ones((N, T))
        opt.update(obs, tgt_mu, tgt_prc, tgt_wt)
        output, _, _, _ = opt.prob(obs)
        assert not np.isnan(output).any()

    def test_pickle_round_trip(self):
        """__getstate__ / __setstate__ must round-trip weights correctly."""
        opt = self._make_policy_opt(dO=14, dU=7)
        N, T = 4, 10
        obs = np.random.randn(N, T, 14)
        tgt_mu = np.random.randn(N, T, 7)
        tgt_prc = np.tile(np.eye(7), (N, T, 1, 1))
        tgt_wt = np.ones((N, T))
        opt.update(obs, tgt_mu, tgt_prc, tgt_wt)

        blob = pickle.dumps(opt)
        opt2 = pickle.loads(blob)

        # Verify weights survived round-trip
        out1, _, _, _ = opt.prob(obs)
        out2, _, _, _ = opt2.prob(obs)
        np.testing.assert_allclose(out1, out2, rtol=1e-5,
            err_msg="Policy outputs differ after pickle round-trip")

    def test_no_variable_deprecated(self):
        """torch.autograd.Variable must not be used (deprecated since PyTorch 0.4)."""
        import inspect
        from gps.algorithm.policy_opt import policy_opt_pytorch
        src = inspect.getsource(policy_opt_pytorch)
        assert 'Variable(' not in src, \
            "torch.autograd.Variable is deprecated; use tensors directly"

    def test_save_restore_model(self):
        opt = self._make_policy_opt(dO=14, dU=7)
        with tempfile.TemporaryDirectory() as td:
            path = td + '/model.pt'
            opt.save_model(path)
            opt2 = self._make_policy_opt(dO=14, dU=7)
            opt2.restore_model(path)
            # Compare parameter values
            for (name1, p1), (name2, p2) in zip(
                opt.named_parameters(), opt2.named_parameters()
            ):
                np.testing.assert_allclose(
                    p1.detach().numpy(), p2.detach().numpy(),
                    rtol=1e-5,
                    err_msg=f"Parameter {name1} differs after save/restore"
                )

Phase 2 gate: pytest python/tests/unit/test_policy_opt_pytorch.py -v exits 0 with all 11 tests passing.


Phase 3 — C++17 Modernization

Goal: all C++ compiles cleanly under -std=c++17 -Wall -Wextra -Werror with no Boost smart pointer headers required in any GPS-owned header file.

3.1 Complete change list

# File Line(s) Change
C17-01 gps_agent_pkg/CMakeLists.txt 1 cmake_minimum_required(VERSION 3.14)
C17-02 gps_agent_pkg/CMakeLists.txt 14 Remove OPTION(ENABLE_CXX11...), replace with set(CMAKE_CXX_STANDARD 17) and set(CMAKE_CXX_STANDARD_REQUIRED ON)
C17-03 gps_agent_pkg/include/gps_agent_pkg/robotplugin.h 10–11 Remove #include <boost/scoped_ptr.hpp> and #include <boost/shared_ptr.hpp>; add #include <memory>
C17-04 gps_agent_pkg/include/gps_agent_pkg/robotplugin.h 33 #define ros_publisher_ptr(X) macro: replace boost::scoped_ptr with std::unique_ptr
C17-05 gps_agent_pkg/include/gps_agent_pkg/robotplugin.h 63–71 All boost::scoped_ptr<T>std::unique_ptr<T>
C17-06 gps_agent_pkg/include/gps_agent_pkg/robotplugin.h 73–81 All boost::shared_ptr<T>std::shared_ptr<T>
C17-07 gps_agent_pkg/include/gps_agent_pkg/options.h 10, 30 boost::variant<...>std::variant<...>; remove #include <boost/variant.hpp>
C17-08 gps_agent_pkg/include/gps_agent_pkg/sample.h 12, 36 Same variant migration
C17-09 gps_agent_pkg/include/gps_agent_pkg/sensor.h 9, 59, 61 boost::scoped_ptrstd::unique_ptr
C17-10 gps_agent_pkg/include/gps_agent_pkg/controller.h 7, 35 same
C17-11 gps_agent_pkg/include/gps_agent_pkg/encodersensor.h 12, 55–58, 91–93 boost::shared_ptrstd::shared_ptr; boost::scoped_ptrstd::unique_ptr
C17-12 gps_agent_pkg/include/gps_agent_pkg/trialcontroller.h 10, 31, 33 boost::scoped_ptrstd::unique_ptr
C17-13 gps_agent_pkg/include/gps_agent_pkg/caffenncontroller.h 22 boost::scoped_ptrstd::unique_ptr
C17-14 gps_agent_pkg/src/sensor.cpp 13, 19 C-style cast → static_cast<Sensor*> or factory returning std::unique_ptr<Sensor>
C17-15 gps_agent_pkg/src/positioncontroller.cpp 154–167 Add else { return false; } / throw std::logic_error("unhandled mode") to is_finished()
C17-16 gps_agent_pkg/src/sample.cpp 38, 61, 173 After each ROS_ERROR for out-of-bounds, add return;
C17-17 gps_agent_pkg/src/robotplugin.cpp 263, 586 Replace while(!trylock()); with timed-trylock pattern (10 attempts, 1ms sleep)
C17-18 gps_agent_pkg/src/robotplugin.cpp 268, 280 int d / int i loop variables → std::size_t to fix sign-compare
C17-19 gps_agent_pkg/src/sample.cpp 176, 193 same sign-compare fix
C17-20 gps_agent_pkg/src/encoderfilter.cpp 54 atof(...)std::stod(...) with try/catch
C17-21 gps_agent_pkg/src/pr2plugin.cpp 235–237 PLUGINLIB_DECLARE_CLASSPLUGINLIB_EXPORT_CLASS
C17-22 gps_agent_pkg/src/neuralnetworkcaffe.cpp 79–81 Raw delete weights → use std::unique_ptr<std::string> throughout
C17-23 gps_agent_pkg/CMakeLists.txt 42, 111 Remove $ENV{GPS_ROOT_DIR} dependency; use ${CMAKE_CURRENT_SOURCE_DIR}
C17-24 gps_agent_pkg/CMakeLists.txt 48–68 Add PROTOBUF_GENERATE_CPP target so make regenerates .pb.h on .proto changes

3.2 Spinlock fix (C17-17) — reference implementation

// robotplugin.cpp — replace line 263
{
    constexpr int kMaxTryLockAttempts = 10;
    int attempts = 0;
    while (!report_publisher_->trylock()) {
        if (++attempts >= kMaxTryLockAttempts) {
            ROS_WARN_THROTTLE(1.0, "RealtimePublisher lock contention; skipping report");
            return;
        }
        std::this_thread::sleep_for(std::chrono::microseconds(500));
    }
}

3.3 Test suite — Phase 3 gate

C++ tests are driven by CMake/CTest and the CI cpp job. Additionally, the following Python-side tests verify the C++ interface through the ROS bridge:

python/tests/unit/test_proto_roundtrip.py

"""Verify proto-generated Python bindings are consistent after proto3 migration."""
import pytest
import numpy as np


@pytest.mark.unit
class TestProtoRoundTrip:
    def test_sample_proto_fields_present(self):
        """All expected fields are accessible after proto3 migration."""
        from gps.proto.gps_pb2 import Sample
        s = Sample()
        # Fields must still be accessible — proto3 field presence semantics differ
        assert hasattr(s, 'T')
        assert hasattr(s, 'dX')
        assert hasattr(s, 'dU')
        assert hasattr(s, 'dV')
        assert hasattr(s, 'dO')

    def test_sample_type_enums_accessible(self):
        from gps.proto import gps_pb2
        assert hasattr(gps_pb2, 'ACTION')
        assert hasattr(gps_pb2, 'JOINT_ANGLES')
        assert hasattr(gps_pb2, 'JOINT_VELOCITIES')
        assert hasattr(gps_pb2, 'END_EFFECTOR_POINTS')

    def test_serialise_deserialise(self):
        from gps.proto.gps_pb2 import Sample
        s = Sample()
        s.T = 100
        s.dX = 26
        s.dU = 7
        blob = s.SerializeToString()
        s2 = Sample()
        s2.ParseFromString(blob)
        assert s2.T == 100
        assert s2.dX == 26
        assert s2.dU == 7

Phase 3 gate:

  • catkin_make -DCMAKE_CXX_STANDARD=17 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-Wall -Wextra -Werror" exits 0 with zero warnings
  • pytest python/tests/unit/test_proto_roundtrip.py -v exits 0

Phase 4 — Algorithm Core Test Suite

Goal: prove the iDG game-theoretic logic is correct by testing every algorithm module at unit and integration level. This phase produces the bulk of the test suite.

4.1 Unit tests

python/tests/unit/test_cost_functions.py

"""Unit tests for all cost function implementations."""
import numpy as np
import pytest


def make_sample(T=10, dX=4, dU=2, dV=2):
    import types
    from gps.sample.sample import Sample
    from gps.proto.gps_pb2 import ACTION, ACTION_V, JOINT_ANGLES
    agent = types.SimpleNamespace(
        T=T, dX=dX, dU=dU, dV=dV, dO=dX, dM=0,
        x_data_types=[JOINT_ANGLES],
        obs_data_types=[],
        meta_data_types=[],
        _x_data_idx={JOINT_ANGLES: list(range(dX))},
    )
    agent.pack_data_x = lambda *a, **kw: None
    agent.pack_data_obs = lambda *a, **kw: None
    agent.pack_data_meta = lambda *a, **kw: None
    s = Sample(agent)
    s.set(ACTION, np.random.randn(T, dU))
    s.set(ACTION_V, np.random.randn(T, dV))
    s.set(JOINT_ANGLES, np.random.randn(T, dX))
    return s


@pytest.mark.unit
class TestCostAction:
    def _make(self, mode='protagonist'):
        from gps.algorithm.cost.cost_action import CostAction
        return CostAction({'wu': np.ones(2), 'mode': mode, 'gamma': 1.0})

    def test_protagonist_returns_six_terms(self):
        cost = self._make('protagonist')
        result = cost.eval(make_sample())
        assert len(result) == 6

    def test_antagonist_returns_six_terms(self):
        cost = self._make('antagonist')
        result = cost.eval(make_sample())
        assert len(result) == 6

    def test_no_nan_protagonist(self):
        cost = self._make('protagonist')
        l, lx, lu, lxx, luu, lux = cost.eval(make_sample())
        assert not np.isnan(l).any()
        assert not np.isnan(lu).any()

    def test_no_nan_antagonist(self):
        cost = self._make('antagonist')
        l, lx, lu, lxx, luu, lux = cost.eval(make_sample())
        assert not np.isnan(l).any()

    def test_unknown_mode_raises(self):
        from gps.algorithm.cost.cost_action import CostAction
        cost = CostAction({'wu': np.ones(2), 'mode': 'invalid', 'gamma': 1.0})
        with pytest.raises((ValueError, SystemExit, TypeError)):
            cost.eval(make_sample())


@pytest.mark.unit
class TestCostStateDataTypes:
    def test_all_data_types_contribute(self):
        """Regression for P3-23: return inside loop."""
        from gps.proto.gps_pb2 import JOINT_ANGLES, JOINT_VELOCITIES
        from gps.algorithm.cost.cost_state import CostState
        hp = {
            'data_types': {
                JOINT_ANGLES:    {'wp': np.ones(4), 'target_state': np.zeros(4)},
                JOINT_VELOCITIES:{'wp': np.ones(4) * 2.0, 'target_state': np.zeros(4)},
            },
            'ramp_option': 0, 'wp_final_multiplier': 1.0,
            'l1': 0.0, 'l2': 1.0, 'alpha': 1e-2,
        }
        cost = CostState(hp)
        # Create sample where JOINT_VELOCITIES is nonzero but JOINT_ANGLES is zero
        s = make_sample(dX=8)
        from gps.proto.gps_pb2 import JOINT_ANGLES, JOINT_VELOCITIES
        s.set(JOINT_ANGLES, np.zeros((10, 4)))       # zero cost
        s.set(JOINT_VELOCITIES, np.ones((10, 4)))    # nonzero cost
        l, lx, lu, lxx, luu, lux = cost.eval(s)
        assert l.sum() > 0, "JOINT_VELOCITIES cost was silently skipped"

    def test_gamma_and_mode_from_hyperparams(self):
        """Regression for P3-22: gamma and mode must not be hardcoded."""
        from gps.proto.gps_pb2 import JOINT_ANGLES
        from gps.algorithm.cost.cost_state import CostState
        hp_prot = {'data_types': {JOINT_ANGLES: {'wp': np.ones(4), 'target_state': np.zeros(4)}},
                   'ramp_option': 0, 'wp_final_multiplier': 1.0,
                   'l1': 0.0, 'l2': 1.0, 'alpha': 1e-2, 'mode': 'protagonist'}
        hp_ant  = dict(hp_prot, mode='antagonist')
        c_prot  = CostState(hp_prot)
        c_ant   = CostState(hp_ant)
        assert c_prot._hyperparams.get('mode') == 'protagonist'
        assert c_ant._hyperparams.get('mode')  == 'antagonist'

python/tests/unit/test_dynamics.py

"""Unit tests for DynamicsLRPrior — standard and robust fitting."""
import numpy as np
import pytest


@pytest.mark.unit
class TestDynamicsLRPrior:
    def _make(self):
        from gps.algorithm.dynamics.dynamics_lr_prior import DynamicsLRPrior
        from gps.algorithm.dynamics.config import DYN_PRIOR_GMM
        hp = dict(DYN_PRIOR_GMM)
        hp.update({'regularization': 1e-6})
        return DynamicsLRPrior(hp)

    def test_fit_sets_Fm_fv(self):
        dyn = self._make()
        N, T, dX, dU = 5, 10, 4, 2
        X = np.random.randn(N, T, dX)
        U = np.random.randn(N, T, dU)
        dyn.fit(X, U)
        assert dyn.Fm.shape == (T - 1, dX, dX + dU)
        assert dyn.fv.shape == (T - 1, dX)
        assert not np.isnan(dyn.Fm).any()
        assert not np.isnan(dyn.fv).any()

    def test_fit_robust_sets_Fm_fv(self):
        dyn = self._make()
        N, T, dX, dU, dV = 5, 10, 4, 2, 2
        X = np.random.randn(N, T, dX)
        U = np.random.randn(N, T, dU)
        V = np.random.randn(N, T, dV)
        dyn.fit_robust(X, U, V)
        assert dyn.Fm.shape == (T - 1, dX, dX + dU + dV)
        assert not np.isnan(dyn.Fm).any()

    def test_fit_degenerate_not_nan(self):
        """Regularization must prevent NaN when X is nearly constant."""
        dyn = self._make()
        N, T, dX, dU = 3, 5, 4, 2
        X = np.ones((N, T, dX)) + 1e-10 * np.random.randn(N, T, dX)
        U = np.random.randn(N, T, dU)
        dyn.fit(X, U)
        assert not np.isnan(dyn.Fm).any()
        assert not np.isnan(dyn.fv).any()

python/tests/unit/test_traj_opt_lqr.py

"""Unit tests for TrajOptLQRPython — standard and robust passes."""
import numpy as np
import pytest


def make_algorithm_stub(T=10, dX=4, dU=2, dV=2, M=1):
    """Minimal algorithm stub for traj_opt tests."""
    import types
    from gps.algorithm.algorithm_utils import IterationData, TrajectoryInfo
    from gps.algorithm.policy.lin_gauss_policy import LinearGaussianPolicy

    alg = types.SimpleNamespace(
        T=T, dX=dX, dU=dU, dV=dV, M=M,
        cur=[IterationData() for _ in range(M)],
    )
    for m in range(M):
        alg.cur[m].traj_info = TrajectoryInfo()
        alg.cur[m].traj_info.dynamics = types.SimpleNamespace(
            Fm=np.zeros((T-1, dX, dX+dU)),
            fv=np.zeros((T-1, dX)),
            dyn_covar=np.tile(1e-3*np.eye(dX), (T-1, 1, 1)),
        )
        alg.cur[m].traj_info.x0mu = np.zeros(dX)
        alg.cur[m].traj_info.x0sigma = np.eye(dX)
        alg.cur[m].traj_info.cc = np.zeros(T)
        alg.cur[m].traj_info.cv = np.zeros((T, dX+dU))
        alg.cur[m].traj_info.Cm = np.tile(np.eye(dX+dU), (T, 1, 1))
        alg.cur[m].eta = 1.0
        alg.cur[m].step_mult = 1.0
        K = np.zeros((T, dU, dX))
        k = np.zeros((T, dU))
        pS = np.tile(np.eye(dU), (T, 1, 1))
        ipS = np.tile(np.eye(dU), (T, 1, 1))
        cpS = np.tile(np.eye(dU), (T, 1, 1))
        alg.cur[m].traj_distr = types.SimpleNamespace(
            T=T, dX=dX, dU=dU, K=K, k=k,
            pol_covar=pS, inv_pol_covar=ipS, chol_pol_covar=cpS,
        )
    return alg


@pytest.mark.unit
class TestTrajOptLQRPython:
    def _make(self):
        from gps.algorithm.traj_opt.traj_opt_lqr_python import TrajOptLQRPython
        from gps.algorithm.traj_opt.config import TRAJ_OPT_LQR
        return TrajOptLQRPython(dict(TRAJ_OPT_LQR))

    def test_update_returns_traj_distr_and_eta(self):
        traj_opt = self._make()
        alg = make_algorithm_stub()
        result = traj_opt.update(0, alg)
        assert len(result) == 2  # (traj_distr, eta)

    def test_update_no_nan(self):
        traj_opt = self._make()
        alg = make_algorithm_stub()
        traj_distr, eta = traj_opt.update(0, alg)
        assert not np.isnan(traj_distr.K).any()
        assert not np.isnan(traj_distr.k).any()
        assert not np.isnan(eta) if np.isscalar(eta) else not np.isnan(eta).any()

    def test_eta_positive(self):
        """Dual variable must remain non-negative."""
        traj_opt = self._make()
        alg = make_algorithm_stub()
        _, eta = traj_opt.update(0, alg)
        scalar_eta = float(np.atleast_1d(eta)[0])
        assert scalar_eta >= 0, f"eta={scalar_eta} must be non-negative"

python/tests/unit/test_gmm.py

"""Unit tests for the GMM utility (used in dynamics prior)."""
import numpy as np
import pytest


@pytest.mark.unit
class TestGMM:
    def _make(self):
        from gps.utility.gmm import GMM
        return GMM()

    def test_estep_mstep_convergence(self):
        gmm = self._make()
        N, D = 200, 4
        # Generate simple two-cluster data
        X = np.vstack([
            np.random.randn(N//2, D) + np.array([5, 0, 0, 0]),
            np.random.randn(N//2, D) - np.array([5, 0, 0, 0]),
        ])
        gmm.update(X, K=2)
        assert gmm.sigma.shape == (2, D, D)
        assert gmm.mu.shape == (2, D)
        assert abs(gmm.pi.sum() - 1.0) < 1e-9
        # Means should approximately separate the two clusters
        centers = np.sort(np.abs(gmm.mu[:, 0]))
        assert centers[1] > 2.0, "GMM did not separate clusters"

    def test_no_nan_after_update(self):
        gmm = self._make()
        X = np.random.randn(100, 6)
        gmm.update(X, K=3)
        assert not np.isnan(gmm.sigma).any()
        assert not np.isnan(gmm.mu).any()

4.2 Integration tests

python/tests/integration/test_box2d_protagonist.py

"""Integration test: 2 full GPS iterations on Box2D arm (protagonist only)."""
import numpy as np
import pytest


@pytest.mark.integration
def test_box2d_protagonist_two_iterations():
    """Run 2 complete GPS iterations, verify cost decreases."""
    import importlib.util, os, sys
    exp_path = os.path.join(
        os.path.dirname(__file__),
        '../../../experiments/box2d_mdgps_protagonist/hyperparams.py'
    )
    if not os.path.exists(exp_path):
        pytest.skip("Experiment hyperparams not found")

    spec = importlib.util.spec_from_file_location('hyperparams', exp_path)
    hp = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(hp)
    config = hp.config

    # Override for speed
    config['iterations'] = 2
    config['num_samples'] = 2
    config['gui_on'] = False

    from gps.gps_main import GPSMain
    gps = GPSMain(config, closeloop=False, robust=False)

    costs_per_iter = []
    original_log = gps._log_data

    def tracking_log(itr, traj_sample_lists, pol_sample_lists=None):
        costs = [
            np.mean(gps.algorithm.cur[cond].cs)
            for cond in gps.algorithm._cond_idx
        ]
        costs_per_iter.append(np.mean(costs))
        original_log(itr, traj_sample_lists, pol_sample_lists)

    gps._log_data = tracking_log
    gps.run()

    assert len(costs_per_iter) == 2, f"Expected 2 logged iterations, got {len(costs_per_iter)}"
    # Cost should not explode (weak check — does not require decrease in 2 iters)
    assert not np.isnan(costs_per_iter).any()
    assert all(c < 1e6 for c in costs_per_iter), \
        f"Cost exploded: {costs_per_iter}"


@pytest.mark.integration
def test_box2d_idg_two_iterations():
    """Run 2 complete iDG (robust) iterations, verify no crash."""
    import importlib.util, os
    exp_path = os.path.join(
        os.path.dirname(__file__),
        '../../../experiments/box2d_mdgps_idg_y0.5/hyperparams.py'
    )
    if not os.path.exists(exp_path):
        pytest.skip("iDG experiment hyperparams not found")

    spec = importlib.util.spec_from_file_location('hyperparams', exp_path)
    hp = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(hp)
    config = hp.config
    config['iterations'] = 2
    config['num_samples'] = 2
    config['gui_on'] = False

    from gps.gps_main import GPSMain
    gps = GPSMain(config, closeloop=False, robust=True)
    # Should complete without raising
    gps.run_robust()

4.3 Fault injection tests

python/tests/fault/test_fault_injection.py

"""Fault injection: verify graceful handling of bad inputs."""
import numpy as np
import pytest


@pytest.mark.fault
class TestFaultInjection:
    def test_nan_dynamics_does_not_propagate_silently(self):
        """If dynamics fitting receives NaN data, an exception must be raised,
        not silently propagated to produce NaN policies."""
        from gps.algorithm.dynamics.dynamics_lr_prior import DynamicsLRPrior
        from gps.algorithm.dynamics.config import DYN_PRIOR_GMM
        hp = dict(DYN_PRIOR_GMM, regularization=1e-6)
        dyn = DynamicsLRPrior(hp)
        N, T, dX, dU = 5, 10, 4, 2
        X = np.random.randn(N, T, dX)
        X[2, 5, 1] = np.nan   # inject a single NaN
        U = np.random.randn(N, T, dU)
        try:
            dyn.fit(X, U)
            # If it doesn't raise, the result must not be NaN
            assert not np.isnan(dyn.Fm).any(), \
                "NaN propagated silently through dynamics fit"
        except (ValueError, np.linalg.LinAlgError, RuntimeError):
            pass  # Raising is acceptable

    def test_missing_pkl_returns_none(self):
        from gps.utility.data_logger import DataLogger
        result = DataLogger().unpickle('/tmp/__nonexistent_gps_test_xyz__.pkl')
        assert result is None

    def test_gamma_zero_does_not_divide_by_zero(self):
        """gamma=0 in cost_action should not produce inf/nan."""
        from gps.algorithm.cost.cost_action import CostAction
        import types
        from gps.sample.sample import Sample
        from gps.proto.gps_pb2 import ACTION, ACTION_V
        T, dU, dV = 5, 2, 2
        agent = types.SimpleNamespace(
            T=T, dX=4, dU=dU, dV=dV, dO=4, dM=0,
            x_data_types=[], obs_data_types=[], meta_data_types=[],
        )
        agent.pack_data_x = lambda *a, **kw: None
        s = Sample(agent)
        s.set(ACTION, np.random.randn(T, dU))
        s.set(ACTION_V, np.random.randn(T, dV))
        cost = CostAction({'wu': np.ones(dU), 'mode': 'protagonist', 'gamma': 0.0})
        l, lx, lu, lxx, luu, lux = cost.eval(s)
        assert not np.isnan(l).any()
        assert not np.isinf(l).any()

    def test_gamma_inf_does_not_crash(self):
        """gamma=inf in cost_action should not crash (may produce inf cost, not NaN)."""
        from gps.algorithm.cost.cost_action import CostAction
        import types
        from gps.sample.sample import Sample
        from gps.proto.gps_pb2 import ACTION, ACTION_V
        T, dU, dV = 5, 2, 2
        agent = types.SimpleNamespace(
            T=T, dX=4, dU=dU, dV=dV, dO=4, dM=0,
            x_data_types=[], obs_data_types=[], meta_data_types=[],
        )
        agent.pack_data_x = lambda *a, **kw: None
        s = Sample(agent)
        s.set(ACTION, np.random.randn(T, dU))
        s.set(ACTION_V, np.random.randn(T, dV))
        cost = CostAction({'wu': np.ones(dU), 'mode': 'protagonist', 'gamma': np.inf})
        try:
            l, _, _, _, _, _ = cost.eval(s)
            assert not np.isnan(l).any(), "gamma=inf produced NaN (not inf)"
        except (ValueError, OverflowError, FloatingPointError):
            pass  # Raising is also acceptable

Phase 4 gate: pytest -m "unit or fault" -x --cov=python/gps --cov-fail-under=65 exits 0.


Phase 5 — Proto3 Migration and Dependency Pinning

5.1 Proto2 → Proto3

gps_agent_pkg/proto/gps.proto currently uses syntax = "proto2" with:

  • optional field qualifier (proto2 only)
  • [default = 100] on field T (proto2 only)
  • [packed=true] on repeated fields (default in proto3, explicit in proto2)

Migration steps:

  1. Change syntax = "proto2";syntax = "proto3";
  2. Remove all optional qualifiers (proto3 makes all singular fields optional by default)
  3. Remove [default = 100] — set T at call site instead
  4. Keep [packed=true] annotations (they are valid in proto3 for compatibility)
  5. Regenerate gps.pb.h and gps_pb2.py via CMake target (C17-24 above)
  6. Update Python imports that relied on HasField() — in proto3 use explicit presence tracking (oneof or optional keyword with proto3 optional extension)

5.2 Pinned requirements.txt

# python/requirements.txt  (runtime)
numpy>=1.24,<2.0
scipy>=1.11,<2.0
torch>=2.1,<3.0
tensorflow>=2.13,<3.0
protobuf>=4.23,<5.0
scikit-image>=0.21
visdom>=0.2.4

# requirements-dev.txt  (test/CI only)
pytest>=7.4
pytest-cov>=4.1
pytest-timeout>=2.1
pytest-xdist>=3.3
pytest-mock>=3.11
memory-profiler>=0.61
psutil>=5.9
black>=23.0
isort>=5.12

5.3 Dockerfile modernization

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV LANG=en_US.UTF-8
ENV ROS_DISTRO=noetic

RUN apt-get update && apt-get install -y \
    python3.11 python3.11-dev python3-pip \
    ros-noetic-catkin catkin-tools \
    libboost-all-dev libeigen3-dev libprotobuf-dev protobuf-compiler \
    && rm -rf /var/lib/apt/lists/*

RUN python3.11 -m pip install --upgrade pip
COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt

# Build GPS C++ package
COPY . /workspace/catkin_ws/src/gps
WORKDIR /workspace/catkin_ws
RUN . /opt/ros/noetic/setup.sh && \
    catkin build gps_agent_pkg \
      --cmake-args -DCMAKE_CXX_STANDARD=17 \
                   -DCMAKE_BUILD_TYPE=Release

Phase 5 gate: pytest python/tests/unit/test_proto_roundtrip.py -v exits 0 after proto3 migration. Docker image builds successfully and runs python -c "import gps; print('ok')".


Phase 6 — Load, Soak, and Concurrency Testing

This phase validates production-grade reliability characteristics. These tests run in CI nightly (not on every push) due to duration.

6.1 Load tests

python/tests/load/test_load_trajectory_scale.py

"""Load tests: measure performance scaling with N, T, M."""
import time
import tracemalloc

import numpy as np
import pytest


@pytest.mark.load
@pytest.mark.parametrize("N,T,M", [
    (5,  100, 4),   # nominal
    (20, 100, 4),   # 4x samples
    (5,  500, 4),   # 5x horizon
    (5,  100, 16),  # 4x conditions
])
def test_dynamics_fit_scales_linearly(N, T, M):
    """DynamicsLRPrior.fit() wall time must be sub-quadratic in N*T."""
    from gps.algorithm.dynamics.dynamics_lr_prior import DynamicsLRPrior
    from gps.algorithm.dynamics.config import DYN_PRIOR_GMM
    dX, dU = 26, 7
    hp = dict(DYN_PRIOR_GMM, regularization=1e-6)
    dyn = DynamicsLRPrior(hp)
    X = np.random.randn(N, T, dX)
    U = np.random.randn(N, T, dU)

    start = time.perf_counter()
    dyn.fit(X, U)
    elapsed = time.perf_counter() - start

    # Budget: 2 seconds max for any configuration in this parametrize set
    assert elapsed < 2.0, \
        f"fit() took {elapsed:.2f}s for N={N},T={T},M={M} — too slow"


@pytest.mark.load
def test_memory_traj_opt_no_unbounded_growth():
    """TrajOptLQRPython must not leak memory across 50 iterations."""
    import gc, os, psutil
    from tests.unit.test_traj_opt_lqr import make_algorithm_stub
    from gps.algorithm.traj_opt.traj_opt_lqr_python import TrajOptLQRPython
    from gps.algorithm.traj_opt.config import TRAJ_OPT_LQR

    traj_opt = TrajOptLQRPython(dict(TRAJ_OPT_LQR))
    process = psutil.Process(os.getpid())

    # Warm-up
    for _ in range(5):
        alg = make_algorithm_stub(T=100, dX=26, dU=7, dV=7)
        traj_opt.update(0, alg)

    gc.collect()
    rss_before = process.memory_info().rss

    for _ in range(50):
        alg = make_algorithm_stub(T=100, dX=26, dU=7, dV=7)
        traj_opt.update(0, alg)

    gc.collect()
    rss_after = process.memory_info().rss
    growth_mb = (rss_after - rss_before) / 1024 / 1024
    assert growth_mb < 50, \
        f"Memory grew {growth_mb:.1f} MB over 50 traj_opt calls — possible leak"

6.2 Soak test

python/tests/soak/test_soak_training.py

"""
Soak test: run GPS training for 8 hours (or N_SOAK_ITERS iterations if set).

Run with:  pytest -m soak --timeout=36000 python/tests/soak/test_soak_training.py

Set environment variable N_SOAK_ITERS=200 for overnight run.
Set N_SOAK_ITERS=10 for a quick smoke test.
"""
import gc
import logging
import os
import time
import tracemalloc

import numpy as np
import psutil
import pytest


log = logging.getLogger(__name__)
SOAK_ITERS = int(os.environ.get('N_SOAK_ITERS', 10))


@pytest.mark.soak
@pytest.mark.timeout(36000)
def test_soak_gps_training():
    """
    Acceptance criteria:
    - RSS growth < 10 MB/hour
    - P99 iteration latency < 3x P50
    - Zero RuntimeWarning: overflow in dynamics fitting
    - Zero NaN in any cost or traj_distr
    """
    import importlib.util
    exp_path = os.path.join(
        os.path.dirname(__file__),
        '../../../experiments/box2d_mdgps_protagonist/hyperparams.py'
    )
    if not os.path.exists(exp_path):
        pytest.skip("Experiment not found")

    spec = importlib.util.spec_from_file_location('hyperparams', exp_path)
    hp_mod = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(hp_mod)
    config = hp_mod.config
    config['iterations'] = SOAK_ITERS
    config['num_samples'] = 3
    config['gui_on'] = False

    from gps.gps_main import GPSMain
    gps = GPSMain(config, closeloop=False, robust=False)

    process = psutil.Process(os.getpid())
    latencies = []
    rss_samples = [(time.time(), process.memory_info().rss)]

    original_take_iter = gps._take_iteration

    def instrumented_take_iter(itr, sample_lists, **kw):
        t0 = time.perf_counter()
        original_take_iter(itr, sample_lists, **kw)
        elapsed = time.perf_counter() - t0
        latencies.append(elapsed)
        rss_samples.append((time.time(), process.memory_info().rss))
        gc.collect()
        # Check for NaN in current cost
        for m in range(gps.algorithm.M):
            cs = gps.algorithm.cur[m].cs
            if cs is not None:
                assert not np.isnan(cs).any(), \
                    f"NaN in cur[{m}].cs at iteration {itr}"

    gps._take_iteration = instrumented_take_iter
    gps.run()

    # --- Assertions ---
    latencies_arr = np.array(latencies)
    p50 = np.percentile(latencies_arr, 50)
    p99 = np.percentile(latencies_arr, 99)
    log.info("P50 latency: %.3fs  P99: %.3fs", p50, p99)
    assert p99 < 3 * p50, \
        f"P99 ({p99:.3f}s) > 3 * P50 ({p50:.3f}s) — latency spike detected"

    if len(rss_samples) >= 2:
        t_start, rss_start = rss_samples[0]
        t_end, rss_end = rss_samples[-1]
        elapsed_hours = max((t_end - t_start) / 3600, 1e-6)
        growth_mb_per_hour = (rss_end - rss_start) / 1024 / 1024 / elapsed_hours
        log.info("RSS growth: %.2f MB/hour", growth_mb_per_hour)
        assert growth_mb_per_hour < 10, \
            f"Memory leak detected: {growth_mb_per_hour:.2f} MB/hour"

6.3 Concurrency / race condition test (ThreadSanitizer)

Add to .github/workflows/ci.yml a separate tsan job:

  tsan:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - name: Build with ThreadSanitizer
        run: |
          source /opt/ros/noetic/setup.bash
          catkin build gps_agent_pkg \
            --cmake-args -DCMAKE_CXX_STANDARD=17 \
                         -DCMAKE_BUILD_TYPE=RelWithDebInfo \
                         -DCMAKE_CXX_FLAGS="-fsanitize=thread -g"
      - name: Run TSan integration test
        run: |
          # Inject 50ms jitter, run 30-second ROS test, assert no data races
          rostest gps_agent_pkg tsan_robotplugin.test
        timeout-minutes: 5

Phase 6 gate (nightly CI):

  • Load tests exit 0 under -m load
  • Soak test exits 0 with N_SOAK_ITERS=200 (overnight)
  • TSan job exits 0 with zero data race reports

Migration Completion Checklist

Phase Deliverable Gate
0 Test infra, conftest.py, pytest.ini, GitHub Actions skeleton CI green on scaffolding
1 All 30 Python 2→3 fixes pytest -m unit --cov-fail-under=60 exits 0
2 policy_opt_pytorch.py complete rewrite 11 PyTorch unit tests pass, zero TF imports
3 C++17, Boost→std, spinlock fix, missing return fix catkin_make -Werror clean build; proto round-trip test passes
4 Algorithm core unit + integration + fault tests pytest -m "unit or fault" --cov-fail-under=65 exits 0
5 Proto3 migration, pinned deps, new Dockerfile Docker build clean; proto test passes
6 Load, soak, TSan concurrency tests Nightly CI green; RSS growth < 10 MB/hr; P99 < 3×P50

Summary Scorecard (Current vs Target)

Dimension Current Target after all phases
Python 3.8+ compatibility 2/10 9/10
Logic correctness 4/10 9/10
C++17 compliance 3/10 9/10
Test coverage 1/10 8/10
CI/CD 0/10 9/10
Build reproducibility 2/10 8/10
Observability 3/10 7/10
Production readiness 2/10 8/10

This document was produced from exhaustive static analysis of all 237 Python files, 41 C++ files, 5 build-system files, the proto schema, and 133+ experiment hyperparameter files. Every finding is cross-verified at exact file paths and line numbers.