Skip to content

Commit e7e9c39

Browse files
mjhuffSyntaxColoringsfoster1
authored
fix(robot-server, api): fix command history accumulation across protocol runs (#19108)
Closes [RQA-3917](https://opentrons.atlassian.net/browse/RQA-3917) <!-- Thanks for taking the time to open a Pull Request (PR)! Please make sure you've read the "Opening Pull Requests" section of our Contributing Guide: https://github.com/Opentrons/opentrons/blob/edge/CONTRIBUTING.md#opening-pull-requests GitHub provides robust markdown to format your PR. Links, diagrams, pictures, and videos along with text formatting make it possible to create a rich and informative PR. For more information on GitHub markdown, see: https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax To ensure your code is reviewed quickly and thoroughly, please fill out the sections below to the best of your ability! --> # Overview Although protocol engine is eventually dereferenced during the protocol run lifecycle, there exists a circular reference between command history and additional objects that isn't well captured by the gc lib or memray call stack analysis. Fully clearing the run's `CommandHistory` before dereferencing the run orchestrator eliminates a memory leak. <!-- Describe your PR at a high level. State acceptance criteria and how this PR fits into other work. Link issues, PRs, and other relevant resources. --> ## Test Plan and Hands on Testing - See ticket for the wonderful script made by @SyntaxColoring, which was modified to run a step-intensive protocol 25 times in a simulated environment. The following outputs were generated with memray, passing no additional flags when generating the bin file. To convert the bin file to an HTML file, the following was run: `memray flamegraph --split-threads --temporal /path/to/file`. ### Twenty-Five Simulated Protocol Runs (with PR only) <img width="1451" height="411" alt="Screenshot 2025-08-01 at 4 20 37 PM" src="https://github.com/user-attachments/assets/577ea481-8106-43e9-9ff8-8f41e3dbcae3" /> Note the total memory usage (which may vary by robot). Compare with `edge` output, below. We expect to see some increase in total heap usage up to a point as various caching occurs. After run 17, there is no more apparent memory increase. ### Twenty-Five Simulated Protocol Runs (`edge` prior to any recent memory fixes, without PR) <img width="1475" height="380" alt="Screenshot 2025-08-01 at 10 48 33 PM" src="https://github.com/user-attachments/assets/95c8b516-011e-4f8c-a890-505ad57bd1e2" /> Note that after the 25th run, total `opentrons-robot-server` heap allocation is substantially greater than the above case. ### Twenty-Five Simulated Protocol Runs (with PR), No LRU Caching, #19107 Cherry-Pick Included <img width="1464" height="358" alt="Screenshot 2025-08-01 at 4 28 56 PM" src="https://github.com/user-attachments/assets/df9ad20c-ddfd-48c4-8b15-f3f350d68a5a" /> Effectively no increase in memory utilization after initialization and the completion of the second protocol run. ### Two Real Protocol Runs (with PR), No LRU Caching Included <img width="1459" height="370" alt="Screenshot 2025-08-01 at 4 30 29 PM" src="https://github.com/user-attachments/assets/da3f2fa8-294e-4639-a551-423e86ed375f" /> The various spikes during the run are because of camera captures via HTTP. ### Six Real Protocol Runs (with PR, #19107, #19110, #19109, #19071) <img width="1458" height="361" alt="Screenshot 2025-08-01 at 11 03 41 PM" src="https://github.com/user-attachments/assets/5bd18f3f-4a78-4540-8502-b5ec0a5b2e9f" /> Run between 10-40 minutes. The end of run memory for run 2 is 504MB, which is equivalent to the end of run 6 memory. The memray HTML analysis file is too large to attach directly on github, but it's included in the ticket. <!-- Describe your testing of the PR. Emphasize testing not reflected in the code. Attach protocols, logs, screenshots and any other assets that support your testing. --> ## Changelog - Fixed command history accumulating in memory across protocol runs. <!-- List changes introduced by this PR considering future developers and the end user. Give careful thought and clear documentation to breaking changes. --> <!-- - What do you need from reviewers to feel confident this PR is ready to merge? - Ask questions. --> ## Risk assessment low - we are clearing state exactly before we dereference the run orchestrator, at which point we don't expect this state to be available, anyway. <!-- - Indicate the level of attention this PR needs. - Provide context to guide reviewers. - Discuss trade-offs, coupling, and side effects. - Look for the possibility, even if you think it's small, that your change may affect some other part of the system. - For instance, changing return tip behavior may also change the behavior of labware calibration. - How do your unit tests and on hands on testing mitigate this PR's risks and the risk of future regressions? - Especially in high risk PRs, explain how you know your testing is enough. --> [RQA-3917]: https://opentrons.atlassian.net/browse/RQA-3917?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Max Marrone <[email protected]> Co-authored-by: Seth Foster <[email protected]>
1 parent c8d5916 commit e7e9c39

File tree

7 files changed

+37
-1
lines changed

7 files changed

+37
-1
lines changed

api/src/opentrons/protocol_engine/protocol_engine.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -645,6 +645,10 @@ def set_error_recovery_policy(self, policy: ErrorRecoveryPolicy) -> None:
645645
"""Replace the run's error recovery policy with a new one."""
646646
self._action_dispatcher.dispatch(SetErrorRecoveryPolicyAction(policy))
647647

648+
def clear_command_history(self) -> None:
649+
"""Clear command history."""
650+
self._state_store.clear_command_history()
651+
648652

649653
# TODO(tz, 7-12-23): move this to shared data when we dont relay on ErrorOccurrence
650654
def code_in_error_tree(

api/src/opentrons/protocol_engine/state/command_history.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,19 @@ def set_command_failed(self, command: Command) -> None:
255255
self._set_most_recently_completed_command_id(command.id)
256256
self._all_failed_command_ids.append(command.id)
257257

258+
# TODO(jh, 08-01-25) Although protocol engine is garbage collected, command history persists in memory between protocol runs.
259+
# Explicitly clearing all history before dereferencing protocol engine and the run's run orchestrator eliminates
260+
# memory accumulation. Investigate further.
261+
def clear(self) -> None:
262+
"""Clear state."""
263+
self._commands_by_id.clear()
264+
self._all_command_ids.clear()
265+
self._all_failed_command_ids.clear()
266+
self._all_command_ids_but_fixit_command_ids.clear()
267+
self._queued_command_ids.clear()
268+
self._queued_setup_command_ids.clear()
269+
self._queued_fixit_command_ids.clear()
270+
258271
def _add(self, command_id: str, command_entry: CommandEntry) -> None:
259272
"""Create or update a command entry."""
260273
if command_id not in self._commands_by_id:

api/src/opentrons/protocol_engine/state/commands.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,10 @@ def handle_action(self, action: Action) -> None:
301301
case _:
302302
pass
303303

304+
def clear_history(self) -> None:
305+
"""Clears CommandHistory state."""
306+
self._state.command_history.clear()
307+
304308
def _handle_queue_command_action(self, action: QueueCommandAction) -> None:
305309
# TODO(mc, 2021-06-22): mypy has trouble with this automatic
306310
# request > command mapping, figure out how to type precisely

api/src/opentrons/protocol_engine/state/state.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -317,6 +317,10 @@ def predicate() -> _ReturnT:
317317

318318
return await self._wait_for(condition=predicate, truthiness_to_wait_for=True)
319319

320+
def clear_command_history(self) -> None:
321+
"""Clear CommandHistory state."""
322+
self._command_store.clear_history()
323+
320324
async def wait_for_not(
321325
self,
322326
condition: Callable[_ParamsT, _ReturnT],

api/src/opentrons/protocol_runner/run_orchestrator.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,3 +490,7 @@ def _map_parse_mode_to_python_parse_mode(parse_mode: ParseMode) -> PythonParseMo
490490
return PythonParseMode.ALLOW_LEGACY_METADATA_AND_REQUIREMENTS
491491
else:
492492
raise UnknownProtocolParseMode()
493+
494+
def clear_command_history(self) -> None:
495+
"""Force cleanup of command history."""
496+
self._protocol_engine.clear_command_history()

robot-server/robot_server/runs/run_orchestrator_store.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,9 @@ async def clear(self) -> RunResult:
295295
run_time_parameters = self.run_orchestrator.get_run_time_parameters()
296296
command_annotations = self.run_orchestrator.get_command_annotations()
297297

298-
self._run_orchestrator = None
298+
if self._run_orchestrator is not None:
299+
self._run_orchestrator.clear_command_history()
300+
self._run_orchestrator = None
299301

300302
return RunResult(
301303
state_summary=run_data,

robot-server/tests/runs/test_run_orchestrator_store.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,12 @@ async def test_clear_engine(subject: RunOrchestratorStore) -> None:
215215
notify_publishers=mock_notify_publishers,
216216
)
217217
assert subject._run_orchestrator is not None
218+
engine = subject._run_orchestrator._protocol_engine
219+
engine.state_view.state.commands.command_history._queued_command_ids.add("1231")
218220
result = await subject.clear()
221+
assert (
222+
len(engine.state_view.state.commands.command_history._queued_command_ids) == 0
223+
)
219224

220225
assert subject.current_run_id is None
221226
assert isinstance(result, RunResult)

0 commit comments

Comments
 (0)