Skip to content

Conversation

JenniferWang
Copy link
Contributor

@JenniferWang JenniferWang commented Oct 9, 2025

  • Fix the test fixture yaml files
  • Re-write the logic to make the round-trip test compatible with multi-nodes.

Specifically,

  • Move validation function to the policy.py module because @endpoint almost only work with lambda functions.
  • Refactor the test to make setup / tear down clearer

Closes sub-task in #411, close #143

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2025
@JenniferWang JenniferWang marked this pull request as ready for review October 9, 2025 21:23
@Jack-Khuu
Copy link
Contributor

Can we close #143 once this lands too :)?

)

# Cleanup DCP directory
path = Path(TEST_DCP_DIR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use TemporaryDirectory which handles this automatically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to change this because NFS does not work for tmp directory

# We only care about the final output
params.output_kind = RequestOutputKind.FINAL_ONLY
return params

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this has to be in policy.py? Is monarch failing to pickup/serialize the function if it's not in here?
If it has to be here, may be prepend all the functions with _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, monarch pickled the function with the module path integration_tests but this cannot be resolved in the remote node. Seems that we need to define it in the main modules.

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 126 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@2a6e46f). Learn more about missing BASE report.

Files with missing lines Patch % Lines
tests/integration_tests/test_policy_update.py 0.00% 82 Missing ⚠️
src/forge/actors/policy.py 0.00% 44 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #365   +/-   ##
=======================================
  Coverage        ?   63.24%           
=======================================
  Files           ?       78           
  Lines           ?     7725           
  Branches        ?        0           
=======================================
  Hits            ?     4886           
  Misses          ?     2839           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the test is amazing, having it in the policy.py is not really an option IMO.

From a UX / library perspective, it makes things way more difficult to understand. The argument could be made that directing users to docs alleviates this pain, but 1) our docs are incredibly incomplete and 2) we know from OSS users experience with titan/ tune that most people prefer to just go to the code.

@allenwang28
Copy link
Contributor

While the test is amazing, having it in the policy.py is not really an option IMO.

Oh I completely missed that part. Is it possible to create an inherited Policy that has all of the test functionality and is kept in integration_tests?

I'm not sure I'm clear what the monarch pickling issue is

@joecummings
Copy link
Member

@allenwang28 What are our options here? Surely someone else has tried to run a multi node test via Monarch and not have to clutter up an Actor?

@Jack-Khuu
Copy link
Contributor

Jack-Khuu commented Oct 10, 2025

Wild thought (lmk if this errors): For tests specific methods, can we just append them to the class definition in the test? It's the same idea as inheriting, but let's you surgically change the nested PolicyWorker

Within test:

from foo import Bar

def util(self):
  ...

Bar._special_test_util = util

proxy: Bar = Bar()

I'm not sure I'm clear what the monarch pickling issue is

I'd like to see the trace too since at least 2 others have encountered this when using Monarch/Services

@JenniferWang
Copy link
Contributor Author

@joecummings , @allenwang28 , @Jack-Khuu

The pickle problem is, what I think, caused by how Monarch serializes arguments (in this case, passing a lambda v.s. a function variable to an endpoint on the Policy service.

WARNING  forge.controller.service.replica:replica.py:257 Got failure on replica 0. Error:
A remote actor call has failed.
 Traceback of where the remote call failed (most recent call last):
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 875, in handle
    result = await instrumented()
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 872, in instrumented
    raise e
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 865, in instrumented
    result = await the_method(*args, **kwargs)
  File "/mnt/home/jiyue/forge/src/forge/actors/policy.py", line 451, in _test_validate_model_params
    return await self.policy_worker._test_validate_model_params.call(validate_fn)
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/future.py", line 138, in mark_complete
    func, value = fut.set_result, await coro
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/endpoint.py", line 147, in process
    rank, value = await r._recv()
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 728, in _recv
    return self._process(result)
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 758, in _process
    return rank, super()._process(msg)
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 737, in _process
    raise cast(Exception, payload)
monarch._src.actor.actor_mesh.ActorError: A remote actor call has failed.
 Traceback of where the remote call failed (most recent call last):
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 811, in handle
    args, kwargs = unflatten(message, local_state)
  File "/mnt/data/jiyue/lib/python3.10/site-packages/monarch/_src/actor/pickle.py", line 98, in unflatten
    return up.load()
ModuleNotFoundError: No module named 'tests.integration_tests'

@allenwang28
Copy link
Contributor

@JenniferWang how do you run this workload?

@JenniferWang
Copy link
Contributor Author

@JenniferWang how do you run this workload?

pytest -s tests/integration_tests/test_policy_update.py::TestWeightSync::test_sanity_check         --config apps/grpo/qwen3_1_7b.yaml

@JenniferWang
Copy link
Contributor Author

@allenwang28 , indeed, i think it's because of the way how the script is invoked.

@allenwang28
Copy link
Contributor

got it, I wonder if PYTHONPATH=. pytest ... works?

@JenniferWang JenniferWang force-pushed the fix-policy-update-test branch 2 times, most recently from 290e506 to 13c8499 Compare October 14, 2025 13:23
@JenniferWang
Copy link
Contributor Author

got it, I wonder if PYTHONPATH=. pytest ... works?

Magic!

@casteryh
Copy link
Contributor

got it, I wonder if PYTHONPATH=. pytest ... works?

Magic!

Can you add how to run this test to the docstring?

@JenniferWang JenniferWang force-pushed the fix-policy-update-test branch from fb47d9b to bf0289e Compare October 15, 2025 13:48
@JenniferWang JenniferWang merged commit 703f419 into main Oct 15, 2025
9 checks passed
allenwang28 pushed a commit to allenwang28/forge that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce multi-worker integration tests between RLEngine and PolicyEngine.

6 participants