Skip to content

Conversation

@CormickKneey
Copy link
Collaborator

Description

Implement Flask-based RPC server with serialization and corresponding tests

Related Issue

Task Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Copy link
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! The implementation looks good overall, but we should take special care about the error handling and data serialization support.

Comment on lines 48 to 51
# Custom extension type codes for msgspec
CUSTOM_TYPE_PICKLE = 0
CUSTOM_TYPE_CLOUDPICKLE = 1
CUSTOM_TYPE_RAW_VIEW = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an enum class for this.

@CormickKneey CormickKneey force-pushed the refactor/rpc_framework branch 2 times, most recently from 7640d4d to 4c7e433 Compare November 13, 2025 11:55
Comment on lines 176 to 172
with tempfile.TemporaryDirectory() as tmpdir:
tokenizer.save_pretrained(tmpdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use the original model path from which the tokenizer is loaded? A temp dir may not be synchronized across different nodes.

If we must use temp dir, ensure that it locates under cluster.fileroot. See cli_args.py for details.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the Tokenizer be serialized here may be modified by some internal logic after creation, like: calling add_tokens() , add_special_tokens .
So my solution is: save it into the a temp directory and package it with zip into the IO to be sent.

Comment on lines 184 to 190
compression = (
zipfile.ZIP_STORED if total_size < 512 * 1024 else zipfile.ZIP_DEFLATED
)
with zipfile.ZipFile(
zip_buffer, "w", compression=compression, compresslevel=6
) as zf:
for root, _, files in os.walk(tmpdir):
for f in files:
zf.write(
os.path.join(root, f),
arcname=os.path.relpath(os.path.join(root, f), tmpdir),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think zip-unzip will decrease loading time though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Let me find a replacement~

else:
return Response(
success=False, message=f"Unknown engine name: {config.class_name}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the client side, we can automatically convert the class to string because they are all built-in classes in AReaL. The server can accept string path and run dynamic importing. Using enums and if-else is not very extensible.

FYI I had implemented such logic in PR #528: https://github.com/inclusionAI/AReaL/pull/528/files#diff-b441cd1ab2d7679b06541869f7a78cc4330650c65ff8dff07bbfe5b568fd337eR124-R130. You don't have to following my original implementation, but we'd better not drop any error handling and extensibility designs.

@CormickKneey CormickKneey force-pushed the refactor/rpc_framework branch 3 times, most recently from 8a98f89 to db035ec Compare November 14, 2025 07:48
@CormickKneey
Copy link
Collaborator Author

/gemini review

@CormickKneey CormickKneey marked this pull request as ready for review November 14, 2025 07:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Flask-based RPC server and client, along with a custom serializer for handling complex data types like PyTorch tensors. While the new components are well-structured and the serializer itself is well-tested, there's a critical issue: the new serializer is not actually integrated into the RPC client and server. This means the new RPC implementation currently only supports JSON-serializable types, and will fail when trying to transmit complex objects like tensors, which seems to defeat the main purpose of introducing a custom serializer. I've provided detailed comments on this and a few other points.

@CormickKneey CormickKneey force-pushed the refactor/rpc_framework branch from 421dd56 to 8213732 Compare November 16, 2025 14:29
@CormickKneey CormickKneey force-pushed the refactor/rpc_framework branch from 8213732 to d4fe507 Compare November 16, 2025 14:31
@CormickKneey CormickKneey force-pushed the refactor/rpc_framework branch from d4fe507 to 616f6de Compare November 16, 2025 15:04
@CormickKneey
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by refactoring the RPC server to use Flask and implementing a robust serialization mechanism for complex data types. The new serialization module is well-designed and thoroughly tested. The RPC server is now more structured with clear endpoints and better error handling. I have a few suggestions regarding performance, maintainability, and robustness that could further enhance the implementation.

@CormickKneey CormickKneey force-pushed the refactor/rpc_framework branch from 59f60b0 to 2f72bb6 Compare November 17, 2025 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants