- 
                Notifications
    
You must be signed in to change notification settings  - Fork 24
 
Add v2 RETURNN config serialization #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 
           I brought back Dim serialization in a way where RETURNN is not imported in the manager.  | 
    
| known_modules: Collection[str] = (), | ||
| sis_path_handling: SisPathHandling = None, | ||
| ): | ||
| from returnn.tensor import Dim, batch_dim, single_step_dim | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import requires returnn to be available for i6_core, I think this is not the case in many setups, maybe we should make the import conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't go through all of the code (until half of the _Serializer class), but here's my feedback. Feel free to fix what you think is appropriate :)
| 
               | 
          ||
| We handle those objects specially: | ||
| - primitive types (int, float, bool, str) | ||
| - Sisyphus Path objects | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Path serialization (1/2, also see below): as far as I understand, you also seem to be handling sisyphus variables by using _instanciate_delayed_copy (as seen here). Is this correct?
If it's indeed correct, are you sure that Path handling is triggered? I assume that you added this because you found some case in which it wasn't supported. But if it's not needed, maybe you could also drop explicit support for Path. Path inherits from AbstractPath, which in turn inherits from DelayedBase (source). That would make the code easier by also dropping e.g. SisPathHandling entirely.
Note: Path.get() calls Path.get_path(), as seen here, so either call is equivalent.
| .strip() | ||
| ) | ||
| assert isinstance(_base_sys_path_list, list) and all(isinstance(p, str) for p in _base_sys_path_list) | ||
| return _base_sys_path_list | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you always use the output of this function along with the in operator. I'd recommend to convert to a set (all elements are strings anyway, so they're hashable).
| return name.isidentifier() and not iskeyword(name) | ||
| 
               | 
          ||
| 
               | 
          ||
| class ReturnnConfigWithNewSerialization(ReturnnConfig): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReturnnConfigV2 is a shorter name, and matches the directory structure. Better or not, I leave it to you... I personally think it's better 🙂
| except OSError: | ||
| # cannot get source, e.g. code is a lambda | ||
| assert name is not None | ||
| args = [ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it seems "obvious" that you're trying to serialize the function manually, the code seems extremely obscure. I'd recommend adding a source as a comment that we can look for in case of doubt about the functionality of this code.
Is it just an exhaustive enumeration of the second column in https://docs.python.org/3/library/inspect.html, type "code"? If so, here's a suggestion for a comment:
| args = [ | |
| # Dump all elements that identify this piece of code as a function. | |
| # The exhaustive enumeration below is taken from https://docs.python.org/3/library/inspect.html, type "code". | |
| args = [ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this comes from https://github.com/rwth-i6/i6_core/blob/main/returnn/config.py#L293. The code is practically equivalent, so I strongly suggest to find ways of combining both chunks of code.
For a preliminary idea: you're only using _parse_python inside ReturnnConfigWithNewSerialization. Moreover, I see that the only difference is an assertion in the code (https://github.com/rwth-i6/i6_core/blob/main/returnn/config.py#L299) which you don't seem to have (is this intended?). Because of this two factors, I suggest to make the _parse_python function a method in ReturnnConfigWithNewSerialization with the following structure:
class ReturnnConfigWithNewSerialization(ReturnnConfig):
    # Nitpick: note the double underscore at the beginning to ideally match the definition of the superclass.
    def __parse_python(...):
        try:
            return super().__parse_python(...)
        except AssertionError:
            # Thrown by the superclass method in some circumstances: https://github.com/rwth-i6/i6_core/blob/main/returnn/config.py#L299.
            # We ignore the assertion on purpose because TODO.
            return self.__parse_python(code.get())
| 
               | 
          ||
| class ReturnnConfigWithNewSerialization(ReturnnConfig): | ||
| """ | ||
| Overwrites the serialization behavior of ReturnnConfig | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Overwrites the serialization behavior of ReturnnConfig | |
| V2 RETURNN config with simpler and better serialization. This class is completely retrocompatible with respect to V1. | |
| Allows serializing lambda functions as well as sisyphus `Path` and `Variable`, among the supported data types in the base RETURNN config class. | |
| Migration is recommended. | 
If possible add more details to motivate people to migrate to this better serialization config class.
| python_prolog_code, | ||
| serialized.as_serialized_code(), | ||
| python_epilog_code, | ||
| "# -*- mode: python; tab-width: 4 -*-\n", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this supposed to be at the top of the file?
| known_modules={"returnn"}, | ||
| extra_sys_paths=extra_sys_paths, | ||
| # instanciate_delayed should already have handled it (e.g. Path), | ||
| # or if it has not, then we want to keep it as it is (e.g. PtCheckpoint), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Path serialization (2/2): from the comments here I see that you already thought of Path being handled by instanciate_delayed. Can you provide an example of an object (e.g. PtCheckpoint as per the comments) that might not be properly handled?
| return "".join(code.py_code for code in self.code_list) | ||
| 
               | 
          ||
| 
               | 
          ||
| class _Serializer: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I strongly prefer having basic code as much to the top of the file as possible (e.g. _Serializer), and then code which builds from such code (e.g. serialize_config).
This also includes methods from the classes, e.g. _handle_next_queue_item would precede work_queue.
But that's my preference, maybe yours differs.
| _Ref(value): name for name, value in self._internal_reserved_names.items() | ||
| } | ||
| 
               | 
          ||
| def work_queue(self): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming the function for clarity and adding a few more details as a TL;DR for the docstring?
| def work_queue(self): | |
| def resolve_serialization_values(self): | |
| """ | |
| Runs the serialization queue, making sure that all values can be properly serialized. | |
| If a value depends on another value, the latter is serialized first. | 
| 
               | 
          ||
| 
               | 
          ||
| @dataclass | ||
| class SerializedConfig: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that as_serialization_collection is not used (isn't it?). I propose dropping this class and using "".join(...) instead.
This PR adds an adapted version of the v2 serialization written by @albertz. I intend for this PR to house discussion on what features to support in v2 serialization and what not to support.
The adaptations I made mainly concern themselves around being compatible with existing RETURNN configs that use the v1 serialization mechanisms like the
Import,PartialImport, etc.