ID Generator creation is non blocking and can cause failuire for large files

**Describe the bug**

When an IdGenerator is [created](https://github.com/NVIDIA-NeMo/Curator/blob/aaf633038c71de15f6658cd1f9160c14ceab256c/nemo_curator/stages/deduplication/id_generator.py#L116) from a JSON big enough, the IdGenerator actor creation may take a long time, as the constructor parameters are slow to serialize/deserialize. So long that the actor is not constructed and registered when used in the rest of the pipeline, such as FuzzyDeduplicationWorkflow.

**Steps/Code to reproduce bug**

Loading a ~3M json serialization file reproduces the issue 90% of the time on my AnyScale/GCS setup. Error happens next time the IdGenerator is used, with the error "Failed to look up actor with name 'curator_deduplication_id_generator'".


**Expected behavior**

As the actor was created before, that should not happen.


**Additional context**

The error is the following: actor creation on Ray is not synchronous. When IdGenerator.remote(....) is called, the actor will be registered in Ray only when the constructor is done. If one search for the actor too soon, the call could fail. Usual fix is to add a function def ready() -> bool: return True in the actor, and calls that before returning the function. Will send a fix soon.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ID Generator creation is non blocking and can cause failuire for large files #1343

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ID Generator creation is non blocking and can cause failuire for large files #1343

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions