Skip to content

ID Generator creation is non blocking and can cause failuire for large files #1343

@Chyroprase

Description

@Chyroprase

Describe the bug

When an IdGenerator is created from a JSON big enough, the IdGenerator actor creation may take a long time, as the constructor parameters are slow to serialize/deserialize. So long that the actor is not constructed and registered when used in the rest of the pipeline, such as FuzzyDeduplicationWorkflow.

Steps/Code to reproduce bug

Loading a ~3M json serialization file reproduces the issue 90% of the time on my AnyScale/GCS setup. Error happens next time the IdGenerator is used, with the error "Failed to look up actor with name 'curator_deduplication_id_generator'".

Expected behavior

As the actor was created before, that should not happen.

Additional context

The error is the following: actor creation on Ray is not synchronous. When IdGenerator.remote(....) is called, the actor will be registered in Ray only when the constructor is done. If one search for the actor too soon, the call could fail. Usual fix is to add a function def ready() -> bool: return True in the actor, and calls that before returning the function. Will send a fix soon.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions