Skip to content

Slow resolve reference for large output DBΒ #627

@gpetretto

Description

@gpetretto

I have realized that, when the size of the output DB increases, resolving the references becomes a bottleneck for the execution of the jobs.
I have a DB with ~8000 jobs outputs in atomate2 and resolving the references for the store_inputs job for an elastic flow was taking hours. Introducing a mongodb index on the output collection with {uuid: 1, index: -1} led to a huge speedup.
Admittedly, I am not working with a very powerful DB, but I expect that this kind of problem would affect even more powerful machines as the DB size grows bigger.

I am opening this issue to check if I was the only one experiencing this kind of problem and to know if there is a set of suggested indexes to be added to the output DB.
Maybe there is margin for some optimization in the code? Or at least it could be good to perform some analysis of the most common queris and add a list of suggested indexes to the documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions