Memory demand of tanimoto score compute

I am running the library files creation on the server and it struggles a bit with the tanimoto score compute to get the TopKTanimotoScores. 
My impression is that it is mostly memory footprint, for some reason it uses over 100gb of RAM there. It does finish correctly. For only neg mode spectra relatively fast, but for the full library it took at least an hour. It might be good to allow for some batch compute, since we are only interested in the top 10 highest scores. This would reduce the memory footprint and would also allow for printing some progress bar, which is now lacking, which makes it hard to know if it crashed or is still running. 

We could do something like what I did for the MS2DeepScore compute, compute in batches and select the top k during the batch to free up memory. Just make sure it handles the paralalelization well and that this does not cause too much overhead.


```python
def predict_top_k_ms2deepscores(
    library_embeddings: Embeddings, query_embeddings: Embeddings, batch_size: int = 500, k=1
) -> Tuple[np.ndarray, np.ndarray]:
    """Memory efficient way of calculating the highest MS2DeepScores

    When doing large matrix multiplications the memory footprint of storing the output matrix can be large.
    E.g. when doing 500.000 vs 10.000 spectra this is a very large matrix. On a laptop this can result in very
    slow run times. If only the highest MS2DeepScore is needed processing in batches prevents using too much memory

    Args:
        library_embeddings: The embeddings of the library spectra
        query_embeddings: The embeddings of the query spectra
        batch_size: The number of query embeddings processed at the same time.
                    Setting a lower batch_size results in a lower memory footprint.
        k: Number of highest matches to return

    Returns:
        List[List[int]: indexes of highest scores and the value for the highest score.
        Per query embedding the top k highest indexes are given.
        List[List[float]: the highest scores.
    """
    top_indexes_per_batch = []
    top_scores_per_batch = []
    num_of_query_embeddings = query_embeddings.embeddings.shape[0]
    # loop over the batches
    for start_idx in tqdm(
        range(0, num_of_query_embeddings, batch_size),
        desc="Predicting highest ms2deepscore per batch of "
        + str(min(batch_size, num_of_query_embeddings))
        + " embeddings",
    ):
        end_idx = min(start_idx + batch_size, num_of_query_embeddings)
        selected_query_embeddings = query_embeddings.embeddings[start_idx:end_idx]
        score_matrix = cosine_similarity_matrix(selected_query_embeddings, library_embeddings.embeddings)
        top_n_idx = np.argsort(score_matrix, axis=1)[:, -k:][:, ::-1]
        top_n_scores = np.take_along_axis(score_matrix, top_n_idx, axis=1)
        top_indexes_per_batch.append(top_n_idx)
        top_scores_per_batch.append(top_n_scores)
    # todo refactor to use the Embeddings class and return spectrum hashes instead of indexes
    return np.vstack(top_indexes_per_batch), np.vstack(top_scores_per_batch)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory demand of tanimoto score compute #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory demand of tanimoto score compute #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions