I am running the library files creation on the server and it struggles a bit with the tanimoto score compute to get the TopKTanimotoScores.
My impression is that it is mostly memory footprint, for some reason it uses over 100gb of RAM there. It does finish correctly. For only neg mode spectra relatively fast, but for the full library it took at least an hour. It might be good to allow for some batch compute, since we are only interested in the top 10 highest scores. This would reduce the memory footprint and would also allow for printing some progress bar, which is now lacking, which makes it hard to know if it crashed or is still running.
We could do something like what I did for the MS2DeepScore compute, compute in batches and select the top k during the batch to free up memory. Just make sure it handles the paralalelization well and that this does not cause too much overhead.
def predict_top_k_ms2deepscores(
library_embeddings: Embeddings, query_embeddings: Embeddings, batch_size: int = 500, k=1
) -> Tuple[np.ndarray, np.ndarray]:
"""Memory efficient way of calculating the highest MS2DeepScores
When doing large matrix multiplications the memory footprint of storing the output matrix can be large.
E.g. when doing 500.000 vs 10.000 spectra this is a very large matrix. On a laptop this can result in very
slow run times. If only the highest MS2DeepScore is needed processing in batches prevents using too much memory
Args:
library_embeddings: The embeddings of the library spectra
query_embeddings: The embeddings of the query spectra
batch_size: The number of query embeddings processed at the same time.
Setting a lower batch_size results in a lower memory footprint.
k: Number of highest matches to return
Returns:
List[List[int]: indexes of highest scores and the value for the highest score.
Per query embedding the top k highest indexes are given.
List[List[float]: the highest scores.
"""
top_indexes_per_batch = []
top_scores_per_batch = []
num_of_query_embeddings = query_embeddings.embeddings.shape[0]
# loop over the batches
for start_idx in tqdm(
range(0, num_of_query_embeddings, batch_size),
desc="Predicting highest ms2deepscore per batch of "
+ str(min(batch_size, num_of_query_embeddings))
+ " embeddings",
):
end_idx = min(start_idx + batch_size, num_of_query_embeddings)
selected_query_embeddings = query_embeddings.embeddings[start_idx:end_idx]
score_matrix = cosine_similarity_matrix(selected_query_embeddings, library_embeddings.embeddings)
top_n_idx = np.argsort(score_matrix, axis=1)[:, -k:][:, ::-1]
top_n_scores = np.take_along_axis(score_matrix, top_n_idx, axis=1)
top_indexes_per_batch.append(top_n_idx)
top_scores_per_batch.append(top_n_scores)
# todo refactor to use the Embeddings class and return spectrum hashes instead of indexes
return np.vstack(top_indexes_per_batch), np.vstack(top_scores_per_batch)
I am running the library files creation on the server and it struggles a bit with the tanimoto score compute to get the TopKTanimotoScores.
My impression is that it is mostly memory footprint, for some reason it uses over 100gb of RAM there. It does finish correctly. For only neg mode spectra relatively fast, but for the full library it took at least an hour. It might be good to allow for some batch compute, since we are only interested in the top 10 highest scores. This would reduce the memory footprint and would also allow for printing some progress bar, which is now lacking, which makes it hard to know if it crashed or is still running.
We could do something like what I did for the MS2DeepScore compute, compute in batches and select the top k during the batch to free up memory. Just make sure it handles the paralalelization well and that this does not cause too much overhead.