An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

Article Content

Abstract

Background

Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) – called molecular networking – organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability.

Result

In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use.

Conclusion

It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.

Peer Review reports

Background

Tandem mass spectrometry (MS/MS) of small molecules is a cornerstone of the metabolomics and natural products fields. The high-throughput ability to collect MS/MS spectra has presented new discovery opportunities but also creates new challenges in data interpretation. A common approach to analyzing these MS/MS spectra is the matching of MS/MS against libraries of reference MS/MS spectra for the identification of exact and near structural analogues [1]. Additionally, the molecular networking paradigm [2] has arisen as a popular strategy to organize similar molecules even without compound annotation. The key underlying algorithmic process that enables these computational techniques is the ability to compare MS/MS spectra and determine a level of chemical structural similarity, even without identification. Algorithmic approaches such as cosine, modified cosine similarity, and spectral entropy [1, 3, 4] have been used to address this challenge in cases where the underlying structures are near-neighbors. Recently, algorithmic advances have expanded this approach to multiple modifications but rely on intermediary structures present in the data [5]. In addition, machine learning (ML) approaches have become viable alternatives that hold the promise of enhanced performance [6,7,8,9].

However, there are key limitations in the development and evaluation of these ML approaches that hinder their impact on the field. First, there are domain-knowledge barriers for new entrants to the field seeking to leverage existing data to build new models. Second, there is currently no standard dataset or proposed metrics for the community to effectively measure and compare the progress that is being made by new publications. Third, existing model’s ability to generalize to new molecules remains under-investigated. In this manuscript we aim to present a method that aims to address these issues for the MS/MS similarity problem.

First, we introduce a metadata harmonized, machine learning ready method and dataset for training and evaluating MS/MS spectrum similarity for machine learning. This training dataset draws from public MS/MS resources: GNPS [10] and MassBank [11]. We describe the benchmark’s creation and apply it to gain insight into existing models, including an exploration of out of distribution generalization, an active area of research in the parallel fields of computer vision and natural language processing [12,13,14,15]. Next, we highlight limitations in existing metrics and provide a set of evaluation metrics and reproducible tools that allow for the evaluation of future models in domain-relevant metrics. Finally, we build on the concept of living data in mass spectrometry [10, 16] and introduce infrastructure to periodically update these ML datasets.

Results

Training/test data preparation for structural similarity prediction

We collected and harmonized public data from the GNPS and MassBank spectral libraries, a schematic of the methodology can be seen in Fig. 1. As of July 2024, the input data contains 39,274 structures and 788,951 spectra. We harmonized MS/MS library metadata (including removing duplicate imported MS/MS) and canonicalized 2D structures leaving 37,363 structures and 699,317 spectra (see Methods: Dataset Processing & Cleaning). We further removed the GNPS MS/MS libraries with heterogeneous instrument acquisition (individual GNPS user contributions library) and libraries with a large number of unexplained MS/MS fragmentation (BMDMS), resulting in 459,250 spectra with 32,857 structures. We filtered to positive ion mode, leaving 29,702 structures and 292,810 spectra. Next, we filtered the MS/MS spectra to keep only those with a precursor m/z matching their annotations, and significant fragmentation in the MS/MS (see Methods: Selecting MS/MS Data for Machine Learning, SI Fig. 1). The final dataset contained 189,467 spectra from 28,132 structures. Daily updates of the harmonized-source data that continuously integrate user uploads are available online (See Availability of Data and Materials) and are available as quarterly Zenodo depositions.

We developed a method to create consistent train/test splits that emphasizes a more balanced coverage of relevant data space in two key dimensions: (1) similar and dissimilar structure pairs of MS/MS spectra (pairwise structure similarity diversity) and (2) similar and dissimilar structures between training and test (train-test similarity diversity) (See Methods: Sampling Low Train-Test Similarity Pairs). We define train-test similarity for a structure as the maximum similarity between a test structure, and all training examples. To adapt this definition to the pairwise nature of the problem, we use the mean train-test similarity for both structures in a test pair as the measure of train-test similarity. The resulting “balanced test set” is important for understanding how MS/MS similarity models generalize to new data and provides a comprehensive assessment of model performance across the possible and relevant input space.

The baseline methodology for constructing test/train splits is the random selection of unique chemical structures. For example, in the MS2DeepScore publication [8], 500 random unique structures (2,967 MS/MS spectra) were selected for the test. This random selection reflects the underlying distribution of structures, but under-samples regions that are of key interest for an MS/MS similarity model—specifically in lower train-test similarity and higher pairwise structure similarity (top left and middle of SI Fig. 1A).

To build on this baseline, we introduce a method for sampling spectra and pairs designed to improve coverage of the structural similarity and train-test similarity space (Methods: Sampling Low Train-Test Similarity Pairs). This algorithm optimizes sampling of test structures over 13 train-test structural similarity bins ranging from 0.4 to 1.0 (Tanimoto similarity) by forcing structures to be placed into these bins, ensuring high coverage. This is accomplished by iteratively removing training data structures that cause test structures to have high train-test similarity. Second, to address pairwise structure similarity diversity, defined as the Tanimoto similarity between any two structures within the dataset, we apply a random walk sampling approach. This method applied to the dataset created a training set of 121,209 spectra and 17,339 structures, a validation set of 3,671 spectra and 500 structures, and a test set of 28,106 spectra and 4,891 structures. This division yielded a more complete coverage of the similar/dissimilar structure pairs and simila/dissimilar test-train distances when compared with the fully random method, even when normalizing for the number of test structures in the random method (Fig. 2). Overall, the random test set selection achieved 92 out of 120 bins while the introduced approach exceeded the threshold for 111 of 120 bins, 20.7% improvement (Fig. 2). Moreover, we define a region of interest (orange box Fig. 2) with train-test similarity (0.55, 0.85) and pairwise similarity > 0.5 that captures structurally related molecules with potential application in molecular networking, but distant from the test set. We note that the introduced method reached the threshold of 200 test pairs in all 30 bins in this region while the random selection method reached 16 out of 30 bins. For the remainder of the manuscript, we will refer to this dataset as the “All-Pairs” test dataset which contains 1,059,860,580 ordered spectrum pairs from 4,891 structures (Table 1).

Table 1 Dataset summary

Full size table

MS2DeepScore evaluation

MS2DeepScore – reproduction

As a test case, we use the MS/MS similarity prediction method introduced with MS2DeepScore. Briefly, MS2DeepScore uses a siamese multi-layer perceptron architecture to embed two binned MS/MS spectra into two separate 200 dimensional vectors. A cosine similarity function is applied to the two embeddings to predict a structural-similarity score that approximates the Tanimoto similarity of the underlying compounds. We reproduce the training pipeline and recapitulate the relative performance here as in the original publication by retraining on the All-Pairs dataset with a test split based on random InChiKey selection (shown in Fig. SI 3, SI 2 A&C). As a consequence of the similar performance, we maintain hyperparameters consistent with the original method.

When evaluating the performance of MS2DeepScore with the All-Pairs test dataset created using the introduced methodology, we achieve an overall root mean square error (RMSE) between cosine predictions and Tanimoto scores of 0.1743 and a RMSE for molecules with pairwise structure similarity > 0.6 of 0.2630 (Fig. 3). Standard deviation is shown across 4 different random seeds in Fig. SI 4 and training loss curves are shown in SI Figure SI 5. We note that the RMSE is highest for structures with high structural similarity (Fig. 3), an observation consistent with the original publication. This is unexpected since highly similar molecular structures are expected to fragment similarly and should be easiest to predict.

Filtered test datasets evaluation

We hypothesized that the test construction, which used all pairs of MS/MS spectra within the test set, may lead to an inflated RMSE in high similarity structures. To investigate this, we created a new “Filtered” test dataset that constrained the MS/MS pairs to having the same ionization method, mass analyzer, and adduct. Further, we constrain the collision energy difference (when available) to less than 5 eV, and the precursor mass difference to less than 200 Daltons. These constraints are reasonable because when applying MS/MS similarity in the context of tools like molecular networking, data will come from the same dataset and thus instrument conditions. This filtered set reduced the All-Pairs test set to 183,347,420 spectrum pairs from 4,734 structures (Table 1). Distributions of the number of pairs per Tanimoto score bin are shown in SI Figure SI 6.

Using this Filtered test set to evaluate the “All-Pairs model” (MS2DeepScore trained on data with no filtration criteria) performance, we find that for structural related molecules (Tanimoto similarity > 0.60), the RMSE decreases in this new evaluation (from 0.2630 to 0.2278) – (Fig. 3—Blue star line in A and B, SI Table 1). This is more striking in the highest structural similarity bin (0.9-1.0 Tanimoto similarity) where the error decreases from 0.3038 to 0.1962. The RMSE in structurally related pairs is further reduced when evaluating with a strict collision energy requirement from a bin-averaged RMSE of 0.2278 to 0.1777 although the decrease in the 0.9-1.0 bin is less significant from 0.1962 to 0.1783 (SI Note 1: Strict Collision Energy Evaluation, Fig. SI 7).

To provide a more realistic evaluation of performance, we introduce two new metrics that address ranking and retrieval performance that are useful in MS/MS library and analog search: Top Candidate Similarity and Top-Rank. Top Candidate Similarity reflects how well a model can retrieve highly similar structures (see Methods: Ranking and Retrieval Evaluation). This metric measures the maximum structural similarity of the retrieved structures to the structure corresponding to the query spectrum, up to rank k as measured by the Tanimoto Similarity. We report Top Candidate Similarity for the top 1, 3, and 10 candidates retrieved by each model, along with the “Theoretical Maximum,” the maximum Tanimoto similarity between all queries and distinct spectra within the test set. The “All-Pairs model” evaluated on the Filtered dataset achieves an average Tanimoto similarity of 0.8505 at k = 1, 0.8619 at k = 3, and 0.8847 at k = 10 compared to an optimal score of 0.9168 when identical structures are included in the evaluation (Fig. 4A). In the setting where no identical InChiKeys are present, the model achieves a score of 0.4703 at k = 1, 0.4909 at k = 3, and 0.5456 at k = 10, lagging behind the optimal average Tanimoto score of 0.6604 (Fig. 4B). The “All-Pairs model” performs comparably with the Modified Cosine method [10] which achieves a score of 0.8812 (vs 0.8847) and 0.5401 (vs 0.5456) at k = 10 when including and not including identical compounds respectively. Reported metrics for MS2DeepScore are averaged across four random seeds. Standard deviations are reported in SI figure SI 8, SI Table 2.

While the Top Candidate Similarity reflects a model’s ability to retrieve similar structures, it is limited in the ability to assess how well a model can retrieve and rank structures relative to Tanimoto similarity. To address this, we introduce the Top-Rank metric which measures how well a model ranks the most similar structures, within the top 1,3, and 10 candidates (see Methods: Ranking and Retrieval Evaluation). One limitation of this metric, however, is that the ranking of results corresponding to a structurally unique compound are weighted equally to those with many near-neighbors. When evaluating the “All-Pairs model” on the Filtered dataset with identical compounds included, the average rank is 37.11 at k = 1, 25.75 at k = 3, and 13.86 at k = 10 (Fig. 5A, SI Table 3). This suggests that while the model is able to retrieve identical structures, with the Top Candidate Similarity close to the optimal at all k, the “All-Pairs model” fails to correctly rank the absolute most similar retrieved structure. When removing identical InChiKeys from the evaluation, the best predicted rank increases to 862.56 at k = 1, 740.32 at k = 3, and 409.63 at k = 10 (Fig. 5B; Table 2, SI Table 3). In comparison to Modified Cosine, the “All-Pairs model” outerperforms with a ranking of 453.3658 to 473.26 at k = 10. Reported metrics are averaged across four random seeds. Standard deviations are shown in SI Figure SI 9. This suggests while the “All-Pairs model” can reidentify known compounds, it fails to retrieve the most similar structures.

Table 2 Comprehensive evaluation of the models

Full size table

Filtered training datasets evaluation

With the same methodology we constructed a training set of filtered pairs. This training set reduces the All-Pairs training set from 121,209 to 106,135 spectra and 17,339 to 16,720 structures. We retrained MS2Deepscore and evaluated the performance across the Filtered evaluation set. While this new training improved the average RMSE (from 0.1908 to 0.1509, SI Table 1), in Fig. 3B, we note that the bin-averaged RMSE in the structurally related molecules (Tanimoto score > 0.6) increases significantly from 0.2278 to 0.2569. Notably, the effect of training on filtered data on RMSE remains consistent across train-test similarities (Fig. 6). From a practical perspective, the higher structural similarity ranges are more important for downstream matching applications, e.g. MS/MS library search and molecular networking. Thus, even though the average RMSE improvement is notable, in this scenario, it is likely less useful of a metric to understand how these similarity models may be useful in practice (for further discussion see SI Note 2). We argue that focusing on higher-structural similarity diversity bins would be of more practical value, in the > 0.6 or > 0.7 Tanimoto ranges.

Biased training increases prediction accuracy in highly structurally similar molecules

Despite improved overall RMSE, training on Filtered data failed to improve performance on structurally related molecules. Therefore, we bias the training distribution for the “All-Pairs model” towards higher pairwise similarity pairs to improve performance on such pairs (See Methods: Biased Training of MS2DeepScore). Our target distribution for training data contains 50% of pairs with a pairwise similarity > 0.81 Tanimoto score while the random sampling scheme introduced in (Huber et al. 2021) targets 19%. However, we replicate the MS2DeepScore sampling methodology which first prioritizes complete coverage of InChiKeys and secondarily prioritizes the distribution of pairwise similarities at training time (Fig. SI 10a). During the selection of the All-Pairs training data targeting a uniform distribution, 9.75% of training samples are above the 0.81 Tanimoto similarity threshold while during the biased sampling, 24.57% of training pairs are above the threshold (Fig. SI 10).

For the “All-Pairs model” evaluated on the Filtered test set, biasing results in an increased overall RMSE of 0.2587 from 0.1670. However, this retraining results in a decrease in bin-averaged RMSE for structurally related molecules (ground truth similarity > 0.6) from 0.2630 to 0.1823 (Fig. 3A) when training and evaluating on the All-Pairs dataset and 0.2278 to 0.1663 when training on All-Pairs and evaluating on the Filtered dataset (Fig. 3B). Further, we observe that the reduction in error for high structural similarity pairs is maintained across train-test similarity (Fig. 6). Notably, the 0.9-1.0 bin of the “All-Pairs model” evaluated on the Filtered data set achieves the largest decrease in RMSE of 0.0461.

Despite the improvement in RMSE for structurally related compounds, there was no improvement in the retrieval setting when biasing the training data. For the “All-Pairs model,” biased training data marginally improved Top Candidate Similarity when including identical structures from 0.8505 to 0.8464 at k = 1 and from 0.8847 to 0.8816 at k = 10 (Fig. 4A, SI Table 2). A similar pattern holds when identical structures were excluded from the retrieval metrics in which case the “Biased All-Pairs model” marginally improved (Fig. 4B; Table 2), achieving a score of 0.5396 versus 0.5456 for the “Unbiased All-Pairs model” at k = 10.

Similarly, Top Rank performance decreased, with the average rank increasing from 13.86 to 18.02 at k = 10 (Fig. 5A, SI Table 3) when spectra from identical InChiKeys were present in the retrieval. In addition, “Biased All-Pairs model” performance was consistently worse than the Modified Cosine score (Fig. 5A, SI Table 3), achieving a rank of 18.02 versus 13.0057 at k = 10. When excluding identical compounds, the performance for the “All-Pairs model” trained on biased data was comparable to Modified Cosine, 453.37 versus 473.26 at k = 10, but performed worse than the “Unbiased All-Pairs model” which achieved a score of 409.63 at k = 10 (Table 2. SI Table 3).

Generalization of MS2DeepScore

Enabled by the improved sampling for higher pairwise similarity structures distant from the train set, we probe the generalization capabilities of the “All-Pairs model” on the All-Pairs test set (Fig. SI 11a) and the Filtered test set (Fig. SI 11b). On the All-Pairs test set, RMSE increased from 0.16 to 0.30 for the [0.7, 0.8) pairwise similarity bin, and similarly from 0.18 to 0.23 for the [0, 0.1) bin. Meanwhile, in the [0.3, 0.5) pairwise similarity region, there is limited change in RMSE as train-test similarity changes with RMSE decreasing from 0.16 to 0.11 in the [0.4,0.5) range (Fig. SI 11a). Similarly, when evaluating on the Filtered dataset, RMSE increased from 0.14 to 0.31 for the [0.7,0.8) pairwise similarity bin and 0.16 to 0.24 for the [0, 0.1) bin while decreasing from 0.16 to 0.11 for the [0.4, 0.5) range (Fig. SI 11b). This suggests that not only is the performance of the model out of distribution affected by the train-test similarity, but also the pairwise structural similarity of the input data. A comparison with the “Filtered model” is included in SI Figure SI 12.

Further, we investigate the role of train-test similarity in ranking and retrieval on the Filtered evaluation set using the “All-Pairs model”. For the Top Candidate Similarity metric, we report distance to the optimal structure similarity for each train-test similarity bin in (Fig. 7A, B). In the case where identical structures are included in the retrieval set (Fig. 7A), the most similar train-test bin (0.95, 1.0 Tanimoto) measured the smallest gap to the optimal structural similarity (Fig. 7A). When excluding identical structures from retrieval (Fig. 7B), we still observe that the top train-test similarity bin (0.95,1.0) exhibits the smallest gap to the optimal similarity. However, outside of this top window, we do not observe a clear trend for the other bins with a train-test similarity less than 0.95.

In contrast, for the Top Rank metric, there is a decrease in performance as train-test similarity decreases. When including identical compounds in retrieval, at k = 1, for the most similar train-test bin, the “All-Pairs model” predicts an average rank of 17.95, while the (0.55,0.60) bin predicts an average rank of 234.97 (Fig. 7C). This trend holds across all k-values, with an average rank of 7.60 for (0.95, 1.0) similarity bin and 113.62 for the (0.55,0.60) bin at k = 10. When not considering identical compounds in retrieval at k = 1 the (0.95, 1.0) similarity bin has an average rank of 29.30 while the (0.55,0.60) has an average rank of 304.55 (Fig. 7D). For k = 10, the “All-Pairs model” predicts an average rank of 11.9680 and 149.77 for the (0.95, 1.0) and (0.55, 0.60) bins respectively, indicating that the “All-Pairs model” is better able to rank highly similar compounds which are closer to the training set.

Discussion

By using an experimental condition-conscious approach to evaluation, we reduce the error for structurally related compounds and explain the anomalous trend of high error for high similarity structures (those with Tanimoto similarity greater than 0.9). Further, we use a new training method that biases towards high-pairwise similarity structures that further reduces the prediction error. However, we also observe that while we enhance the prediction accuracy (measured by RMSE), other metrics measuring retrieval accuracy notably suffer. This finding suggests that an ensemble approach may improve overall performance by allowing specialized models to focus on different regions of the input space [17, 18] We hope the complementary nature of the three metrics we include in this manuscript will provide a more comprehensive evaluation of future models.

While the introduced sampling method increases the coverage of the space greater than 0.55 train-test similarity (Fig. 2, Fig. SI 2), the entirety of the two-dimensional space is still not covered. Notably, for the region of higher pairwise similarity and low training similarity (pairwise similarity > 0.5, train-test similarity < 0.55), the introduced test-train selection method fails to meet the 200 pair criteria. This region is especially challenging as the test data is very dissimilar from the training data, which artificially forces the training dataset to be smaller. Thus, we have chosen a compromise to balance this region with an acceptable training dataset size. Since we are releasing an open-source pipeline that is continually rerun as new libraries are released, we anticipate that additional and new libraries will help increase this coverage while maintaining large training datasets.

Further we would like to emphasize that molecular similarity is a subjective measure with thresholds varying by application, [19,20,21] for this reason we offer a benchmark across train-test similarities and pairwise similarities, enabling end users to make an informed decision on the necessity for retraining for their specific task. In a similar vein, different levels of error may be appropriate for different tasks and a particular use case should inform users on the necessity to retrain their models. For example, while the structural similarity of retrieved compounds is roughly independent of train-test similarity, the ranking of retrieved compounds suffers (Fig. 7A, C).

One weakness of the community data included here is the incompleteness of collision energies. While we used the original provenance files for the public library creation to collect as many collision energies as possible, 70.72% of the test spectra do not have collision energy metadata. As a compromise, we allow pairs without collision energy information to maximize the size of the training and test datasets but note that requiring all pairs to have annotated collision energies can reduce error in high-pairwise similarity structures (Filtered Evaluation, SI: Strict Collision Energy Evaluation).

Conclusions

We present a publicly available training and test dataset al.ong with evaluation techniques and metrics that target performance characteristics meaningful to the practical applications of MS/MS structural similarity. Further, we experiment with two domain-inspired training methodologies and demo

Abstract
Background
Untargeted tandem mass spectrometry serves as a scalable solution for the organization of small molecules. One of the most prevalent techniques for analyzing the acquired tandem mass spectrometry data (MS/MS) – called molecular networking – organizes and visualizes putatively structurally related compounds. However, a key bottleneck of this approach is the comparison of MS/MS spectra used to identify nearby structural neighbors. Machine learning (ML) approaches have emerged as a promising technique to predict structural similarity from MS/MS that may surpass the current state-of-the-art algorithmic methods. However, the comparison between these different ML methods remains a challenge because there is a lack of standardization to benchmark, evaluate, and compare MS/MS similarity methods, and there are no methods that address data leakage between training and test data in order to analyze model generalizability.

Result
In this work, we present the creation of a new evaluation methodology using a train/test split that allows for the evaluation of machine learning models at varying degrees of structural similarity between training and test sets. We also introduce a training and evaluation framework that measures prediction accuracy on domain-inspired annotation and retrieval metrics designed to mirror real-world applications. We further show how two alternative training methods that leverage MS specific insights (e.g., similar instrumentation, collision energy, adduct) affect method performance and demonstrate the orthogonality of the proposed metrics. We especially highlight the role that collision energy plays in prediction errors. Finally, we release a continually updated version of our dataset online along with our data cleaning and splitting pipelines for community use.

Conclusion
It is our hope that this benchmark will serve as the basis of development for future machine learning approaches in MS/MS similarity and facilitate comparison between models. We anticipate that the introduced set of evaluation metrics allows for a better reflection of practical performance.

Peer Review reports

Background
Tandem mass spectrometry (MS/MS) of small molecules is a cornerstone of the metabolomics and natural products fields. The high-throughput ability to collect MS/MS spectra has presented new discovery opportunities but also creates new challenges in data interpretation. A common approach to analyzing these MS/MS spectra is the matching of MS/MS against libraries of reference MS/MS spectra for the identification of exact and near structural analogues [1]. Additionally, the molecular networking paradigm [2] has arisen as a popular strategy to organize similar molecules even without compound annotation. The key underlying algorithmic process that enables these computational techniques is the ability to compare MS/MS spectra and determine a level of chemical structural similarity, even without identification. Algorithmic approaches such as cosine, modified cosine similarity, and spectral entropy [1, 3, 4] have been used to address this challenge in cases where the underlying structures are near-neighbors. Recently, algorithmic advances have expanded this approach to multiple modifications but rely on intermediary structures present in the data [5]. In addition, machine learning (ML) approaches have become viable alternatives that hold the promise of enhanced performance [6,7,8,9].

Results
Training/test data preparation for structural similarity prediction
We collected and harmonized public data from the GNPS and MassBank spectral libraries, a schematic of the methodology can be seen in Fig. 1. As of July 2024, the input data contains 39,274 structures and 788,951 spectra. We harmonized MS/MS library metadata (including removing duplicate imported MS/MS) and canonicalized 2D structures leaving 37,363 structures and 699,317 spectra (see Methods: Dataset Processing & Cleaning). We further removed the GNPS MS/MS libraries with heterogeneous instrument acquisition (individual GNPS user contributions library) and libraries with a large number of unexplained MS/MS fragmentation (BMDMS), resulting in 459,250 spectra with 32,857 structures. We filtered to positive ion mode, leaving 29,702 structures and 292,810 spectra. Next, we filtered the MS/MS spectra to keep only those with a precursor m/z matching their annotations, and significant fragmentation in the MS/MS (see Methods: Selecting MS/MS Data for Machine Learning, SI Fig. 1). The final dataset contained 189,467 spectra from 28,132 structures. Daily updates of the harmonized-source data that continuously integrate user uploads are available online (See Availability of Data and Materials) and are available as quarterly Zenodo depositions.

Fig. 1
figure 1
Schematic of data processing pipeline data used to generate our benchmark is collected from GNPS and MassBank EU. On top of existing metadata, provenance spectral files from GNPS are integrated into the main pipeline. Data from the two sources is first harmonized and cleaned (Methods: Dataset Processing & Cleaning), and then the highest quality annotations are selected and a conservative cleaning of spectra is performed (Methods: Selecting MS/MS Data for Machine Learning). Finally, data is split into training and test subsets ensuring low structural similarity between the molecules in different sets (Methods: Sampling Low Train-Test Similarity Pairs). Logos used in this figure are licensed under the Creative Commons Attribution 4.0 International License (see Availability of Data and Materials)

Full size image
The baseline methodology for constructing test/train splits is the random selection of unique chemical structures. For example, in the MS2DeepScore publication [8], 500 random unique structures (2,967 MS/MS spectra) were selected for the test. This random selection reflects the underlying distribution of structures, but under-samples regions that are of key interest for an MS/MS similarity model—specifically in lower train-test similarity and higher pairwise structure similarity (top left and middle of SI Fig. 1A).

Fig. 2
figure 2
Sampling coverage. A The number of ordered structure pairs binned by pairwise and train-test similarity for the random sampling method. Pairwise similarity represents the Tanimoto similarity between two molecular structures. For an individual structure, train-test similarity corresponds to the maximum similarity between a test structure and all train examples. The mean train-test similarity for both structures is used to represent the train-test similarity of the pair. B Same as in A but employing our new sampling method introduced in Methods. Bins are inclusive on the left side only, with the exception of the highest bin which includes 1.0. Both test sets contain 4,734 unique InChiKeys (considering the first block only. Color represents the log base 10 count of structure pairs. Cells with counts below 200 ordered pairs (100 unique pairs) are white, highlighting poor sampling coverage. The random test set selection approach achieves a coverage of 76.7% while the introduced sampling method achieves a coverage of 92.5%. Coverage is improved in the high pairwise, moderate train-test similarity region (pairwise similarity > 0.5, train-test similarity above 0.55) denoted in orange. Unthresholded heat maps and a comparison with a 500 InChiKey random sample are available in Fig. SI 2

Full size image
Table 1 Dataset summary
Full size table
MS2DeepScore evaluation
MS2DeepScore – reproduction
As a test case, we use the MS/MS similarity prediction method introduced with MS2DeepScore. Briefly, MS2DeepScore uses a siamese multi-layer perceptron architecture to embed two binned MS/MS spectra into two separate 200 dimensional vectors. A cosine similarity function is applied to the two embeddings to predict a structural-similarity score that approximates the Tanimoto similarity of the underlying compounds. We reproduce the training pipeline and recapitulate the relative performance here as in the original publication by retraining on the All-Pairs dataset with a test split based on random InChiKey selection (shown in Fig. SI 3, SI 2 A&C). As a consequence of the similar performance, we maintain hyperparameters consistent with the original method.

Filtered test datasets evaluation
We hypothesized that the test construction, which used all pairs of MS/MS spectra within the test set, may lead to an inflated RMSE in high similarity structures. To investigate this, we created a new “Filtered” test dataset that constrained the MS/MS pairs to having the same ionization method, mass analyzer, and adduct. Further, we constrain the collision energy difference (when available) to less than 5 eV, and the precursor mass difference to less than 200 Daltons. These constraints are reasonable because when applying MS/MS similarity in the context of tools like molecular networking, data will come from the same dataset and thus instrument conditions. This filtered set reduced the All-Pairs test set to 183,347,420 spectrum pairs from 4,734 structures (Table 1). Distributions of the number of pairs per Tanimoto score bin are shown in SI Figure SI 6.

Fig. 3
figure 3
Performance relative to ground truth similarity. A Root mean squared error (RMSE) comparison between All-Pairs, Filtered, Unbiased, and Biased models on All-Pairs test data across 10 ground truth structural similarity bins. B RMSE comparison on the filtered test dataset. Across both evaluation sets, models trained on Filtered data outperform models trained on the All-Pairs set for low pairwise similarities but underperform for higher pairwise similarities. Biasing the pairwise similarity distribution during training improves pairwise similarity greater than 0.6 prediction performance. X-tick values represent the exclusive upper end of the pairwise similarity bin (e.g., 0.1 is the [0.0, 0.1) bin), with the exception of 1.0 which is inclusive

Full size image
To provide a more realistic evaluation of performance, we introduce two new metrics that address ranking and retrieval performance that are useful in MS/MS library and analog search: Top Candidate Similarity and Top-Rank. Top Candidate Similarity reflects how well a model can retrieve highly similar structures (see Methods: Ranking and Retrieval Evaluation). This metric measures the maximum structural similarity of the retrieved structures to the structure corresponding to the query spectrum, up to rank k as measured by the Tanimoto Similarity. We report Top Candidate Similarity for the top 1, 3, and 10 candidates retrieved by each model, along with the “Theoretical Maximum,” the maximum Tanimoto similarity between all queries and distinct spectra within the test set. The “All-Pairs model” evaluated on the Filtered dataset achieves an average Tanimoto similarity of 0.8505 at k = 1, 0.8619 at k = 3, and 0.8847 at k = 10 compared to an optimal score of 0.9168 when identical structures are included in the evaluation (Fig. 4A). In the setting where no identical InChiKeys are present, the model achieves a score of 0.4703 at k = 1, 0.4909 at k = 3, and 0.5456 at k = 10, lagging behind the optimal average Tanimoto score of 0.6604 (Fig. 4B). The “All-Pairs model” performs comparably with the Modified Cosine method [10] which achieves a score of 0.8812 (vs 0.8847) and 0.5401 (vs 0.5456) at k = 10 when including and not including identical compounds respectively. Reported metrics for MS2DeepScore are averaged across four random seeds. Standard deviations are reported in SI figure SI 8, SI Table 2.

Fig. 4
figure 4
Tanimoto similarity of top predictions on filtered data. A The maximum Tanimoto similarity of the top k compounds with highest predicted similarity averaged across test data points. The optimal structural similarity (Theoretical Maximum) that demonstrates an upper bound on performance is also included. B Tanimoto similarity excluding any compounds with an identical planar 2D structure. On average, all variants of MS2DeepScore retrieve data points with comparable Tanimoto similarity to the modified cosine similarity

Full size image
While the Top Candidate Similarity reflects a model’s ability to retrieve similar structures, it is limited in the ability to assess how well a model can retrieve and rank structures relative to Tanimoto similarity. To address this, we introduce the Top-Rank metric which measures how well a model ranks the most similar structures, within the top 1,3, and 10 candidates (see Methods: Ranking and Retrieval Evaluation). One limitation of this metric, however, is that the ranking of results corresponding to a structurally unique compound are weighted equally to those with many near-neighbors. When evaluating the “All-Pairs model” on the Filtered dataset with identical compounds included, the average rank is 37.11 at k = 1, 25.75 at k = 3, and 13.86 at k = 10 (Fig. 5A, SI Table 3). This suggests that while the model is able to retrieve identical structures, with the Top Candidate Similarity close to the optimal at all k, the “All-Pairs model” fails to correctly rank the absolute most similar retrieved structure. When removing identical InChiKeys from the evaluation, the best predicted rank increases to 862.56 at k = 1, 740.32 at k = 3, and 409.63 at k = 10 (Fig. 5B; Table 2, SI Table 3). In comparison to Modified Cosine, the “All-Pairs model” outerperforms with a ranking of 453.3658 to 473.26 at k = 10. Reported metrics are averaged across four random seeds. Standard deviations are shown in SI Figure SI 9. This suggests while the “All-Pairs model” can reidentify known compounds, it fails to retrieve the most similar structures.

Fig. 5
figure 5
Top predicted rank on filtered data. A The best predicted rank for the k most similar structures, averaged across four random seeds (See: Methods: Ranking and Retrieval Evaluation). The “All-Pairs model” and “Filtered model” retrieve the highest similarity structures up to ground-truth rank k = 3, at which point they are comparable to modified cosine. B The best predicted rank for the k most similar structures, excluding identical structures. The “All-Pairs model” outperforms all other methods at any maximum k value

Full size image
Table 2 Comprehensive evaluation of the models
Full size table
Filtered training datasets evaluation
With the same methodology we constructed a training set of filtered pairs. This training set reduces the All-Pairs training set from 121,209 to 106,135 spectra and 17,339 to 16,720 structures. We retrained MS2Deepscore and evaluated the performance across the Filtered evaluation set. While this new training improved the average RMSE (from 0.1908 to 0.1509, SI Table 1), in Fig. 3B, we note that the bin-averaged RMSE in the structurally related molecules (Tanimoto score > 0.6) increases significantly from 0.2278 to 0.2569. Notably, the effect of training on filtered data on RMSE remains consistent across train-test similarities (Fig. 6). From a practical perspective, the higher structural similarity ranges are more important for downstream matching applications, e.g. MS/MS library search and molecular networking. Thus, even though the average RMSE improvement is notable, in this scenario, it is likely less useful of a metric to understand how these similarity models may be useful in practice (for further discussion see SI Note 2). We argue that focusing on higher-structural similarity diversity bins would be of more practical value, in the > 0.6 or > 0.7 Tanimoto ranges.

Fig. 6
figure 6
Effect of training methodology refinements. A The RMSE of the “All-Pairs Model” evaluated on the Filtered dataset. B Training with filtered MS/MS pairs results in a decrease in average RMSE of 0.0161. However, this decrease was not uniformly beneficial to all pairs of spectra in the test data; RMSE increased for in areas with pairwise similarity higher than 0.6 and decreased for low pairwise similarity (< 0.3). C The change in RMSE for the “Biased All-Pairs model” in comparison to the unbiased “All-Pairs model” shown in (A). Biased training improves RMSE for high pairwise similarity pairs (> 0.6) and increases RMSE for low pairwise similarity pairs (< 0.4)

Full size image
Biased training increases prediction accuracy in highly structurally similar molecules
Despite improved overall RMSE, training on Filtered data failed to improve performance on structurally related molecules. Therefore, we bias the training distribution for the “All-Pairs model” towards higher pairwise similarity pairs to improve performance on such pairs (See Methods: Biased Training of MS2DeepScore). Our target distribution for training data contains 50% of pairs with a pairwise similarity > 0.81 Tanimoto score while the random sampling scheme introduced in (Huber et al. 2021) targets 19%. However, we replicate the MS2DeepScore sampling methodology which first prioritizes complete coverage of InChiKeys and secondarily prioritizes the distribution of pairwise similarities at training time (Fig. SI 10a). During the selection of the All-Pairs training data targeting a uniform distribution, 9.75% of training samples are above the 0.81 Tanimoto similarity threshold while during the biased sampling, 24.57% of training pairs are above the threshold (Fig. SI 10).

Despite the improvement in RMSE for structurally related compounds, there was no improvement in the retrieval setting when biasing the training data. For the “All-Pairs model,” biased training data marginally improved Top Candidate Similarity when including identical structures from 0.8505 to 0.8464 at k = 1 and from 0.8847 to 0.8816 at k = 10 (Fig. 4A, SI Table 2). A similar pattern holds when identical structures were excluded from the retrieval metrics in which case the “Biased All-Pairs model” marginally improved (Fig. 4B; Table 2), achieving a score of 0.5396 versus 0.5456 for the “Unbiased All-Pairs model” at k = 10.

Similarly, Top Rank performance decreased, with the average rank increasing from 13.86 to 18.02 at k = 10 (Fig. 5A, SI Table 3) when spectra from identical InChiKeys were present in the retrieval. In addition, “Biased All-Pairs model” performance was consistently worse than the Modified Cosine score (Fig. 5A, SI Table 3), achieving a rank of 18.02 versus 13.0057 at k = 10. When excluding identical compounds, the performance for the “All-Pairs model” trained on biased data was comparable to Modified Cosine, 453.37 versus 473.26 at k = 10, but performed worse than the “Unbiased All-Pairs model” which achieved a score of 409.63 at k = 10 (Table 2. SI Table 3).

Generalization of MS2DeepScore
Enabled by the improved sampling for higher pairwise similarity structures distant from the train set, we probe the generalization capabilities of the “All-Pairs model” on the All-Pairs test set (Fig. SI 11a) and the Filtered test set (Fig. SI 11b). On the All-Pairs test set, RMSE increased from 0.16 to 0.30 for the [0.7, 0.8) pairwise similarity bin, and similarly from 0.18 to 0.23 for the [0, 0.1) bin. Meanwhile, in the [0.3, 0.5) pairwise similarity region, there is limited change in RMSE as train-test similarity changes with RMSE decreasing from 0.16 to 0.11 in the [0.4,0.5) range (Fig. SI 11a). Similarly, when evaluating on the Filtered dataset, RMSE increased from 0.14 to 0.31 for the [0.7,0.8) pairwise similarity bin and 0.16 to 0.24 for the [0, 0.1) bin while decreasing from 0.16 to 0.11 for the [0.4, 0.5) range (Fig. SI 11b). This suggests that not only is the performance of the model out of distribution affected by the train-test similarity, but also the pairwise structural similarity of the input data. A comparison with the “Filtered model” is included in SI Figure SI 12.

Further, we investigate the role of train-test similarity in ranking and retrieval on the Filtered evaluation set using the “All-Pairs model”. For the Top Candidate Similarity metric, we report distance to the optimal structure similarity for each train-test similarity bin in (Fig. 7A, B). In the case where identical structures are included in the retrieval set (Fig. 7A), the most similar train-test bin (0.95, 1.0 Tanimoto) measured the smallest gap to the optimal structural similarity (Fig. 7A). When excluding identical structures from retrieval (Fig. 7B), we still observe that the top train-test similarity bin (0.95,1.0) exhibits the smallest gap to the optimal similarity. However, outside of this top window, we do not observe a clear trend for the other bins with a train-test similarity less than 0.95.

Fig. 7
figure 7
Ranking and retrieval metrics by train-test similarity. A & B Distance between the maximum predicted Tanimoto similarity up to rank k and the optimal Tanimoto similarity plotted for each train-test similarity bin. A When including identical compounds, the most similar bin (0.95, 1.0) outperforms all other bins (with lower bounds greater than 0.55) where the 200 pair sampling threshold was achieved (Fig. 2). B When excluding identical compounds, the same trend holds, although when considering the top 10 predictions, the closest bin to the test set (0.95, 1.0) outperforms all others. C&D) The best predicted rank for the top-k most similar structures to each query when identical compounds are included in the results (C), and when excluded (D). In both cases, the model predicts better (lower) ranks for the most similar compounds as the data points become closer to the training set

Full size image
In contrast, for the Top Rank metric, there is a decrease in performance as train-test similarity decreases. When including identical compounds in retrieval, at k = 1, for the most similar train-test bin, the “All-Pairs model” predicts an average rank of 17.95, while the (0.55,0.60) bin predicts an average rank of 234.97 (Fig. 7C). This trend holds across all k-values, with an average rank of 7.60 for (0.95, 1.0) similarity bin and 113.62 for the (0.55,0.60) bin at k = 10. When not considering identical compounds in retrieval at k = 1 the (0.95, 1.0) similarity bin has an average rank of 29.30 while the (0.55,0.60) has an average rank of 304.55 (Fig. 7D). For k = 10, the “All-Pairs model” predicts an average rank of 11.9680 and 149.77 for the (0.95, 1.0) and (0.55, 0.60) bins respectively, indicating that the “All-Pairs model” is better able to rank highly similar compounds which are closer to the training set.

Discussion
By using an experimental condition-conscious approach to evaluation, we reduce the error for structurally related compounds and explain the anomalous trend of high error for high similarity structures (those with Tanimoto similarity greater than 0.9). Further, we use a new training method that biases towards high-pairwise similarity structures that further reduces the prediction error. However, we also observe that while we enhance the prediction accuracy (measured by RMSE), other metrics measuring retrieval accuracy notably suffer. This finding suggests that an ensemble approach may improve overall performance by allowing specialized models to focus on different regions of the input space [17, 18] We hope the complementary nature of the three metrics we include in this manuscript will provide a more comprehensive evaluation of future models.

Conclusions
We present a publicly available training and test dataset al.ong with evaluation techniques and metrics that target performance characteristics meaningful to the practical applications of MS/MS structural similarity. Further, we experiment with two domain-inspired training methodologies and demonstrate orthogonality in the proposed metrics. We anticipate that our benchmark will support the rapid development of new models and enable their comprehensive assessment. Finally, we hope that this work and dataset can complement existing efforts [22] in the field to create standardized datasets and machine learning problem formulations.

nstrate orthogonality in the proposed metrics. We anticipate that our benchmark will support the rapid development of new models and enable their comprehensive assessment. Finally, we hope that this work and dataset can complement existing efforts [22] in the field to create standardized datasets and machine learning problem formulations.

Related Articles

Contact us

Article Content

Abstract

Background

Result

Conclusion

Background

Results

Training/test data preparation for structural similarity prediction

MS2DeepScore evaluation

MS2DeepScore – reproduction

Filtered test datasets evaluation

Filtered training datasets evaluation

Biased training increases prediction accuracy in highly structurally similar molecules

Generalization of MS2DeepScore

Discussion

Conclusions