SAR and QSAR Introduction
Structure Activity Relationship (SAR) procedures amplify the potential to reduce the costs of the early drug discovery pipeline<ref>Bamborough P, Drewry D, Harper G, Smith GK, Schneider K. J Med Chem. 2008, 51 (24), 7898–7914</ref><ref>Frye SV. Chem Biol., 1999,6(1),R3–R7</ref>
This area of computational chemistry deals with building models to predict biological activity of a molecule based on its chemical structure. SAR models are being used more frequently for decision making nowadays but poor quality of predictions is still a major concern.
We have developed a web-based tool for SAR and QSAR modeling to add to the services provided by charmming.org. It is an implementation of one of the most recent advances in modern machine learning algorithms – Random Forests. <ref name="breiman_reference">Breiman, Leo Mach.Learn.,2001,45,5</ref><ref name="Iwona_reference">Weidlich Iwona E,Filippov Igor V, Brown J, Kaushik-Basu N, Krishnan R, Nicklaus Marc C,Thorpe Ian F,Bioorg.Med.Chem. 2013,21(11), 3127-3137</ref>
The tool allows a user to create his or her own models based on submitted training sd files which combine structures with activity information (either categorical (SAR) or numerical (QSAR)), to track the model generation process, and to run created models on the new data to predict activity. The whole process is presented in a straightforward, user-friendly manner with each step prompting the user for the next action so that even a first time visitor to the web service can feel confident on what stage of the process he or she is currently situated. Additionally two lessons for SAR categorization and QSAR regression are available from the charmming.org website.
The two most important features for any SAR or QSAR model is accuracy of prediction and the ability to generalize. If the model is accurate for only a small set of similar compounds, it is going to be less usable for novel drug discovery. Therefore uploading the original training data set with diverse structures is recommended.
The CHARMMing SAR/QSAR tool uses the Random Forest (RF) algorithm, which is one of the most accurate modern machine learning methods. It is an ensemble classifier that is a collection of simpler individual classifiers. The individual classifiers in the case of Random Forest are decision trees. A decision tree can be thought of as an algorithmic implementation of the 20 question game; each node of the tree is a question such as ‘is feature N smaller than the threshold T?’, and the branches are ‘Yes/No’ answers. The leaves in this case will be the prediction results such as ‘Active/Inactive’. The prediction of the whole forest can be an average of the individual tree predictions.<ref name="breiman_reference" /><ref name="Iwona_reference" />
This SAR/QSAR module uses 2048-bit Morgan fingerprints of radius 2 as features and employs a Random Forest model consisting of 500 trees. Morgan Fingerprints are bit vectors where bits are set according to molecular neighborhoods of atoms. They are similar to Extended Connectivity Fingerprints used in Accelrys Pipeline Pilot and MNA descriptors used in PASS.
More information is available in Bioorg.Med.Chem. 2013, 21 (11), 3127-3137.
Preparing and uploading the data
The activity should be expressed as a binary value (Y/N) (for SAR) or a numerical value (for QSAR) for the purposes of machine modeling. It is recommended to estimate the diversity for the whole training set, as well as for the active and inactive subsets separately before uploading.
To create a predictive SAR/QSAR model, the module uses an in-house developed python script based on the RDKit Random Forest machine learning and chemistry modules (ML and Chem). RDKit is an open source toolkit for cheminformatics and machine learning written in C++ and Python.
It is a well known problem that some computational models overfit the data <ref>Tropsha A, Mol.Inf., 2010, 29, 476-488</ref> that is, instead of making predictions it is merely "recalls" the compounds very similar to those seen in the training set. RDKit seems to be a top choice for not overfitting the data (a common problem in machine learning approaches) based on tests of overfitting tendencies by the Y-randomization procedure<ref>Y-Randomization – A Useful Tool in QSAR Validation, or Folklore? Christoph Rücker, Gerta Rücker, and Markus Meringer</ref>. Y-Randomization is a tool used in validation of machine learning models. The performance of the original model in data description (AUC or R2) is compared to that of models built for randomly shuffled response, based on the original descriptor pool and the original model building procedure.
Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) curve is used to measure performance of the model for four cases—self-prediction, cross-validation, Y-randomization and prediction for the validation set (if present). Self-prediction means that the prediction is run on the same file used for training. Cross-validation is a common method to validate a SAR/QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. In our case the 5-fold cross-validation is performed by randomly splitting the training set into training (80%) and validation (20%) sets. The procedure is repeated 5 times and the average AUC is reported. It is important to verify that Y-randomization result is significantly lower than self-AUC or cross-validation, otherwise the model is overfitting the data. High self-AUC value is common and should not be by itself considered an indication of the quality of the model, however low self-AUC is certainly a reason to be alarmed. Cross-validation is the best indicator of the model predictive ability in the absence of external test sets. High cross-validation and low y-randomization are indicators of a good model, though by no means a guarantee of one.
For regression modeling the square of correlation coefficient, R2, is used to estimate the model performance in the same way as AUC for categorization.
It is also easy to validate the model on an external validation set – if the tool finds activity score already present in the “prediction” file it will automatically compute precision and recall measures or R2.
User uploads an sd file with structures to predict activity. When users identify compounds predicted by the RF model with activity scores above the threshold for active compounds s he/he can confirm the compounds with the highest predicted score experimentally. It is always imperative that the SAR/QSAR computational prediction be supported by thorough experim ental testing. Please contact us and share your experience with the service. This will help improve our services.
The new SAR/QSAR tool can be used as a stand-alone utility or as a supporting filter for the docking procedure.