SAR and QSAR Introduction

'''Note: this section of charmmtutorial.org is under development. It may change rapidly.'''

SAR/QSAR Introduction
Structure Activity Relationship (SAR) (http://en.wikipedia.org/wiki/Structure-activity_relationship) procedures ampli fy the potential to reduce the costs of the early drug discovery pipeline.(Bamborough P, Drewry D, Harper G, Smith GK, Schneider K. J Med Chem. 2008, 51 (24), 7 898–7914. http://pubs.acs.org/doi/abs/10.1021/jm8011036 ; Frye SV. Chem Biol., 1999, 6(1), R3–R7. [htt p://www.ncbi.nlm.nih.gov/pubmed/9889153 http://www.ncbi.nlm.nih.gov/pubmed/9889153] ; Lovering F, Bikker J, Humblet C. Escape, J Med Chem. 2009, 52(21), 6 752–6756. http://pubs.acs.org/doi/abs/10.1021/jm901241e)

This area of computational chemistry deals with building models to predict biological activity of a molecule based on its chemical structure. SAR models are being used more frequen tly for decision making nowadays but poor quality of predictions is still a major concern.

We have developed a web-based tool for SAR and QSAR (http:// en.wikipedia.org/wiki/Quantitative_structure-activity_relationship) modeling to add to the services provided by charmming.org. It is an implementation of o ne of the most recent advances in modern machine learning algorithms – Random Forests. (Breiman, Leo Mach. Learn. 2001, 45, 5. df http://oz.berkeley.edu/~breiman/randomforest2001.pdf ; I.E.Weidlich,Igor V. Filippov, Jodian Brown, Neerja Kaushik-Basu, Ramalingam Krishnan, Marc C. Nicklaus, Ian F.Thorpe,Bioorg.Med.Chem. 2013, 21 (11), 3127-3137. http://www.sciencedire ct.com/science/article/pii/S0968089613002460 ; http://en.wikipedia.org/wiki/Random_forest)

The tool allows a user to create his or her own models based on submitted training sd files which combine structures with activity information (either categorical (SAR) or numerica l (QSAR)), to track the model generation process and to run created models on the new data to predict activity. The whole process is presented in a straightforward, user-friendly m anner with each step prompting the user for the next action so that even a first time visitor to the web service can feel confident on what stage of the process he or she is curren tly situated. Additionally two lessons for SAR categorization and QSAR regression are available from the charmming.org website.

The two most important features for any SAR or QSAR model is accuracy of prediction and the ability to generalize. If the model is accurate for only a small set of similar compound s it is going to be less usable for novel drug discovery. Therefore uploading the original training data set with diverse structures is recommended.

Charmming SAR/QSAR tool uses Random Forest (RF) algorithm. It is one of the most accurate modern machine learning methods. It is an ensemble classifier that is a collection of simp ler individual classifiers. The individual classifiers in the case of Random Forest are decision trees. A decision tree can be thought of as an algorithmic implementation of the 20 question game, each node of the tree is a question such as ‘is feature N smaller than the threshold T?’, and the branches are ‘Yes/No’ answers. The leaves in this case wil l be the prediction results such as ‘Active/Inactive’. The prediction of the whole forest can be an average of the individual tree predictions. (Breiman, Leo Mach. Learn. 2001, 45, 5. http://oz.berkeley.edu/~breiman/randomforest2001.pdf ; I.E.Weidlich,Igor V. Filippov, Jodian Brown,&n bsp;Neerja Kaushik-Basu, Ramalingam Krishnan, Marc C. Nicklaus, Ian F.Thorpe, Bioorg.Med.Chem.2013, 21 (11), 31 27-3137.

http://www.sciencedirect.com/science/article/pii/S0968089613002460 ; [http://en.wikipedia.org/w iki/Random_forest http://en.wikipedia.org/wiki/Random_forest])

This SAR/QSAR module uses 2048-bit Morgan fingerprints of radius 2 as features and employs a Random Forest model consisting of 500 trees. Morgan Fingerprints is a bit vector where bits are set according to molecular neighborhoods of atoms. (http://code.google.com/p/ rdkit/wiki/ExplainingMorganFingerprintBits) It is similar to Extended Connectivity Fingerprints used in Accelrys Pipeline Pilot (/products/datasheets/chemistry-collection.pdf http://accelrys.com/products/datasheets/chemistry-collection.pdf) and MNA descriptors used in PASS. (rnals.org/content/16/8/747.full.pdf http://bioinformatics.oxfordjournals.org/content/16/8/747.full.pdf)

More information is available in ''Bioorg.Med.Chem. 2013, ''21 (11), 3127-3137. [http://www.sciencedirect.com/science/article /pii/S0968089613002460 http://www.sciencedirect.com/science/article/pii/S0968089613002460]

Preparing and uploading the data
The activity should be expressed as a binary value (Y/N) (for SAR) or a numerical value (for QSAR) for the purposes of machine modeling. It is recommended to estimate the d iversity for the whole training set, as well as for the active and inactive subsets separately before uploading.

Training procedure
To create a predictive SAR/QSAR model, the module uses an in-house developed python script based on the RDKit (http://rdkit.org/) Random&nbs p;Forest machine learning (http://en.wikipedia.org/wiki/Random_forest) and chemistry modules (ML and Chem). RDKit is an open source toolkit for cheminformatics and machine learning written in C++ and Python (http://code.google.com/p/rdkit/).

It is a well known problem that some computational models overfit the data ( A.Tropsha, Mol.Inf., 2010, 29, 476-488. 00061/abstract http://onlinelibrary.wiley.com/doi/10.1002/minf.201000061/abstract), that is instead of making predictions it is merely "recalls" the compounds very similar to thos e seen in the training set. RDKit seems to be a top choice for not overfitting the data (a common problem in machine learning approaches) based on tests of overfitting tendencies b y the Y-randomization procedure. (http://www.mathe2.uni-bayreuth.de/markus/pdf/pub/YRandQsar.pdf) Y-Randomization i s a tool used in validation of machine learning models. The performance of the original model in data description (AUC or R2) is compared to that of models built for ran domly shuffled response, based on the original descriptor pool and the original model building procedure.

Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) curve (http://en.wikipedia.org/wiki/Recei ver_operating_characteristic) is used to measure performance of the model for four cases—self-prediction, cross-validation, Y-randomization and prediction for the validation set (if present). Self-prediction means that the prediction is run on the same file used for training. Cross-validation is a common method to validate a SAR/QSAR model. In cross-valid ation, some compounds are held out as a test set, while the remaining compounds form a training set. In our case the 5-fold cross-validation is performed by randomly splitting the training set into training (80%) and validation (20%) sets. The procedure is repeated 5 times and the average AUC is reported. It is important to verify that Y-randomization result is significantly lower than self-AUC or cross-validation, otherwise the model is overfitting the data. High self-AUC value is common and should not be by itself considered an indi cation of the quality of the model, however low self-AUC is certainly a reason to be alarmed. Cross-validation is the best indicator of the model predictive ability in the absence of external test sets. High cross-validation and low y-randomization are indicators of a good model, though by no means a guarantee of one.

For regression modeling the square of correlation coefficient, R2, is used to estimate the model performance in the same way as AUC fo r categorization. (http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

It is also easy to validate the model on an external validation set – if the tool finds activity score already present in the “prediction” file it will automatically compute precision and recall (http://en.wikipedia.org/wiki/Precision_and_recall) measures or R2.

Prediction procedure
User uploads an sd file with structures to predict activity. When users identify compounds predicted by the RF model with activity scores above the threshold for active compounds s he/he can confirm the compounds with the highest predicted score experimentally. It is always imperative that the SAR/QSAR computational prediction be supported by thorough experim ental testing. Please contact us and share your experience with the service. This will help improve our services.

The new SAR/QSAR tool can be used as a stand-alone utility or as a supporting filter for the docking procedure.