SAR Categorization Lesson
This lesson is defined as a guided example with a detailed explanation of each step of the process and a definitely right/wrong answer.
The activity should be expressed as a binary value, for example Y/N but any other pair of values will be automatically recognized as well, for the purposes of machine modeling. More information is available from the SAR/QSAR Introduction, “Uploading the data” section.
Step 1: Uploading the training data set
The user uploads the original training data set which is diverse and contains activity property (called “Activity” in this example). The acceptable format is sd file.
If the uploaded file does not contain required properties or is not written in sd format the user will see the error message and will have to re-upload the correct file.
Step 2: Choosing the property in the training data set that contains the activity
The user chooses the property with activity from the pull down menu and clicks next.
If the user chooses the property different than activity from the pull down menu he/she will see the error message and will have to choose the correct property.
The QSAR job history page will inform the user about the status of the job: Running or Done.
To see the model attributes the user needs to click on “View Model”
The SAR/QSAR tool automatically verifies new models by using well-known machine learning techniques such as cross-validation and y-randomization so users can immediately see whether the created model is able to calculate valid predictions. This is an important and often missed step in QSAR modeling. A user is presented with AUC measurements for the training set, for y-randomized set and an average AUC for 5-fold cross-validation for categorical modeling. A prediction score as well as “active/inactive” labels and the recommended threshold are the output of the prediction. The threshold is automatically recommended based on the balance between recall and precision. Recall and precision for the training set are also displayed for the user.
Recommended threshold - The Random Forest method returns a numerical “score”, to get a binary prediction a threshold must be chosen to separate actives vs. inactives. Recommended value for the threshold is automatically calculated based on the balance between the precision and the recall values - that is the precision and recall are as close in value as possible. The original score will also be available for the model runs so that a user can select his/her own threshold if desired.
Self AUC - Area Under the Curve is used to measure performance of the model for self-prediction. A low value signifies an unsuitable model which cannot even predict its training values but a high score does not necessarily warrants a high confidence.
Y-randomization- this procedure tests the overfitting tendencies of the model. The value should be close to50% for AUC or to 0 for R2. If the value is close to self-AUC or cross-validation values the model overfits and is unsuitable for use.
5-fold cross-validation- Cross-validation is performed by randomly splitting the training set into training (80%) and validation (20%) sets. The procedure is repeated 5 times and the average AUC is reported.
Recall - Fraction of predicted real active compounds among the real active compounds (when recommended threshold is used).
Precision - Fraction of predicted real active compounds among the predicted active compounds (when the recommended threshold is applied to the predicted score).
Step 3: Prediction
At this time the user can upload an sd file with molecules for which he would like to compute the activity and click “Predict”.
If the user uploads non sd file he/she will see an error message.
The user can also upload an sd file from the “View Models” panel “Available QSAR Models”, then click “Predict” to predict the activity.
Taking a look at the values we can say that self AUC and cross-validation are about 88% and 87% respectively which is a high value and they are close to each other. At the same time y-randomization is only 70% which is somewhat lower than the other two values. We can conclude that the model posses some predictive ability and does not overfit the data.
The results are available to download from the “Download Results” menu.
Predicted activity and predicted score are presented in the downloaded sd file for each structure.
The predicted score (0.104) is above the recommended threshold (0.046) so the activity was predicted active. The predicted activity property name is constructed from the original property name (Activity in our example) by prefixing it with the word PREDICTED_.
To compare built models the “View Models” hyperlink can be selected.
To view all jobs the user may click “View Jobs/Results”