QSAR Regression Lesson
This lesson is a guided example with a detailed explanation of each step of the process and a definitely right / wrong answer.
The activity should be expressed as a numerical value for the purposes of machine modeling. All values should be expressed in the same units - e.g. nM, µM etc. More information is available from the SAR/QSAR Introduction, “Uploading the data” section. Note that only the property with the value (STANDARD_VALUE in this example) is necessary for the model training.
Step 1: Uploading the data
The user uploads the original training data set which is diverse and contains activity property with numerical values. The acceptable format is sd file.
If the uploaded file does not contain required properties or is not written in sd format the user will see the error message and will have to re-upload the correct file.
Step 2: Choice of property
The user chooses the property with numerical activity value from pull down menu and clicks next.
If the selected property does not contain numerical values the user will have to correct the selection.
The QSAR job history page will inform the user about the status of the job, specifically whether it is running or has completed.
To see the model attributes the user needs to click on “View Model”
For regression modeling the process is similar to categorization process except that R2 is used instead of AUC and there is no recall and precision.
The QSAR tool automatically verifies new models by using well-known machine learning techniques such as cross-validation and y-randomization so users can immediately see whether the created model is able to calculate valid predictions. This is an important and often missed step in QSAR modeling. A user is presented with R2 for the training set, for y-randomized set and for 5-fold cross-validation for regression modeling.
Self R2- R2 is used to measure performance of the model for self-prediction. A low value signifies an unsuitable model which cannot even predict its training values, but a high score does not necessarily warrants a high confidence.
Y-randomization - This procedure tests the overfitting tendencies of the model. The value should be close to 0 for R2. If the value is close to self R2or cross-validation values the model overfits and is unsuitable for use.
5-fold cross-validation - Cross-validation is performed by randomly splitting the training set into training (80%) and validation (20%) sets. The procedure is repeated 5 times and the average R2 is reported.
Note that in this particular example the model has performed poorly (high Y-randomization value and low cross-validation) and normally we would not recommend using such a model for prediction. For the sake of example though we will proceed with this model.
Step 3:Activity prediction
At this time the user can upload an sd file to predict the activity and click “Predict”.
The results are available to download from the "Download Results" page.
Predicted activity is presented in the downloaded sd file for each structure.
The model predicted activity (1593 nM for the molecule number 3 in submitted sd file). The predicted activity property name is constructed from the original property name (STANDARD_VALUE in our example) by prefixing it with the word PREDICTED_.
The user might want to validate the regression model on an external validation set – if the tool finds activity property (STANDARD_VALUE in our example) already present in the "prediction" file it will automatically compute R2 and predict PREDICTED_STANDARD_VALUE.
To compare built models the user can click on “View Models”
The user can also upload the sd file for prediction from “View Models” “Available QSAR Models” menu an click “Predict”.
To view all jobs the user click “View Jobs/Results”