Pa (probability "to be active") estimates the chance that the studied compound is belonging to the sub-class of active compounds (resembles the structures of molecules, which are the most typical in a sub-set of "actives" in PASS training set).
Pi ( (probability "to be inactive") estimates the chance that the studied compound is belonging to the sub-class of inactive compounds (resembles the structures of molecules, which are the most typical in a sub-set of "inactives" in PASS training set).
IAP (Invariant Accuracy of Prediction) is the average accuracy of prediction that is obtained for the whole PASS training set in leave-one-out cross-validation procedure.
IAP equals numerically to ROC AUC
Leave-one-out cross-validation (LOO CV) procedure is performed using the whole PASS training set for validation of prediction quality. Biological activity spectrum is predicted for each compound using the structure-activity relationships calculated from the data for all other compounds. The prediction result is compared with known experimental data for the studied compound. The procedure is repeated for all compounds from the PASS training set; then the average Invariant Accuracy of Prediction (IAP=1-IEP) values are calculated for each biological activity and for all biological activites.
Only activities with Pa > Pi are considered as possible for a particular compound.
It is necessary to remember that probability Pa first of all reflects the similarity of molecule under prediction with the structures of molecules, which are the most typical in a sub-set of "actives" in the training set. Therefore, usually there is no direct correlation between the Pa values and quantitative characteristics of activities.
Even active and potent compound, whose structure is not typical to the structures of "actives" from the training set, may obtain a low Pa value and even Pa < Pi during the prediction. This is clear from the way how the functions Pa(B) and Pi(B) are constructed: the values Pa for "actives" and Pi for "inactives" are distributed fully uniformly. Taking this into account, the following interpretation of prediction results is possible.
If, for instance, Pa value equals to 0.9, then for 90% of "actives" from the training set the B values are less than for this compound, and only for 10% of "actives" this value is higher. If we decline the suggestion that this compound is active, we will make a wrong decision with probability 0.9.
In case if Pa value is less than 0.5, but Pa > Pi, then for more than half of "actives" from the training set the B values are higher than for this compound. If we decline the suggestion that this compound is active, we will make a wrong decision with probability less than 0.5. In such case the probability to confirm this kind of activity in the experiment is small, but it will be confirmed more than 50% chances that this structure has a high novelty and may become New Chemical Entity (NCE).
If the predicted biological activity spectrum is wide, the structure of the compound is quite simple, and does not contain peculiarities, which are responsible for the selectivity of its biological action.
If it appears that the structure under prediction contains a few new MNA descriptors (in comparison with the descriptors from the compounds of the training set), then the structure has low similarity with any structure from the training set, and the results of prediction should be considered as very rough estimates.
Based on these criteria, one may choose which activities have to be tested for the studied compounds on the basis of compromise between the novelty of pharmacological action and the risk to obtain the negative result in experimental testing. Certainly, one will also take into account a particular interest to some kinds of activity, experimental facilities, etc.
The number of new MNA descriptors for a tested molecule may be used for estimation of the applicability domain: the more the percent of new MNA descriptors, the less the molecule structure is appropriate for the model. The most accurate prediction is achieved for molecules without new MNA descriptors. We analyzed how the percent of new MNA descriptors correlate with the accuracy of prediction calculated by leave-one-out cross-validation procedure. The results are shown below:
Percent of new MNA descriptors | AUC |
0-5 % | 0.786 |
5-10 % | 0.775 |
10-15 % | 0.796 |
15-20 % | 0.756 |
20-25 % | 0.737 |
25-30 % | 0.668 |
30-35 % | 0.731 |
35-40 % | 0.667 |
40-99 % | 0.276 |
We consider that the tested molecules will be in the applicability domain if they have up to 25% new MNA descriptors.