GUSAR software was developed to create QSAR/QSPR models on the basis of the appropriate training sets represented as SDfile contained data about chemical structures and endpoint in quantitative terms.
QSAR MODELING ON THE BASIS OF QNA DESCRIPTORS
QNA descriptors are P and Q values calculated for each atom of molecule. The calculation of P and Q values is based on the connectivity matrix (C) and the standard values of ionization potential (IP) and electron affinity (EA) of atoms in a molecule [1]. The estimation of a target property of chemical compound is calculated as the mean value of the function of P and Q values of the atoms of a molecule in QNA descriptors' space. We have proposed to use two-dimensional Chebyshev polynomials for approximation of the function of P and Q values. So, the independent regression variables are calculated as average values of particular two-dimensional Chebyshev polynomials of P and Q values for molecule atoms.
QNA descriptors and their polynomial transformations do not provide information on the shape and volume of a molecule although this information may be important for determining the structure-activity relationships. Therefore, these parameters were added to the variables obtained from Chebyshev polynomials. Topological length of the molecule is the maximal distance, calculated by the number of bonds between any two atoms (including hydrogen). The volume of a molecule is the sum of each atom's volume.
The number of initial variables for QSAR modelling depends on the number of compounds in the training set and corresponds to the number of Chebyshev polynomials plus the number of the first, second and third power of the values of topological length and volume of a molecule.
GUSAR algorithm uses three randomly selected parameters to generate different QSAR models based on QNA descriptors: (a) calculation of QNA descriptors for all atoms or for the atoms in a molecule with two or more immediate neighbours; (b) changing of the coefficient before the connectivity matrix (c) changing of parameters of Chebyshev polynomials.
GUSAR allows creating of QSAR models based on predicted biological activity profiles of chemical compounds. Each chemical compound is represented as a list of MNA descriptors, which are used as input parameters [2] for predicting of the biological activity profiles. PASS algorithm is used to calculate this profile.
The latest version of PASS (10.1) predicts 4130 types of biological activity with mean prediction accuracy about 95%. Now, the list of predictable biological activities includes 501 pharmacotherapeutic effects, (e.g., Antihypertensive, Hepatoprotectant, Nootropic, etc.), 3295 mechanisms of action, (e.g., 5 Hydroxytryptamine antagonist, Acetylcholine M1 receptor agonist, Cyclooxygenase inhibitor, etc.), 57 adverse & toxic effects (e.g., Carcinogenic, Mutagenic, Hematotoxic, etc.), 199 metabolic terms (e.g., CYP1A inducer, CYP1A1 inhibitor, CYP3A4 substrate, etc.) 49 transporter proteins (e.g., P-glycoprotein 3 inhibitor, Nucleoside transporters inhibitors) and 29 activities related to gene expression (e.g., TH expression enhancer, TNF expression inhibitor, VEGF expression inhibitor). The results of PASS prediction are given as a list of biological activities, for which the difference between probabilities to be active (Pa) and to be inactive (Pi) is calculated.
For obtaining different QSAR models the Pa-Pi values for the activities randomly selected from the total list of predicted biological activities, were used as input independent variables for the regression analysis. Similar to the QSAR analysis with QNA descriptors, topological length and volume of molecules were added as the variables to biological activity profile.
QSAR MODELING ON THE BASIS OF BIOLOGICAL ACTIVITY PROFILES PREDICTION USING MNA DESCRIPTORS
SELF-CONSISTENT REGRESSION
GUSAR uses self-consistent regression for building of (Q)SAR models. Self-consistent regression (SCR) is based on the regularized least-squares method described in [1,3]. Unlike the stepwise regression and other methods of combinatorial search, the initial SCR model includes all regressors. The basic purpose of SCR method is to remove the variables poorly described of appropriate value [3]. The number of the final variables in QSAR equation selected after the self-consistent regression procedure is significantly less compared to the number of the initial variables.
NEAREST NEIGHBOUR'S CORRECTION
It is well known that the use of both global and local models for non-congeneric sets, improves the quality of QSAR models [4]. We used the experimental data on three nearest neighbours (NN) to correct the prediction values obtained from the regression model. The correction value was estimated by taking an average of three chemicals values from the training set that are the most similar to the chemical under prediction. The similarity of any chemical compounds' pairs is estimated as Pearson's coefficient calculated in the space of independent variables obtained after SCR. The mean experimental value obtained for three nearest neighbour compounds from the training set was averaged with the predicted value of the test compound.