Training sets were created on the basis of the data on the drug-induced changes in mRNA expression and protein concentration that were represented in Comparative Toxicogenomics Database. They include the structures of single electroneutral organic molecules with molecular weight of 50 - 1250 Da and the data on drug-induced changes of human-specific gene expression.
mRNA-based training set consists of 1756 compounds and allows predicting drug-induced changes of gene expression for 1802 genes (1069 up- and 733 downregulations). The average accuracy calculated by leave-one-out cross-validation procedure (ROC AUC) is 0.853.
Protein-based training set consists of 1736 compounds and allows predicting drug-induced changes of gene expression for 123 genes (78 up- and 45 downregulations). The average accuracy calculated by leave-one-out cross-validation procedure (ROC AUC) is 0.89.
MCF7-based training set consists of 1024 compounds and allows predicting drug-induced changes of gene expression for 3900 genes (1769 up- and 2131 downregulations). The average accuracy calculated by leave-one-out cross-validation procedure (ROC AUC) is 0.89.
VCAP_6-based training set consists of 6614 compounds and allows predicting drug-induced changes of gene expression for 16124 genes (10687 up- and 5437 downregulations). The average accuracy calculated by leave-one-out cross-validation procedure (ROC AUC) is 0.80.
VCAP_24-based training set consists of 6534 compounds and allows predicting drug-induced changes of gene expression for 9716 genes (6078 up- and 3638 downregulations). The average accuracy calculated by leave-one-out cross-validation procedure (ROC AUC) is 0.78.