Bioinformatics and Drug Discovery

* Address correspondence to this author at the Department of Biology, Faculty of Science, University of Ottawa, Ottawa, Ontario, Canada, K1N 6N5; Tel: (613) 562-5800 ext. 6886; Fax: (613) 562-5486;, E-mail: ac.awattou@aixx

Received 2016 Jun 17; Revised 2016 Sep 11; Accepted 2016 Sep 21. Copyright © 2017 Bentham Science Publishers

This is an open access article licensed under the terms of the Creative Commons Attribution-Non-Commercial 4.0 International Public License (CC BY-NC 4.0) (https://creativecommons.org/licenses/by-nc/4.0/legalcode), which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.

Abstract

Bioinformatic analysis can not only accelerate drug target identification and drug candidate screening and refinement, but also facilitate characterization of side effects and predict drug resistance. High-throughput data such as genomic, epigenetic, genome architecture, cistromic, transcriptomic, proteomic, and ribosome profiling data have all made significant contribution to mechanism-based drug discovery and drug repurposing. Accumulation of protein and RNA structures, as well as development of homology modeling and protein structure simulation, coupled with large structure databases of small molecules and metabolites, paved the way for more realistic protein-ligand docking experiments and more informative virtual screening. I present the conceptual framework that drives the collection of these high-throughput data, summarize the utility and potential of mining these data in drug discovery, outline a few inherent limitations in data and software mining these data, point out news ways to refine analysis of these diverse types of data, and highlight commonly used software and databases relevant to drug discovery.

Keywords: Drug target, Drug candidate, Drug screening, Genomics, Epigenetics, Transcriptomics, Proteomics, Structure

1. INTRODUCTION

Drug discovery starts with diagnosis of a disease with well characterized symptoms that reduce the quality of life. Conventionally, a desirable drug is a chemical (which could be a simple chemical or a complicated protein) or a combination of chemicals that reduces the symptoms without causing severe side effects in the patient. Other properties of a desirable drug include affordability and profit for drug companies [1, 2], low chance of drug resistance [3] leading to dramatic decrease in the commercial value of the drug, low deleterious effect on the environment, e.g., no re-activation by bacterial species after human use [4]. Thus, a desirable drug is one that not only is efficacious with little side effects, but also has minimal long-term negative effect on the patient, the society and the environment.

This review will focus on how bioinformatics can facilitate the discovery of such desirable drugs. Bioinformatics is an interdisciplinary science spanning genomics, transcriptomics, proteomics, population genetics and molecular phylogenetics. Bioinformaticians in drug discovery use high-throughput molecular data (Fig. ​ 1 1 ) in comparisons between symptom-carriers (patients, animal disease models, cancer cell lines, etc.) and normal controls. The key objectives of such comparisons are to 1) connect disease symptoms to genetic mutations, epigenetic modifications, and other environmental factors modulating gene expression, 2) identify drug targets that can either restore cellular function or eliminate malfunctioning cells, e.g., cancer cells, 3) predict or refine drug candidates that can act upon the drug target to achieve the designed therapeutic result and minimize side effects, and 4) assess the impact on environmental health and the potential of drug resistance.

An external file that holds a picture, illustration, etc. Object name is CTMC-17-1709_F1.jpg

Major types of high-throughput data and their key information relevant to drug discovery. Metabolomic data belong to cheminformatics and are not included.

2. GENOMIC SEQUENCE AND EXOME DATA IN DRUG DISCOVERY

One of the early contributions from bioinformatics to drug target discovery is the identification of sequence homology between simian sarcoma virus onc gene, v-sis, and a platelet-derived growth factor (PDGF) by simple string matching [5, 6]. This finding not only resulted in PDGF being used as a cancer drug target [7-9], but also led to two new lines of thinking. First, the viral transforming factor may work simply by changing transient expression of a growth factor to constitutive expression, suggesting growth factors as targets for anti-cancer drug development. Second, any factors modulating gene expression patterns can potentially contribute to cancer. This new conceptual framework of cancer biology contributed to the progress of mechanism-based anti-cancer drug development in the following years [10-12].

2.1. Genetic Diseases

Genomic and whole exome sequencing of patients with inherited disorders have recovered many somatic mutations which are associated with genetic diseases [13-15] and could be potential drug targets. The main difficulty concerning bioinformatic research on somatic mutations lies in the identification of disease-causing mutations among many observed genetic differences between matched patient and normal control [16]. Some diseases such as cancer exhibit high genetic heterogeneity [17], even among cells within a single tumor [18]. Many of these somatic mutations could be the consequence rather than the cause of cellular malfunction [16].

Effort must be made to distinguish three types of somatic mutations: 1) those that cause the disease and may serve as drug targets, 2) those that are closely linked to the disease gene and consequently are associated with the disease, and 3) those not associated with the disease but happen to occur in the patient group and not in the control group. The second type of mutations can be used for disease diagnosis, but not as drug target. The third type can be excluded in two ways. The first is by increasing sample size. If thousands of breast cancers all share the same somatic mutation, then the relevance of the mutation to breast cancer is high relative to a somatic mutation occurring in only one breast cancer [19]. The second is by collecting longitudinal data, recognizing that many diseases may have a genetic determinant long before the manifestation of the disease [20]. Suppose mutation X predisposes a person to Alzheimer’s disease (AD). If we compare one groups of AD patients with a non-AD control group, and if the control group has people who already have mutation X but have not developed AD yet, we may fail to recognize the importance of mutation X simply because it is not unique in the AD group. Only if we follow patients or relevant animal models over time can we come to the conclusion that whoever has mutation X eventually develop AD.

It is much more difficult to distinguish between the first and second type of genetic differences between patient and control without an understanding of disease mechanism. A loss-of-function mutation can happen in the coding sequence (CDS), in the regulatory motif (e.g., response elements for ligand-activated nuclear receptors) or in an enhancer that could be up to 1 million bases away from the CDS. Bioinformaticians will typically take three approaches to check if the mutation has major impact on gene function: 1) whether the mutation replaces an amino acid by a very different one (e.g., non-polar uncharged glycine by a positively charged arginine) at a typically conserved site, 2) whether the mutation occurs in a highly conserved non-coding sequence (which is typically done by comparing genomes between human and non-human primates.), and 3) whether the mutation occurs in a known signal (e.g., regulatory motif, splice sites, transcription initiation and termination sites) for cellular machinery (e.g., ribosome, spliceosome, degradosome). The last approach is facilitated by the availability of extensively compiled and curated databases of known regulatory motifs [21-23]. Bioinformatic tools are often used to scan genomes for regulatory motifs. Such tools include position weight matrix (PWM) to find the genomic location of a known motif, Gibbs sampler for de novo motif discovery [24, 25] and support vector machines [26, 27] that can be used to extract differences between two groups of sequences (e.g., motif-present and motif-absent) and to use the resulting information to detect/scan motifs in genomes. The regulatory motifs could be response elements of nuclear receptors whose identification often leads to refinement of drug targets [28]. Such studies are facilitated by software such as DAMBE [29] which, when given an annotated genomic sequence, can extract coding sequences, rRNAs, tRNAs, introns, exons, 5’ and 3’ splice sites, upstream or downstream sequences of gene features, etc., with just a few mouse clicks. In addition to functions for PWM, Gibbs sampler, and minimum folding energy estimation, DAMBE can also compute protein isoelectric point and indices of protein translation efficiency.

If a deleterious mutation is identified to be a loss-of-function mutation, then bioinformatics can help identify a paralogous gene or an alternative cellular pathway that can compensate for the mutation effect. Functional redundancy or partial redundancy is common in mammals, e.g., the function of paralogous genes USP4 and USP15 in mice are partially redundant [30]. Human adrenoleukodystrophy (ALD) is caused by partial deletion of the 10-exon gene ABCD1 resulting in the accumulation of very long chain fatty acids [31], which suggests not only diet limitation of very long chain fatty acids (VLCFA) in disease management, but also activation of alternative metabolic pathways for VLCFA through regulating another gene involved in fatty acid metabolism (ABCD2) and suppression of the activity of elongase involved in generating VLCFA [32]. Another example of activating alternative biological pathways or genes with partial functional redundancy involves sickle-cell anemia [33] caused by a single amino acid replacement in human beta-globin gene [34, 35]. Fetal hemoglobin gene (HbF) is a promising drug target because HbF reduces hemoglobin polymerization and clumping. A drug that could revive the silenced HbF would alleviate the symptoms of sickle-cell anemia and thalassemia in adults [36, 37]. Interestingly, some β-thalassemia patients have the correct version of the β-globin gene but the gene is not expressed because of mutations that occurred far away from it [38, 39]. Such long-range gene regulation will be addressed later on epigenetic modification and genome architecture.

2.2. Human Diseases Caused by Pathogens

Well annotated genomes are essential for target-based drug discovery against pathogens. The general bioinformatic approach involves three essential steps. The first is to identify essential genes in the pathogen as drug targets. A genome, especially a well-annotated one, can facilitate identification of such essential genes. For example, genes involved in nucleotide synthesis are well known, but are often missing in pathogenic species because they use salvage pathway instead of de novo synthesis pathway to procure nucleotides. In, Trypanosoma brucei, genes for de novo synthesis of ATP, GTP and TTP have gone missing, but the pathogen retains limited capacity for de novo synthesis of CTP [40], presumably because CTP generally has much lower centration than the other three nucleotides in the cell and cannot be reliably obtained through salvage. This points to CTP synthesis pathway as a drug target. Indeed, inhibiting CTP synthesis arrests the growth and replication of the pathogen [40]. Essential genes are often highly conserved and can be revealed by genomic comparisons between pathogens and their phylogenetic relatives. Sometimes they may also inferred from experimental data from model organisms such as Escherichia coli, Bacillus subtilis or Saccharomyces cerevisiae whose genes have been systematically and individually knocked out. Genes essential for the two bacterial species are likely to be essential in another bacterial species.

The second step in developing drugs against pathogen is to check if such essential genes have homologues in the host. If they do, then inhibiting such essential genes in the pathogen may have adverse effect on the function of the host homologue, and we consequently need to perform sequence and structural comparisons between the pathogen and host homologues to identify unique part in the pathogen homologue to assist in the design of pathogen-specific drugs.

Third, to minimize the chance of pathogen developing drug resistance, it is important for the drug to target at specific pathogen and not its phylogenetic relatives that are not pathogenic. For this reason, pathogenicity islands that are unique in pathogenic bacteria but not in their non-pathogenic relatives have increasingly become the preferred source of drug targets [41-43].

Bioinformatic analysis revealed a glutamate transport system that is present in the pathogen Clostridium perfringens but absent in mammals and birds [44]. Drugs developed against such a transport system will protect not only humans, but also domesticated mammals and fowls. In the human parasite Giardia intestinalis, the phosphoinositide-3 kinase (PI3K) signaling pathways are essential and could serve as a drug target. However, the PI3K pathway is also essential in many eukaryotes so it is important to identify what is unique in the PI3K homologues (Gipi3k1 and Gipi3k2) in G. intestinalis relative to mammals. Sequence comparisons revealed a unique insertion only in the parasite that can serve as a pathogen-specific drug target [45]. The same approach is used in targeting Pseudomonas aeruginosa [46]. Similarly, in developing anti-HIV-1 drugs, one can target genes involved in reverse transcription and protease digestion of its translated polyprotein because these processes not only are essential for viral survival and transmission, but also have no close homologues in human so their inhibition should have minimal side effect on human.

Genomic analysis can also help in repurposing existing drugs against other pathogens. Galactofuranose (Galf) is an important constituent on the cell surface of a variety of bacterial pathogens [47, 48], and its synthesis requires UDP-galactopyranose mutase (UGM). Because Galf is absent in human [44], UGM has been used as a desirable drug target [49]. UGM coded by gene GLF was later found in several eukaryotic unicellular pathogens [50] as well as in nematodes [51]. Can we repurposing drugs developed against bacterial pathogens to fight eukaryotic unicellular pathogens [50]? Drug repurposing is cost-effective in drug development [52]. Genomic analysis shows that eukaryotic UGMs, while similar to each other, is quite different from prokaryotic UGMs, suggesting difficulty in drug repurposing from bacterial pathogen to eukaryotic pathogens. However, if one develops an effective drug against one eukaryotic UGM, the drug would have a very good chance of being repurposed for another eukaryotic pathogen.

Genomics has also contributed to understanding drug actions. The venom protein PcFK1 of spider Psalmopoeus cambridgei was able to inhibit the growth of Plasmodium falciparum, but the mechanism was unknown. A sequence analysis revealed sequence homology between PcFK1 and the protein substrate of P. falciparum enzyme PfSUB1, leading to the hypothesis that PcFK1 is an antagonist of PfSUB1. Subsequent docking prediction and in vitro experiments confirm this hypothesis, pointing to PfSUB1 as a drug target [53].

Essential cellular processes are often functionally redundant, and understanding such functional redundancy is crucial in developing effective drugs against pathogens. In Mycobacterium tuberculosis, arabinofuranosyltransferases Mt-EmbA and Mt-EmbB contribute to the synthesis of cell wall mycolyl-arabinogalactan-peptidoglycan complex and are targeted by the drug ethambutol. Bioinformatic analyses revealed another arabinofuranosyltransferase, Mt-AftA, which is not inhibited by ethambutol and consequently would serve as a drug target [54]. A combination of drugs against all three arabinofuranosyltransferases will not only be more effective against the pathogen, but also reduce the chance of the pathogen developing drug resistance. Activating alternative biological pathways to satisfy the need of growth and survival has been known in bacterial species since the discovery of the lac operon and the glucose/lactose genetic switch [55], and a drug cannot be effective against a pathogen or a cancer cell unless we know how cells do things with alternative pathways that can be activated in response to the drug.

Bioinformatics, with its inherent evolutionary perspective and its integration of molecular phylogenetics [56, 57], can often contribute to resolving controversies on molecular mechanisms. One such example involves the causal interpretation of CpG methylation causing CpG deficiency through subsequent C→T mutation mediated by spontaneous deamination. A controversy arose when both Mycoplasma genitalium and M. pneumoniae genomes were found to lack DNA CpG methyltransferase, yet M. genitalium genome exhibits much stronger CpG deficiency than M. pneumoniae genome, suggesting a conclusion that the difference in CpG deficiency between the two species is irrelevant to CpG methylation [58, 59]. Such a conclusion from genomic studies without an evolutionary perspective is often wrong. A comprehensive phylogenetic study using software DAMBE [29] showed that the ancestors of the two species should have multiple CpG methyltransferases because M. pulmonis and other relatives that branch off earlier than M. genitalium and M. pneumoniae have multiple CpG methyltransferases. After the loss of the CpG methyltransferases in the ancestor of M. genitalium and M. pneumoniae, both species began to gain CpG frequency, but M. pneumoniae evolved much faster (with a much longer branch) and regained CpG much faster than M. genitalium [60]. These findings restored the validity of causal relationship between CpG-specific DNA methylation and CpG deficiency, and illustrate the importance of having an evolutionary perspective in understanding biological processes. Because many such studies involve highly diverged bacterial or viral species, and because it is often difficult to obtain reliable multiple sequence alignment with highly divergent sequences, a new phylogenetic method based on pairwise sequence alignment has recently been developed [61] to facilitate phylogenomic studies involving highly diverged species.

3. EPIGENETICS, GENOME ARCHITECTURE AND CISTROMES IN DRUG DISCOVERY

Monozygotic twins carrying the same deleterious mutations such as the aforementioned ALD mutation often differ much in phenotype [62-65]. Such observations serve to highlight the relationship between epigenetic modifications and human diseases [66, 67]. Epigenetic modification includes two interrelated events, DNA methylation and histone modification. The maintenance of DNA methylation pattern in mammals is accomplished by the mammalian DNA methyltransferase 1 (DNMT1) whose CatD domain recognizes hemi-methylated CpG sites [68] so that DNA methylation pattern can be maintained from parental to daughter cells. In mammals, the methylated CpG recruits proteins with a methyl-CpG binding domain such as MBD1, MBD2, MBD3 and MeCP2 which then recruit histone deacetylase to remove the acetyl group and restore the positive charge of lysine residues (or histone N-terminal) in histone so that the negatively charged backbone of DNA can wrap tightly around the positively charged histone to silence the gene [69]. A silenced gene is in many ways equivalent to a loss-of-function mutation. Because some cancers appear to be caused by permanent silencing of genes involved in apoptosis pathway through DNA methylation and histone deacetylation [70-71], histone deacetylase has been used as a drug target with its inhibitors aiming to reactivate the apoptosis pathway [72]. The main problem in this approach is specificity because deacetylase inhibitors often have profound effect on the regulation of many other genes, which may explain why such drugs often do not enter clinical trials [73]. Methods for precise editing of the epigenome, involving components for DNA-binding and specific sequence recognition and modification are currently being developed [74].

The conventional view that DNA methylation and histone deacetylation mainly serve the purpose of permanent gene silencing has now been replaced by a more general conceptual framework of epigenetic modification and gene regulation (Fig. ​ 2 2 ). This conceptual shift demands integrated analysis of several types of high-throughput data: methylation pattern from bisulfite sequencing [75-76], DNA/protein binding data (cistrome) from ChIP-on-chip and ChIP-Seq [77], and genome architecture data from Hi-C [78] or its derivatives. DNA methylation alters DNA/protein binding which in turn alters genome architecture, i.e., two DNA segments far apart along the linear DNA can be brought together. Genome architecture data pave the way for studying spatial interaction between enhancers and promoters that can be up to one million bases apart. That gene expression depends on gene location on the genome is known since 1930 through studies of translocation [79], but empirical evidence accumulated much later to demonstrate that protein/DNA interactions resulted in nucleosome reconfiguration and interaction between enhancer and promoter [80-84]. This had spawned the formulation of the enhancer hub model of gene regulation [85, 86]. That is, the hub contains one or more enhancers and a gene with its promotor looping close to the hub will be expressed; deletion of such a hub will silence the expression of all genes that depends on their physical proximity to the hub to be expressed.

An external file that holds a picture, illustration, etc. Object name is CTMC-17-1709_F2.jpg

A general framework of epigenetic effects on gene expression, through 1) DNA methylation and histone acetylation/deacetylation, 2) alteration of DNA-binding proteins and consequent protein-DNA and protein-protein interactions, and 3) alteration of long-distance interactions such as enhancer-promotor interactions. LM – laboratory method, BQ: sample bioinformatic questions.

From a bioinformatics point of view, the key question concerns what is the methylation signal on DNA and whether it is possible to modulate such a signal to alter epigenetic modifications. I have mentioned before that some β-thalassemia patients have the correct version of the β-globin gene but the gene is not expressed because of mutations that occurred far away from it [38, 39]. One may formulate two hypotheses. First, the enhancer that controls the expression of β-globin gene is mutated or deleted in the patient [38, 87]. Second, the enhancer that is brought close to the promotor of β-globin gene in normal genome architecture is relocated somewhere else due to abnormal epigenetic modifications and protein/DNA binding. Testing these hypotheses, which has become possible only with the availability of high-throughput data of genome architecture, methylation patterns and cistromes (the set of all protein/DNA binding sites), would shed light on how we can reposition the enhancer and the β-globin promotor so that the gene will be expressed [88-90]. Similarly, if the β-globin gene is silenced through DNA methylation, then the knowledge of how to modulate the signal to modify the methylation pattern would bring us closer to reactivating the silenced β-globin gene. Along the same line of reasoning, if the fetal globin genes are silenced by methylation, and if reactivation of these fetal globin genes can alleviate the problem caused by mutations in adult globin genes, then the knowledge of site-specific demethylation would be highly desirable [74].

Given that some CpG are methylated and some are not in mammalian genomes, one straightforward bioinformatic analysis would be to compare the flanking sites of these two groups of CpG dinucleotides to detect if flanking nucleotides contributes to methylation signals. Equivalent analyses of splice sites have revealed strong splice signal in flanking sequences of the 5’ and 3’ splice sites [91, 92], but such comparisons of flanking regions between methylated and unmethylated CpG, although done in a limited scale [93-95], have not yielded clear-cut results. Equally disappointing is that, while the concept of imprinting center (IC) has been known for many years [96], the physical basis of IC, either at the sequence level or structural level, remains elusive.

Because monozygotic twins carrying the same genetic defect often differ much in manifestation of the associated disease [62-65], one naturally wishes to identify environmental contributions such as diet to epigenetic modification [97, 98]. As methylation needs S-adenosyl L-methionine (SAM) as the methyl donor, a deficiency in methionine most likely will, and indeed has been confirmed to, affect DNA methylation [99, 100]. Similarly, one would predict that any major perturbation on methionine, such as the deletion of methylthioadenosine phosphorylase (MTAP) crucial in the methionine salvage pathway, would also affect DNA methylation, gene regulation and cancer. Indeed, MTAP deletion is common in cancer cells [101]. Thus, all genes that affect methionine metabolism could be drug targets, and bioinformatics, with databases such as KEGG [102-104] can identify such genes effectively.

If wrong DNA methylation pattern has formed, then an ideal drug (or an epigenome-repairing nano-machine) should be able to specifically identify the wrong pattern and correct it [74]. To develop such a drug or nano-machine, we first have to know the correct methylation pattern or ideally discover a set of molecules that encode such a correct pattern. Experimental results have accumulated in support of RNA’s role in epigenetic modification [105]. Given that DNA in the zygote undergoes demethylation to regain pluripotency [106], the epigenomic code is perhaps not on DNA. As proteins do not seem to be good in writing code in and because most core histones are replaced by protamine in male germ cells [107], the epigenetic codes, especially the ones that specify de novo DNA methylation, is unlikely to be found in proteins. However, such codes may exist in a set of highly conserved and structurally stable RNA molecules that might be present as early as the oocyte and sperm stage. Long noncoding RNAs (lncRNAs) can participate in epigenetic modification and regulate chromatin state. Characterization of lncRNAs bound to DNA and protein by the ChiRP-seq method [108, 109] revealed numerous sequence-specific binding sites on DNA, and the binding of lncRNA such as HOTAIR [108, 110] and Kcnq1ot1 [111] to such sites facilitates the recruitment of Polycom Repressive Complex 2 (PRC2) for mediating histone H3 lysine-27 trimethylation. Short RNAs can also modulate epigenetic changes. Mature sperm contain a number of small RNA species [97, 98, 112, 113], and these small RNAs do affect offspring phenotype [113, 114]. Furthermore, these small RNAs on offspring appear to contribute to epigenetic modification [97, 98, 113, 114]. The ENCODE pilot project shows that “the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts” [115]. Those non-coding transcripts may be a treasure trove for bioinformaticians to discover epigenome-modifying RNAs as drug targets.

Epigenetic modification has an early origin. Many bacterial species modify their own DNA by methylation to protect against endogenic type II restriction endonucleases. Some Bacteriophage have their own methyltransferase that can modify their own genome against host restriction digestion [116], and human viral pathogens such as HIV-1 can induce profound alteration in host epigenetic pattern [117]. It is now known that some of the host defense mechanisms against pathogens are implemented through epigenetic modifications [118, 119] and many pathogens can modify host epigenetic patterns in favour of their survival and reproduction in the host [118]. What is the eventual fate of such pathogen-mediated epigenetically modified host cells remains unclear. Do they defeat the pathogen invasion, restore the normal epigenetic pattern and reassume normal function again or do they initiate certain apoptosis pathway and perish? What epigeneticists need is a model organism or a cell line in which the epigenetic pattern can be perturbed by extrinsic factors and then restored back to normal.

4. TRANSCRIPTOMICS AND DRUG DISCOVERY

Transcriptomic data have been increasingly used to identify differentially regulated genes, alternatively spliced isoforms and different transcription start and termination sites between patient and matched control [120-125]. Transcriptomic data analysis contributes to drug discovery mainly in two ways, one in phenotypic screening to identify and refine drug candidates, and the other in drug target identification.

4.1. Phenotypic Screening

There has been debates on what constitutes phenotypic screening, but recently proposed definitions [12, 126] converge in five points: 1) the screening involves a large number of compounds (drug candidates) ideally chosen systematically, 2) phenotypic changes in response to each compound is monitored, 3) a criterion of desirability is formulated and used in ranking the compounds, 4) those compounds generating desirable biological effects (phenotypes) are kept as drug candidates for further testing and validation, and 5) the mechanism of action is unknown and not the focus of the screening. Phenotypic screening can be quite effective in identifying active ingredients in traditional medicine, with one of the success stories being the discovery of artemisinin which is the most effective drug against the malaria parasite Plasmodium falciparum [127].

While the target-based approach is effective in developing drugs against diseases with relatively simple mechanisms such as single-gene genetic diseases, phenotypic screening is more effective in drug development against diseases with multiple causes such as multi-gene genetic diseases [128-129]. Cancer is composed of heterogeneous genetic background [17], with extremely high genetic diversity among cells within a single tumor [18]. For such complex diseases, phenotypic screening designed specifically for cancer has been used widely in cancer drug development [11]. The identification of an efficacious chemical by screening often shed lights on the molecular mechanism of action [130].

Phenotypic screening of FDA-approved drugs for drug repurposing is cost-effective because these drugs have already gone through the difficult path of regulatory authorities. This approach has resulted in promising inhibitors against Enteroviruses [131], anti-aging therapeutics [132], anti-cancer drugs [133], and allosteric Bcr-Abl inhibitors in the fight against chronic myeloid leukemia [134].

How does bioinformatics contribute to phenotypic screening? The answer lies in the fact that many modern phenotypic screening studies, especially in screening for anti-cancer drugs, typically define phenotype, either implicitly or explicitly, as a gene expression (transcripts or protein) profile [11] or a metabolomic profile [135-137]. From this perspective, there are two alternative approaches to treat cancer cells. The first is to restore the gene expression of cancer cells to that of normal cells. The second, when the first is not achievable, is to kill cancer cells by inducing apoptosis [11-12]. These two approaches imply two criteria in phenotypic screening for anti-cancer drugs: 1) increased similarity in gene expression between cancer cells and normal cells, and 2) increased similarity in gene expression between cancer cells and apoptotic cells.

Bioinformatics can contribute to gene expression and drug discovery by formulating an objective and rational index of drug desirability (Idd) in phenotypic screening studies with gene expression profiles as phenotypes. Such an Idd would complement therapeutic indices [138, 139] based on various pharmacokinetic models for evaluating drug effects and safety under various drug concentrations [140-142]. The lack of an explicit Idd may have contributed to the low rate of successful drugs discovered through phenotypic screen [126]. For this reason, I will take a rare step in a review article to initiate the effort of developing an index of drug desirability integrating both symptom reduction and side effect.

Designate gene expression profile of a “patient” (which could be an animal disease model or cancer cell line) as Gp, that of a normal control as Gn, and that of a patient after the use of a candidate drug as Gd. It is now easy to compute a variety of pairwise distances [143] between Gn and Gp, between Gd and Gp and between Gn and Gd (designated Dnp, Ddp, and Dnd, respectively, Fig. ​ 3 3 ). Dnp is a measure of severity of the symptoms, and (Dnp – Dnd) a measure of symptom reduction by the application of the candidate drug, equivalent to drug efficacy (Emax) in pharmacodynamics models [141-142]. Side effect could be measured by the difference between (Dnd + Ddp) and Dnp, i.e., (Dnd + Ddp - Dnp), which implies that the side effect is greater for Drug B in Fig. ( ​ 3b 3b ) than for Drug A in Fig. ( ​ 3a 3a ). With these definitions, we can formulate an index of drug desirability (Idd) as:

An external file that holds a picture, illustration, etc. Object name is CTMC-17-1709_F3.jpg

Numerical illustration of applying Idd in Eq. (1) in phenotypic screening to two sets of transcriptomic data (a) and (b). Gn, Gp and Gd refer to gene expression of normal cells, disease cells before drug application, and disease cells after drug application, respectively.