Small-molecule drug discovery SMDD is a multidimensional challenge that involves huge expenditure and time-consuming. The average time required from the invention to market is about 14 years and costing around US$2.8 billion. Two major factors for the failures in discovery is efficacy and toxicity. Though the discovery of drugs is slow process with high investment, the pharmaceutical companies and academic institutes are still spending money because of high commercial potential and also benefits the society.  Huge amount of experimental data accumulated from the past decades including invitro (biochemical) assays, invivo assays and clinical trials. This data will become a valuable source for learning and understanding success and failure of compounds in entire discovery process. The acquired knowledge will be useful to predict the future drug candidate in the discovery experiments. Generation of the knowledge/hypothesis from the known data can be implemented by using the machine learning/deep learning methods. Since the data acuminated is huge (Big-Data), proper curation, efficient mining and building hypothesis (ML/DL models) need to be implemented in drug discovery and development pipeline of pharmaceutical industry to increase their success rates. The integration of these tools as Artificial Intelligence (AI) can serve end-to-end drug discovery and development.  Thus, combining the drug discovery process with AI transform the paradigm of drug discovery.

Millions of experimental data (known data) available in public domain (PubChem, CheMBL, Binding DB, PDB etc.). The experimental data includes invitro and invivo data for each disease, ADME/Tox and many more. Properly curated data should be considered for generation of precise ML/DL models. Two types of machine learning methods are widely used in the small molecule drug discovery, unsupervised and supervised. Unsupervised methods are used to cluster the molecules based on the chemical similarity. As the data is large, quicker clustering methods, k-means, k-median, mean-shifting, Gaussian mixture can be applied to yield better results. The clustering methods are useful to identify the nearest neighbors and has a greater application in repurposing and off-targets prediction. Supervised machine learning models are useful in generating the models for the data sets having the experimental activity. These methods are predictive methods with either quantitative/continues or qualitative/categorical based on the experimental activity of the training data. Generating the models for each protein or type of disease and when applied, will classify the unexplored data more precisely for to identify new hits. However, the precision mainly depends on the quality of the input data. The predictive methods combining with molecular modeling will accelerate the discovery process from hit identification to lead optimization.  The supervised learning methods will have major role to identify the druggable compounds based on ADMET. Building a highly predictive models for Absorption, distribution, Metabolism, excretion and toxicity using adequate samples and filter the compounds during the screening will furnish most druggable compounds for biological studies.

Physio-chemical properties, quantum mechanical properties, 2D descriptors, 3D descriptors, molecular patterns, molecular figure-prints of the training data sets will be used to generate the machine learning models. Methods such as PCR, PLS, Support vector machines (SVMs), naïve Bayes, random forest, neural networks, recursive partitioning .. are quit often used for generation of ML models by correlating descriptors with experimental activity. 

Recently deep neural network (DNN) gain an importance not only in drug discovery also in other areas of science and business. DNN is a deep learning neural network method which build hierarchical internal representations of the input data with the help of multiple hidden layers. Four major DNNs, Convolutional neural networks (CNN),  Recurrent neural network (RNN), Deep autoencoders (AE), Deep belief network (DBN) are having their own advantage for the model generation. These methods are applied in prediction of    biological activity, ADMET properties and physico-chemical parameters. For a small training data, ML methods will perform equally or better than DNN, but with  large datasets DNN will outperform ML. Overfitting is a major challenge during model generation, recent developments available to overcome this challenge such as  DropOut and DropConnect. Significant development in these methods in the areas of de novo design, binding energy between ligand-receptor, chemical syntheses, nanoparticles, formulations and many   more.

The ML/DL provide the end-to-end service in drug discovery, development and beyond. The new rational pipeline will accelerate the discovery and reduce the failures in the discovery. In future the knowledge based innovation will generate new medicines with cost-effective treatment for chronic diseases

*IDS has inhouse ML/DL models and providing services for our clients

Leave a Reply