19-10-2016, 03:37 PM
Protein-Protein Docking: A Survey on Searching Techniques and Scoring Functions, towards a Machine Learning Approach
1459918796-ProteinProteinDockingASurveyonSearchingTechniquesandScoringFunctionstowardsaMachineLearningApproach.docx (Size: 30.94 KB / Downloads: 7)
Abstract
Protein-protein docking is a higher order computational problem in which finding the best pose where the two proteins can bind with each other. Since the problem is having higher volume of search space the complexity in solving this problem is still hard in practice. Even though there are lot of technology developments, there exists a challenge in reducing the computational complexity to solve the docking problem. For every docking problem orientation and conformation will be the two major processes involved. This paper provides a layout that briefs about the available searching techniques (for orientation) and scoring functions (for conformation) used to solve the docking problem. Here we have concentrated with some of the machine learning algorithms that may helpful in reducing the computational complexity and enhancing the accuracy too. And we have suggested the appropriate techniques and algorithms that may helpful in solving the protein-protein docking problem effectively.
Introduction
Protein-protein docking comprises of two step process first step will be orientation that is identifying the correct geometrical site where the proteins can fit over. The second step is conformation which is a process of evaluating the orientation for the best pose. In general the orientation can be done by the sampling techniques, and the conformation done by some of the scoring techniques.
Here, the problem is computationally intensive since and the searching space is too large. In order to identify the best pose the proteins will undergo large set of conformational changes during the binding process which has to be stored and processed with the provided recourses. We are bringing to the fact that concerning the protein flexibility will leads to the docking problem more accurate and complex to solve. Based on the computational intensity the protein docking problem can be broadly classified into following categories they are (i) protein-rigid & ligand-rigid, (ii) protein-rigid & ligand-flexible, (iii)protein-flexible & ligand-rigid, (iv) protein- flexible & ligand-flexible. Category four can be further classified into two sub categories, (i) Rigid protein backbone and side chains are flexible, (ii) Fully flexible protein backbone and side chains.
In the past few years there are lot of researchers are trying to reduce the computational complexity of protein docking. Since the flexible docking needs randomized sampling and large set of conformational changes. So it is clear that the searching space is too large which leads to the higher order computational complexity.
Despite of the marginal improvements in computational intelligence flexible docking is still hard problem to solve with in a deterministic polynomial time (NP hard). We review several searching strategies and scoring techniques which may helpful in reducing the computational complexity of the flexible docking problem.
Materials and methods
Searching techniques
The major process in docking will be identifying the hotspot binding sites at the protein-protein interface. There are lot of searching techniques have been available for the docking process. Broadly the searching techniques can be classified into three categories such as exhaustive global search, local shape feature matching, and randomized searching technique.
Exhaustive global search
Based on three dimensional translation and rotation. The molecules are fall into two categories static molecule and moving molecule. The moving molecule will search over the translational space of the static molecule to find the binding site.
The above process will be repeated until the rotational space of the moving molecule gets completed.
FTT based search
In FFT (Fast Fourier Transform) based searching algorithms maps the protein atoms into a grid space, and the interactions are being calculated by using the scoring functions. FFT uses simple shape complementary based (Chen and Wang, 2003; Chen et.al. 2003) scoring function. Given set of proteins, static (S) and the moving (M). These proteins are mapped into the grid as follow as
S={S(l,m,n), 1≤ l, m, n≤ N} with
S(l,m,n)={█(█(1:onthesurfaceoftheprotein@ρ:insidethemolecule)@0:Outsidetheprotein)┤
M={M(l,m,n), 1≤ l, m, n≤ N} with
M(l,m,n)={█(█(1:onthesurfaceoftheprotein@δ:insidethemolecule)@0:Outsidetheprotein)┤
Where ρ is a number <<1 and δ ranges the positive value between 0 and 1.
The binding score can be calculated with respect to the static protein translation space C (p, q, r).
C={ C (p, q, r), 1≤ p, q, r ≤ N} with
C(p,q,r)=∑_(l=1)^n▒∑_(m=1)^n▒∑_(n=1)^n▒〖S(l,m,n)×M(l+p,m+q,n+r),〗
It can be completed as
C=〖FFT〗^(-1) ((FFT(S) ) ´×FFT(M) )
The above equation will clearly explains the FFT based searching algorithm uses simple shape complementary based scoring function.
Direct search in Cartesian space
Here, this algorithm maps the two proteins into Cartesian grid space. After the mapping process each grid points are associated with a value. The grid based mechanism used here will be similar to FFT based algorithms. The values of the grid can be represented as
G(v)={█(o:ifthegridnotoccupiedbytheprotein@1:ifthegridoccupiedbytheprotein)┤
Where G (v) is the value associated with the grid position. In addition to that here there are some Boolean functions and some heuristics have been used in order to enhance the speed of searching process.
Local shape feature matching
In case of local shape feature matching the proteins are being represented in form of molecular shapes for example Connolly surface. This surface can be calculated by excluding the accessible surface which is a probe or solvent molecule, after that some algorithms have been used in order to calculate the effective local shape complementary. Distance geometric algorithm and geometric hashing are the two predominant algorithms have been used in calculating the local shape complementary.
Distance geometric algorithm
There are three steps to be followed in distance geometric algorithm. First molecular surface will be calculated for receptor molecule using the DMS program. Second we have to generate the sphere points to project the binding site of receptor to the moving molecule using SPHGEN program. Third step will be constructing the possible complexes between the receptor and moving ligands. If a match is said to be successful if and only if the edges in a set of sphere points have the match with, the set of ligand atoms given a certain distance tolerance. The docking process will continue until the rational binding orientation gets complete.
Geometric hashing
Geometric hashing algorithms have the capacity to do the global searching for protein-protein docking by calculating the shape descriptors local match. For example we can take the surface patches of two proteins which can be generated by PatchDock (Duhovny et.al. 2002) program. First both proteins have to be represented in a molecular surface such as Connolly surface with the help of MS program. Second we have to use a segmentation algorithm which will identify the patches (geometric) such as flat, convex and concave. Third a graph will be constructed to represent the patches as nodes which depend on the shape function. With the help of this graph surface patches will be matched using the geometric hashing algorithm. A Pose-clustering technique is used to rank the orientation based on some score for the respective shape complementary.
Randomized search
Randomized searching algorithms have been used for both local and global search in atom level. In this type of searching technique it is not necessary to have grid or shape like molecular representations instead it uses the randomized beginning and terminal position of moving protein. Here the binding process begin with the random positions so once the initial binding pose gets ready it can be refined using some algorithms such as genetic and Monte Carlo methods. RosettaDock, ATTRACT, ICM Docking, ICM Disco are some examples for randomized searching algorithms.
RosettaDock (Gray et.al. 2003; Chaudhury and Gray 2008) is both local and global Protein-Protein docking program in which the binding site information’s are already known. By starting the random positions with respect to the random rotations followed by the translation the binding decoys will be created. A two stage Monte Carlo search program is used for the purpose of refining the binding conformation as well as the side chain conformation. In the first stage of this searching program will be low –resolution optimization. The second stage will be high-resolution refinement. The searching will be takes place still a rational number of conformations gets complete.
ATTRACT
ATTRACT (Zacharias, 2003; Bastard, 2006) is a global Protein-Protein docking program. Here each residue of the proteins represented up to 3 pseudo atoms. First backbone atoms represented using pseudo atom located Cα position. Second side chain atoms are represented using another pseudo atom at the geometric mean of side chain atoms or amino atoms. Third two pseudo atoms being used for amino acids which are larger and flexible. Then we have to generate sequence of start positions on the static protein with a probe radius that greater than the radius of the moving protein. By considering random starting positions initial orientation has been constructed with respect to the rotations. Similar to RosettaDock here also the initial orientations were optimized by the Monte Carlo search.
HADDOK
The docking process in HADDOCK (Dominguez, et.al. 2003) is based on Ambiguous Interaction Resistant which regulates a distance between the residues incorporated with the interaction. Here also the initial orientation will be randomized in which two proteins are separated by 150Å distance from each other. After the initial orientations the refinement process will be carried out in the Cartesian space using the Molecular Dynamics simulations. Then finally the results are being clustered in order to eliminate duplications available in the conformations.
ICM-Disco (Abagyan et.al., 1994) can be used to achieve global energy optimization by the multiple starting positions with respect to the moving ligand using the randomized searching algorithm as a two step process. First we have to define multiple starting positions separated by 15Å distance from each other. Then the sampling will be carried out with respect to each starting positions. It has been undergone with a Monte Carlo minimization. Then it is refined further by considering the side chain flexibility using the global energy optimization technique.
Scoring function
A function which may helpful in calculating the binding affinity of the ligand in a specific portion of the protein. It is a simple function which is used predicting how strappingly the ligand binds to the target protein. In order to increase the reliability of computational docking we are in need of the scoring function more accurate. Here, prediction is necessary to justify which mode of binding is considered to be a best one. I this section we are going to discuss about various type of scoring functions available to solve the docking problem.
Physics based scoring function
The physics based scoring function depends on the force fields available which may helpful in computing straight interaction between the proteins. They use non-bonded energies of force fields. It consists of van der Waals, electrostatic energy, and hydrogen bonding terms. This physics functions are useful in calculating potential energy which is a part of energy change during the binding process of two proteins. AMBER (Weiner et.al. 1984; Weiner et.al. 1986), COMBINE, GoldScore, and MedusaScore are the examples of energy functions. The general form of these functions will be
∆Gbinding=∆EVDW+∆Eelectrostatic+ [∆EH-bond] + ∆Gsolvant
A major benefit of physics-based methods is that they can work on the progress of modern force fields, quantum mechanics methods, solvation models, and others. If the scoring function relies on bond free energies then the physics based scoring functions are useful in practice.