Nussinov algorithm

Nucleic acid structure prediction algorithm
ClassNucleic acid structure prediction
Worst-case performance O ( n 3 ) {\displaystyle O(n^{3})}
Worst-case space complexity O ( n 2 ) {\displaystyle O(n^{2})}

The Nussinov algorithm is a nucleic acid structure prediction algorithm used in computational biology to predict the folding of an RNA molecule that makes use of dynamic programming principles.[1] The algorithm was developed by Ruth Nussinov in the late 1970s.

Background

RNA origami occurs when an RNA molecule "folds" and binds to itself. This folding often determines the function of the RNA molecule. RNA folds at different levels, this algorithm predicts the secondary structure of the RNA.

Algorithm

Scoring

We score a solution by counting the total number of paired bases. Thus, attempting to maximize the score that maximizes the total number of bonds between bases.

Motivation

Consider an RNA sequence S {\displaystyle S} whose elements are taken from the set { A , U , C , G } {\displaystyle \{A,U,C,G\}} . Let us imagine we have an optimal solution to the subproblem of folding S i {\displaystyle S_{i}} to S j 1 {\displaystyle S_{j-1}} , and an optimal solution for folding S u {\displaystyle S_{u}} to S v {\displaystyle S_{v}} i u v j 1 {\displaystyle i\leq u\leq v\leq j-1} . Now, to align S i {\displaystyle S_{i}} to S j {\displaystyle S_{j}} , we have two options:

  1. Leave S j {\displaystyle S_{j}} unpaired, and keep the structure of S i {\displaystyle S_{i}} to S j 1 {\displaystyle S_{j-1}} . The score for this alignment will be equal to the score of the alignment of S i {\displaystyle S_{i}} to S j 1 {\displaystyle S_{j-1}} , as no new base pairs were created.
  2. Pair S j {\displaystyle S_{j}} with S k {\displaystyle S_{k}} , where i k < j {\displaystyle i\leq k<j} . The score for this alignment will be the score of the base pairing, plus the score of the best alignment of S i {\displaystyle S_{i}} to S k 1 {\displaystyle S_{k-1}} and S k + 1 {\displaystyle S_{k+1}} to S j 1 {\displaystyle S_{j-1}} .

Algorithm

Consider an RNA sequence S {\displaystyle S} of length n {\displaystyle n} such that S i { A , U , C , G } {\displaystyle S_{i}\in \{A,U,C,G\}} .

Construct an n × n {\displaystyle n\times n} matrix M {\displaystyle M} . Initialize M {\displaystyle M} such that

M ( i , i ) = 0 {\displaystyle M(i,i)=0}

M ( i , i 1 ) = 0 {\displaystyle M(i,i-1)=0}

for 1 i n {\displaystyle 1\leq i\leq n} .

M ( i , j ) {\displaystyle M(i,j)} will contain the maximum score for the subsequence S i . . . S j {\displaystyle S_{i}...S_{j}} . Now, fill in entries of M {\displaystyle M} up and to the right, so that

M ( i , j ) = max i k < j { M ( i , k 1 ) + M ( k + 1 , j 1 ) + Score ( S k , S j ) M ( i , j 1 ) {\displaystyle M(i,j)=\max _{i\leq k<j}{\begin{cases}M(i,k-1)+M(k+1,j-1)+{\text{Score}}(S_{k},S_{j})\\M(i,j-1)\end{cases}}}

where Score ( S k , S j ) = { 1 , S k  and  S j  complementary 0 , otherwise. {\displaystyle {\text{Score}}(S_{k},S_{j})={\begin{cases}1,&S_{k}{\text{ and }}S_{j}{\text{ complementary}}\\0,&{\text{otherwise.}}\end{cases}}}

After this step, we have a matrix M {\displaystyle M} where M ( i , j ) {\displaystyle M(i,j)} represents the optimal score of the folding of S i . . . S j {\displaystyle S_{i}...S_{j}} .

To determine the structure of the folded RNA by traceback, we first create an empty list of pairs P {\displaystyle P} . We initialize with i = 1 , j = n {\displaystyle i=1,j=n} . Then, we follow one of three scenarios.

  1. If j i {\displaystyle j\leq i} , the procedure stops.
  2. If M ( i , j ) = M ( i , j 1 ) {\displaystyle M(i,j)=M(i,j-1)} , then set i = i , j = j 1 {\displaystyle i=i,j=j-1} and continue.
  3. Otherwise, for all k : i k < j {\displaystyle k:i\leq k<j} , if S k {\displaystyle S_{k}} and S j {\displaystyle S_{j}} are complementary and M ( i , j ) = M ( i , k 1 ) + M ( k + 1 , j 1 ) + 1 {\displaystyle M(i,j)=M(i,k-1)+M(k+1,j-1)+1} , append ( k , j ) {\displaystyle (k,j)} to P {\displaystyle P} , then traceback both with i = i , j = k 1 {\displaystyle i=i,j=k-1} and i = k + 1 , j = j 1 {\displaystyle i=k+1,j=j-1} .

When the traceback finishes, P {\displaystyle P} contains all of the paired bases.

Limitations

The Nussinov algorithm does not account for the three-dimensional shape of RNA, nor predict RNA pseudoknots.[2] Furthermore, in its basic form, it does not account for a minimum stem loop size. However, it is still useful as a fast algorithm for basic prediction of secondary structure.

References

  1. ^ Nussinov, R; Jacobson, A B (Nov 1980). "Fast algorithm for predicting the secondary structure of single-stranded RNA". Proceedings of the National Academy of Sciences of the United States of America. 77 (11): 6309–6313. Bibcode:1980PNAS...77.6309N. doi:10.1073/pnas.77.11.6309. ISSN 0027-8424. PMC 350273. PMID 6161375.
  2. ^ "RNA Structure and RNA Structure Prediction" (PDF).