Combinatorial RNA

Computer Science Level 1

RNA molecules are long chains of different repeating units called bases , which can either fold and pair with any other base on the same molecule, or remain unpaired. For short RNA chains, the number of possible arrangements can be quickly enumerated by inspection. Which arrangement the RNA tends to fold into depends on its sequence of bases: given a sequence, every arrangement can be "scored", as some pairs are more likely to form stable bonds than others.

For biological RNAs, the typical length is greater than 100 bases. As with shorter lengths, the arrangement that the RNA folds into depends on its sequence, but the number of possible arrangements to consider is outrageously large, including many common structural motifs shown below.

For an RNA molecule consisting of $N=100$ bases, there are $n$ different arrangements which it can choose from.

How many digits are in $n$ expressed in base $10?$

21 83 106 189

3 solutions

Jose Fernandez Goycoolea
Jul 21, 2018

Let's call $R_N$ the number of different arrangements of an RNA molecule consisting of $N$ bases.

Consider the very first base of the chain, there are two possibilities for it:

If it's free, then the other bases can be arranged in $R_{N-1}$ different ways.
If it's paired, there are $N-1$ possible couples for it and in each case, the remaining bases can be arranged in $R_{N-2}$ different ways.

We have a recurrence relation! starting whith $R_1=1$ and $R_2=2$ : $R_N=R_{N-1}+(N-1)R_{N-2}$

One could try here to find a closed form for the recurrence, but it is generally not easy to do so. I just wrote a little Python code to find $R_{100}=24053347438333478953622433243028232812964119825419485684849162710512551427284402176$ that has $\boxed{83}$ digits.

R=[1,1,2]
for N in range(3,101):
    R.append(R[N-1]+(N-1)*R[N-2])
print(R[100])
print(len(str(R[100])))

Log may be a bit inaccurate and its use looks cumbersome, print(len(str(R[100]))) is simpler, though takes more time.

Alexander Maly - 2 years, 10 months ago

You are right, the formula I originally write although it is correct on paper, fail in some cases when implemented. For example print(1 + log(1000)/log(10)//1) returns 3 and no 4 as it should. Do you know why this is??

Jose Fernandez Goycoolea - 2 years, 10 months ago

Due to finite precison (usually 64 bits) of the hardware floating points. In hexadecimal instead of
log(1000) / log(10) = 3 we obtain 0x1.ba18a998fffap+2 / 0x1.26bb1bbb55516p+1 = 0x1.7ffffffffffffp+1
That is, just a bit less than 3.

Alexander Maly - 2 years, 10 months ago

Blake Farrow Staff
Jul 18, 2018

This problem is an application of permutations .

Let's break down the steps required to count the number of arrangements. We'll have to

Select the number of pairs $k$ to consider.
Select $2k$ bases to form the $k$ pairs out of.
Partition the $2k$ bases into $k$ pairs, each of which have $2$ bases.

For Step 1, we'll have to sum up all possible number of pairs, from $0$ (a completely unpaired molecule of $N$ bases) to $N/2$ (every base is paired). So from this step, we'll just take some number $k$ to stand for the number of base pairs we'll be considering for the later steps.

For Step 2 , we need to choose $2k$ bases from the $N$ total bases in the molecule to participate in the $k$ base pairs. This is a pretty straightforward quantity that we can express using the binomial " $N$ choose $2k$ ":

$\binom{N}{2k} = \frac{N!}{(2k)!(n-2k)!}.$

For Step 3 , things get a bit more dicey. Since this is a partitioning problem, we seek to find the number of ways to put $2k$ interchangeable bases into $k$ pairs, so that each pair has exactly $2$ bases in it. A good first guess might be to use the multinomial, something like this:

$\binom{N}{2,...2},$

where the number of $2$ in the multinomial is $k$ , the number of pairs we wish to form. This multinomial is equivalently stated using factorials as:

$\binom{N}{2,...2} = \frac{(2k)!}{2^k},$

since $2! = 2$ .

But we'd be way over-counting if we took this term as the number of ways of performing Step 3. If we use a straight up binomial, we're assuming an ordering on the pairs even though they don’t have one, the order of the pairs that we select don't matter when we're uniquely determining an RNA structure, so we are over-counting by a factor of the number of ways of arranging the $k$ base pairs. So to correct for this factor, we can still use the multinomial but we also have to divide by a factor of $k!$ . So the number of ways of performing Step 3 is:

$\binom{N}{2,...2} \times \frac{1}{k!} = \frac{(2k)!}{2^k k!}.$

Add 'em up

So combining Steps 2 and 3, we can figure out the number of arrangements for a given number of base pairs $k$ :

$n(k) = \frac{N!}{(2k)!(N-2k)!} \frac{(2k)!}{2^k} \frac{1}{k!}$

As we mentioned for Step 1, we need to add up all the possible numbers of base pairs $n$ , from $0$ to $N/2$ . Of course if $N$ is odd, we don't want a fraction, so really we want the floor: $\left \lfloor{\frac{N}{2}}\right \rfloor .$

$n = \sum_0^{\left \lfloor{\frac{N}{2}}\right \rfloor} \frac{N!}{(N-2k)! 2^k k!}$

Now that we've got this function, we can just plug in the number of bases we're considering, $N=100$ , to yield the number of arrangements. I did this with Mathematica. It's a really big number:

$n(100) = 24053347438333478953622433243028232812964119825419485684849162710512551427284402176$

Which has 83 digits.

Side Note

For short RNA sequences, enumerating all the possible arrangements and scoring them using their sequence is practical. Given a sequence, if an $\tt A$ base pairs with a $\tt U$ base the score goes up by $1$ , as with a $\tt C$ base pairing with a $\tt G$ base. Other "wobble" pairs like $\tt U$ and $\tt G$ add a somewhat lower number to the score.

But there's no way anyone is ever going to enumerate and score all of the arrangements for a $N=100$ RNA molecule. Thanks to some enterprising work by a biologist named Ruth Nussinov in the late 1970s, we don't have to. Her algorithm uses dynamic programming to decrease the time requirements from combinatorial $O(n!)$ to low polynomial time $O(n^3)$ .

If all the solutions require programming I think a coding environment would be useful to have under the problem. After finding the recursion I got the rough approximation n(k)~(k/e)^(k/2) giving log(n(100))~(2-0.4342)*50~78

Levente Bodnár - 2 years, 10 months ago

You're right, Levente. I've added a coding environment to the problem. Cheers.

Blake Farrow Staff - 2 years, 10 months ago

It's not clear to me what the algorithm from Ruth Nussinov does in polynomial time. It scores every sequence? To score every sequence you'd need to enumerate all of them. It scores a single sequence? That can be obv be done in a linear time

Nicola Gabbia - 2 years, 10 months ago

Nussinov's algorithm scores the different arrangements as it goes, and only considered the highest scoring arrangement for a given sequence. It only ends up scoring order $N^3$ arrangements and is guaranteed to find the highest scoring arrangement, given one important caveat.

It takes advantage of a pattern in biological RNAs that I didn't mention in the problem: pairs are almost always nested . Bases "inside" of a pair will only pair with other bases inside. An example RNA is shown below, with the connectivity drawn with concentric (or nested) arcs:

This means if you can enumerate and score all the arrangements of a short subsequence and find the optimal one, that arrangement will remain the same even if you expand the subsequence. So the Nussinov algorithm scores short, easily enumerated subsequences, then expands one base at a time in both directions, only forming pairs which score highest. This nested quality of RNA arrangements makes this problem well suited to dynamic programming, which needs to be able to break the intractable problem into many simpler and independent sub-problems.

This is one of the focuses of a new chapter that's coming out in Brilliant's Computational Biology course in the next few days.

Blake

Blake Farrow Staff - 2 years, 10 months ago

Thanks, very interesting. Hence the purpose of the algorithm is finding the highest scoring sequences

Nicola Gabbia - 2 years, 10 months ago

K T
Jan 4, 2021

If we have n nodes, the number of pairs that can be formed ranges from $0$ to $\lfloor \frac{n}{2}\rfloor$ .

To place p pairs among n nodes can be done in $\frac{n!}{(n-2p)!2^p p!}$ ways.

For n=100, we sum this expression for p ranging from 0 to 50 and get

$\sum_{p=0}^{50} \frac{n!}{(n-2p)!2^p p!}=24053347438333478953622433243028232812964119825419485684849162710512551427284402176$

which has $\boxed{83}$ digits.

Combinatorial RNA

3 solutions

0 pending reports