Saving Bits

Computer Science Level 1

A biologist wants to encode a large genetic sequence consisting of the letters C , G , A , and T .

In his data, C and G were the most common letters, so he chose the encoding C -> 1, G -> 0, A -> 00, T -> 01 . For example, CCGT is encoded as 11001 .

Later, the biologist noticed 11001 in his codebase. He concluded that it represented CCGT . Is he correct?

The following code allows you encode a DNA string. Try running the code, and then change the string in line 12 to see if you can find other strings with the same encoding!

encoding = {'C' : '1', 'G' : '0', 'A' : '00', 'T': '01'}

def encode(sequence):
    #Create an empty string for the encoding of the sequence
    encoded_string = ""
    
    #For each letter, append its encoding to the encoded string
    for letter in range(len(sequence)):
        encoded_string += encoding[sequence[letter]]
    return encoded_string

print(encode('CCGT'))

Python 3

Bonus 1: Can you describe a better encoding scheme?

Bonus 2: Can you write code that returns all possible genetic sequences from a given encoding?

Yes, he is correct No, not necessarily

17 solutions

Arjen Vreugdenhil
Sep 24, 2017

Relevant wiki: Huffman Code

An alternative scheme, apart from the one Cole mentions, could be C -> 0, G -> 10, A -> 110, T -> 111 . If the frequencies of C and G are significantly higher than of A and T, this will result in improved length.

Detailed analysis

Using Huffman's algorithm we find the most effective encoding, based on the frequencies. I will assume that $f_C > f_G > f_A, f_T.$ (Here, $f_X$ stands for the expected frequency of base X.) The following graph represents all possible cases: $f_G$ is plotted horizontally and $f_C$ is plotted vertically. For instance,

U represents the limiting situation of only Cs.
V represents the situation with only Cs and Gs, at equal frequencies.
Y is the situation with $f_C = f_G = f_A + f_T$ .
X is the situation with $f_C = f_G = f_A = f_T$ .

The three lines delimiting the allowed region are defined as follows:

Black line $XYV$ : $f_C > f_G$ .
Red line $UV$ : $f_C + f_G \leq 1$ .
Green line $UZX$ : $f_G > f_A, f_T$ implies $2f_G > 1 - f_G + f_C$ and thus $f_G > \tfrac13 - \tfrac13f_C$ .

There are two possible scenarios.

Case I. $f_A + f_T > f_C$ . The Huffman algortihm tells us to use a coding such as C -> 00, G -> 01, A -> 10, T -> 11

The inequality $f_C < f_A + f_T$ implies $f_C < 1 - f_G - f_C$ or $f_C < \tfrac12 - \tfrac12f_G$ (blue line $ZY$ ). Thus we apply this algorithm in green shaded region. This region corresponds to the situation in which the bases occur with approximately the same frequencies.

Case II. $f_C > f_A + f_T$ . In this case, the Huffman algorithm suggests C -> 0, G -> 10, A -> 110, T -> 111

This applies to the yellow shaded region.

In Case I, the expected code length is obviously 2 bits per base.

In Case II, the expected code length is $n = 3 - f_G - 2f_C$ . In the graph the "iso-length" lines have been drawn for various values of $n$ . Note that in the green shaded region this code would lead to an expected code length $n > 2$ ; this shows that the Case I code is indeed more efficient in this situation.

This is such a great answer with a lot of insight to the problem. Thank you very much!

Ibrahim Akalin - 3 years, 8 months ago

Seriously impressive

Ben Mc Donough - 3 years, 8 months ago

My god Mr. Arjen Vreugdenhil, you are a genius! What book(s) did you read to come up with such an amazing solution?

Mickyas Tesfamichael - 3 years, 7 months ago

I studied math and like to be creative with it. It is always fun to figure out how something can be translated into a graph. Huffman codes I remember from Theoretical Informatics classes.

Brilliant problems are often a nice starting point for "exploring" a situation. Don't just solve it, but determine relationships, trends, and develop an intuitive understanding of what you find.

Arjen Vreugdenhil - 3 years, 7 months ago

Cole Persch
Sep 24, 2017

This is a poor coding scheme because (for example) 01 could be mistaken at GC or T. 11001 in particular (using only the single bit encoding) could just be CCGGC.

It is much better to be consistent in the number of digits each letter encodes to. Example: C -> 00 G -> 10 T ->01 A -> 11

This way, CCGT is encoded at 00001001, and there is no room to argue. Another advantage of this system is that you can easily tell the number of letters represented, just divide the length by 2.

Moderator note:

Whether it is better or not you have the same number of digits each letter depends on your situation.

If you're looking at a sequence in a random access fashion (that is, a search could hit anywhere in the string) then it's much easier to encode every letter as given in this answer and know that, for example, position $100 \times 2 - 1 = 199$ will be the start of the $100^\text{th}$ term in the sequence.

However, if you're looking for high compression (that this problem hints at, since it mentions C and G are the most common letters) which will only be uncompressed sequentially, it can be more efficient to have what's known as a variable length code . Suppose the code has letters C, G, A, and T, where C is the most likely, G is the next most likely, and A and T are the least likely. One possible encoding would be C = 0, G = 10, A = 110, and T = 111. This would reduce the expected length of any codewords. Arjen's answer goes into more detail on this.

In Challenge Master answer, T should by 111, not 110

Dale Berger - 3 years, 8 months ago

Yep, fixed.

Jason Dyer Staff - 3 years, 8 months ago

You mean 01 mistaken as GC or T, right? Because 10 isn't T, it's 01. Your example, however, is spot on for a better encoding system.

Was the problem changed at some time? Because I saw the same kind of mixup in another answer as well.

József Inczefi - 3 years, 8 months ago

You are right to point out that he meant 01 could be mistaken as GC or T. Hence, another sequence that yields the same code is CCGGC.

I have edited the solution to reflect this. The problem has not been changed.

Calvin Lin Staff - 3 years, 8 months ago

it still says 10 instead of 01 :)

József Inczefi - 3 years, 8 months ago

@József Inczefi – Ah thanks. Fixed. There were 2 of us editing at the same time, so we ended up saving over the edit.

Calvin Lin Staff - 3 years, 8 months ago

Chew-Seong Cheong
Sep 25, 2017

No, not necessarily because 11001 can also be CCAC. A better codebase can be a strictly 2-digit system such as A - 00, C - 01, G - 10, T - 11, then CCGT - 01011011 and CCAC - 01010001. All sequence then will be distinct.

all that needs to be done to choose the answer no, not necessarily is to prove another answer could be correct. for 11001, it could be CCAC. Like you said, the answer is no, not necessarily.

gh dgfhdgf - 3 years, 3 months ago

Kamlesh Kumar
Sep 25, 2017

The important thing to think about for this kind of encoding is that the code assigned to one character is not prefix of the code assigned to any other character .

Here the code for character G is 0 and 0 are present as a prefix in the code of character A and T Which will create confusion while decoding. That's why given encoding is not valid.

Kevin Gillespie
Sep 25, 2017

This problem is a perfect example of Huffman coding

Hasmik Garyaka
Sep 19, 2017

CG and T both encoded as 01.

CG is 10, not 01.

József Inczefi - 3 years, 8 months ago

Victor Dumbrava
Sep 25, 2017

CCGT ( 11001 ) is equivalent to CCAC , as A represents a run of zeros and GT is 001 , equivalent to AC .

Frank Fourlas
Oct 1, 2017

A better solution would be to encode each letter with too bytes. For example A->00 T->01 G->10 C->11. This way we encode the 4 letters very efficiently because we use the minimum amount of bytes to represent 4 different values and we have a constant length of each value (2 bytes) and we cannot mess up by counting a 2 or 3 byte character as 2 or 3 characters 1 byte long.

Beau Broad
Sep 30, 2017

Better encoding. Since you don't need a binary read out on the program don't use it use the letters instead. If you do need a binary read out then use 3 figures so you have 4 different options and you know each 3 figures represents 1 part of the base pair. Or just the numbers 1 through 4. I don't understand why you would encode dna when you would just have to decode it to look at it.

Joshua Newburn
Sep 29, 2017

Instead of "1" and "0" make them "11" and "00". That way there are no ambiguities.

Kyle Brilliant
Sep 29, 2017

A better encoding would be CCAC, this way the codes are used more efficiently and it still results in 1101

Saksham Jain
Sep 29, 2017

CCAC is also encoded same

Douglas Douglas
Sep 29, 2017

Coding in simple binary but with 2 units for each element. C = 00, G = 01, T =10, A = 11. Because each element occupies two spaces the confusion of deciding whether to decode q single integer or a double is eliminated.

Andrew Du
Sep 28, 2017

There could be multiple strings with the same encoding, as mentioned below. AT=1 CG=0 is better. DNA has only two types of pairs. I don't know Bonus 2.

Shreyas Belwalkar
Sep 28, 2017

Since you have only four combinations, i.e, A,T,G,C, and a definite code for each one of them, i.e, 0 & 1,which also repeat themselves in the coding of an individual, i.e, A is 00, you can write the numbers in at least 2 arrangements of A,T,G &C. In this case, 11001 can also be written as CCAC, besides CCGT. Hence the biologist is not necessarily correct, as it can be either of these two arrangements in the DNA

Yehonathan Shatz
Sep 27, 2017

CCAC was the first alternative option I could think of. I haven't ever done much coding, so I may have misunderstood something though.

I assumed that the sequence "11001" given in the problem started at the beginning of a letter, and that there four (full) letters in total encoded in that sequence. With this assumption, the first two letters had to both be "C", because that is the only letter whose encryption starts with a "1" (and also happens to end there.)

Now what remains is "001". This can be broken into three options (apologies for the bad formatting!): -0, 0, 1 (GGC) -00, 1 (AC) -0, 01 (GT)

The first option would make there be 5 letters in total (I already have CC from before.) The second option was my choice. The third is the choice already made in the problem, so I had to choose another.

I hope that this was not too bad of an answer. This is my first time posting here. Thanks a lot for reading!

Adem Pokvic
Sep 26, 2017

In coding, it's not allowed for a character to be the prefix of an other char expect in graycode (every char is coded like that 10-a, 100-b, 1000-c, 10000-d... in that case we know a char enden if we find a 1 ie. 1 is a separator)

Saving Bits

17 solutions

Moderator note:

0 pending reports