A biologist wants to encode a large genetic sequence consisting of the letters
C
,
G
,
A
, and
T
.
In his data, C and G were the most common letters, so he chose the encoding
C -> 1, G -> 0, A -> 00, T -> 01
. For example,
CCGT
is encoded as
11001
.
Later, the biologist noticed
11001
in his codebase. He concluded that it represented
CCGT
.
Is he correct?
The following code allows you encode a DNA string. Try running the code, and then change the string in line 12 to see if you can find other strings with the same encoding!
encoding = {'C' : '1', 'G' : '0', 'A' : '00', 'T': '01'} def encode(sequence): #Create an empty string for the encoding of the sequence encoded_string = "" #For each letter, append its encoding to the encoded string for letter in range(len(sequence)): encoded_string += encoding[sequence[letter]] return encoded_string print(encode('CCGT'))
Bonus 1: Can you describe a better encoding scheme?
Bonus 2: Can you write code that returns all possible genetic sequences from a given encoding?
This section requires Javascript.
You are seeing this because something didn't load right. We suggest you, (a) try
refreshing the page, (b) enabling javascript if it is disabled on your browser and,
finally, (c)
loading the
non-javascript version of this page
. We're sorry about the hassle.
This is such a great answer with a lot of insight to the problem. Thank you very much!
Seriously impressive
My god Mr. Arjen Vreugdenhil, you are a genius! What book(s) did you read to come up with such an amazing solution?
Log in to reply
I studied math and like to be creative with it. It is always fun to figure out how something can be translated into a graph. Huffman codes I remember from Theoretical Informatics classes.
Brilliant problems are often a nice starting point for "exploring" a situation. Don't just solve it, but determine relationships, trends, and develop an intuitive understanding of what you find.
This is a poor coding scheme because (for example) 01 could be mistaken at GC or T. 11001 in particular (using only the single bit encoding) could just be CCGGC.
It is much better to be consistent in the number of digits each letter encodes to. Example: C -> 00 G -> 10 T ->01 A -> 11
This way, CCGT is encoded at 00001001, and there is no room to argue. Another advantage of this system is that you can easily tell the number of letters represented, just divide the length by 2.
Whether it is better or not you have the same number of digits each letter depends on your situation.
If you're looking at a sequence in a random access fashion (that is, a search could hit anywhere in the string) then it's much easier to encode every letter as given in this answer and know that, for example, position 1 0 0 × 2 − 1 = 1 9 9 will be the start of the 1 0 0 th term in the sequence.
However, if you're looking for high compression (that this problem hints at, since it mentions C and G are the most common letters) which will only be uncompressed sequentially, it can be more efficient to have what's known as a variable length code . Suppose the code has letters C, G, A, and T, where C is the most likely, G is the next most likely, and A and T are the least likely. One possible encoding would be C = 0, G = 10, A = 110, and T = 111. This would reduce the expected length of any codewords. Arjen's answer goes into more detail on this.
In Challenge Master answer, T should by 111, not 110
You mean 01 mistaken as GC or T, right? Because 10 isn't T, it's 01. Your example, however, is spot on for a better encoding system.
Was the problem changed at some time? Because I saw the same kind of mixup in another answer as well.
Log in to reply
You are right to point out that he meant 01 could be mistaken as GC or T. Hence, another sequence that yields the same code is CCGGC.
I have edited the solution to reflect this. The problem has not been changed.
Log in to reply
it still says 10 instead of 01 :)
Log in to reply
@József Inczefi – Ah thanks. Fixed. There were 2 of us editing at the same time, so we ended up saving over the edit.
No, not necessarily because 11001 can also be CCAC. A better codebase can be a strictly 2-digit system such as A - 00, C - 01, G - 10, T - 11, then CCGT - 01011011 and CCAC - 01010001. All sequence then will be distinct.
all that needs to be done to choose the answer no, not necessarily is to prove another answer could be correct. for 11001, it could be CCAC. Like you said, the answer is no, not necessarily.
The important thing to think about for this kind of encoding is that the code assigned to one character is not prefix of the code assigned to any other character .
Here the code for character G is 0 and 0 are present as a prefix in the code of character A and T Which will create confusion while decoding. That's why given encoding is not valid.
CG and T both encoded as 01.
CG is 10, not 01.
CCGT
(
11001
) is equivalent to
CCAC
, as
A
represents a run of zeros and
GT
is
001
, equivalent to
AC
.
A better solution would be to encode each letter with too bytes. For example A->00 T->01 G->10 C->11. This way we encode the 4 letters very efficiently because we use the minimum amount of bytes to represent 4 different values and we have a constant length of each value (2 bytes) and we cannot mess up by counting a 2 or 3 byte character as 2 or 3 characters 1 byte long.
Better encoding. Since you don't need a binary read out on the program don't use it use the letters instead. If you do need a binary read out then use 3 figures so you have 4 different options and you know each 3 figures represents 1 part of the base pair. Or just the numbers 1 through 4. I don't understand why you would encode dna when you would just have to decode it to look at it.
Instead of "1" and "0" make them "11" and "00". That way there are no ambiguities.
A better encoding would be CCAC, this way the codes are used more efficiently and it still results in 1101
Coding in simple binary but with 2 units for each element. C = 00, G = 01, T =10, A = 11. Because each element occupies two spaces the confusion of deciding whether to decode q single integer or a double is eliminated.
There could be multiple strings with the same encoding, as mentioned below. AT=1 CG=0 is better. DNA has only two types of pairs. I don't know Bonus 2.
Since you have only four combinations, i.e, A,T,G,C, and a definite code for each one of them, i.e, 0 & 1,which also repeat themselves in the coding of an individual, i.e, A is 00, you can write the numbers in at least 2 arrangements of A,T,G &C. In this case, 11001 can also be written as CCAC, besides CCGT. Hence the biologist is not necessarily correct, as it can be either of these two arrangements in the DNA
CCAC was the first alternative option I could think of. I haven't ever done much coding, so I may have misunderstood something though.
I assumed that the sequence "11001" given in the problem started at the beginning of a letter, and that there four (full) letters in total encoded in that sequence. With this assumption, the first two letters had to both be "C", because that is the only letter whose encryption starts with a "1" (and also happens to end there.)
Now what remains is "001". This can be broken into three options (apologies for the bad formatting!): -0, 0, 1 (GGC) -00, 1 (AC) -0, 01 (GT)
The first option would make there be 5 letters in total (I already have CC from before.) The second option was my choice. The third is the choice already made in the problem, so I had to choose another.
I hope that this was not too bad of an answer. This is my first time posting here. Thanks a lot for reading!
In coding, it's not allowed for a character to be the prefix of an other char expect in graycode (every char is coded like that 10-a, 100-b, 1000-c, 10000-d... in that case we know a char enden if we find a 1 ie. 1 is a separator)
Problem Loading...
Note Loading...
Set Loading...
Relevant wiki: Huffman Code
An alternative scheme, apart from the one Cole mentions, could be
C -> 0, G -> 10, A -> 110, T -> 111
. If the frequencies of C and G are significantly higher than of A and T, this will result in improved length.Detailed analysis
Using Huffman's algorithm we find the most effective encoding, based on the frequencies. I will assume that f C > f G > f A , f T . (Here, f X stands for the expected frequency of base X.) The following graph represents all possible cases: f G is plotted horizontally and f C is plotted vertically. For instance,
U represents the limiting situation of only Cs.
V represents the situation with only Cs and Gs, at equal frequencies.
Y is the situation with f C = f G = f A + f T .
X is the situation with f C = f G = f A = f T .
The three lines delimiting the allowed region are defined as follows:
Black line X Y V : f C > f G .
Red line U V : f C + f G ≤ 1 .
Green line U Z X : f G > f A , f T implies 2 f G > 1 − f G + f C and thus f G > 3 1 − 3 1 f C .
There are two possible scenarios.
Case I. f A + f T > f C . The Huffman algortihm tells us to use a coding such as
C -> 00, G -> 01, A -> 10, T -> 11
The inequality f C < f A + f T implies f C < 1 − f G − f C or f C < 2 1 − 2 1 f G (blue line Z Y ). Thus we apply this algorithm in green shaded region. This region corresponds to the situation in which the bases occur with approximately the same frequencies.
Case II. f C > f A + f T . In this case, the Huffman algorithm suggests
C -> 0, G -> 10, A -> 110, T -> 111
This applies to the yellow shaded region.
In Case I, the expected code length is obviously 2 bits per base.
In Case II, the expected code length is n = 3 − f G − 2 f C . In the graph the "iso-length" lines have been drawn for various values of n . Note that in the green shaded region this code would lead to an expected code length n > 2 ; this shows that the Case I code is indeed more efficient in this situation.