Monkey's Typing Challenge

Probability Level 3

A monkey is typing a string of random letters, with all letters equally (and independently) likely to be typed.

Which word is more likely to appear first in the string, heart or earth?

Heart Earth Equally likely

6 solutions

Aaron Miller Staff
Oct 16, 2017

In a random string of letters, the words $\text{HEART}$ and $\text{EARTH}$ have equal probabilities, but $\text{HEART}$ is more likely to appear nearer to the beginning of the string.

The two words in question share the substring $\text{EART}.$ Imagine searching through a random string and finding the first instance of $\text{EART}.$ In the context of this question, we care about three scenarios:

$\text{EART}$ is preceded by $\text{H}$ to form $\text{HEART}$
$\text{EART}$ is followed by $\text{H},$ and is not preceded by $\text{H},$ to form $\text{EARTH}$
$\text{EART}$ is neither preceded nor followed by $\text{H},$ in which case we move on and look for the next instance of $\text{EART}.$

The probability of scenario 1. is $\frac{1}{26} \approx 0.0385$ since the letters are selected independently. Notice that it doesn't matter if $\text{EART}$ is followed by $\text{H}$ here because $\text{HEART}+\text{H}$ would still count as $\text{HEART}$ appearing first.

Similarly in scenario 2., the probability that $\text{EART}$ is followed by $\text{H}$ is $\frac{1}{26}$ since the letters are selected independently. But the letter preceding $\text{EART}$ cannot be $\text{H},$ or it would count as $\text{HEART}$ appearing first. The probability that the preceding letter is not $\text{H}$ is $\frac{25}{26},$ so the joint probability is $\frac{1}{26} \times \frac{25}{26} \approx 0.0370.$

We can see that each instance of substring $\text{EART}$ has a $0.385\%$ chance of forming the word $\text{HEART},$ but only a $0.370\%$ chance of forming the word $\text{EARTH}$ and not the word $\text{HEART}.$ Thus, $\text{HEART}$ is slightly favored to appear first in a random string.

Small flaw: you don't acknowledge the possibility that EART occurs at position zero, in which case EARTH might appear but HEART has no chance. This offsets your conclusion, but since this circumstance only happens with probability 1/26^4, your answer is still good.

Kevin Bourrillion - 3 years, 7 months ago

This small flaw (omission) becomes of paramount importance as the word length shortens down to 2 letters (HE vs. EH in Gregory Lewis' example below).

Vlad Ro - 3 years, 7 months ago

Amazing explanation !

Rishu Jaar - 3 years, 7 months ago

I don't think that's right... By the same logic, wouldn't we conclue that THEAR would probably come sooner than HEART (using the HEAR string). And then couldn't we say that RTHEA will probably come sooner than THEAR... and then we could say ARTHE will probably come sooner than RTHEA... Finally, we would claim that EARTH will probably come sooner than ARTHE...

We just made a loop! By transitivity, we just showed that EARTH will come before HEART ! I think the problem with studying the string EART is that if these are exactly the first four letters that the monkey types, there is zero probability that an H comes before, while there is 1/26 chance that the word EARTH will be spelled.

I believe the probabilities are equal.

Vivian Robert - 3 years, 7 months ago

Nice reasoning

Navin Murarka - 3 years, 7 months ago

I am adding a comment to my own because I realize now that what I wrote above wasn’t correct. I actually had a great exchange of ideas with a couple of people at brilliant.org who helped me understand how my logic wasn’t right. It was very enriching.

Even though the idea I present of forming a “loop” seemed attractive, I understand now that this is not a transitive process. It is possible for each step in the loop to be true without forming a contradiction. To keep it short, I think the best intuitive way to be convinced of the result is to play around with a much smaller alphabet and a smaller string of letters.

If the alphabet has only two letters, A and B, then we can check which one of ABB or BBA has a higher chance of showing up first. It is not 50%-50%. There are eight ways the monkey’s string may begin: AAA… AAB… ABA… ABB… BAA… BAB… BBA… BBB…

We of course notice that ABB and BBA both show up once. Unsurprisingly, there is a tie after three letters in that respect. However, if we look at the other six possibilities, they lean clearly in favor of ABB showing up first in the near future. AAA is only two letters away from completing ABB whereas it’s three letters away from completing BBA. AAB is only one letter away from completing ABB whereas it’s two letters away from completing BBA. ABA is two letters vs three letters away from the words. BAA is two letters vs three letters away. BAB is one letter vs two letters away. In fact, only BBB is now closer to completing the second word, since it’s only missing one letter to complete BBA (and three letters away from ABB). By extending the eight strings to see what happens further, we would see that it just confirms or even increases this unbalanced trend.

I know this is not a proof at all, but I just wanted to amend my own comment.

Vivian Robert - 2 years, 4 months ago

Sorry for my late response. I saw this problem on Facebook and saved it for later as I was busy. That later is now (as I sometimes don't use Facebook for weeks or even months).

A very simple example of non-transitive probabilities: we run around the circle on and on and stop at a random point. The circle has 3 equally spaced points marked A, B and C.

Now that we stopped, we start again and check whether we are more likely to encounter A or B first. There are 3 equally probable possibilities: we first encounter A, B or C. If we first encounter B, then B is before A. If we encounter A first, then A is before B. If we first encounter C, then we follow CAB, so A is before B.

Summary: the probability that A is before B is 2/3 and B before A only 1/3. So we're more likely to encounter A before B.

It's also easy to work out that it's more likely to encounter B before C. Also C before A. So a loop. And the difference is not insignificant: 2/3 vs 1/3.

The way probability transitivity works is: if X is before Y with probability p and Y before Z with probability q, then X is before Z with probability at least p+q-1. Even with p=q=2/3 that probability may (and in my example does) equal only 1/3. In the HEART/EARTH/ARTHE example probability is only marginally over 1/2, so transitivity promises almost nothing; with more values it promises nothing.

Tomaž Cedilnik - 1 year, 11 months ago

monkey will type zlatan!

Chandru Bala - 2 years, 10 months ago

Interesting part is, Probability is not "transitive".. in the sense that Heart will "beat " Earth which will beat Arthe, which will beat rthea which will beat thear which will beat heart. (IOW if two people have to choose one word (from these above 5) the second person will "win".

Gyan Mehta - 10 months, 3 weeks ago

This logic has a fatal flaw: substitute EART with E and it proves that HE is more likely to come before EH! Swap the H and EART and change the numbers a bit and you've proven that EARTH is more likely to come before HEART!

The problems are twofold: the string could begin with EART, or EARTEART could appear in the string. Both of those reduce the probability that EART is preceded by H. What makes this different than HE vs EH is that the probability of those happening is much lower.

Gregory Lewis - 3 years, 7 months ago

Josh Silverman Staff
Oct 14, 2017

Intuition

On first read this might seem like a trick question.

Suppose we go to position $i$ in any string $S$ prepared with the method described in the problem. There is, of course, an equal probability for the string $S_{i:(i+4)}$ to be $\textrm{EARTH}$ or $\textrm{HEART},$ equal to $1/26^5.$

But what is the probability for it to be $\textrm{EARTH}$ or $\textrm{HEART}$ given that neither $\textrm{EARTH}$ nor $\textrm{HEART}$ have appeared yet?

Suppose we want to know the probability of the string starting at position $i$ being $\textrm{EARTH}$ AND that it is the first time that either $\textrm{EARTH}$ or $\textrm{HEART}$ appear in the string. What we need to worry about is the following: if $S_{i:(i+4)}$ is to be the first appearance of $\textrm{EARTH},$ then $S_{(i-1)}$ cannot be $\textrm{H}$ or else we'd have $S_0S_1\cdots S_{i-2}\textrm{HEARTH}$ and $\textrm{HEART}$ would slide into the win just as $\textrm{EARTH}$ was to appear.

Similarly, if it is to be $\textrm{HEART}$ AND the first time that either $\textrm{HEART}$ or $\textrm{EARTH}$ appear in the string, then the substring $S_{(i-4):(i-1)}$ cannot be $\textrm{EART}.$

From here, it's already clear that we're less likely to have $S_{(i-4):(i-1)} = \textrm{EART}$ than we are to have $S_{(i-1)}=\textrm{H},$ so $\textrm{HEART}$ is more likely to appear first. But to get quantitative, we have to do a calculation.

Calculation

We can write these probabilities as

$P\left(\textrm{EARTH} = S_{i:(i+4)}\ \mathbf{AND}\ \textrm{H} \neq S_{(i-1)} \biggr\rvert\{\textrm{EARTH},\textrm{HEART}\}\notin S_{0:(i-1)}\right)$

and the corresponding one for $\textrm{HEART}$ as

$P\left(\textrm{HEART} = S_{i:(i+4)}\ \mathbf{AND}\ \textrm{EART} \neq S_{(i-4):(i-1)}\biggr\rvert\{\textrm{EARTH},\textrm{HEART}\}\notin S_{0:(i-1)}\right)$

Note that both of these probabilities are conditioned upon neither $\textrm{EARTH}$ nor $\textrm{HEART}$ appearing in $S_{0:(i-1)},$ so we don't have to worry about the likelihood of those events occurring. For simplicity, we'll write the first expression above as $P_i\left(\textrm{EARTH}\right)$ from here on, and likewise for the $\textrm{HEART}$ probability.

For the first expression above we need $S_{(i-1)}\neq\textrm{H},$ $S_{i:(i+3)} = \textrm{EART},$ and $S_{(i+4)}=\textrm{H}.$ Thus,

$P_i\left(\textrm{EARTH}\right) = P(S_{(i-1)}\neq\textrm{H})\times P\left(\textrm{EART}\right)\times P\left(\textrm{H}\right)$ and likewise $P_i\left(\textrm{HEART}\right) = P(S_{(i-4):(i-1)}\neq\textrm{EART})\times P\left(\textrm{H}\right)\times P\left(\textrm{EART}\right)$

Thus, the probability that $\textrm{HEART}$ appears in $S$ before $\textrm{EARTH}$ is given by

$\begin{aligned} P(\textrm{HEART before EARTH}) &= \frac{P_i\left(\textrm{HEART}\right)}{P_i\left(\textrm{HEART}\right) + P_i\left(\textrm{EARTH}\right)} \\ &= \frac{P(S_{(i-4):(i-1)}\neq\textrm{EART})\times P\left(\textrm{H}\right)\times P\left(\textrm{EART}\right)}{P(S_{(i-4):(i-1)}\neq\textrm{EART})\times P\left(\textrm{H}\right)\times P\left(\textrm{EART}\right) + P(S_{(i-1)}\neq\textrm{H})\times P\left(\textrm{EART}\right)\times P\left(\textrm{H}\right)} \\ &= \frac{P(S_{(i-4):(i-1)}\neq\textrm{EART})}{P(S_{(i-1)}\neq\textrm{H}) + P(S_{(i-4):(i-1)}\neq\textrm{EART})} \end{aligned}$

After the cancellations, the probabilities we're left with are simple. If the size of our alphabet is $A,$ then $\begin{aligned} P(S_{(i-1)}\neq\textrm{H}) &= 1-A^{-1} \\ P(S_{(i-4):(i-1)}\neq\textrm{EART}) &= 1 - A^{-4} \end{aligned}$ and $\boxed{\displaystyle P(\textrm{HEART before EARTH}) = \frac{1 - A^{-4}}{2 - A^{-1} - A^{-4}}}.$

Estimate

For the 26 letter alphabet, $A^{-4}$ is very small compared to $A^{-1},$ so we can neglect $A^{-4}$ and expand in $A^{-1}$ as follows $\begin{aligned} P(\textrm{HEART before EARTH}) &\approx \frac12\left(1 + \frac{1}{2A}\right) \\ &= \frac12 + \frac{1}{4A} \\ &\approx 0.5096\ldots \end{aligned}$

The exact probability is approximately $\num{0.50980337470366754985}.$

Generalization

We can generalize this to the case of alphabets of size $A\geq N$ where $N$ is the length of the two words. In our specific case $A$ was 26 and $N$ was 5. For general $A$ and $N,$ we have

$P(N,A) = \frac{1-A^{-(N-1)}}{2 - A^{-1} - A^{-(N-1)}}.$

Simulation

Given how surprising the result is, it is reassuring to verify it by explicitly performing the suggested typing exercise.

Below are the results when we run the experiment on small alphabets of size $A$ equal to the length of the two words. E.g. for $A=4$ , we check whether $\textrm{DABC}$ or $\textrm{ABCD}$ appears first when we enumerate random strings from the alphabet $\{\textrm{A}, \textrm{B}, \textrm{C}, \textrm{D}\}.$

$Simulation results for $N=\num{200000}$ random typing experiments.$ Simulation results for $N=\num{200000}$ random typing experiments.

There seems to be close agreement between our prediction and the simulation.

For reference, the code is below. You can also add letters not in either of the competing words by uncommenting/adding to the list extra .

import string
import random

string1 = "ABCDEF"
string2 = "FABCDE"

letters = list(string1) 
extra = list("MNOP")
alphabet = letters #+ extra

counts = {"ONE": 0, "TWO": 0}

def measure():

    my_string = "0" * len(string1)
    over = None

    while over is None:
        my_string = my_string[1:] + random.choice(alphabet)
        if my_string == string1:
            over = "ONE"
        elif my_string == string2:
            over = "TWO"
    return over

for round in range(200000):
    counts[measure()] += 1

print(counts)

@Josh Silverman , @Eli Ross - Wow!! Awesome question and awesome solution!

Christopher Criscitiello - 3 years, 7 months ago

So I assume this amazing solution is the reason for this problem being in the 'advanced' problems of the week, not the actually problem itself then :)

Arthur Conmy - 3 years, 7 months ago

You should be using a bunch of typewriters and monkeys to prove the result though :)

Dario Zanze - 3 years, 7 months ago

As long as they are interested, I can't see a problem with that.

Josh Silverman Staff - 3 years, 7 months ago

Michael Mendrin
Oct 13, 2017

In a string of indefinite length, consider all instances of the sub-string "EART" Let Z be any letter that is not H. Then we can have sub-strings ZEARTZ, HEARTZ, HEARTH, ZEARTH that can occur.

If HEARTZ is equally likely as ZEARTH to first occur, then because of some chance that HEARTH can be the first to occur, then HEART is more likely to be the first to occur.

Now that Josh Silverman has posted his full solution with the computed probability, let's look at the probability of just one H appearing either at front or back of EART, and the probability of appearing both

$A=\dfrac{1}{26} \dfrac{25}{26}$
$B=\dfrac{1}{26} \dfrac{1}{26}$

Hence, the probability of HEART appearing first should be

$\dfrac{A+B}{2A+B}=0.509804...$

which almost agrees with Josh's figure. Right now I"m not ready to say that my figure here is correct.

Edit: Josh's figure is the accurate one. The solution provide here is still only an approximation, as it does not distinguish between the two cases in that small chance of EART appears in the first few letters of the sequence. My solution neglects the small chance that the sequence begins with EART. Both figures agree to six decimal places, and then diverges.

In my solution I used linear algebra to find the exact probability. Your approximation is close within 0.02%! (Compared to the fact that HEART is about 2% more likely than EARTH.)

Arjen Vreugdenhil - 3 years, 7 months ago

Wondering if we're dead on @Arjen Vreugdenhil , the formula I arrive at gives 0.50980337470 for the probability of HEART before EARTH.

Josh Silverman Staff - 3 years, 7 months ago

Yes, I believe your boxed formula (and its generalization for $P(A,N)$ ) is identical with my solution. The denominator in your solution is likely the determinant of the matrix I used, up to some trivial factor.

Arjen Vreugdenhil - 3 years, 7 months ago

What is the likelihood that the first instance would in fact be HEART, with the next, (or 6th,) letter being an H? I just thought it would be neat if that were to happen because the addition of a single letter could make both words in the five letter format, be found in order. Although the probability of such an occurence is likely very low. Nor would this supposition change the original outcome.

Matt Bronio - 3 years, 7 months ago

Arjen Vreugdenhil
Oct 23, 2017

We can think of this process as an automaton, with the following states and transitions: (blue line: H; purple line: E; green line: other letter (X))

$\begin{array}{r|cccccc} \text{state} & H & E & A & R & T & \text{other} \\ \hline \text{start:}\ 0 & 1 & 5 & 0 & 0 & 0 & 0 \\ 1 & 1 & 2 & 0 & 0 & 0 & 0 \\ 2 & 1 & 5 & 3 & 0 & 0 & 0 \\ 3 & 1 & 5 & 0 & 4 & 0 & 0 \\ 4 & 1 & 5 & 0 & 0 & \mathbf{HEART} & 0 \\ 5 & 1 & 5 & 6 & 0 & 0 & 0 \\ 6 & 1 & 5 & 0 & 7 & 0 & 0 \\ 7 & 1 & 5 & 0 & 0 & 8 & 0 \\ 8 & \mathbf{xEARTH} & 5 & 0 & 0 & 0 & 0 \\ \hline \end{array}$

Let $P(s)$ indicate the probability that a state $s$ will eventually lead to the finals state $\mathbf{HEART}$ . The probability of $\mathbf{xEARTH}$ is then $1 - P(s)$ . Treating $P(\cdot)$ as a vector, we may write the transition probabilities as follows, where $\alpha = \tfrac1{26}$ is the probability of an individual letter: $\mathbf P(s) = \left[\begin{array}{ccccccccc} 1 - 2\alpha & \alpha & & & & \alpha & & & \\ 1 - 2\alpha & \alpha & \alpha & & & & & & \\ 1 - 3\alpha & \alpha & & \alpha & & \alpha & & & \\ 1 - 3\alpha & \alpha & & & \alpha & \alpha & & & \\ 1 - 3\alpha & \alpha & & & & \alpha & & & \\ 1 - 3\alpha & \alpha & & & & \alpha & \alpha & & \\ 1 - 3\alpha & \alpha & & & & \alpha & & \alpha & \\ 1 - 3\alpha & \alpha & & & & \alpha & & & \alpha \\ 1 - 2\alpha & & & & & \alpha & & & \end{array}\right]\ \mathbf P(s) + \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 0 \\ \alpha \\ 0 \\ 0 \\ 0 \\ 0\end{array}\right].$ This can be reduced to $\left[\begin{array}{ccccccccc} -2\alpha & \alpha & & & & \alpha & & & \\ 1 - 2\alpha & \alpha-1 & \alpha & & & & & & \\ 1 - 3\alpha & \alpha & -1 & \alpha & & \alpha & & & \\ 1 - 3\alpha & \alpha & & -1 & \alpha & \alpha & & & \\ 1 - 3\alpha & \alpha & & & -1 & \alpha & & & \\ 1 - 3\alpha & \alpha & & & & \alpha-1 & \alpha & & \\ 1 - 3\alpha & \alpha & & & & \alpha & -1 & \alpha & \\ 1 - 3\alpha & \alpha & & & & \alpha & & -1 & \alpha \\ 1 - 2\alpha & & & & & \alpha & & & -1 \end{array}\right]\ \mathbf P(s) = \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 0 \\ -\alpha \\ 0 \\ 0 \\ 0 \\ 0\end{array}\right].$

Lazily, I solved this numerically and found $\mathbf P(s) = \left[\begin{array}{c} 0.509803 \\ 0.509804 \\ 0.509831 \\ 0.510529 \\ 0.528657 \\ 0.509802 \\ 0.509774 \\ 0.509049 \\ 0.490196 \end{array}\right].$ Specifically, $P(S) = 0.509803 > \tfrac12$ , so that the final state $\mathbf{HEART}$ is more likely than $\mathbf{xEARTH}$ .

Note that this is true at any point during the process, except in state $8$ : at that point, the outcome $\mathbf{xEARTH}$ is slightly more probable.

It seems like $\mathbf{S}$ is a container for any letters not in $\mathbf{HEART},$ but what is $\mathbf{x}$ supposed to represent?

Josh Silverman Staff - 3 years, 7 months ago

You should realize that my notation $S$ , $HE$ , $xEA$ etc. does not refer to (sub)strings but to states of the automaton. As the string is feed into the automaton, each character corresponds to a transition from one state to another. Perhaps I should have numbers the states $0, 1, 2, 3, 4, \mathbf{5}, 1^\star, 2^\star, 3^\star, 4^\star, \mathbf{5^\star}$ or something like that, to avoid confusion.

S = initial state. We return to this state when we cannot possibly be in EARTH or HEART at the moment;

xEA = state obtained after a non-H, E, and A in that order.

Arjen Vreugdenhil - 3 years, 7 months ago

Ah I see, that's why $\mathbf{S}$ is its own state, it's just the restart state. Yea I think it would help to show a sample string of states for the automaton, or simply the cycles of the state space.

Josh Silverman Staff - 3 years, 7 months ago

@Josh Silverman – I will post a diagram and use numbers for the states instead of letters. That should clarify matters.

Arjen Vreugdenhil - 3 years, 7 months ago

If we replace the vector in the last equation by $\dots = \left[\begin{array}{c} -1 \\ -1 \\ -1 \\ -1 \\ -1 \\ -1 \\ -1 \\ -1 \\ -1 \end{array}\right],$ and solve, we find the expected number of moves to reach one of the two words from any given state. $\mathbf{E}(s) = \left[\begin{array}{c} 6057178 \\ 6057166 \\ 6056834 \\ 6048218 \\ 5824210 \\ 6057165 \\ 6056834 \\ 6048218 \\ 5824210 \end{array}\right].$ Thus, the monkeys will on average need slightly more than 6 million keystrokes to write either HEART or EARTH.

Naively, one might estimate this number of keystrokes as $\mathbf{E} \approx \frac{26^5}2 = 5\:940\:688,$ which is pretty close to the true value. The actual number of keystrokes is slightly bigger because the probabilities of typing EARTH and HEART are not entirely independent.

Arjen Vreugdenhil - 3 years, 7 months ago

Paul Jackson
Oct 25, 2017

No need for a bunch of math here. On average the letters EART will show up after some expected number of letters are chosen. The probability that an H will be in front of the E is the same as the probability that an H will be behind the T, but when both the letter before the E and after the T are H, HEART wins. This means there are more ways for for HEART to win than for EARTH to win.

Exactly - and quite how so many people got this one wrong is a mystery to me as it is not remotely difficult to arrive at the conclusion you have produced above. This problem is two categories out of place as a p;roblem of the week.

Thomas Sutcliffe - 3 years, 7 months ago

Robert Williams
Mar 8, 2018

If we ignore the possibility of EARTH appearing as the first characters (1 in 11,881,375) this is analogous to two people taking turns flipping a coin and the winner being the first to land heads. The person who goes first has a clear advantage: 1/2 + 1/8 + 1/32... versus 1/4 + 1/16 + 1/64...

In this case the aim is to type an H first, and the Heart team goes first, so has the advantage.

The occurrence of EART can be ignored, it is simply the trigger for each turn, but the probabilities would be the same without it —the game would just be a lot quicker.

Monkey's Typing Challenge

6 solutions

0 pending reports