Y cn rd ths jst fn

Computer Science Level 2

If you could read the title just fine, it is because the English language (as well as all natural languages) are redundant. This isn't to say that there are multiple words that mean the same thing (although there are), but that if you compare the number of questions you would have to ask to uniquely identify a word I'm thinking of, and the number of possible words, the second number is much bigger than the first.

For example, by the time you read the first four letters of a word starting with calc , you can be pretty sure it's going to end up being calculus , calcium , calculate , or calculation . You don't need all the extra letters to distinguish the remaining possibilities. Concretely, if we take a list of all the words of a given length that exist, and sort them into alphabetical order, we only need $\log_2 N$ questions to identify any given word in the list (where $N$ is the number of words in the list). On the other hand, if we were making full use of the language, we could manage $\displaystyle 26^L$ unique words of length $L$ .

Using the English language dictionary built in to UNIX operating systems, and filtering for words of length 5, I find 10230 unique words. Taking words of length 5 as a proxy for the entire English language, how short, on average, could we make five letter words before someone with perfect reasoning couldn't read them anymore?

The answer is 2.83389.

2 solutions

Tijmen Veltman
Aug 31, 2014

As mentioned in the problem, for a word of length $L$ we have $26^L$ possibilities. In our case we have 10230 different words to make, so we need $L$ to satisfy $26^L=10230$ on average. This gives $L=\log_{26}10230 \approx \boxed{2.834}.$

Thanks to texting, I read the first word of the title as "Why".

Calvin Lin Staff - 6 years, 9 months ago

Yeah, no. The answer has to be an integer number of letters. Sorry, Josh. Thanks for playing.

Al Fargnoli - 4 years ago

@Al Fargnoli Why does an average have to be an integer?

Daniel Filreis - 3 years, 8 months ago

Because it is impossible to construct a message with a non-integer length of characters. Someplace you have to use a floor, ceiling, or rounding function.

Al Fargnoli - 3 years, 7 months ago

@Al Fargnoli – The actual words would be of integer length, but their average (which is what the exercise asked for) doesn't need to be.

Tijmen Veltman - 3 years, 7 months ago

@Tijmen Veltman – "how short, on average, could we make five letter words"

He didn't ask for the average value. "On average" is something different, and means "generally speaking". That makes the answer '3'.

Patrick Sims - 3 years, 4 months ago

@Patrick Sims – The answer 2.83389 was marked as correct, so that's a pretty strong indication that he did indeed mean the average.

Tijmen Veltman - 3 years, 4 months ago

Sorry, but I cannot agree with this because we are talking about English. In English, 40% of the letters are vowels. Certain digraphs tend to be highly prevalent--think "th" and "qu". The upshot is that the entropy of the language is reduced by roughly 60% on a per word basis. These facts are what make it possible to solve simple substitution ciphers. In any case, if we drop the first and last letter of the five-letter word, the ambiguity skyrockets. Actually, this principle applies to words of arbitrary length. Ask any professional cryptographer.

Mark Von Hendy - 2 years, 8 months ago

Vai Patel
Feb 26, 2018

Use of dictionary words shortens our word list from $26^5$ to $10230$ , which decreases the number of questions required to identify a word from $\log_{2}{26^5}$ to $\log_{2}{10230}$ .

So on average, we could shorten a word proportionally to the decrease in the number of questions required to identify it, giving us a length $L = 5 * \frac{\log_{2}{10230}}{\log_{2}{26^5}} = \frac{5}{5} * \frac{\log_{2}{10230}}{\log_{2}{26}} = \log_{26}{10230} \approx 2.83389$

Y cn rd ths jst fn

The answer is 2.83389.

2 solutions

0 pending reports