Y cn rd ths jst fn

If you could read the title just fine, it is because the English language (as well as all natural languages) are redundant. This isn't to say that there are multiple words that mean the same thing (although there are), but that if you compare the number of questions you would have to ask to uniquely identify a word I'm thinking of, and the number of possible words, the second number is much bigger than the first.

For example, by the time you read the first four letters of a word starting with calc , you can be pretty sure it's going to end up being calculus , calcium , calculate , or calculation . You don't need all the extra letters to distinguish the remaining possibilities. Concretely, if we take a list of all the words of a given length that exist, and sort them into alphabetical order, we only need log 2 N \log_2 N questions to identify any given word in the list (where N N is the number of words in the list). On the other hand, if we were making full use of the language, we could manage 2 6 L \displaystyle 26^L unique words of length L L .

Using the English language dictionary built in to UNIX operating systems, and filtering for words of length 5, I find 10230 unique words. Taking words of length 5 as a proxy for the entire English language, how short, on average, could we make five letter words before someone with perfect reasoning couldn't read them anymore?


The answer is 2.83389.

This section requires Javascript.
You are seeing this because something didn't load right. We suggest you, (a) try refreshing the page, (b) enabling javascript if it is disabled on your browser and, finally, (c) loading the non-javascript version of this page . We're sorry about the hassle.

2 solutions

Tijmen Veltman
Aug 31, 2014

As mentioned in the problem, for a word of length L L we have 2 6 L 26^L possibilities. In our case we have 10230 different words to make, so we need L L to satisfy 2 6 L = 10230 26^L=10230 on average. This gives L = log 26 10230 2.834 . L=\log_{26}10230 \approx \boxed{2.834}.

Thanks to texting, I read the first word of the title as "Why".

Calvin Lin Staff - 6 years, 9 months ago

Yeah, no. The answer has to be an integer number of letters. Sorry, Josh. Thanks for playing.

Al Fargnoli - 4 years ago

Log in to reply

@Al Fargnoli Why does an average have to be an integer?

Daniel Filreis - 3 years, 8 months ago

Log in to reply

Because it is impossible to construct a message with a non-integer length of characters. Someplace you have to use a floor, ceiling, or rounding function.

Al Fargnoli - 3 years, 7 months ago

Log in to reply

@Al Fargnoli The actual words would be of integer length, but their average (which is what the exercise asked for) doesn't need to be.

Tijmen Veltman - 3 years, 7 months ago

Log in to reply

@Tijmen Veltman "how short, on average, could we make five letter words"

He didn't ask for the average value. "On average" is something different, and means "generally speaking". That makes the answer '3'.

Patrick Sims - 3 years, 4 months ago

Log in to reply

@Patrick Sims The answer 2.83389 was marked as correct, so that's a pretty strong indication that he did indeed mean the average.

Tijmen Veltman - 3 years, 4 months ago

Sorry, but I cannot agree with this because we are talking about English. In English, 40% of the letters are vowels. Certain digraphs tend to be highly prevalent--think "th" and "qu". The upshot is that the entropy of the language is reduced by roughly 60% on a per word basis. These facts are what make it possible to solve simple substitution ciphers. In any case, if we drop the first and last letter of the five-letter word, the ambiguity skyrockets. Actually, this principle applies to words of arbitrary length. Ask any professional cryptographer.

Mark Von Hendy - 2 years, 8 months ago
Vai Patel
Feb 26, 2018

Use of dictionary words shortens our word list from 2 6 5 26^5 to 10230 10230 , which decreases the number of questions required to identify a word from log 2 2 6 5 \log_{2}{26^5} to log 2 10230 \log_{2}{10230} .

So on average, we could shorten a word proportionally to the decrease in the number of questions required to identify it, giving us a length L = 5 log 2 10230 log 2 2 6 5 = 5 5 log 2 10230 log 2 26 = log 26 10230 2.83389 L = 5 * \frac{\log_{2}{10230}}{\log_{2}{26^5}} = \frac{5}{5} * \frac{\log_{2}{10230}}{\log_{2}{26}} = \log_{26}{10230} \approx 2.83389

0 pending reports

×

Problem Loading...

Note Loading...

Set Loading...