Infinite Monkey or Porky Pie

Introduction

I came across an interesting tweet yesterday. In summary, someone had written an article about a football: an English Premier League football club. A fan of that club read that article and spotted that if you take the first letter of every paragraph in the article, it spelled out this sentence: “I’d rather be back at the Boleyn than a soulless bowl wow ts”.

The author of the article said that that was an amazing coincidence: he didn’t intend that to happen.

Infinite Monkey Problem

The Infinite Monkey Problem is the scenario in which an infinite number of monkeys are each given a typewriter … will any of these monkeys ever type out the complete works of Shakespeare? Intuitively, most people say, yes, eventually it must happen.

Without calling the author of the article a monkey, how likely is it that he randomly typed out tat sentence in the way he says? I created a simulation in a spreadsheet that attempts to answer that question.

The Simulation

Firstly, make sure you are aware that the letters of the English alphabet in the English language are not all equally used: the letter e is used more than any other letter by far; then it’s the letter t … z is the least used letter.

I converted those frequencies into cumulative frequencies, expressed as integers, to use with my random number generator. I used RANDARRAY() (see the notes at the end of this article to see my advice on using the newer functions) to help me to generate 100,000 rows of random numbers, from which I used XLOOKUP() to convert the random number to a letter:

I did this eight times in each row so that I could generate any word or combination of up to eight letters to see if any of the words from the article could be simulated to appear.

Notice, each random number has been generated on its own. That is, every letter represents a key depression on a keyboard and it also assumes that every key depression is independent of every other key depression. Moreover, XLOOKUP() addresses the lookup table so that the letter e will appear more than any other letter … and so on

Using the CONCAT() function, I then put together the letters as you see above. Remember, I am trying to find out using this simulation if Excel will simulate the typing of the letter a, since it appears in the sentence we are exploring; will it simulate the typing of the two letter words, be, at and I’d (for this I assumed that id = I’d since I have only programmed letter and not punctuation, just to save time and space). How about. the three, four … eight letter words that appear in the sentence?

Did the Words Appear in the Simulation and how do I Know?

How do I know, from the 100,000 rows and eight columns whether the words appear in the simulation? I used the COUNTIF() function:

I put COUNTIF() in the following range, changing it from finding the letter a to finding the words be … at … the … bowl and so on

We can see that in 100,000 trials, the letter a appears 8,048 times, the word be appears 177 times … and five of the words do not appear in the simulation at all. Well, they might do since that screenshot represents the results of just one simulation. When I press the F9 key and run the simulation again, I get different results, like this one, after pressing the F9 key four times:

[Side bar: for Excel fans, I have counted the total number of words I was searching for in two ways:

=COUNT(AA9:AH12)

=LEN(AA1) - LEN(SUBSTITUTE(AA1," ",""))+1]

What Should have Happened?

So far, I have shown what happened and what might have happened but now it’s time to suggest what should have happened. That is, we got the letter a 8,140 times which is 8140/100000 = 8.244%; we got be 220 times and that is 220/100000 = 0.220% and rather appeared once, which is 1/100000 = 0.001%.

Using the frequency lookup table we can see that a should appear by itself 8.167% of the time … 8.244% is pretty close to that!

To find out how often the word be should appear, we multiply the frequencies (probabilities) of b and e together, to find

1.492% * 12.702% = 0.1895%, that is, 0.1895% * 100,000 = 189.5 or 190 times

The simulations above showed be appearing 177 and 220 times. Hmm, how good is that? Let’s press F9 another ten times and see what happens:

165

188

167

176

181

187

154

213

188

154

average (mean) 177.3

median 178.5

We are expecting be to appear 190 but we are getting a median of 179 times … close!

I have not done it, but we could carry out a Chi Squared test of my results but I am not so bothered here about pinpoint accuracy as the overall chance of the sentence we are discussing being typed randomly. Here are all of the probabilities: simulated and expected following another simulation:

If we combine all of these probabilities now to assess the chances that this sentence appeared randomly, we get an expected probability of

0.0000000000000000000000000000000000000000000000000000000001638%

Conclusion

The simulated probability is unclear for us because several words did not appear in our simulation: even though we simulated 800,000 letters at a time, we need a much bigger sample than that to ensure all of the words are returned.

My title says infinite monkey or porky pie: what do you think? On the basis of this analysis do you think that sentence fell out of that article by accident, with a probability of just 1.63803e-58?

Excel Functions Used

In this article I used both old and new functions and where I think you might not have or have used the newer functions, I suggest an alternative function to use:

New Functions

RANDARRAY() … use RANDBETWEEN() instead

XLOOKUP() … use VLOOKUP() instead

CONCAT() … use CONCATENATE() or & instead

Older Functions with no Alternative Suggested

COUNTIF()

COUNT()

LEN()

SUBSTITUTE()

Download the spreadsheet from here:


The file is large and since there are 800,000 volatile functions in it, it may be slow to refresh if you computer does not have much memory.


Duncan Williamson


7th January 2020


No comments: