Letter Distribution

The most common letters in English are etaoin shrdlu. Thus the E is the most common letter in English.

Historically, such distribution have been important in the printing press. (The Linotype keyboards had these letters in this order, and the lay-out of modern qwerty keyboards minimizes the frequency of common letter pairs). However, even today, there are many applications, as can be seen from these other sources:
 * Wikipedia on Letter frequencies.
 * Deaf and Blind on word and letter frequencies, with different letter order for different source (fiction, press, religious writing and scientific articles)]. Deaf and Blind uses this information to improve your typing speed.
 * Data compression on statistical distribution of English texts.
 * Simon Singh on frequency analysis, geared towards deciphering coded messages.

Motivation
Mostly because I was interested in a word frequency analysis (as opposed to a letter frequency analysis), I created a few scripts to generate word and frequencies based on a text corpus. As the corpus I used the 120 most popular books on Project Gutenberg (My target was 100, but I overshoot a bit to compensate for books in other languages; in the end I only deleted a German-English dictionary), a free source of books. For obvious reasons, this corpus contains only older texts whose copyright has expired, like Pride and Prejudice by Jane Austen (1813), A Christmas Carol by Charles Dickens (1843), The Notebooks of Leonardo Da Vinci (1452-1519), and also The Outline of Science by John Arthur Thomson (1922). Thus, expect that this distribution is slightly different for modern-day English (yet another reason showing that copyright hinders scientific analysis!).

In order to compensate for bias, the letter frequency is not simply added, but averaged per book. This removes the bias for long books. For word frequency, there would still be a large random factor due to this particular choice of books in the corpus. For example, A Christmas Carol contains the word Scrooge 362 times. The other 119 books in the corpus have 0 hits for Scrooge. Thus on average, "Scrooge" occurs 3 times per book. (or actually "0.0019% of all words" as I use averages instead of totals.) The mean number of occurrences is 0 times per book, a much more realistic number. So I use the mean instead of the average.

Procedure
for f in `ls -1 en-original`; iconv -t utf-8 < en-original/$f > en-plaintext/$f; flip -u en-plaintext/$f; endfor;
 * Find 100 most popular texts in English at a free source; in this case from Project Gutenbert (www.gutenberg.org). The disadvantage of this source is that it favours an old-fashioned writing style. For example, it contains 7667 hits for "&c.", and only 1457 for "etc.", the modern equivalent.
 * Remove all headers and footers, including all transcriber notes. This removes bias for words that occur in the header of every file (such as "transcribed by" or "copyright")
 * Convert all texts to UTF-8 encoding, and unix-style line breaks.
 * Do a first letter frequency distribution and manually check files for irregularities or strange representations of accents. Read the transcoding notes by the transcriber (e.g. change [=a] to ā, [)o] to ŏ, f_te to fête, ɑ-Centauri to α-Centauri, etc.)
 * Scripts: findbaddiacriticals.py, letterfreq.py


 * Remove all accents (don't becomes dont) and punctuation. Remove double spaces (so we will never find triplet with two consecutive spaces later, such as "a " or "  b"), and change text to upper case.
 * Script: text2words.py


 * Combine the word-files in a statistical distribution of all words, and in a statistical distribution of all letter, letter pairs and (optionally) letter triplets.
 * Scripts: generatewordstats.py, generateletterstats.py


 * Optionally remove letters or words with a mean of zero, and normalize the remaining letters or words.
 * Script: normalizestats.py


 * Summarize the statistics in a table or figure.
 * Scripts: stats2table.py, or generate a letter frequency graph with gnuplot:

set title "Most common letters in English texts" set style fill solid border -1 set grid ytics set format y "%.0f%%" set xrange [0.5:26.5]  # Plot only first 26 letters, adding a 0.5 margin on the side plot '[[media:EN-Letters.txt|letters.txt]]' using 0:(100*$1):(0.8):xticlabels(9) with boxes title "Mean" replot '[[media:EN-Letters.txt|letters.txt]]' using 0:(100*$2):(100*$3) with yerrorbars title "Average" linetype 3
 * 1) !/usr/bin/env gnuplot
 * 1) set xtics rotate by 90  # For words file

Source
The corpus (text sources) contains 120 most popular text books from Project Gutenberg.

See also: [[media:EN-sources.txt|List of all text sources]] in the corpus.

The smallest book in the corpus, The Legend of Sleepy Hollow by Washington Irving, is 65.7 kiBytes (11896 words, 65469 letters) (although The Illustrated War News of November 18, 1914 contains only 9856 words, it's size is larger, 68.9 kiByte); the largest book, Les Miserables by Victor Hugo, is 3.1 MiByte (570˙485 words, 3˙087˙505 letters). The total size was 77.6 MiByte (13˙777˙933 words and 75˙898˙490 letters), thus an average of 662 kiByte (114˙816 words and 632˙487 letters per book). File sizes are larger then character count, as it includes punctuation and because UTF-8 encoded text may use more bytes per character. The letters counts mentioned in this paragraph include spaces, and exclude punctuation.

No filtering has been done to remove a sporadic non-English word in an otherwise English text. The idea was that those uncommon words would filter out by using the mean word frequency, since less then half the files would contain that word and it would thus not be reported.

Word Length
The whole corpus contained 13˙777˙933 words and 75˙898˙490 letters, where "letters" include spaces. So the average word size is 75898490/13777933 -1 = 4.509 letters per word.

However, word boundaries are not always obvious. Two punctuation characters are particularly ambiguous, the period and hyphen.

While the period is mostly used as a full stop at the end of sentences, it is also used in abbreviations and as a decimal separator. Consider the acronym and abbreviation U. S. A., U.S.A., a.m. and mr.. Since we want to treat a.m. as a single word, we simple remove all full stops. Since a full stop at the end of a sentence is always followed by a space, that gives the required word delimitation. In this case a.m. and U.S.A. are considered one word, while U. S. A. is considered three words.

The hyphen can be used to separate syllables of a single word. In addition, a hyphen is often used in texts, even though a dash character is meant. Consider the quote from the Jane Eyre autobiography by Charlotte Brontë bellow. In that text, the hyphen is used as a word separating dash (parish--Mr. Oliver), as a hyphen, joining two words (iron-foundry) and as a line-wrapper (needle-factory). While a dash -- is clearly a word separator, a line-wrapper - clearly is not. A manual check at this corpus revealed that the considering a hyphen as a word separator gave better results. It is likely that this is different in other languages, such as Dutch, given that in Dutch is it more common to merge words.


 * Miss Oliver; the only daughter
 * of the sole rich man in my parish--Mr. Oliver, the proprietor of a needle-
 * factory and iron-foundry in the valley.

Finally, 26 out of 13777933 words are reported by wc</tt> (Unix word count tool) as two words. This includes a word such as "παροιμα", which occurs in a footnote in Travels in the Great Desert of Sahara. No filtering for such irregularities has been done.

By varying the interpretations of the characters, we get the following letter and word counts:

(*) only listing those characters that are not obvious. For example, space and return are always word separators, while a caret (^) or underscore (_) are never word separators (but simply removed if they occur in the middle of a word).

Word Frequency
All books together contained 13˙777˙933 words of which there are 157˙278 different words. The average per book was 114˙816 words and 8785 different words. The mean per book was was 95˙883 words and 7659 different words.

To get an idea of the overlap or diversity between the books in the corpus, let's examine the overlap between the different words in the files. If all books would have exactly the same words, the total would be 8785 different words. If all books would have different words, the total would be 1˙054˙202 different words. The actual figure of 157˙278 is somewhere in between.

Some figures:

25 Most frequently occurring words

Raw data: [[media:EN-commonwords.txt|EN-commonwords.txt]], containing the 4010 words that occurred in at least half of the sources.



Letter Frequency
As expected, the majority of the books only contains the letters A-Z, and the most frequently used letters in English are E, T, A, O and N. Interestingly, 44 out of the 120 books also contains Æ (Latin Capital Letter AE). Apparently Unicode treats this as a character instead of a ligature (otherwise the compatibility decomposition would have turned this into an A + E). But more importantly, it seems to be an uncommon, but familiar character, in use in words such as Cæsar and Mediæval. Familiar because one out of 3 books uses it, Uncommon since it's usage is only 1 in every 15000 characters (compare to Z, the least common letter from A-Z in English, which is used 1 in 1670 characters) Other characters that were used are Œ (OE ligature which apparently is not converted to OE by Unicode compatibility decomposition, 5 books), Μ (Mu, 4 books), Α (Alpha, 3 books), Σ (Sigma, 3 books), other Greek characters (3 books or less), Ð (Eth, 1 book) and Þ (Thorn, 1 book) The table below gives the letter distribution in English. For all practical purposes, consult the normalized table on the right.

Raw data: [[media:EN-Letters.txt|EN-Letters.txt]], the statistical distribution including space.



Comparison with Other Sources
The letter order we found, ETAONI SRHLDU is slightly different from other sources:

Letter Pairs
The same script that determines letter frequence, can also determine letter pairs and letter triplets: two or three letters in a row. For example, QU is a much more common letter pair than IH, even though I and H are more common letter in than Q and U in English.

The 25 most common letter pairs (normalized by removing letter pairs with spaces) are listed below.

Raw data: [[media:EN-Letterpairs.txt|EN-Letterpairs.txt]] and [[media:EN-Lettertriplets.txt|EN-Lettertriplets.txt]], the statistical distribution including spaces.

Starting and Ending Letters
By examining all letter pairs that either start or end with a space, we can find the most common starting letter of a word, and the most common ending letter of a word. The following table gives the result.

Interestingly, there is not a single word in the corpus that ends with a Q. Despite that the word English start with an E, and the E is clearly the most common letter, only very few other words in the corpus start with an E.

Diacriticals
The above calculations were done by removing diacriticals (acute, circumflex, diaeresis (umlaut), grave, macron, tilde, dot below, caron, etc.)

When diacriticals are kept, it hardly changes the number of words:

Not surprisingly, the number of different letters dramatically increases from 54 to 139. However, the only additional character that has a mean higher than zero (thus in occurs in more than half of the books) is É (E-acute).

Below is a normalized table with and without diacriticals. It is obvious that the differences fall well within the margin of error, and the above statistics can be used as-is.

Future Interest
I would be interested to see if there is a distinction between British and American English. Also, I may do the same analysis for Dutch, French and German.