Letter Distribution

From Exterior Memory
Revision as of 16:12, 23 January 2012 by MacFreek (Talk | contribs) (Future Interest)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The most common letters in English are etaoin shrdlu. Thus the E is the most common letter in English.

Historically, such distribution have been important in the printing press. (The Linotype keyboards had these letters in this order, and the lay-out of modern qwerty keyboards minimizes the frequency of common letter pairs). However, even today, there are many applications, as can be seen from these other sources:

Motivation

Mostly because I was interested in a word frequency analysis (as opposed to a letter frequency analysis), I created a few scripts to generate word and frequencies based on a text corpus. As the corpus I used the 120 most popular books on Project Gutenberg (My target was 100, but I overshoot a bit to compensate for books in other languages; in the end I only deleted a German-English dictionary), a free source of books. For obvious reasons, this corpus contains only older texts whose copyright has expired, like Pride and Prejudice by Jane Austen (1813), A Christmas Carol by Charles Dickens (1843), The Notebooks of Leonardo Da Vinci (1452-1519), and also The Outline of Science by John Arthur Thomson (1922). Thus, expect that this distribution is slightly different for modern-day English (yet another reason showing that copyright hinders scientific analysis!).

In order to compensate for bias, the letter frequency is not simply added, but averaged per book. This removes the bias for long books. For word frequency, there would still be a large random factor due to this particular choice of books in the corpus. For example, A Christmas Carol contains the word Scrooge 362 times. The other 119 books in the corpus have 0 hits for Scrooge. Thus on average, "Scrooge" occurs 3 times per book. (or actually "0.0019% of all words" as I use averages instead of totals.) The mean number of occurrences is 0 times per book, a much more realistic number. So I use the mean instead of the average.

Procedure

  • Find 100 most popular texts in English at a free source; in this case from Project Gutenbert (www.gutenberg.org). The disadvantage of this source is that it favours an old-fashioned writing style. For example, it contains 7667 hits for "&c.", and only 1457 for "etc.", the modern equivalent.
  • Remove all headers and footers, including all transcriber notes. This removes bias for words that occur in the header of every file (such as "transcribed by" or "copyright")
  • Convert all texts to UTF-8 encoding, and unix-style line breaks.
for f in  `ls -1 en-original`;
    iconv -t utf-8 < en-original/$f > en-plaintext/$f;
    flip -u en-plaintext/$f;
endfor;
  • Do a first letter frequency distribution and manually check files for irregularities or strange representations of accents. Read the transcoding notes by the transcriber (e.g. change [=a] to ā, [)o] to ŏ, f_te to fête, ɑ-Centauri to α-Centauri, etc.)
Scripts: findbaddiacriticals.py, letterfreq.py
  • Remove all accents (don't becomes dont) and punctuation. Remove double spaces (so we will never find triplet with two consecutive spaces later, such as "a " or " b"), and change text to upper case.
Script: text2words.py
  • Combine the word-files in a statistical distribution of all words, and in a statistical distribution of all letter, letter pairs and (optionally) letter triplets.
Scripts: generatewordstats.py, generateletterstats.py
  • Optionally remove letters or words with a mean of zero, and normalize the remaining letters or words.
Script: normalizestats.py
  • Summarize the statistics in a table or figure.
Scripts: stats2table.py, or generate a letter frequency graph with gnuplot:
#!/usr/bin/env gnuplot
set title "Most common letters in English texts"
set style fill solid border -1
set grid ytics
set format y "%.0f%%"
set xrange [0.5:26.5]   # Plot only first 26 letters, adding a 0.5 margin on the side
# set xtics rotate by 90  # For words file
plot 'letters.txt' using 0:(100*$1):(0.8):xticlabels(9) with boxes title "Mean"
replot 'letters.txt' using 0:(100*$2):(100*$3) with yerrorbars title "Average" linetype 3

Results

Source

The corpus (text sources) contains 120 most popular text books from Project Gutenberg.

See also: List of all text sources in the corpus.

The smallest book in the corpus, The Legend of Sleepy Hollow by Washington Irving, is 65.7 kiBytes (11896 words, 65469 letters) (although The Illustrated War News of November 18, 1914 contains only 9856 words, it's size is larger, 68.9 kiByte); the largest book, Les Miserables by Victor Hugo, is 3.1 MiByte (570˙485 words, 3˙087˙505 letters). The total size was 77.6 MiByte (13˙777˙933 words and 75˙898˙490 letters), thus an average of 662 kiByte (114˙816 words and 632˙487 letters per book). File sizes are larger then character count, as it includes punctuation and because UTF-8 encoded text may use more bytes per character. The letters counts mentioned in this paragraph include spaces, and exclude punctuation.

No filtering has been done to remove a sporadic non-English word in an otherwise English text. The idea was that those uncommon words would filter out by using the mean word frequency, since less then half the files would contain that word and it would thus not be reported.

Word Length

The whole corpus contained 13˙777˙933 words and 75˙898˙490 letters, where "letters" include spaces. So the average word size is 75898490/13777933 -1 = 4.509 letters per word.

However, word boundaries are not always obvious. Two punctuation characters are particularly ambiguous, the period and hyphen.

While the period is mostly used as a full stop at the end of sentences, it is also used in abbreviations and as a decimal separator. Consider the acronym and abbreviation U. S. A., U.S.A., a.m. and mr.. Since we want to treat a.m. as a single word, we simple remove all full stops. Since a full stop at the end of a sentence is always followed by a space, that gives the required word delimitation. In this case a.m. and U.S.A. are considered one word, while U. S. A. is considered three words.

The hyphen can be used to separate syllables of a single word. In addition, a hyphen is often used in texts, even though a dash character is meant. Consider the quote from the Jane Eyre autobiography by Charlotte Brontë bellow. In that text, the hyphen is used as a word separating dash (parish--Mr. Oliver), as a hyphen, joining two words (iron-foundry) and as a line-wrapper (needle-<CR>factory). While a dash -- is clearly a word separator, a line-wrapper -<CR> clearly is not. A manual check at this corpus revealed that the considering a hyphen as a word separator gave better results. It is likely that this is different in other languages, such as Dutch, given that in Dutch is it more common to merge words.

Miss Oliver; the only daughter
of the sole rich man in my parish--Mr. Oliver, the proprietor of a needle-
factory and iron-foundry in the valley.

Finally, 26 out of 13777933 words are reported by wc (Unix word count tool) as two words. This includes a word such as "παροιμα", which occurs in a footnote in Travels in the Great Desert of Sahara. No filtering for such irregularities has been done.

By varying the interpretations of the characters, we get the following letter and word counts:

word separators (*) non-word separators (*) word count letter count
(incl. spaces)
change average word length
space return only quotes (' ") full-stop (.) hyphen (-) other punctuation 13622831 75743388 -155102 4.560
other punctuation quotes (' ") full-stop (.) hyphen (-) 13630796 75751353 -147137 4.557
full-stop (.) other punctuation quotes (' ") hyphen (-) 13654183 75774740 -123750 4.550
hyphen (-) other punctuation quotes (' ") full-stop (.) 13777933 75898490 0 (base) 4.509
full-stop (.) hyphen (-) other punctuation quotes (' ") 13789577 75910134 +11644 4.505
quotes (' ") full-stop (.) hyphen (-) other punctuation others only (caret ^ underscore _) 13889202 76009759 +111269 4.473

(*) only listing those characters that are not obvious. For example, space and return are always word separators, while a caret (^) or underscore (_) are never word separators (but simply removed if they occur in the middle of a word).

Word Frequency

All books together contained 13˙777˙933 words of which there are 157˙278 different words. The average per book was 114˙816 words and 8785 different words. The mean per book was was 95˙883 words and 7659 different words.

To get an idea of the overlap or diversity between the books in the corpus, let's examine the overlap between the different words in the files. If all books would have exactly the same words, the total would be 8785 different words. If all books would have different words, the total would be 1˙054˙202 different words. The actual figure of 157˙278 is somewhere in between.

Some figures:

Total number of words 13˙777˙933
Number of different words 1˙054˙202
# Words occurring at least twice 91˙793
# Words occurring in at least 2 books 65˙357
# Words occurring in at least half of the books 4010

25 Most frequently occurring words

Word Mean Average Std. deviation Book occurrence
THE 7.47% 7.20% 1.92% 120/120
OF 3.77% 3.81% 1.35% 120/120
AND 3.38% 3.36% 0.72% 120/120
TO 2.43% 2.43% 0.47% 120/120
A 2.07% 2.12% 0.43% 120/120
IN 1.96% 1.98% 0.47% 120/120
THAT 1.00% 0.99% 0.36% 120/120
IT 0.91% 0.98% 0.41% 120/120
WAS 0.90% 0.89% 0.56% 120/120
IS 0.89% 0.99% 0.62% 120/120
WITH 0.75% 0.77% 0.18% 120/120
AS 0.74% 0.75% 0.16% 120/120
FOR 0.70% 0.70% 0.18% 120/120
HE 0.69% 0.78% 0.59% 118/120
HIS 0.67% 0.72% 0.52% 119/120
ON 0.61% 0.60% 0.21% 120/120
BY 0.58% 0.60% 0.24% 120/120
AT 0.54% 0.55% 0.18% 120/120
BE 0.51% 0.56% 0.23% 120/120
BUT 0.49% 0.50% 0.19% 120/120
NOT 0.49% 0.49% 0.20% 120/120
FROM 0.46% 0.46% 0.14% 120/120
WHICH 0.45% 0.49% 0.27% 120/120
THIS 0.42% 0.46% 0.16% 120/120
HAD 0.40% 0.45% 0.35% 120/120

Raw data: EN-commonwords.txt, containing the 4010 words that occurred in at least half of the sources.

Most common words in English texts

Letter Frequency

As expected, the majority of the books only contains the letters A-Z, and the most frequently used letters in English are E, T, A, O and N.

Interestingly, 44 out of the 120 books also contains Æ (Latin Capital Letter AE). Apparently Unicode treats this as a character instead of a ligature (otherwise the compatibility decomposition would have turned this into an A + E). But more importantly, it seems to be an uncommon, but familiar character, in use in words such as Cæsar and Mediæval. Familiar because one out of 3 books uses it, Uncommon since it's usage is only 1 in every 15000 characters (compare to Z, the least common letter from A-Z in English, which is used 1 in 1670 characters) Other characters that were used are Œ (OE ligature which apparently is not converted to OE by Unicode compatibility decomposition, 5 books), Μ (Mu, 4 books), Α (Alpha, 3 books), Σ (Sigma, 3 books), other Greek characters (3 books or less), Ð (Eth, 1 book) and Þ (Thorn, 1 book)

The table below gives the letter distribution in English. For all practical purposes, consult the normalized table on the right.

Raw statistics Normalized
Letter Mean Average Std. deviation Book occurrence
Space 18.28846265% 18.31685753% 0.99928817% 120/120
E 10.26665037% 10.21787708% 0.39458158% 120/120
T 7.51699827% 7.50999398% 0.41889737% 120/120
A 6.53216702% 6.55307059% 0.36853691% 120/120
O 6.15957725% 6.20055405% 0.31786759% 120/120
N 5.71201113% 5.70308374% 0.30319732% 120/120
I 5.66844326% 5.73425524% 0.44720691% 120/120
S 5.31700534% 5.32626738% 0.37410794% 120/120
R 4.98790855% 4.97199926% 0.41114075% 120/120
H 4.97856396% 4.86220925% 0.62025270% 120/120
L 3.31754796% 3.35616550% 0.29645987% 120/120
D 3.28292310% 3.35227377% 0.38989991% 120/120
U 2.27579536% 2.29520040% 0.22213489% 120/120
C 2.23367596% 2.26508836% 0.47638953% 120/120
M 2.02656783% 2.01727037% 0.21595281% 120/120
F 1.98306716% 1.97180888% 0.24773033% 120/120
W 1.70389377% 1.68961396% 0.33305120% 120/120
G 1.62490441% 1.63586607% 0.18039463% 120/120
P 1.50432428% 1.50311560% 0.23884118% 120/120
Y 1.42766662% 1.46995463% 0.29196635% 120/120
B 1.25888074% 1.27076566% 0.13130773% 120/120
V 0.79611644% 0.78804815% 0.09796300% 120/120
K 0.56096272% 0.56916712% 0.18896073% 120/120
X 0.14092016% 0.14980832% 0.07221414% 120/120
J 0.09752181% 0.11440544% 0.05892571% 120/120
Q 0.08367550% 0.08809302% 0.03196684% 120/120
Z 0.05128469% 0.05979301% 0.03778145% 120/120
Ae 0.00000000% 0.00653524% 0.02446928% 44/120
Oe 0.00000000% 0.00027979% 0.00177956% 5/120
Alpha 0.00000000% 0.00006771% 0.00051567% 3/120
Omicron 0.00000000% 0.00005821% 0.00048921% 2/120
Iota 0.00000000% 0.00005526% 0.00043422% 2/120
Sigma 0.00000000% 0.00004565% 0.00034906% 3/120
Epsilon 0.00000000% 0.00004536% 0.00036217% 2/120
Nu 0.00000000% 0.00004440% 0.00038359% 2/120
Tau 0.00000000% 0.00003606% 0.00028694% 3/120
Upsilon 0.00000000% 0.00003106% 0.00026460% 2/120
Mu 0.00000000% 0.00002643% 0.00016019% 4/120
Eta 0.00000000% 0.00002349% 0.00020447% 3/120
Omega 0.00000000% 0.00002156% 0.00017057% 2/120
Pi 0.00000000% 0.00002148% 0.00016503% 2/120
Lamda 0.00000000% 0.00001846% 0.00015096% 3/120
Kappa 0.00000000% 0.00001717% 0.00014063% 2/120
Gamma 0.00000000% 0.00001454% 0.00011691% 2/120
Rho 0.00000000% 0.00000998% 0.00008438% 2/120
Beta 0.00000000% 0.00000942% 0.00007870% 2/120
Delta 0.00000000% 0.00000719% 0.00005686% 2/120
Theta 0.00000000% 0.00000655% 0.00006040% 2/120
Chi 0.00000000% 0.00000607% 0.00004688% 2/120
Phi 0.00000000% 0.00000503% 0.00005487% 1/120
Eth 0.00000000% 0.00000330% 0.00003605% 1/120
Thorn 0.00000000% 0.00000142% 0.00001545% 1/120
Zeta 0.00000000% 0.00000112% 0.00001219% 1/120
Psi 0.00000000% 0.00000112% 0.00001219% 1/120
Letter Mean Average Std. deviation Book occurrence
 
E 12.596% 12.510% 0.483% 120/120
T 9.222% 9.195% 0.513% 120/120
A 8.014% 8.023% 0.451% 120/120
O 7.557% 7.592% 0.389% 120/120
N 7.008% 6.983% 0.371% 120/120
I 6.954% 7.021% 0.548% 120/120
S 6.523% 6.521% 0.458% 120/120
R 6.119% 6.087% 0.503% 120/120
H 6.108% 5.953% 0.759% 120/120
L 4.070% 4.109% 0.363% 120/120
D 4.028% 4.104% 0.477% 120/120
U 2.792% 2.810% 0.272% 120/120
C 2.740% 2.773% 0.583% 120/120
M 2.486% 2.470% 0.264% 120/120
F 2.433% 2.414% 0.303% 120/120
W 2.090% 2.069% 0.408% 120/120
G 1.994% 2.003% 0.221% 120/120
P 1.846% 1.840% 0.292% 120/120
Y 1.752% 1.800% 0.357% 120/120
B 1.544% 1.556% 0.161% 120/120
V 0.977% 0.965% 0.120% 120/120
K 0.688% 0.697% 0.231% 120/120
X 0.173% 0.183% 0.088% 120/120
J 0.120% 0.140% 0.072% 120/120
Q 0.103% 0.108% 0.039% 120/120
Z 0.063% 0.073% 0.046% 120/120

Raw data: EN-Letters.txt, the statistical distribution including space.

Frequency distribution of letters in English texts

Comparison with Other Sources

The letter order we found, ETAONI SRHLDU is slightly different from other sources:

Source Order Corpus Size
Linotype ETAOIN SHRDLU CMFWYP VBGKQJ XZ Unknown
Our statistics ETAONI SRHLDU CMFWGP YBVKXJ QZ 120 books, 13˙777˙933 words and 75˙898˙490 letters
Data Compression ETAONI SRHDLU CMFWGY PBVKXJ QZ 7 books, 5086936 letters
Robert Edward Lewand ETAOIN SHRDLC UMWFGY PBVKJX QZ Unknown
Tom Linton ETAINO SRLDHC UMFPYG WVBKXJ QZ 3 sources, 2700 words, 15000 letters
Deaf and Blind ETAOIN SRHLDC UMFPGW YBVKXJ QZ English Language
Deaf and Blind ETAONI SRHLDC MUFPGW YBVKJX QZ Press Reporting
Deaf and Blind ETIAON SRHLDC UMFPYW GBVKXJ QZ Religious Writings
Deaf and Blind ETAION SRHLCD UMFPGY BWVKXQ JZ Scientific Writings
Deaf and Blind ETAOHN ISRDLU WMCGFY PVKBJX ZQ General Fiction
Deaf and Blind ETAOIN SRHLDC UMFPGW YBVKXJ QZ Word Averages
Deaf and Blind ETAINO SHRDLC UMFWYG PBVKQJ XZ Morse Code
Deaf and Blind EAIRTO NSLCUP MDHGBY FVWKXZ QJ Non-Plural Word Letter Frequency (18584 words)
Deaf and Blind EISARN TOLCDU GPMHBY FVKWZX JQ Plural Word Letter Frequency (45406 words)
Simon Singh ETAOIN SHRDLC UMWFGY PBVKJX QZ Unknown

Letter Pairs

The same script that determines letter frequence, can also determine letter pairs and letter triplets: two or three letters in a row. For example, QU is a much more common letter pair than IH, even though I and H are more common letter in than Q and U in English.

The 25 most common letter pairs (normalized by removing letter pairs with spaces) are listed below.

Letter Mean Average Std. deviation Book occurrence
TH 4.21% 4.04% 0.57% 120/120
HE 3.80% 3.66% 0.57% 120/120
IN 2.42% 2.39% 0.26% 120/120
AN 2.19% 2.13% 0.27% 120/120
ER 2.11% 2.06% 0.21% 120/120
RE 1.88% 1.79% 0.18% 120/120
ND 1.63% 1.59% 0.25% 120/120
ON 1.52% 1.49% 0.28% 120/120
EN 1.39% 1.35% 0.19% 120/120
AT 1.39% 1.33% 0.16% 120/120
ES 1.30% 1.23% 0.26% 120/120
ED 1.25% 1.21% 0.29% 120/120
OR 1.21% 1.17% 0.19% 120/120
OF 1.20% 1.16% 0.34% 120/120
AR 1.16% 1.12% 0.19% 120/120
IS 1.16% 1.13% 0.23% 120/120
IT 1.14% 1.12% 0.18% 120/120
OU 1.13% 1.16% 0.39% 120/120
TO 1.13% 1.09% 0.16% 120/120
HA 1.10% 1.08% 0.31% 120/120
NG 1.09% 1.06% 0.17% 120/120
ST 1.08% 1.08% 0.21% 120/120
TE 1.07% 1.04% 0.21% 120/120
AS 1.00% 0.96% 0.17% 120/120
HI 0.99% 0.96% 0.26% 120/120
SE 0.94% 0.93% 0.15% 120/120

Raw data: EN-Letterpairs.txt and EN-Lettertriplets.txt, the statistical distribution including spaces.

Starting and Ending Letters

By examining all letter pairs that either start or end with a space, we can find the most common starting letter of a word, and the most common ending letter of a word. The following table gives the result.

Starting Letter of a Word Ending Letter of a Word
Letter Mean Average Std. deviation Book occurrence
T 16.708% 16.228% 2.030% 120/120
A 11.670% 11.429% 1.139% 120/120
O 7.529% 7.173% 1.212% 120/120
S 6.987% 7.002% 1.153% 120/120
I 6.953% 6.978% 1.412% 120/120
W 6.563% 6.369% 1.392% 120/120
H 5.447% 5.495% 2.330% 120/120
B 4.715% 4.749% 0.577% 120/120
C 4.301% 4.487% 0.959% 120/120
F 3.991% 4.074% 0.606% 120/120
M 3.722% 4.027% 0.978% 120/120
P 3.510% 3.453% 0.762% 120/120
D 2.988% 2.997% 0.473% 120/120
L 2.552% 2.575% 0.573% 120/120
R 2.274% 2.276% 0.407% 120/120
E 2.257% 2.230% 0.397% 120/120
N 2.241% 2.242% 0.568% 120/120
G 1.898% 1.911% 0.486% 120/120
U 1.055% 1.074% 0.226% 120/120
V 0.743% 0.776% 0.274% 120/120
Y 0.720% 1.105% 1.050% 120/120
K 0.570% 0.620% 0.308% 120/120
J 0.365% 0.436% 0.274% 120/120
Q 0.208% 0.227% 0.132% 120/120
X 0.020% 0.034% 0.047% 97/120
Z 0.014% 0.033% 0.072% 110/120
Letter Mean Average Std. deviation Book occurrence
E 20.660% 20.402% 1.523% 120/120
S 12.826% 12.395% 1.578% 120/120
D 10.927% 10.918% 2.165% 120/120
T 9.253% 9.367% 1.940% 120/120
N 8.343% 8.324% 0.991% 120/120
Y 6.003% 5.829% 0.830% 120/120
R 5.838% 5.827% 0.908% 120/120
F 4.449% 4.425% 1.033% 120/120
O 4.430% 4.299% 0.930% 120/120
H 2.999% 3.008% 0.559% 120/120
G 2.978% 3.029% 0.649% 120/120
A 2.797% 2.736% 0.679% 120/120
L 2.755% 2.812% 0.592% 120/120
M 1.747% 1.723% 0.371% 120/120
W 0.921% 0.955% 0.378% 120/120
K 0.891% 0.943% 0.423% 120/120
I 0.818% 1.129% 1.046% 119/120
P 0.529% 0.599% 0.281% 120/120
U 0.382% 0.651% 0.734% 119/120
C 0.251% 0.333% 0.451% 120/120
X 0.096% 0.114% 0.097% 120/120
B 0.073% 0.096% 0.088% 120/120
V 0.019% 0.044% 0.105% 108/120
Z 0.011% 0.023% 0.033% 103/120
J 0.003% 0.020% 0.079% 79/120
Q 0.000% 0.000% 0.000% 0/120

Interestingly, there is not a single word in the corpus that ends with a Q. Despite that the word English start with an E, and the E is clearly the most common letter, only very few other words in the corpus start with an E.

Diacriticals

The above calculations were done by removing diacriticals (acute, circumflex, diaeresis (umlaut), grave, macron, tilde, dot below, caron, etc.)

When diacriticals are kept, it hardly changes the number of words:

Diacriticals removed Diacriticals kept
Total number of words 13˙777˙933 13˙777˙933
Number of different words 1˙054˙202 1˙054˙963
# Words occurring at least twice 91˙793 92˙138
# Words occurring in at least 2 books 65˙357 65˙302
# Words occurring in at least half of the books 4010 4009

Not surprisingly, the number of different letters dramatically increases from 54 to 139. However, the only additional character that has a mean higher than zero (thus in occurs in more than half of the books) is É (E-acute).

Below is a normalized table with and without diacriticals. It is obvious that the differences fall well within the margin of error, and the above statistics can be used as-is.

Diacriticals removed Diacriticals kept
Letter Mean Average Std. deviation Book occurrence
E 12.596% 12.510% 0.483% 120/120
T 9.222% 9.195% 0.513% 120/120
A 8.014% 8.023% 0.451% 120/120
O 7.557% 7.592% 0.389% 120/120
N 7.008% 6.983% 0.371% 120/120
I 6.954% 7.021% 0.548% 120/120
S 6.523% 6.521% 0.458% 120/120
R 6.119% 6.087% 0.503% 120/120
H 6.108% 5.953% 0.759% 120/120
L 4.070% 4.109% 0.363% 120/120
D 4.028% 4.104% 0.477% 120/120
U 2.792% 2.810% 0.272% 120/120
C 2.740% 2.773% 0.583% 120/120
M 2.486% 2.470% 0.264% 120/120
F 2.433% 2.414% 0.303% 120/120
W 2.090% 2.069% 0.408% 120/120
G 1.994% 2.003% 0.221% 120/120
P 1.846% 1.840% 0.292% 120/120
Y 1.752% 1.800% 0.357% 120/120
B 1.544% 1.556% 0.161% 120/120
V 0.977% 0.965% 0.120% 120/120
K 0.688% 0.697% 0.231% 120/120
X 0.173% 0.183% 0.088% 120/120
J 0.120% 0.140% 0.072% 120/120
Q 0.103% 0.108% 0.039% 120/120
Z 0.063% 0.073% 0.046% 120/120
Letter Mean Average Std. deviation Book occurrence
E 12.5940% 12.5046% 0.4799% 120/120
T 9.2242% 9.1960% 0.5129% 120/120
A 8.0157% 8.0214% 0.4521% 120/120
O 7.5577% 7.5906% 0.3891% 120/120
N 7.0002% 6.9823% 0.3713% 120/120
I 6.9556% 7.0205% 0.5478% 120/120
S 6.5245% 6.5220% 0.4581% 120/120
R 6.1207% 6.0882% 0.5034% 120/120
H 6.1092% 5.9538% 0.7595% 120/120
L 4.0710% 4.1096% 0.3630% 120/120
D 4.0285% 4.1049% 0.4774% 120/120
U 2.7926% 2.8082% 0.2727% 120/120
C 2.7347% 2.7728% 0.5826% 120/120
M 2.4868% 2.4701% 0.2644% 120/120
F 2.4334% 2.4145% 0.3033% 120/120
W 2.0909% 2.0689% 0.4078% 120/120
G 1.9939% 2.0031% 0.2209% 120/120
P 1.8460% 1.8406% 0.2925% 120/120
Y 1.7519% 1.8000% 0.3575% 120/120
B 1.5448% 1.5561% 0.1608% 120/120
V 0.9769% 0.9650% 0.1200% 120/120
K 0.6884% 0.6969% 0.2314% 120/120
X 0.1729% 0.1834% 0.0884% 120/120
J 0.1197% 0.1401% 0.0722% 120/120
Q 0.1027% 0.1079% 0.0391% 120/120
Z 0.0629% 0.0732% 0.0463% 120/120
E Acute 0.0003% 0.0053% 0.0153% 65/120

Future Interest

I would be interested to see if there is a distinction between British and American English. Also, I may do the same analysis for Dutch, French and German.