Code:TextToWords.py

See Letter Distribution for the use of this script.

""" text2words.py Convert a UTF-8 text file to a list of words. All accents characters are removed (don't becomes dont). Diacritical marks are kept by default (ä remains ¨ä). Ligatures are expanded (thus ĳ becomes ij). Punctuation is removed (a.m. becomes am). Double spaces are removed (so the final text will next contain two or more spaces in a row). All spaces are converted into carriage return, so that the final text contains a list of words. All text is changed to upper case. """ import sys import os import codecs import unicodedata import tempfile import shutil import pprint keepcats   = ['Ll', 'Lu', 'Lt'] removecats = ['Cf', 'Lm', 'Lo', 'Mn', 'Nd', 'No', 'Pc', 'Pe', 'Pf', 'Pi', 'Ps', 'Sk'] spacecats  = ['Cc', 'Pd', 'Po', 'Sc', 'Sm', 'So', 'Zl', 'Zp', 'Zs'] keepchars  = u"" removechars = u"\"'′″." # in Po, but remove instead of turn into space spacechars  = u"-" keepdiacriticals  = False   # should diacritical marks (such as umlauts) be kept or removed? keepligatures     = False   # False removes ligatures (ĳ becomes ij) and formatting (½ becomes 1⁄2, ² becomes 2) makeuppercase     = True    # convert characters to upper case? removedoublespace = True    # series of spaces are converted into a single space. def getNormalization:     global keepdiacriticals, keepligatures     if keepdiacriticals:         if keepligatures:             return 'NFC'         else:   # not keepligatures             return 'NFKC'     else:   # not keepdiacriticals         if keepligatures:             return 'NFD'         else:   # not keepligatures             return 'NFKD' def reduceChar(char, space=u"\n"):     """Decides if a char is kept, turned into a space, or removed (return empty string)""" global keepcats, removecats, spacecats, keepchars, removechars, spacechars if char in keepchars: return char if char in removechars: return u"" if char in spacechars: return space cat = unicodedata.category(char) if cat in keepcats: keepchars += char # cache return char if cat in removecats: removechars += char # cache return u"" if cat in spacecats: spacechars += char # cache return space raise ValueError("Do not know how to handle char %s (u%04X, category %s)" % (repr(char), ord(char), cat)) def process(filename, destdir=None): global makeuppercase, removedoublespace tmpfile = tempfile.TemporaryFile src = codecs.open(filename, "rb", "utf-8") dst = codecs.getwriter("utf-8")(tmpfile) normfactor = getNormalization spaceChar = u"\n" lastspace = True # last character is a space, used to supress double spaces for line in src: newline = u"" line = unicodedata.normalize(normfactor, line) if makeuppercase: line = line.upper for c in line: cn = reduceChar(c, space=spaceChar) if cn == spaceChar: if (not lastspace) or (not removedoublespace): newline += cn                lastspace = True elif cn != u"": # regular char newline += cn                lastspace = False dst.write(newline) src.close # now overwrite source with modified destination file dst.seek(0) if destdir: filename = os.path.join(destdir, os.path.basename(filename)) src = open(filename, "wb") shutil.copyfileobj(dst, src) src.close dst.close def printletterfreq(letters): for c,v in letters.items: print repr(c),ord(c),v def dirfiles(dir): # we do not visit subdirectories, so this is not a loop. # print os.walk(dir) root, dirs, files = os.walk(dir).next filenames = [] for f in files: if f.startswith('.'): continue filenames.append(os.path.join(dir,f)) return filenames if __name__ == '__main__': for filename in dirfiles('en-plaintext'): size = os.stat(filename).st_size print filename, "(",size/1000,"kB )" process(filename, 'en-words') #printletterfreq(letters)
 * 1) !/usr/bin/env python
 * 2) encoding: utf-8
 * 1) Specify which characters should be kept, by Unicode category (see below for an overview of the categories)
 * 1) note: we remove non-latin or greek letters (such as arabic, hebrew or chinese characters)
 * 1) Override specific characters
 * 1) It is tempting to retain double spaces at the end of a sentence, and remove others. That would allow statistics about the first letter of a sentence (as opposed to just the first letter of a word). However, in practice it is undoable to get correctly find a sentence border, as a dot can be used for an abbreviation (s.v.p., a.m., Mr. Jones, L.T. Smith, etc.) and some sentences do not end in a dot (such as titles). Ideally, there should be a separate abbreviation marker. But since that does not exists (and is not used), this is fruitless attempt.
 * 1) Unicode categories:           (space/remove/keep)
 * 2) Cc - Other, control                   s    tab, return
 * 3) Cf - Other, format                    s    line separator, pop directional formatting, right-to-left override
 * 4) Cn - Other, not assigned              ?
 * 5) Co - Other, private use               ?
 * 6) Cs - Other, surrogate                 ?
 * 7) Ll - Letter, lowercase                k    abcdefřéîæäçȳøαβåάέγἡὡῶºɑ
 * 8) Lm - Letter, modifier                 r
 * 9) Lo - Letter, other                    r    Arabic, Hebrew letter
 * 10) Lt - Letter, titlecase                k
 * 11) Lu - Letter, uppercase                k    ABCDEFÀÃÆÇÈÏÑÖÛÝĀĪŒŪƷΑΓΔΕΗΙΚΛΜΝΟΠΡΣΤΥΦΩἸ
 * 12) Mc - Mark, spacing combining          ?
 * 13) Me - Mark, enclosing                  ?
 * 14) Mn - Mark, non-spacing                r    Latin, Arabic, Hebrew diacritical marks (umlaut, etc.)
 * 15) Nd - Number, decimal digit            r    0123456789
 * 16) Nl - Number, letter                   ?
 * 17) No - Number other                     r   ₂¹²³¼½¾⅓⅔⅙⅛
 * 18) Pc - Punctuation, connector           r    _
 * 19) Pd - Punctuation, dash                s/r  –-
 * 20) Pe - Punctuation, close               r    )]}
 * 21) Pf - Punctuation, final quote         r    ’”»
 * 22) Pi - Puntuation, initial quote        r    ‘“«
 * 23) Po - Punctuation, other               s/r  !"#%&'*,./:;?@\¡·¿′″, Hebrew punctuation
 * 24) Ps - Punctuation, open                r    ([{
 * 25) Sc - Symbol, currency                 s    $¢£¤
 * 26) Sk - Symbol, modifier                 r    ^`¯´
 * 27) Sm - Symbol, math                     s    +<=>|~±×÷
 * 28) So - Symbol, other                    s    §©°℥♀♂�
 * 29) Zl - Seaprator, line                  s
 * 30) Zp - Seaparator, paragraph            s
 * 31) Zs - Separator, space                 s    space, no-break space (A0)
 * 1) modname = globals['__name__']     # __main__ if it is called directly
 * 2) module = sys.modules[modname]       # this module