Open main menu

CRM114 (full name: "The CRM114 Discriminator") is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.


Origin of the nameEdit

The name comes from the CRM-114 Discriminator in the Stanley Kubrick movie Dr. Strangelove - a piece of radio equipment designed to filter out messages lacking a specific code-prefix.


While others have done statistical Bayesian spam filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a Markov Random Field representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis[1] gave a 99.87% accuracy;[2] Holden [3] and TREC 2005 and 2006.[4][5] gave results of better than 99%, with significant variation depending on the particular corpus.

CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses entropy encoding to determine similarity, a SVM, by mutual compressibility as calculated by a modified LZ77 algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of skip-grams.

The CRM114 algorithms are multi-lingual (compatible with UTF-8 encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in Japanese at better than 99.9% detection rate and a 5.3% false alarm rate.[6]

CRM114 is a good example of pattern recognition software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the GPL.

At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly.

CRM114 has been applied to a number of other applications, including detection of bots on Twitter and Yahoo [7] [8], as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system. [9]. It has also been used as a predictive method for classifying fault-prone software modules [10].

See alsoEdit


External linksEdit