Markovian discrimination

Within the probability theory Markov model, Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviours of spam and non-spam more accurately than can be done by using simple Bayesian methods.[1] It is based on the theory of Markov chains by Andrey Markov. Whilst a simple Bayesian model of written text contains only the dictionary of legal words and their relative probabilities, a Markovian model adds to this the relative transition probabilities, i.e. the probabilities of the next word given the current word. Effectively a Bayesian filter works on single words alone, while a Markovian filter works on phrases or entire sentences.

Comparison with Bayesian Model and Spam Filtering Application edit

In contrast to a simple Bayesian model, which represents a message simply as a bag-of-words and examines a dictionary of legitimate words along with their relative probabilities, a Markovian model enhances this approach by incorporating relative transition probabilities. These transition probabilities predict the next word based on the preceding word. The fundamental distinction between Bayesian and Markovian filters lies in their scope: while a Bayesian filter examines individual words in isolation, a Markovian filter extends its analysis to phrases or complete sentences. Furthermore, Markovian filters used in spam filtering are not limited to the word level; they can also process letters or partial word tokens. Weights can then be assigned to these tokens based on their probability of appearing in spam or legitimate content, further enhancing the accuracy of the filter.[2]

Markov Models edit

There are two variants of Markov models: the visible Markov model and the hidden Markov model (HMM). These differ in reltaion to the concept of the current word; with a visible Markov model, the current word is considered to contain full information about previous states of the language model, whereas a hidden Markov model obscures the relationship, with the current word being only probabilistically linked to the previously read words.[3]

Application of Hidden Markov Models edit

To illustrate, in a visible Markov model, the word "the" should predict the subsequent word with a high degree of accuracy. In contrast, a hidden Markov model implies that the entire preceding text indicates the actual state and foresees the subsequent words, but it does not provide an absolute guarantee of that state or prediction. As spam filtering typically encounters the latter scenario, hidden Markov models are predominantly employed. Moreover, due to storage constraints, a specific type of hidden Markov model, known as a Markov random field, proves particularly suitable, typically with a 'sliding window' or clique size ranging between four and six tokens.[4]

See also edit

References edit

  1. ^ Chhabra, S., Yerazunis, W. S., and Siefkes, C. 2004. Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas. In Proceedings of the Fourth IEEE international Conference on Data Mining (November 1–04, 2004). ICDM. IEEE Computer Society, Washington, DC, Mazharul
  2. ^ Duarte, J. P., Leite, V. C., & Frutuoso e Melo, P. F. F. (January 2013). A comparison between Markovian models and Bayesian networks for treating some dependent events in reliability evaluations. 2013 International Nuclear Atlantic Conference, Recife, Brazil. [1]
  3. ^ Jurafsky, Daniel & Martin, James H. Speech and Language Processing. 2023. Stanford University. [2]
  4. ^ Yerazunis, W. S. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. Mitsubishi Electric Research Laboratories, TR2004-091, January 2004. [3]