Bayesian programming

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science^[1] he developed this theory and proposed what he called “the robot,” which was not a physical device, but an inference engine to automate probabilistic reasoning—a kind of Prolog for probability instead of logic. Bayesian programming^[2] is a formal and concrete implementation of this "robot".

Bayesian programming may also be seen as an algebraic formalism to specify graphical models such as, for instance, Bayesian networks, dynamic Bayesian networks, Kalman filters or hidden Markov models. Indeed, Bayesian Programming is more general than Bayesian networks and has a power of expression equivalent to probabilistic factor graphs.^[3]

Formalism edit

A Bayesian program is a means of specifying a family of probability distributions.

The constituent elements of a Bayesian program are presented below:^[4]

{\text{Program}}{\begin{cases}{\text{Description}}{\begin{cases}{\text{Specification}}(\pi ){\begin{cases}{\text{Variables}}\\{\text{Decomposition}}\\{\text{Forms}}\\\end{cases}}\\{\text{Identification (based on }}\delta )\end{cases}}\\{\text{Question}}\end{cases}}

A program is constructed from a description and a question.
A description is constructed using some specification ( $\pi$ ) as given by the programmer and an identification or learning process for the parameters not completely specified by the specification, using a data set ( $\delta$ ).
A specification is constructed from a set of pertinent variables, a decomposition and a set of forms.
Forms are either parametric forms or questions to other Bayesian programs.
A question specifies which probability distribution has to be computed.

Description edit

The purpose of a description is to specify an effective method of computing a joint probability distribution on a set of variables $\left\{X_{1},X_{2},\cdots ,X_{N}\right\}$ given a set of experimental data $\delta$ and some specification $\pi$ . This joint distribution is denoted as: $P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)$ .^[5]

To specify preliminary knowledge $\pi$ , the programmer must undertake the following:

Define the set of relevant variables $\left\{X_{1},X_{2},\cdots ,X_{N}\right\}$ on which the joint distribution is defined.
Decompose the joint distribution (break it into relevant independent or conditional probabilities).
Define the forms of each of the distributions (e.g., for each variable, one of the list of probability distributions).

Decomposition edit

Given a partition of $\left\{X_{1},X_{2},\ldots ,X_{N}\right\}$ containing $K$ subsets, $K$ variables are defined $L_{1},\cdots ,L_{K}$ , each corresponding to one of these subsets. Each variable $L_{k}$ is obtained as the conjunction of the variables $\left\{X_{k_{1}},X_{k_{2}},\cdots \right\}$ belonging to the $k^{th}$ subset. Recursive application of Bayes' theorem leads to:

{\begin{aligned}&P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)\\={}&P\left(L_{1}\wedge \cdots \wedge L_{K}\mid \delta \wedge \pi \right)\\={}&P\left(L_{1}\mid \delta \wedge \pi \right)\times P\left(L_{2}\mid L_{1}\wedge \delta \wedge \pi \right)\times \cdots \times P\left(L_{K}\mid L_{K-1}\wedge \cdots \wedge L_{1}\wedge \delta \wedge \pi \right)\end{aligned}}

Conditional independence hypotheses then allow further simplifications. A conditional independence hypothesis for variable $L_{k}$ is defined by choosing some variable $X_{n}$ among the variables appearing in the conjunction $L_{k-1}\wedge \cdots \wedge L_{2}\wedge L_{1}$ , labelling $R_{k}$ as the conjunction of these chosen variables and setting:

P\left(L_{k}\mid L_{k-1}\wedge \cdots \wedge L_{1}\wedge \delta \wedge \pi \right)=P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)

We then obtain:

{\begin{aligned}&P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)\\={}&P\left(L_{1}\mid \delta \wedge \pi \right)\times P\left(L_{2}\mid R_{2}\wedge \delta \wedge \pi \right)\times \cdots \times P\left(L_{K}\mid R_{K}\wedge \delta \wedge \pi \right)\end{aligned}}

Such a simplification of the joint distribution as a product of simpler distributions is called a decomposition, derived using the chain rule.

This ensures that each variable appears at the most once on the left of a conditioning bar, which is the necessary and sufficient condition to write mathematically valid decompositions.^{[citation needed]}

Forms edit

Each distribution $P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)$ appearing in the product is then associated with either a parametric form (i.e., a function $f_{\mu }\left(L_{k}\right)$ ) or a question to another Bayesian program $P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)=P\left(L\mid R\wedge {\widehat {\delta }}\wedge {\widehat {\pi }}\right)$ .

When it is a form $f_{\mu }\left(L_{k}\right)$ , in general, $\mu$ is a vector of parameters that may depend on $R_{k}$ or $\delta$ or both. Learning takes place when some of these parameters are computed using the data set $\delta$ .

An important feature of Bayesian Programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program. $P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)$ is obtained by some inferences done by another Bayesian program defined by the specifications ${\widehat {\pi }}$ and the data ${\widehat {\delta }}$ . This is similar to calling a subroutine in classical programming and provides an easy way to build hierarchical models.

Question edit

Given a description (i.e., $P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)$ ), a question is obtained by partitioning $\left\{X_{1},X_{2},\cdots ,X_{N}\right\}$ into three sets: the searched variables, the known variables and the free variables.

The 3 variables $Searched$ , $Known$ and $Free$ are defined as the conjunction of the variables belonging to these sets.

A question is defined as the set of distributions:

P\left(Searched\mid {\text{Known}}\wedge \delta \wedge \pi \right)

made of many "instantiated questions" as the cardinal of $Known$ , each instantiated question being the distribution:

P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)

Inference edit

Given the joint distribution $P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)$ , it is always possible to compute any possible question using the following general inference:

{\begin{aligned}&P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)\\={}&\sum _{\text{Free}}\left[P\left({\text{Searched}}\wedge {\text{Free}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)\right]\\={}&{\frac {\displaystyle \sum _{\text{Free}}\left[P\left({\text{Searched}}\wedge {\text{Free}}\wedge {\text{Known}}\mid \delta \wedge \pi \right)\right]}{\displaystyle P\left({\text{Known}}\mid \delta \wedge \pi \right)}}\\={}&{\frac {\displaystyle \sum _{\text{Free}}\left[P\left({\text{Searched}}\wedge {\text{Free}}\wedge {\text{Known}}\mid \delta \wedge \pi \right)\right]}{\displaystyle \sum _{{\text{Free}}\wedge {\text{Searched}}}\left[P\left({\text{Searched}}\wedge {\text{Free}}\wedge {\text{Known}}\mid \delta \wedge \pi \right)\right]}}\\={}&{\frac {1}{Z}}\times \sum _{\text{Free}}\left[P\left({\text{Searched}}\wedge {\text{Free}}\wedge {\text{Known}}\mid \delta \wedge \pi \right)\right]\end{aligned}}

where the first equality results from the marginalization rule, the second results from Bayes' theorem and the third corresponds to a second application of marginalization. The denominator appears to be a normalization term and can be replaced by a constant $Z$ .

Theoretically, this allows to solve any Bayesian inference problem. In practice, however, the cost of computing exhaustively and exactly $P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)$ is too great in almost all cases.

Replacing the joint distribution by its decomposition we get:

{\begin{aligned}&P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)\\={}&{\frac {1}{Z}}\sum _{\text{Free}}\left[\prod _{k=1}^{K}\left[P\left(L_{i}\mid K_{i}\wedge \pi \right)\right]\right]\end{aligned}}

which is usually a much simpler expression to compute, as the dimensionality of the problem is considerably reduced by the decomposition into a product of lower dimension distributions.

Example edit

Bayesian spam detection edit

The purpose of Bayesian spam filtering is to eliminate junk e-mails.

The problem is very easy to formulate. E-mails should be classified into one of two categories: non-spam or spam. The only available information to classify the e-mails is their content: a set of words. Using these words without taking the order into account is commonly called a bag of words model.

The classifier should furthermore be able to adapt to its user and to learn from experience. Starting from an initial standard setting, the classifier should modify its internal parameters when the user disagrees with its own decision. It will hence adapt to the user's criteria to differentiate between non-spam and spam. It will improve its results as it encounters increasingly classified e-mails.

Variables edit

The variables necessary to write this program are as follows:

$Spam$ : a binary variable, false if the e-mail is not spam and true otherwise.
$W_{0},W_{1},\ldots ,W_{N-1}$ : $N$ binary variables. $W_{n}$ is true if the $n^{th}$ word of the dictionary is present in the text.

These $N+1$ binary variables sum up all the information about an e-mail.