User:Thepigdog/Inductive probabilities

Inductive probability attempts to give the probability of future events based on past events. It is the basis for inductive reasoning, and gives the mathematical basis for learning and the perception of patterns.

It differs from classical and frequentist probability in that it does not rely on capturing the frequency of previous events in estimating probabilities, or on other assumptions of fairness.

However inductive probability turns out to be strongly related to Bayesian probability which in turn is related to classical and frequentist probability through Bayes' theorem.

Probability definition

Probability is defined as sets of outcomes, in a sample space. Set Boolean duality,

A=\{x:a\}

B=\{x:b\}

A\cap B=\{x:a\}\cap \{x:b\}=\{x:a\land b\}

A\cap B=\{x:a\}\cup \{x:b\}=\{x:a\lor b\}

A definition of the probability of a Boolean expression is needed. Given a condition C,

P(C)={\frac {|\{x:c\}|}{|\{x:\operatorname {true} \}|}}

The condition probability may be identified with implication. The conditional probability $P(A|B)$ is the proportion of outcomes where being in B implies also being in A.

Axioms in words,

Sum of mutually exclusive conditions is 1.
If A and B are independent,

P(A\land B)=P(A)\cdot P(B)

Independent means,

P(B)=P(A\to B)

Even simpler is, probability is a number associated with a boolean condition, such that;

P(B)=P(A\to B)\to P(A\land B)=P(A)\cdot P(B)

P(A)+P(\neg A)=1

Bayes theorem,

P(B)=P(A\to B)\to P(A\land B)=P(A)\cdot P(B)

is equivalent to,

P(A\land B)=P(A)\cdot P(A\to B)

Principle of indifference does not work

Indifference only works if we have physical knowledge about the source of the events. Frequentist and classical probability rely on the principle of indifference, which makes them dependent on information beyond the Boolean conditions known to the intelligent agent.

If x is a member of a set and indifference is used to say that the probability of a variable having each value is the same, then the condition may be re-encoded to give a different set, to which indifference may also be applied. This would give different probabilities, which is a contradiction.

The principle of indifference is then inconsistent, when applied to sets, unless other information about the environment is known, and may not be used for inductive inference.

Boolean induction

Boolean induction starts with a target condition C which is a Boolean expression. Induction is of the form,

A theory T implies the statement C. As the theory T is simpler than C, induction says that there is a probability that the theory T is implied by C.

The method used is based on Solomonoff's theory of inductive inference.

Results

The expression obtained for the probability that the theory T is implied by the condion C is,

\forall t\in T(C),P(t\mid C)={\frac {\sum _{n:R(n)\equiv t}2^{-L(n)}}{\sum _{j\in T(C)}\sum _{m:R(m)\equiv j}2^{-L(m)}}}

A second version of the result derived from the first, by eliminating theories without predictive power is,

\forall t\in T(C),P(t\mid C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(E(j,c))}}

P(\operatorname {random} (C)\mid C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(F(j,c))}}

where,

\forall t,P(F(t,c))=\sum _{n:R(n)\equiv t\land L(n)<L(c)}2^{-L(n)}

Implication and condition probability

The following theorem will be required in the derivation of Boolean induction. Implication is related to conditional probability by the following law,

A\to B\iff P(B\mid A)=1

Derivation,

A\to B

\iff P(A\to B)=1

\iff P(A\land B\lor \neg A)=1

\iff P(A\land B)+P(\neg A)=1

\iff P(A\land B)=P(A)

\iff P(A)\cdot P(B\mid A)=P(A)

\iff P(B\mid A)=1

Probability of a bit string

The principle of indifference is not valid in the real world. Dice may be loaded, and coins not fair. Knowledge of the physical world must be used to determine if the coin is fair.

However in the mathematical world indifference may be used. One bit of information has equal probability of being 1 or 0. Then each bit string has a probability of occurring,

P(a)=2^{-L(a)}

where L(a) is the length of the bit string a.

The parser compiler for mathematics

The parser compiler for mathematics, represented by R is a function that takes a bit string and converts it into a logical condition.

The way the bit string is encoded will determine the probability of Boolean expressions. The encoding must be close to optimal to get the best estimate. Consideration of other encodings is possible. For the purposes of Boolean induction, the encoding given here is considered the best, and no other encoding is necessary. Theoretically this is an approximation.

Each expression is the application of a function to its parameters. The number of bits used to describe a function call is,

The size of the encoding of the function id.
The sum of the sizes of the expressions for each parameter.

A constant is encoded as a function with no parameters.

The function id is encoded with the prefix code based on the number of uses of the function id in the expression.

Distribution of natural numbers

No explicit representation of natural numbers is given. However natural numbers may be constructed by applying the successor function to 0, and then applying other arithmetic functions. A distribution of natural numbers is implied by this, based on the complexity of each number.

Model for constructing worlds

A model of how worlds are constructed is used in determining the probabilities of theories,

A random bit string is selected.
A condition is constructed from the bit string, using the mathematical parser.
A world is constructed that is consistent with the condition.

If w is the bit string then the world is created such that $R(w)$ is true. An intelligent agent has some facts about the word, represented by the bit string c, which gives the condition,

C=R(c)

The set of bit strings consistent with any condition x is $E(x)$ .

\forall x,E(x)=\{w:R(w)\equiv x\}

A theory is a simpler condition that explains (or implies) C. The set of all such theories is called T,

T(C)=\{t:t\to C\}

Applying Bayes' theorem

The next step is to apply the extended form of Bayes' theorem

P(A_{i}\mid B)={\frac {P(B\mid A_{i})\,P(A_{i})}{\sum \limits _{j}P(B\mid A_{j})\,P(A_{j})}}\cdot

where,

B=E(C)

A_{i}=E(t)

To apply Bayes' theorem the following must hold,

$A_{i}$ is a partition of the event space.

For $T(C)$ to be a partition, no bit string n may belong to two theories. To prove this assume they can and derive a contradiction,

N\in T\land N\in M\land N\neq M\land n\in E(N)\land n\in E(M)

\implies N\neq M\land R(n)\equiv N\land R(n)\equiv M

\implies \operatorname {false}

Secondly prove that T includes all outcomes consistent with the condition. As all theories consistent with C are included then $R(w)$ must be in this set.

So Bayes theorem may be applied as specified giving,

\forall t\in T(C),P(E(t)\mid E(C))={\frac {P(E(t))\cdot P(E(C)\mid E(t))}{\sum _{j\in T(C)}P(E(j))\cdot P(E(C)\mid E(j))}}

Using the implication and condition probability law, the definition of $T(C)$ implies,

\forall t\in T(C),P(E(C)\mid E(t))=1

The probability of each theory in T is given by,

\forall t\in T(C),P(E(t))=\sum _{n:R(n)\equiv t}2^{-L(n)}

so,

\forall t\in T(C),P(E(t)\mid E(C))={\frac {\sum _{n:R(n)\equiv t}2^{-L(n)}}{\sum _{j\in T(C)}\sum _{m:R(m)\equiv j}2^{-L(m)})}}

Finally the probabilities of the events may be identified with the probabilities of the condition which the outcomes in the event satisfy,

\forall t\in T(C),P(E(t)\mid E(C))=P(t\mid C)

giving

\forall t\in T(C),P(t\mid C)={\frac {\sum _{n:R(n)\equiv t}2^{-L(n)}}{\sum _{j\in T(C)}\sum _{m:R(m)\equiv j}2^{-L(m)}}}

This is the probability of the theory t after observing that the condition C holds.

Removing theories without predictive power

Theories that are less probable than the condition C have no predictive power. Separate them out giving,

\forall t\in T(C),P(t\mid C)={\frac {P(E(t))}{(\sum _{j:j\in T(C)\land P(E(j))>P(E(C))}P(E(j)))+(\sum _{j:j\in T(C)\land P(E(j))\leq P(E(C))}P(j))}}

The probability of the theories without predictive power on C is the same as the probability of C. So,

P(E(C))=\sum _{j:j\in T(C)\land P(E(j))\leq P(E(C))}P(j)

So the probability

\forall t\in T(C),P(t\mid C)={\frac {P(E(t))}{P(E(C))+\sum _{j:j\in T(C)\land P(E(j))>P(E(C))}P(E(j))}}

and the probability of no prediction for C, written as $\operatorname {random} (C)$ ,

P(\operatorname {random} (C)\mid C)={\frac {P(E(t))}{P(E(C))+\sum _{j:j\in T(C)\land P(E(j))>P(E(C))}P(E(j))}}

The probability of a condition was given as,

\forall t,P(E(t))=\sum _{n:R(n)\equiv t}2^{-L(n)}

Bit strings for theories that are more complex than the bit string given to the agent as input have no predictive power. There probabilities are better included in the random case. To implement this a new definition is given as F in,

\forall t,P(F(t,c))=\sum _{n:R(n)\equiv t\land L(n)<L(c)}2^{-L(n)}

Using F, an improved version of the inductive probabilities is,

\forall t\in T(C),P(t\mid C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(E(j,c))}}

P(\operatorname {random} (C)\mid C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(F(j,c))}}

Intelligent agent

An intelligent agent has a set of actions A available. An agent also has a goal condition G. The probability of G should be maximized by choosing actions from A.

calculate probability achieve goal in next step or step after

P(G|C)=\sum _{t:t\in T(C)\land (t\land a)\to G}P(t\mid C)

????

\forall t\in T(C),P(t\mid C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(E(j,c))}}

P(\operatorname {random} (C)\mid C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(F(j,c))}}

where,

\forall t,P(F(t,c))=\sum _{n:R(n)\equiv t\land L(n)<L(c)}2^{-L(n)}

M(c,G,t)\in \{a:a\in A\land P()\}

P(G,I,O)=P(C\land \bigvee _{i}I_{i}\land R_{i}\to G)

P(G,C,i::I,O)=P(C\land i\to )

S(G,C,i::I,o::O)=C\to G\lor \neg (C\to G)\land S(G,C\land i\land o,I,O)

Universal prior

Now apply Boolean induction again with C as $R(i)$ ,

\forall k\in K(R(i)),P(R(i)\to R(k))={\frac {P(R(k))}{\sum _{j\in K(R(i))}P(R(j))}}

Introduction

Classical probability starts with considering a fair coin. Inductive probability starts by considering an unknown coin. Fair or not fair is unknown. It is considered as a series of bits arriving from an unknown but identifiable source.

If the coin is fair then the probability of any one series of bits is,

P(x)=2^{-n}

where n is the length of the sequence of bits represented by x.

Writing $n=L(x)$ as the length of the message x gives,

L(x)=-\log _{2}(P(x))

This relates the amount of information to the probability of the message. A simpler theory may be described with less data, so is then more probable. Occam's razor states the simplest theory, that fits all the facts, is the most likely.

Patterns in the data

If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data.

Such a pattern may be represented by a computer program. A short computer program may be written that produces a series of bits which are all 1. If the length of the program K is $L(K)$ bits then it's prior probability is,

P(K)=2^{-L(K)}

The length of the shortest program that represents the string of bits is called the Kolmogorov complexity.

Kolmogorov complexity is not computible. This is related to the halting problem. When searching for the shortest program some programs may go into an infinite loop.

Considering all theories

The Greek philospher Epicurus is quoted as saying "If more than one theory is consistent with the observations, keep all theories"^[1].

As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits.

Programs that are already longer than n have no predictive power. The raw (or prior) probability that the pattern of bits is random (has no pattern) is $2^{-n}$ .

Each program that produces the sequence of bits, but is shorter than the n is a theory/pattern about the bits with a probability of $2^{-k}$ where k is the length of the program.

The probability of receiving a sequence of bits y after receiving a series of bits x is then the conditional probability of receiving y given x, which is the probability of x with y appended, divided by the probability of x. ^[2] ^[3] ^[4]

Universal priors

The programming language effects the predictions of the next bit in the string. The language acts as a prior probability. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex the numbers may be factorized.

Using the Kolmogorov complexity gives an unbiased estimate (a universal prior) of the prior probability of a number. As a thought experiment an intelligent agent may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant.

So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable.

A problem arises where an intelligent agents prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem.

Minimum description/message length

The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the Minimum message length^[5] and Minimum description length^[6] methods.

At first site Bayes' theorem appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities. What is the probability that event B happens if firstly event A happens?

P(A\land B)=P(B)\cdot P(A\mid B)=P(A)\cdot P(B\mid A)

Becomes in terms of message length L,

L(A\land B)=L(B)+L(A\mid B)=L(A)+L(B\mid A)

What this means is that in describing an event, if all the information is given describing the event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of A is given, along with the information describing B given A, then all the information describing A and B has been given.^[7] ^[8]

Overfitting

Overfitting is where the model matches the random noise and not the pattern in the data. For example take the situation where a curve is fitted to a set of points. If polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviances from the fitted curve will be smaller. Smaller information length means more probable.

However the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial.

Universal artificial intelligence

The theory of universal artificial intelligence applies decision theory to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence. ^[9]

It is a fundamental theory of intelligence, which optimizes the agents behavior in,

Exploring the environment; performing actions to get responses that broaden the agents knowledge.
Competing or co-operating with another agent; games.
Balancing short and long term rewards.

In general no agent will always provide the best actions in all situations. A particular choice choice made by an agent may be wrong, and the environment may provide no way for the agent to recover from an initial bad choice. However the agent is Pareto optimal in the sense that no other agent will do better than this agent in this environment, without doing worse in another environment. No other agent may, in this sense, be said to be better.

At present the theory is limited by incomputability (the halting problem). Approximations may be used to avoid this. Processing speed and combinatorial explosion remain the primary limiting factors for artificial inteligence.

Derivation of inductive probability

Make a list of all the shortest programs $K_{i}$ that each produce a distinct infinite string of bits, and satisfy the relation,

T_{n}(R(K_{i}))=x

where,

R(K_{i})

is the result of running the program

K_{i}

.

T_{n}

truncates the string after n bits.

The problem is to calculate the probability that the source is produced by program $K_{i}$ , given that the truncated source after n bits is x. This is represented by the conditional probability,

P(s=R(K_{i})\mid T_{n}(s)=x)

Using the extended form of Bayes' theorem

P(A_{i}\mid B)={\frac {P(B\mid A_{i})\,P(A_{i})}{\sum \limits _{j}P(B\mid A_{j})\,P(A_{j})}}\cdot

where,

B=(T_{n}(s)=x)

A_{i}=(s=R(K_{i}))

The extended form relies on the law of total probability. This means that the $A_{i}$ must be distinct possibilities, which is given by the condition that each $K_{i}$ produce a different infinite string. Also one of the conditions $A_{i}$ must be true. This must be true, as in the limit as n tends to infinity, there is always at least one program that produces $T_{n}(s)$ .

Then using the extended form and substituting for $B$ and $A_{i}$ gives,

P(s=R(K_{i})\mid T_{n}(s)=x)={\frac {P(T_{n}(s)=x\mid s=R(K_{i}))\,P(s=R(K_{i}))}{\sum \limits _{j}P(T_{n}(s)=x\mid s=R(K_{j}))\,P(s=R(K_{j}))}}\cdot

As $K_{i}$ are chosen so that $T_{n}(R(K_{i}))=x$ , then,

P(T_{n}(s)=x\mid s=R(K_{i}))=1

The a-priori probability of the string being produced from the program, given no information about the string, is based on the size of the program,

P(s=R(K_{i}))=2^{-I(K_{i})}

giving,

P(s=R(K_{i})\mid T_{n}(s)=x)={\frac {2^{-I(K_{i})}}{\sum \limits _{j}2^{-I(K_{j})}}}\cdot

Programs that are the same or longer than the length of x provide no predictive power. Separate them out giving,

P(s=R(K_{i})\mid T_{n}(s)=x)={\frac {2^{-I(K_{i})}}{\sum \limits _{j:I(K_{j})<n}2^{-I(K_{j})}+\sum \limits _{j:I(K_{j})>=n}2^{-I(K_{j})}}}\cdot

Then identify the two probabilities as,

Probability that x has a pattern

=\sum \limits _{j:I(K_{j})<n}2^{-I(K_{j})}

The opposite of this,

Probability that x is a random set of bits

=\sum \limits _{j:I(K_{j})>=n}2^{-I(K_{j})}

But the prior probability that x is a random set of bits is $2^{-n}$ . So,

P(s=R(K_{i})\mid T_{n}(s)=x)={\frac {2^{-I(K_{i})}}{2^{-n}+\sum \limits _{j:I(K_{j})<n}2^{-I(K_{j})}}}\cdot

The probability that the source is random, or unpredictable is,

P(\operatorname {random} (s)\mid T_{n}(s)=x)={\frac {2^{-n}}{2^{-n}+\sum \limits _{j:I(K_{j})<n}2^{-I(K_{j})}}}\cdot

Solving for programs that match a prefix

A program may be a functional program, which is a representation of recursive mathematical equation. Mathematics makes no distinction between the value and the function. This is the use-mention distinction.

As a result writing the condition for finding functions that produce a prefix x is not fully informative. In mathematics

\{f:T_{n}(f)=x\}

does not quite say what we mean. Another approach is to use the prefix R to mean running the program or function.

\{f:T_{n}(R(f))=x\}

This allows the to use-mention distinction to be made. $R(f)$ is the use and $f$ is the mention. This defines R as a kind of function interpreter.

Whatever way is used the equation represents the solving of an equation to give a function. In general this is not computable. No general program may be written to perform such a task.

However for specific canonical forms the of solving for a function may be computable.

In logic canonical normal form
For numbers see polynomial arithmetic

Key people

References

^ Li, M. and Vitanyi, P., An Introduction to Kolmogorov Complexity and Its Applications, 3rd Edition, Springer Science and Business Media, N.Y., 2008, p 347
^ Solomonoff, R., "A Preliminary Report on a General Theory of Inductive Inference", Report V-131, Zator Co., Cambridge, Ma. Feb 4, 1960, revision, Nov., 1960.
^ Solomonoff, R., "A Formal Theory of Inductive Inference, Part I" Information and Control, Vol 7, No. 1 pp 1–22, March 1964.
^ Solomonoff, R., "A Formal Theory of Inductive Inference, Part II" Information and Control, Vol 7, No. 2 pp 224–254, June 1964.
^ Wallace, Chris (1968). "An information measure for classification". Computer Journal. 11 (2): 185–194. doi:10.1093/comjnl/11.2.185. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Rissanen, J. (1978). "Modeling by shortest data description". Automatica. 14 (5): 465–658. doi:10.1016/0005-1098(78)90005-5.
^ Allison, Lloyd. "Minimum Message Length (MML) – LA's MML introduction".
^ Oliver, J. J.; Baxter, Rohan A. "MML and Bayesianism: Similarities and Differences (Introduction to Minimum Encoding Inference – Part II)".
^ Hutter, Marcus (1998). Sequential Decisions Based on Algorithmic Probability. Springer. ISBN 3-540-22139-5.

External links

Rathmanner, S and Hutter, M., "A Philosophical Treatise of Universal Induction" in Entropy 2011, 13, 1076–1136: A very clear philosophical and mathematical analysis of Solomonoff's Theory of Inductive Inference.
C.S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer-Verlag (Information Science and Statistics), ISBN 0-387-23795-X, May 2005 – chapter headings, table of contents and sample pages.

Disabled categories on pages in User: namespace; remove the nowiki tags when moving the article back. [[Category:Statistical inference]] [[Category:Inductive reasoning]] [[Category:Inference]] [[Category:Machine learning]] [[Category:Probability]] [[Category:Probability and statistics]]

[1] Li, M. and Vitanyi, P., An Introduction to Kolmogorov Complexity and Its Applications, 3rd Edition, Springer Science and Business Media, N.Y., 2008, p 347

[2] Solomonoff, R., "A Preliminary Report on a General Theory of Inductive Inference", Report V-131, Zator Co., Cambridge, Ma. Feb 4, 1960, revision, Nov., 1960.

[3] Solomonoff, R., "A Formal Theory of Inductive Inference, Part I" Information and Control, Vol 7, No. 1 pp 1–22, March 1964.

[4] Solomonoff, R., "A Formal Theory of Inductive Inference, Part II" Information and Control, Vol 7, No. 2 pp 224–254, June 1964.

[5] Wallace, Chris (1968). "An information measure for classification". Computer Journal. 11 (2): 185–194. doi:10.1093/comjnl/11.2.185. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[6] Rissanen, J. (1978). "Modeling by shortest data description". Automatica. 14 (5): 465–658. doi:10.1016/0005-1098(78)90005-5.

[7] Allison, Lloyd. "Minimum Message Length (MML) – LA's MML introduction".

[8] Oliver, J. J.; Baxter, Rohan A. "MML and Bayesianism: Similarities and Differences (Introduction to Minimum Encoding Inference – Part II)".

[9] Hutter, Marcus (1998). Sequential Decisions Based on Algorithmic Probability. Springer. ISBN 3-540-22139-5.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

User:Thepigdog/Inductive probabilities

Contents

Probability definition

Principle of indifference does not work

Boolean induction

Results

Implication and condition probability

Probability of a bit string

The parser compiler for mathematics

Distribution of natural numbers

Model for constructing worlds

Applying Bayes' theorem

Removing theories without predictive power

Intelligent agent

Universal prior

Introduction

Patterns in the data

Considering all theories

Universal priors

Minimum description/message length

Overfitting

Universal artificial intelligence

Derivation of inductive probability

Solving for programs that match a prefix

Key people

See also

References

External links