Post date: 05-Mar-2012 00:48:05

Claude Shannon developed a mathematical theory of communication which has been widely used in many fields of electronics ever since (Shannon 1948). In particular, Shannon was concerned with the uncertainty introduced by a noisy communication channel. When a message is sent over a perfect communication channel, the recipient can be confident that they have received a reliable copy of the transmitted message. However, when the channel is noisy, the recipient will be uncertain, and must do their best to guess what message was actually transmitted. Shannon defined a way of measuring this uncertainty and called it entropy.

You may know of the simple game of Hangman, where you must guess an unknown word, one letter at a time. Each wrong guess takes you one step closer to being hung. When a child is first confronted with the game their uncertainty is very high as there are 26 possibilities for each letter position. As their skills improve, they will realize that some letters are more common than others, and their uncertainty will be a little less. Guessing a single letter at random has a 1 in 26 = 3.85% change of being correct. Guessing a single letter known to be from an English text will be less uncertain as the probability of 'e'=12.70%, 't'=9.06%, 'a'=8.17%, and so on down to 'z'=0.07%. This uncertainty can be expressed as entropy = 4.7 in the naïve situation, decreasing to entropy = 4.2 with a knowledge of word frequency in English. Additional knowledge of English words will further reduce uncertainty as letters are guessed correctly in the game. As a child learns English, their uncertainty in the Hangman environment decreases.

Cosma Shalizi built on Shannon's work by developing an optimal way to extract and store information from a string of symbols (Shalizi 2001). Shalizi's CSSR algorithm first examines a large historical sample of symbols generated by some unknown process. The initial analysis looks for small repeated histories (patterns) within the sample and records the probability of the very next character. This is completed for all small histories in the sample up to some maximum size. These probabilities are then condensed into an ɛ-machine, which is the most concise way possible of representing the knowledge gathered about the process.

For example, consider a historical sample involving just two symbols, A and B.


A statistical analysis will show that:

A is followed by another A(33%) or B(66%)

B is always followed by an A(100%)

We can refine this by looking at longer histories:

AA is followed by A(40%) or B(60%).

AB is followed by A(100%).

BA is followed by A(33%) or B(66%).

BB never occurs in the sample.

Finally we condense the results into 3 states:

S1. A and BA both predict A(33%) or B(66%).

S2. AA predicts A(40%) or B(60%).

S3. B predicts A(100%).

This ɛ-machine is a model of the unknown process which generated the original sample. The model may be improved by examining a larger sample or collecting statistics for longer histories. If you were to move around the ɛ-machine diagram, using the calculated probabilities at each state, then you will generate a string that is statistically identical to the original sample. Alternatively, if you begin receiving more data from the original unknown process, you will be able to sync with the ɛ-machine after a couple of symbols and then be in a position to make the best possible prediction for the next symbol.

Might an ɛ-machine be used as the basis for an intelligent device? The ɛ-machine reduces uncertainty at the maximum possible rate by capturing all patterns that have any predictive power. Furthermore it is the smallest possible representation and it continues to operate optimally with noise in the signal. However, the CSSR algorithm used to construct the ɛ-machine comes at a cost.

Modern computers are based on an architecture proposed by Allan Turing (Turing 1936) in response to a problem in Mathematics. This most basic of computers, comprised a simple machine and a tape. The machine was able to move left or right along the tape and was able to read, write or erase a single symbol to each position on the tape, or simply stop. Each action of the machine was completely determined by the symbol most recently read from the tape and the internal state of the machine. Thus, the Turing machine has memory and the ability to act on the memory. The cost of a computation is the amount of memory required and the amount of time required to complete the computation. These costs are referred to as the memory complexity and the time complexity respectively.

The time complexity of constructing an ɛ-machine using Shalizi's CSSR algorithm depends on three things; the total length of the historical sample data (N), the number of symbols in the alphabet (k), and the longest pattern used to generate the statistics (Lmax). Consider a base case with a sample size N=1000, alphabet size k=2, and history size Lmax=5 which for the sake of this example takes 1 second to run on a computer. Doubling each of these variable in turn increases the run time.

To put this into perspective, consider using CSSR to construct an ɛ-machine to learn from data coming from an extremely low resolution video camera. A very simple camera might have only 5x5=25 black/white pixels (no gray-scale). We might gather 1 hour of data at 10 frames per second and gather statistics for just 1 second (10 frames). This translates to N=36,000, k=1025, and Lmax=10 with a run time much much longer than the age of the universe on our base case computer.

An ɛ-machine learns quickly in the sense that it uses all of the available information, but the computational cost of doing so is prohibitive. The computations required cannot keep up with the rate that data is available from the real world. Brains do.

J.F.Traub and A.G.Werschulz developed an elegant approach known as Information-based Complexity (IBC) that studies problems for which the information is partial, contaminated and priced (Traub & Werschulz 1999). Partial information is very common in the real world where measurements provide information about a particular time or locale, but assumptions must be made regarding the unmeasured points. The assumptions are global information about the environment such as smoothness. Contaminated information is also common due to noise, rounding-errors and/or deception. The cost of information is incurred during its collection, and in the computations that follow.

Consider the challenge of weather forecasting, where it is expensive to measure and collate data from even a few points around the globe, and every measurement contains error introduced by the resolution of the instruments. The computations to generate a forecast make assumptions about missing locations and incur further expense.

Computations made with partial or incomplete information, leave some uncertainty in the result. IBC provides a framework to determine the amount of uncertainty remaining, how much information is required to limit the uncertainty, and the optimal amount of information. IBC defines a quantity called the radius of uncertainty which measures the intrinsic uncertainty in a solution due to the available information. The value of information, is it's capacity to reduce the radius of uncertainty.

These notions used by IBC are very helpful. Organisms with very simple nervous systems clearly have very limited capacity to collect and process information. Yet, despite these severe limitations, natural selection has proven that they routinely take actions that are superior to their competition. When resources are severely limited, and the competition is close, it is important to use those resources to their maximum effect. It is important to collect and utilize that information which reduces uncertainty by the maximum amount for the least cost. IBC provides a framework to evaluate the cost and value of information collected in order to reduce uncertainty.

By way of example, the primate visual system is forward facing with each eye sensitive to light arriving within a cone of only about 120° of the full 360°. The visual acuity in the outer peripheral region is very low compared to the very much higher sensitivity and resolution of the central fovea spanning only about 2°. Primates perform many tens of thousands of saccades per day, which are rapid eye movements, that align the fovea with particular targets in the visual field. The primate visual system collects high quality information from only selected regions rather than all of that available.

In sum, identifying historical patterns enables uncertainty in the future to be reduced. The computational cost of identifying all patterns is prohibitive. Limited resources force brains to gather and utilize only that information which gives the most reduction in uncertainty for the least cost.

Shannon,C.E. (1948), A Mathematical Theory of Communication.

Shalizi,C. (2001), Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata.

Turing, A.M. (1936) On Computable Numbers with an application to the Entscheidungsproblem.

Turing, A.M. (1950) Computing Machinery and Intelligence. Mind 49: 433-460.

Traub,J.F. & Werschulz,A.G. (1999), Complexity and Information (Lezioni Lincee). Cambridge University Press.