# How to measure information

If information is to be treated scientifically rather than philosophically it has to be expressed numerically and quantitatively and not be interpreted as synonymous with meaning. Much confusion is caused by attempts to identify meaning with the change generated in the receiver. What is actually sent is not a measure of the amount of information transmitted. This depends merely on what could have been sent related to the prejudiced view of the expected message.

The information content of a message is nevertheless related to the complexity of its structure. An extensive initial lack of knowledge by the receiver gives a high complexity of the message. Structural complexity of the message thus may be used to define the quantity of information contained in the message.

As defined by information theory, the concept of information is merely a measure of the freedom of choice when selecting a message from the available set of possible messages formed by sequences of symbols from a specific repertoire. Information thus refers to certain statistical properties associated with messages that are forwarded in formal communication systems. In reasonably advanced communication systems, this set of possible messages may be formed using aggregates of words, e.g. in the English vocabulary. Furthermore, the system must be designed to convey every possible selection, not only that one selected at the moment of transmission.

The information content can be determined functionally using the difference between the initial uncertainty before receiving a message and the final uncertainty after receiving it. In this way we get a working definition of information as being the amount of uncertainty which has been removed when we get a message. The information content when receiving a message in one phase is:

Initial uncertainty – final uncertainty = Total information

If received in two stages, the information content is defined as the difference between the initial uncertainty and the intermediate uncertainty.

Initial uncertainty – intermediate uncertainty = Initial information

When further information is  added, the final information is defined as the difference between the intermediate uncertainty, plus the difference between the intermediate uncertainty and final uncertainty.

Intermediate uncertainty – final uncertainty = Final information

Quantitative relationships in such stages may easily be added together. In the above simple equations the terms intermediate uncertainty cancel each other. Total information again equals the difference between the initial and the final uncertainty. In this way the information quantity assigns a value to the content which describes the complexity of the message and which may easily be added. In reality, the quantity of information has no semantic meaning. It is only an index of the degree of unexpectedness of a message carried by a signal.

Information may be measured in terms of decisions and its presence can be demonstrated in reply to a question. The question is posed because of lack of data when choosing between certain possibilities; the greater the number of alternatives, the greater the uncertainty. The game of Twenty Questions illustrates how an object is supposed to be identified through answers to questions concerning the object (‘yes’ or ‘no’ questions). The strategy of uncertainty-reduction in the game is easy to recognize in Figure 4.8.

Figure 4.8 Strategy of uncertainty-reduction in the game of Twenty Questions.

Let us start with a situation where we can pose a single question: is the newborn baby a boy? Here it is equally possible that the reply will be yes as no and when the reply is given no uncertainty remains. The structural complexity and the information content are thereafter the smallest possible. The quantity of information contained in the answer may be defined as one unit of information. In information theory this is more precisely called one ‘bit’ of information. The quantity is derived from the repertoire of the digits 0 and 1 in binary notation, both assigned equal probability and carrying a content of one bit. Each question must divide the field of possibilities equally if one bit of information is to be gained for each reply.

With two questions, one out of four possibilities may be decided; with three questions one out of eight and so on. It is obvious from the examples that the base 2 logarithm of the possible number of answers can be used as a measure of information. Eight possibilities gives log 2 for 8 which is 3. Thus we have here three bits of information. With only one possibility there is evidently no uncertainty at all; the amount of information is zero, as the logarithm of 1 is zero. Using binary bits as a basic unit of measure, the degree of informational uncertainty thus can be empirically defined as a function of the number of bits required for its elimination.

The base 2 is a very natural choice because it comprises the minimum number of alternative messages in the repertoire of even the most primitive communication system. An example of such a system is the old time acoustic fire alarm given by a special tolling of church bells: ordinary tolling = no fire, special tolling = fire. It is however unrealistic to expect equal probabilities regarding the messages contained in the system’s repertoire: the no fire situation is more probable than the fire. The lack of parity between different messages in a possible repertoire therefore corresponds to the probability of their selection. The proposed measure of information (log22) must therefore be revised to include preferences for a certain kind of choice.

To communicate a message is to transmit a pattern distributed in time, seen mathematically as a time series. A measurement of infor- mation is principally related to the regularity or irregularity of that pattern, but the irregular is always more common than the regular. A random sequence of symbols shows no pattern and conveys no information. A fundamental principle of information theory is therefore that information is characterized by symbols with associated probabilities of occurrence. Written language, for example, is a source where the symbols used appear with unequal probabilities and are statistically linked. Its elements are therefore mainly discrete, separate and mutually exclusive.

Between the limits of complete knowledge and complete ignorance, it makes sense to speak of degrees of uncertainty. The larger the set of alternatives possible to choose, the more information we require in order to make the decision. But we must consider not only the range of choices available but also the probabilities associated with each.

The amount of information carried inside a message is determined by its probability in the set of all possible messages. The more probable the type of pattern, the less order it contains, because order has less probability and essentially lacks randomness. It is therefore obvious that the less probable a message is, the more meaning it carries, something intuitively felt. We apprehend the surrounding world not on an equal scale of probability, but on a scale which is heavily biased towards the new and interesting. As we will see, it is the usualness or rareness of the used signs (or frequency) which determines how much information they comprise. Generally, the less probability, the more information. In this context, information is nearly the same as improbability. Thus: the information content of a signal is the measure of the improbability with which this signal appears in a certain communication system.

Probabilities are by default always less than or equal to 1, because 1 is the probability of absolute certainty and no probability can be greater than absolute certainty. From that it follows that the amount of information is determined to be greater than zero when the probability of the matching event is less than one. If the selecting probability in a set of messages is 1.0, the message is always chosen; no freedom of choice exists and no information is communicated. To conclude: when probability approaches zero, information approaches infinity; when probability approaches 1, information approaches zero.

While information combines additively, probabilities taken independently combine multiplicatively. Thus, the relation between the amount of information existing in a message and the probability of that message will be similar to the relation between a set of numbers that multiplies and a set that adds. Mathematically, the first set is defined as the logarithm of the second set, taken to an appropriate base. The handling of logarithms however demands a suitable scale, determined by a factor, positive or negative, by which it can be multiplied. A mathematical property of information conveyed by an event occurring with a certain probability, is that its probability logarithm has a negative value. The ordinary logarithm of a quantity less than one is always negative, while information is naturally taken to be positive. By adding a constant quantity it can be made artificially positive — a result also given if starting from a value other than zero.

The measure of freedom of choice, i.e. the information content, in a repertoire of two messages with different probability may be exemplified by the already mentioned tolling of the bell. If every tenth tolling of the bell on average is a fire alarm, it must be assigned a probability value of 0.1. All other tolling sequences thus comprise the probability of 0.9. Then the information content of a tolling may be calculated as:

– (0.9 log20.9 + 0.1 log20.1) = 0.476 bits

When a particular message becomes more probable than any other, the freedom of choice is restrained and the conforming information measure naturally has to decrease. The information content of the message repertoire does not depend on how we divide the repertoire. The content of each individual message can be computed individually and then added to form the total message.

In applications of information theory it is practical to consider the letters of the alphabet as the repertoire of available messages (each letter being a kind of elementary message). The 26 letters of the English alphabet and the space needed to distinguish between words give a total of 27 symbols. Equal probability of all the symbols in this communication system would assign them the same individual information content of log227 = 4.76, i.e. 5 bits. (Note that every bit represents a choice between two alternatives and that it is inconceivable with fractions of choices here.) In reality, however, they occur with very different probabilities so the average information content of an English letter is about 4 bits.

When utilized in a real message the information content of a letter is still lower, as the English language inherently restricts the freedom of choice. Constraints induced by grouping and patterning of letters, words and compulsory redundancy cause the real information content to be a little less than 2 bits per letter. The probabilities of occurrence of certain pairs, triplets, etc. of letters are astonishingly constant, just as are the frequencies of various words. The individual probability of words and letters seems to have a correlation to their costs in time and effort when used; the total average cost in a message is generally minimized. The probability of occurrence for individual letters in the English alphabet is presented in Table 4.3.

A deduction regarding the English language is  that no word in fact has to be composed of more than three letters. The number of possible trigrams are several tens of thousands which is more than enough for the words of the language.

What about ordinary decimal numbers and their information content? The decimal notation with its repertoire of ten equally probable digits has an information content of log210 = 3.32, i.e. 3 bits (rounded, bits are always integers by definition). Therefore, in a message any sequence of digits may occur while sequences of English words occur according to certain rules.

Both ordinary letters and digits have a shorter notation and a more comprehensive information content than the binary system. The binary notation with its equally probable digits of 0 and 1 carries an information content of log22 = 1 bit and may of course be considered 3 times more excessive than the decimal notation and 5 times more than the alphabetic notation. But we may easily express every sign simply by the use of the binary digits 0 and 1 (the working principle of the computer!). Four digit binary combinations give 16 different words (24 = 16), six digit binary combinations give 64, and so on. Every time we add one binary digit to the word the possible combinations of new words are doubled.

How about information content and transmission time when letters and digits are expressed in the old Morse Code? At first sight it is easy to believe that this code is binary, consisting of dots (the equivalent of 0) and dashes (the equivalent of 1). But this is an erroneous thought due to the three spaces included in the code. A short space (one time unit, tu) exists between the various dots and dashes building each symbol. Between symbols a medium space exists (three time units) and between words a long space (six time units) is to be found. The dot itself is one time unit while the dash consists of three time units. The time units have got their length in order to optimize the operation of the human ear as a detection and hearing device.

Now consider the word TEA which is — – – — and gives the following transmission length:

T = 3 tu + 3 tu + E = 1 tu + 3 tu + A = 5 tu = 15 tu

If this value is compared with the length of the same message sent in a binary code, we will find the same result. A binary code regarding the 26 letters of the alphabet had to use five bits to define each symbol. The total time to transmit the word will also be 15 tu. No need exists for spaces between the letters because the 0 and the 1 are of same duration and there is no need to indicate when one symbol ends and the other starts. After every byte (of 5 bits) there is always a new symbol also consisting of 5 bits.

Obviously, the Morse Code seems to be as efficient as the binary code. This is, however, not true as the chosen word is extremely favourable for the Morse Code. If another three letter word is selected, say WHY, the same calculation as for TEA shows 28 tu, approximately. A comparison between the binary code and the Morse Code shows that the latter, on average, is faster by about 50 per cent. With the introduction of the extremely fast modern computer, the binary notation of the alphabet and decimal system has in a short time outdone communication systems such as Morse telegraphy and teletype.

It is now possible to take a closer look at the system of traffic lights presented on page 235. This reveals that with three different colours, each signal should have an information content of 1,59 bit (2log 3). This is, however, a little too high. In the used sequences of consecutive colours, red never comes directly after green or green directly after red. The repertoire is limited and no random use of the three colours arfr possible. By use of the intermediate symbols of green-yellow and yellow-red, the information content is limited. After green, yellow is always presented and after red always yellow, in reality red-yellow. But after yellow, either green or red may be presented. This part in the sequence of signals must be ascribed the information content of 1 bit.

To create  the  highest  possible  safety  and  reduce  doubtfulness, redundancy is built into the system by the fixed position order of the coloured lamps. Red is always on top, yellow in the middle and green at the bottom.

Source: Skyttner Lars (2006), General Systems Theory: Problems, Perspectives, Practice, Wspc, 2nd Edition.