Bayes Class Exp

StatTools : Bayesian Classification Explained

Links : Home Index (Subjects) Contact StatTools

Introduction Examples Bayes Explained References

This page provides explanations and support for the most basic of Bayesian Probability algorithms, the classification of individuals into alternative groups according to observed attributes. The basic development of a model using a set of reference data is carried out in the Bayesian Classification (Analysis of Reference Data) Program Page , and how that model can be modified to suite local circumstances in the Bayesian Classification (Adjust Reference Table) Program Page .

The next panel takes the user through the calculations using the default example data of the two calculations. This is followed in a panel giving an introduction to Bayesian concepts.

Example 1 : To establish the model using reference Data

Italian	--
Italian	+-
Italian	-+
Italian	-+
Italian	++
Italian	++
Italian	++
Italian	++
Italian	++
Italian	++
French	--
French	--
French	+-
French	+-
French	+-
French	-+
French	-+
French	-+
French	++
French	++
German	--
German	--
German	--
German	--
German	--
German	--
German	+-
German	+-
German	-+
German	++

Please Note : The example data are computer generated to demonstrate the procedures and are not real. also a much larger reference data set is required to establish a stable model. The small sample is used in order that it can be displayed on the web page. This example follows the procedures used in the Bayesian Classification (Analysis of Reference Data) Program Page , and uses the default example data of that page.

We would like to establish a method of identifying Europeans into 3 ethnic types of Italian, French, and German, based on two observations, whether they have dark hair, and whether they have brown eyes. Four combinations are therefore possible, dark hair and brown eyes (++), dark hair and not brown eyes (+-), not dark hair and brown eyes (-+), and neither dark hair nor brown eyes (--).

We carefully chose a sample of people truly representing these three ethnic groups, and observed their hair and eye colors, creating the data set as shown in the table to the right.

Pattern	Italian	French	German
++ (Dark hair, Brown Eyes)	6	2	1
+- (Dark hair, not Brown Eyes)	1	3	2
-+ (not Dark hair, Brown Eyes)	2	3	1
-- (not Dark hair, not Brown Eyes)	1	2	6

Step 1 : A table of count is established. The rows are the patterns, and the columns are groups, and the number of cases with each pattern in each group are listed, as shown in the table to the right.

Pattern	Italian	French	German
++ (Dark hair, Brown Eyes)	0.6	0.2	0.1
+- (Dark hair, not Brown Eyes)	0.1	0.3	0.2
-+ (not Dark hair, Brown Eyes)	0.2	0.3	0.1
-- (not Dark hair, not Brown Eyes)	0.1	0.2	0.6

Step 2 : The creation of the relative Frequency table. This contains the probability of having a particular pattern in each of the group, also known as the reference P(pattern|group) table, or more generically as the P(x|j) table. This is calculated by dividing the count in each cell by the total count in the group (column total). The results are as shown in the table to the left.

The P(pattern|group) table represents the model we have created, and from which we can create decision tools to use on future independent sets of data. Different software packages provide different format to represent this model, but StatTools uses the table as calculated by the Bayesian Classification (Analysis of Reference Data) Program Page .

Step 3 : The creation of the first decision making table, the Maximum Likelihood Table. This is discussed as part of example 2.

Example 2 : The adjustment of the reference P(pattern|group) table to create decision tools.

This discussion supports the procedures in the Bayesian Classification (Adjust Reference Table) Program Page and uses its default example data.

Pattern	Italian	French	German
++	0.6	0.2	0.1
+-	0.1	0.3	0.2
-+	0.2	0.3	0.1
--	0.1	0.2	0.6

We begin by using the reference P(pattern|group) table created by the analysis of a set of reference data, as shown in the table to the left.

Pattern	Italian	French	German
++ (Dark hair, Brown Eyes)	0.67	0.22	0.11
+- (Dark hair, not Brown Eyes)	0.17	0.50	0.33
-+ (not Dark hair, Brown Eyes)	0.33	0.50	0.17
-- (not Dark hair, not Brown Eyes)	0.11	0.22	0.67

Step 1 : The creation of the Maximum Likelihood Table which shows the probability of belonging to a group based on the observations available. It is also called the P(group|pattern) or the P(j|x) table. The Maximum Likelihood Probability is calculated by dividing each probability in the P(pattern|group) table by the sum across all groups (row total). The results are as shown in the table to the right.

A person with dark hair and brown eyes (++) has a probability of 67% being an Italian, 22% being French, and 11% being German, and so should be classified as an Italian (in bold)
A person with dark hair but eyes not brown(+-) has a probability of 17% being an Italian, 50% being French, and 17% being German, and so should be classified as French (in bold)
A person with hair not dark but brown eyes (-+) has a probability of 33% being an Italian, 50% being French, and 17% being German, and so should be classified as French (in bold)
A person with hair not dark and eyes not brown (--) has a probability of 11% being an Italian, 22% being French, and 67% being German, and so should be classified as German (in bold)

Once such a table is established, it can be used as a reference to classify all individuals after observing their characteristics.

Step 2 : The construction of Bayesian Probability Table, taking into consideration the apriori probability of belonging to each of the groups. The table is also called P(group|pattern,π) or P(j|x,π).

The Maximum Likelihood is based on the assumption that the probability of being in any of the groups is the same, except for the observed characteristics. This is seldom the case in reality. If we were to take our model to Rome, to Paris, or to Dresden, the probability of someone to be Italian, French, or German would be very different even before we observed the characteristics. Such a probability, the apriori probability (π) needs to be taken into account.

The program takes π into consideration by using an array of apriori indicators. This is an array which contains relative probabilities of belonging to each group. The values to be entered by the user can be in any measurements (number of cases, probabilities, ratios), and the program normalize these values into probabilities before calculation.

The default example in the Bayesian Classification (Adjust Reference Table) Program Page is "1 1 1", indicating that the apriori probability in the 3 groups are the same (normalized to 0.33 each). The results of the calculations will be the same as that from the Maximum Likelihood table.

Pattern	Italian	French	German
π (apriori probability)	0.14	0.29	0.57
++ (Dark hair, Brown Eyes)	0.43	0.29	0.29
+- (Dark hair, not Brown Eyes)	0.07	0.40	0.53
-+ (not Dark hair, Brown Eyes)	0.17	0.50	0.33
-- (not Dark hair, not Brown Eyes)	0.03	0.14	0.83

If we are to use the reference patterns in say Zurich, a predominantly German speaking part of Switzerland, we may find that , for each Italian in town, there are 2 Frenchmen and 4 Germans, so the apriori probability is "1 2 4", or the probability of being a German is twice of being French and four times of being an Italian.

If we were to add such an Apriori array into calculation, the program will firstly normalize "1 2 4" to proportions of "0.14 0.29 0.57", meaning the apriori probabilities are 14% Italian, 29% French, and 57% German. The Bayesian Probability table taken apriori probabilities into consideration would be as shown to the right.

Because the probability of being German is greater, all those with eyes not brown are classified as German, while the probability (certainty) of classifying to other groups are reduced.

Step 3 : The construction of Bayesian Probability Table, taking into consideration the apriori probability of belonging to each of the groups, and also include a cost function for error. The table is also called P(group|pattern,π,cost) or P(j|x,π,cost). The cost function for a group conceptually represent a measurement of cost or loss, if a case erroneously fails to be assigned to that group. An obvious example is the diagnosis of a swelling on the face. It can be a bruise, an infection, or a cancer. To miss a cancer when there is one would be much more serious (greater cost) than that for an infection, than that for a bruise.

Common practice is to include the cost function after including the apriori probabilities. If cost is to be considered without apriori, the apriori array can be assigned equal values for all groups. The unit for cost can be any measurement, in money, time, or arbitrary units of judgement. The program normalized the array into fractions before use.

Pattern	Italian	French	German
π (apriori probability)	0.14	0.29	0.57
Cost	0.67	0.17	0.17
++ (Dark hair, Brown Eyes)	0.75	0.13	0.13
+- (Dark hair, not Brown Eyes)	0.22	0.33	0.44
-+ (not Dark hair, Brown Eyes)	0.44	0.33	0.22
-- (not Dark hair, not Brown Eyes)	0.13	0.13	0.75

The default example for costs in the Bayesian Classification (Adjust Reference Table) Program Page is "1 1 1" indicating that there is no cost difference between the groups. However, if we are looking desperately for an Italian interpreter in Zurich for an important function, missing an Italian may cost 4 times as much as missing a Frenchman or a German, and we may use a cost array such as "4 1 1", which the program will normalize to "0.67 0.17 0.17", and the results can be seen as in the table to the right.

We can see that we would now assign anyone with brown eyes as Italian, and the rest German. This is because the probability of being a German is greater, and missing an Italian incurs a greater cost. We would not assign anyone to be French at all.

Bayesian Probability

Bayesian probability is based on a relatively simple premise, that, if we know the probability of a set of observation in a situation, then we can calculate the probabilities of alternative situations when presented with the set of observations. In StatTools the term pattern is used to represent one or more observations, and group is used group to represent situations. More formally put :

We can estimate from empirical observations the probability of pattern x when we encounter group j, notated as Probability of x given j, P(x|j) or P(pattern|group)
We can calculate the probabilities of being in each of the alternative groups when presented with the pattern, notated as Probability of j given x, P(j|x) or P(group|pattern)
The formula is P(j_i|x) = P(x|j_i) / ΣP(x|j_i=1,2,3,....)

The elegance of Bayesian probability is that P(x|j_i) can be modified by multiplying it with any other probability functions of the group i, to alter the final probability estimates.

The 3 most common Bayesian functions are therefore :

P(j|x), the Maximum Likelihood Probabilities. This is estimated from observed reference data.
P(j|x,π), the Bayesian Probability. The P(x|j) is multiplied by the apriori probability (π) of the group before the final P(j|x) calculations are carried out. The apriori probability is the probability of belonging to the group before any patterns are known.
P(j|x,π,cost), the Adjusted Bayesian Probability. The P(x|j) is multiplied by the apriori probability (π) and a cost or loss function of the group, before the final P(j|x) calculations are carried out. The cost function for each group represents a cost or loss if a case erroneously fails to be allocated to the group

All the complexities of Bayesian Probability resides in establishing the P(x|j) table, the apriori probabilities, and the cost functions. Once these are determined, the final common calculations are quite straight forward.

The Bayesian Classification Algorithm The algorithm, as described in the example panel of this page, and carried out in the Bayesian Classification (Analysis of Reference Data) Program Page and Bayesian Classification (Adjust Reference Table) Program Page , represents the most basic Bayesian model. This model is often used to introduce students to the Bayesian concept.

The advantages of using this model are :

It is simple, intuitively easy to understand and accept, and not prone to errors and misinterpretations
All the model building and mathematics are carried out up front. The statistician assists in model building and derive the probability tables to be used. Once these are established, the tables are in the form of a manual that can be consulted by the user in decision making, and no further technical expertise is required.

The disadvantages are :

The probability table can be very large. The number of patterns to be used are 2ⁿ for n characteristics, 2 alternatives for one characteristic, 4 for 2, 8 for 3, and so on. In situations such as medical diagnosis, where up to 10 parameters are taken into consideration, the sheer size of the probability table becomes untenable.
Not only is the volume a problem. The reference data set requires sufficient examples for all combinations to build a stable and reproducible model, and this may be difficult to obtain.

Wikipedia. History and basic theoretical consideration of Bayes Theorem.

Wikipedia. Modern adaptation and use of Bayesian probability, terminologies, and some formulae.

Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology. McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.329-344. This is where I got the algorithm from.