Pattern Prob Exp

Although the classical Bayesian Probability model as explained in the Bayesian Classification Explained Page is well accepted and its usefulness has stood the test of time, a number of difficulties in using it are identified. These are ;

If 2 binary variables are used, 4 patterns are produced (--, -+, +- ++). if 3 binary variables are used, 8 patterns are produced (---, --+, -+-, -++, +--, +-+, ++-, +++). The number of patterns required when n binary variables are used is therefore 2ⁿ, even more exponential if there are more than 2 categories in the variables. This exponential increase means that the number of variables used to assign a case to a group must be limited.
As there are numerous combinations of variables, some combinations can be uncommon. This means very large databases are required to develop the template to avoid instability in performance.
There is a lack of flexibility once the template is developed. All the information required in a pattern must be available for a decision to be made. In real life situations such as medical diagnosis or evaluating students, the complete information may take time and resources to obtained, while interim decisions based on incomplete information may be required, and often are sufficiently adequate for action. An example is in the diagnosis of appendicitis, where the diagnosis is made using a combination of signs, symptoms, tests, and development over time, but in many cases, a decision to operate may need to be made before most of the required information are available.

The Pattern Probability Model addresses these difficulties by making the assumption that the variables (attributes) used to assign a case to a group are unrelated, that there is no within group correlation between these variables. Once this assumption is accepted, the influence of each variable on the final decision can be calculated independently, and the total influence is merely a probability function of the individual influences.

If a binary variables are used, 2 patterns are produced (- or + for present, and + or - for absent. if a variable with 3 options is used, 3 patterns are produced (- or + for option 1, - or + for option 2, and - or + for option 3. The number of patterns required increases linearly depending 2 or more options involved in each of the variables. In so doing, the model is able to accommodate a large set of variables (attributes or predictors)
Although a particular combinations of variables is uncommon, each variable in the combination is less uncommon. The combined influence is calculated mathematically, and a stable and consistent model is easier to produce.
The model can be used flexibly. At any stage, the probability that a case belongs to a particular group can be calculated using only the available information at the time. When a full set of information is available, the results are consistent regardless of the order they are included into the model.

How the program should be used and the results interpreted are best discussed in the examples section, and some of the issue in its used discussed in the technical consideration section

Example 1 : Development of the reference pattern

Italian	-+--+
Italian	+--+-
Italian	-++--
Italian	-++--
Italian	+-+--
Italian	+-+--
Italian	+--+-
Italian	+-+--
Italian	+-+--
Italian	+--+-
French	-+-+-
French	-+--+
French	-++--
French	-+-+-
French	-++--
French	+--+-
French	+---+
French	+--+-
French	+-+--
French	+--+-
German	-+--+
German	-+--+
German	-+--+
German	-+-+-
German	-+--+
German	-+--+
German	-++--
German	+--+-
German	+---+
German	+-+--

The default example data from The Pattern Probability Analysis (Analysis Using Reference Data) Program Page is used in this example. The data is computer generated to demonstrate the procedures, and not based on real observations. The three groups are replaced with ethnic identity of Italian, French, and German in this page.

We wish to develop a system of classification to distinguish Europeans into Italian, French, and German. We hope to develop a model which we can use in the future to classify any one we see into one of these 3 ethnic groups. The model will use hair and eye colors.

Hair Color, -(false) or +(true) for dark color hair, and - or + for light color hair. Three patterns are therefore available
- +- for dark color hair
- -+ for light color hair
- -- for no information on hair color
Eye Color, - or + for brown eyes, - or + for blue eyes, and - or + for other color eyes. Four patterns of eye colors are therefore available
- +-- for brown eyes
- --+ for blue eyes
- -+- for eyes of any color other than brown and blue
- --- for no information on eye color
Putting these together, we have a set of attributes or variables in 5 characters, the first two characters represent hair color, and the last 3 eye color

Step 1 : We collected a reference sample of individuals to build our classification model, consisting of 10 each of Italians, French, and Germans. We noted the color of their hair and eyes and use these to build our model.

	Italian	French	German
Ch 1 (dark hair +)	7	5	3
Ch 2 (light hair +)	3	5	7
Ch 3 (Brown eyes +)	6	3	2
Ch 4 (Other color eyes +)	3	5	2
Ch 5 (Blue eyes +)	1	2	6

Step 2 : The program counts the number of trues (in this case +) in all the attributes in all the groups in all the cases of the reference population. The results are as shown in the table to the left.

Of the 10 Italians, 7 had dark color hair and 3 had light color hair. There were 6 with brown eyes, 1 with blue eyes, and 3 with other color eyes.

Of the 10 French, there were 5 each with dark and light color hair, 3 with brown eyes, 2 blue, and 5 other color eyes.

Of the 10 Germans, 3 had dark and 7 light color hair, 2 with brown eyes, 6 with blue eyes, and 2 with other color eyes.

	Italian	French	German
Ch 1 (dark hair +)	0.7	0.5	0.3
Ch 2 (light hair +)	0.3	0.5	0.7
Ch 3 (Brown eyes +)	0.6	0.3	0.2
Ch 4 (Other color eyes +)	0.3	0.5	0.2
Ch 5 (Blue eyes +)	0.1	0.2	0.6

Step 3 : The program then calculates the relative frequencies of these attributes in the group, formally the probability of each attribute in each group, P(Attribute|Group), or commonly denoted P(x|j). The results are shown in the table to the right. The probability is calculated by dividing each count by the total counts in that group.

The P(x|j) table represents the relationship between the groups and attributes, and is used to model subsequent classifications on independent sets of individuals. The table to the right, without the first column of labels, is used by the Pattern Probability (Use of Reference Pattern on New Data) Program Page to allocate individuals into groups.

Step 4 : The model, in terms of the P(x|j) table, is validated using the same data that creates it. As the process is the same as for Maximum Likelihood in Example 2, it will be discussed there.

Example 2 : Use of established reference patterns on a set of data

Italian	French	German
0.7	0.5	0.3
0.3	0.5	0.7
0.6	0.3	0.2
0.3	0.5	0.2
0.1	0.2	0.6

This example demonstrates how the pattern established using a set of reference data, such as in Example 1, is used to classify a new set of data, assigning individuals to a group according to the relevant attributes. The calculations are as carried out in the Pattern Probability (Use of Reference Pattern on New Data) Program Page .

The table to the left shows the default example reference pattern, which is the P(Attribute|Group), or commonly denoted P(x|j), table obtained from the reference data in Example 1.

+-+--

+--+-

+---+

-++--

-+-+-

-+--+

---+-

The default example data to be allocated into groups are as shown to the table to the right. The 7 individuals are

+-+-- Dark hair brown eyes
+--+- Dark hair other color eyes
+---+ Dark hair blue eyes
-++-- Light hair brown eyes
-+-+- Light hair other color eyes
-+--+ Light hair blue eyes
----+ Bald (color hair unknown) blue eyes

Row	Pattern	Groups P(x\|j)
		Italian	French	German
1	+-+--	0.42	0.15	0.06
2	+--+-	0.21	0.25	0.06
3	+---+	0.07	0.1	0.18
4	-++--	0.18	0.15	0.14
5	-+-+-	0.09	0.25	0.14
6	-+--+	0.03	0.1	0.42
7	----+	0.1	0.2	0.6

Step 1 : The establishment of a table of probability of each pattern for each group, the P(pattern|Group), commonly denoted as the P(x|j) table, where x represents the pattern, and j the group. This is calculated using the reference pattern table, multiplying all the probabilities in a group whenever a positive (+) attribute exists. The results are as shown in the table to the right.

This means that Italians have a 42%, French 15% and Germans 6% probability of having someone like that in row 1, dark hair and brown eyes (+-+-- row 1), and Italians have 10%, French 20% and Germans 60% probability of having someone who is bald and has blue eyes (----+) row 7.

This table is important, as it is used for all subsequent calculations.

Row	Pattern	Groups P(j\|x)
		Italian	French	German
1	+-+--	0.67	0.24	0.10
2	+--+-	0.40	0.48	0.12
3	+---+	0.20	0.29	0.51
4	-++--	0.38	0.32	0.30
5	-+-+-	0.19	0.52	0.29
6	-+--+	0.05	0.18	0.76
7	----+	0.11	0.22	0.67

Step 2 : The construction of the Maximum Likelihood Table also called the P(Group|Pattern) or more commonly denoted P(j|x) table, and represents the probability of belonging to each of the groups for the pattern in each row. This is calculated by dividing each probability in the P(pattern|Group) table by the total across all the groups.

The table is as shown to the right. This means that someone with dark hair and brown eyes (+-+-- row 1) has a 67% probability of being Italian, 24% probability of being French, and 10% probability of being German. He is then classified to the most likely group, Italian at 67% (in bold).

Likewise, someone with light color hair and blue eyes (-+--+ row 6) has a 5% probability of being Italian, 18% probability of being French, and 76% probability of being German. He is classified to the most likely group, German (in bold).

Where data is incomplete, as in the case of the bald man with blue eyes (no information on hair color ----+ row 7), the algorithm is still able to make a decision based on what is available, and assigned him to the German group (in bold).

The Maximum Likelihood is the first of the Bayesian Probability calculations, based only on the observed attributes, without taking anything else into consideration.

Step 3 : The construction of Bayesian Probability Table, taking into consideration the apriori probability of belonging to each of the groups. The table is also called P(group|pattern,π) or P(j|x,π).

The Maximum Likelihood is based on the assumption that the probability of being in any of the groups is the same, except for the observed attributes. This is seldom the case in reality. If we were to take our model to Rome, to Paris, or to Dresden, the probability of someone to be Italian, French, or German would be very different even before we observed the attributes. Such a probability, the apriori probability (π) needs to be taken into account.

The program takes π into consideration by using an array of apriori indicators. This is an array which contains relative probabilities of belonging to each group. The values to be entered by the user can be in any measurements (number of cases, probabilities, ratios), and the program normalize these values into probabilities before calculation.

The default example in the Pattern Probability (Use of Reference Pattern on New Data) Program Page is "1 1 1", indicating that the apriori probability in the 3 groups are the same (normalized to 0.33 each). The results of the calculations will be the same as that from the Maximum Likelihood table.

If we are to use the reference patterns in say Zurich, a predominantly German speaking part of Switzerland, we may find that , for each Italian in town, there are 2 Frenchmen and 4 Germans, so the apriori probability is "1 2 4", or the probability of being a German is twice of being French and four times of being an Italian.

Row	Pattern	Groups P(j\|x,π)
		Italian	French	German
1	+-+--	0.44	0.31	0.25
2	+--+-	0.22	0.53	0.25
3	+---+	0.07	0.20	0.73
4	-++--	0.17	0.29	0.54
5	-+-+-	0.08	0.43	0.49
6	-+--+	0.02	0.10	0.88
7	----+	0.03	0.14	0.83

If we were to add such an Apriori array into calculation, the program will firstly normalize "1 2 4" to proportions of "0.14 0.29 0.57", meaning the apriori probabilities are 14% Italian, 29% French, and 57% German. The Bayesian Probability table taken apriori probabilities into consideration would be as shown to the right.

It can be seen that, in Zurich, someone with dark hair and brown eyes (+-+-- row 1) is still most probably Italian at 44%, someone with dark hair and eyes of any color but blue or brown (+--+- row 2) most probably French at 53%, and most probably German with all other combinations of attributes.

Step 4 : The construction of Bayesian Probability Table, taking into consideration the apriori probability of belonging to each of the groups, and also include a cost function for error. The table is also called P(group|pattern,π,cost) or P(j|x,π,cost).

The cost function for a group conceptually represent a measurement of cost or loss, if a case erroneously fails to be assigned to that group. An obvious example is the diagnosis of a swelling on the face. It can be a bruise, an infection, or a cancer. To miss a cancer when there is one would be much more serious (greater cost) than that for an infection, than that for a bruise.

Common practice is to include the cost function after including the apriori probabilities. If cost is to be considered without apriori, the apriori array can be assigned equal values for all groups. The unit for cost can be any measurement, in money, time, or arbitrary units of judgement. The program normalized the array into fractions before use.

Row	Pattern	Groups P(j\|x,π,cost)
		Italian	French	German
1	+-+--	0.70	0.17	0.13
2	+--+-	0.46	0.37	0.18
3	+---+	0.19	0.18	0.64
4	-++--	0.39	0.21	0.40
5	-+-+-	0.20	0.38	0.42
6	-+--+	0.05	0.10	0.85
7	----+	0.10	0.13	0.77

The default example for costs in the Pattern Probability (Use of Reference Pattern on New Data) Program Page is "1 1 1" indicating that there is no cost difference between the groups. However, if we are looking desperately for an Italian interpreter in Zurich for an important function, missing an Italian may cost 3 times as much as missing a Frenchman or a German, and we may use a cost array such as "3 1 1", which the program will normalize to "0.6 0.2 0.2", and the results can be seen as in the table to the right.

We can see now that we would assign anyone with dark hair and eyes not blue as Italians (first 2 rows), and everyone else as Germans, because most people in Zurich are Germans, and not correctly identifying an Italian costs 3 times more than similar mistakes for the other groups. We would not assign anyone to be French at all.

Apparent inconsistencies in tables on this page

The servers of php pages uses 32 bit mathematics, so calculations are precise to many decimal points. StatTools by default presents numbers to 4 decimal points precision. However, in this explanation page, results are presented only to 2 decimal point precisions to conserve space. Rounding errors may therefore make some results appear inconsistent. For an example, 0.1249 may be rounded to 0.12, and 0.1251 may be rounded to 0.13. Multiplying both by 2 produces expected results of 2 x 0.12 = 0.24 and 2 x 0.13 = 0.26 when in the program they are actually 2 x 0.1249 = 0.2498 and 2 x 0.1251 = 0.2502, both rounded to 0.25.

As most probability results are presented to 2 decimal point precision, the results produced are adequate for use. The confusion lies only in translating these results to this the explanation page, when all the interim results are also presented to two decimal point precision. Users should therefore be aware of this and not be confused or alarmed by the apparent inconsistencies.

Formatting data, both input and results

The Pattern Probability model allows the use of multiple variables or attributes to assign cases to groups, the attributes are commonly binary parameters, but sometimes can also contain multiple mutually exclusive categories. In the examples, this page uses 2 hair colors (binary) and 3 eye colors (3 mutually exclusive groups). The program will also produce results when confronted with missing information. Although the concept is straight forward, formatting the data input, combining the different types of attributes is a challenging problem.

Warners et.al., in 1961 (see reference), first describe the use of this model, to diagnose congenital Heart Disease in children. For a binary variable he used the frequency, or 1-frequency, depending on whether the attribute is true or false for that variable. For multiple categories, he listed the categories and the frequency for each, and chose the appropriate category. This allowed the creation of a simple two dimensional table of frequencies, but each attribute required its own unique management before it could be entered into calculations. In 1961, when most computation were manual, such a solution was time consuming but workable.

Overall and Klett, in 1972, used this model to classify psychiatric disorders, and used large number of diagnostic criteria, each having anything up to 10 categories, so the data had to be presented as large multi-dimensional tables. This was possible because the data were part of a Fortran Program, stored on punch cards, with a format unique to each run of the program. However such a format cannot be visualized or manipulated easily in the interactive web based environment of StatTools.

The data format used in these pages are therefore unique to StatTools, modifying the formats from these two predecessors to retain the advantages and overcoming the disadvantages. The principles are as follows

The attributes are presented as a continuous sequence of + for present or - for absent.
A binary attribute is presented in 2 columns. +- for true or yes, -+ for false or no, -- for missing information
An attribute with 3 categories are presented in 3 columns. +-- for first, -+- for second, --+ for third, and --- for missing information
An attribute with 4 categories are presented in 3 columns. +--- for first, -+-- for second, --+- for third, ---+ for fourth, and ---- for missing information
And so on for 5, 6, and multiple categories.
All the Attributes are then strung together in a single string for use.

Such an approach allows a large number of attributes with heterogeneous categories to be handled flexibly on the web page. It also allows the inclusion of missing information without disrupting data processing.

The disadvantage is that such a string is not easily interpretable, and error prone if it is to be assembled manually. However, the data can be easily handled using the Excel spread sheet or a small Javascript program by anyone who is experienced with these tools.

Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology. McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.400-412.

Warner HR, Toronto AF, George Veasey LG, and Stephenson R (1961)A Mathematical Approach to Medical Diagnosis. Application to Congenital Heart Disease. JAMA 117:3 p.177-183