**8.5.3. Maximum
Likelihood Decision Rule Based on Penalty Functions**

In a classification problem, we are given a data set X = { xi‘ƒ i = 1, 2, ... , N } xi being a vector is considered as a piece of evidence. It may support a number of classes (hypotheses) H = {hj‘j = 1, 2, ... , M } . To develop the general method for maximum likelihood classification, the penalty function or the loss function is introduced:

l(j‘k) , j, k = 1, ... , M .

This is a measure of the loss or penalty incurred when a piece of evidence is supporting class hj when in fact it should support class hk. It is reasonable to assume that l(j‘j) = 0 for all j. This implies that there is no loss for an evidence supporting the correct class. For a particular piece of evidence xi, the penalty incurred as xi being erroneously supporting hj is:

l(j‘k) · p(hk‘xi)

where p(hk‘xi) is as before the posterior probability that hk is the correct class for evidence xi. Averaging the penalty over all possible hypotheses, we have the average penalty, called the conditional average loss, associated with f evidence xi erroneously support class hj. That is:

L(hj) =

L is a measure of the accumulated penalty incurred given the evidence could have supported any of the available classes and the penalty functions relating all these classes to class hj.

Thus a useful decision rule for evaluating a piece of evidence for support of a class is to choose that class for which the average loss is the smallest, i.e.,

xi encourages hj, if L(hj) < L(hk) for all

This k_ j is the algorithm that implemented Bayes' rule. Because p(hk‘xi) is usually not available, it is evaluated by p(xi‘hk), p(hk) and p(xi)

p(hk‘xi) =

Thus

L(hj) =

l(j‘k)'s can be defined by domain experts.

A special case for l(j‘k)s is given as follows:

Suppose l(j‘k) = 1 - Fjk with Fjj = 1 and Fjk to be defined. Then from the above formula we have

L(hj) = · -

= 1 -

The minimum penalty decision rule has become searching the maximum for g(hj) which is

.

Thus the decision rule is

xi encourages hj, if g(hj) > g(hk) fall all k_ j

If Fjk = djk , the delta function, i.e.,

djk =

g(hj) is further simplified to

and thus the decision rule becomes

xi encourages hj if p(xi‘hj)p(hj) > p(xi‘hk)p(hk) for all j_k.

This is the commonly-used maximum likelihood decision rule.