8.5 Application Of The Probability Theory

 

In logical expression, when e implies h, that is, e Æ h, can be alternatively read as 'e is sufficient for h' or as 'h is necessary for e'. There is no ambiguity between e and h, i.e., the reliability is 100%. However, in reality, the reliability of e in support of h is lower than a logical implication.

 

8.5.1. Necessity and Sufficiency Measures

 

An evidence e can usually be in two states: absent or present when P(e) = 0 or P(e) = 1, it is of no practical interest. Either way there is nothing to observe. For h, it is the same. Therefore, we shall assume 0 < P(e) < 1 and 0 < P(h) < 1.

To study the necessity and sufficiency measures of e for h, we need to explore the influence that a state of e has on h. If the state of e makes h more plausible, we say that the state of e encourages h. If it makes h less plausible, we say that the state of e discourages h. If it neither encourages nor discourages h, then the state of e has no influence on h, or e and h are independent of each other.

For the necessity measure, we first explore how the absence of e influences h. From O(h‘ ) =  · O(h)

we define N =  

0 ² N ² •

Similarly, we have

For S = • P(h‘e) = 1 , eÆ h \ e is sufficient for h

1 < S < • P(h‘e) > P(h) , e encourages h

S = 1 No influence

0 < S < 1 P(h‘e) < P(h) , e discourages h

S = 0 P(h‘e) = 0 , eÆ  ,  Æ e \ e is sufficient for  .

From the above analysis, it is clear that N and S are the measures for necessity and sufficiency, respectively. N, S and O(h) needed to evaluate O(h‘ ) and O(h‘e) are provided by domain experts. Quite often, instead of directly supply N and S domain experts may supply values of P(e‘h) and P(e‘ ). This implies that observing evidential probabilities under a certain hypothesis h or  .

N =  =  

S =  .

8.5.2. Posterior Probability Estimation

 

In the above section, it has been explained that in order to determine the necessity and sufficiency measures N and S. The posterior probabilities such as P(e‘h) and P(e‘ ) are provided by domain experts. Sometimes, the system engineer may have to participate in the process of determining P(e‘h) and P(e‘ ) as will be explained in later part of this lecture (e.g., classification of land-use/cover types from remotely sensed images).

In spatial handling, domain experts may provide us the spatial data required or we are requested to collect further data from sources such as remote sensing images. Domain experts may also provide us their knowledge on where a specific hypothesis has been validated. It might be our responsibility to transform this type of knowledge into a computer system. The processes of collecting and encoding of expert knowledge is called knowledge acquisition and knowledge representation, respectively. While various complex computer structures for knowledge representation may be used, relatively simple procedures such as use of parametric statistical models or non-parametric look-up tables are often used. For the parametric method, a further readings is Richards (1986). For the non-parametric approach, refer to Duda and Hart (1973). Remote sensing image classification can be considered as a process of hypothesis test in which remotely sensed data are treated as evidences and a number of classes represent a list of hypotheses. In remote sensing image classification the equivalent of processes of knowledge acquisition and representation is supervised training (Gong and Howarth, 1990; and Gong and Dunlop, 1991).

8.5.3. Maximum Likelihood Decision Rule Based on Penalty Functions

 

In a classification problem, we are given a data set X = { xi‘ƒ i = 1, 2, ... , N } xi being a vector is considered as a piece of evidence. It may support a number of classes (hypotheses) H = {hj‘j = 1, 2, ... , M } . To develop the general method for maximum likelihood classification, the penalty function or the loss function is introduced:

l(j‘k) , j, k = 1, ... , M .

This is a measure of the loss or penalty incurred when a piece of evidence is supporting class hj when in fact it should support class hk. It is reasonable to assume that l(j‘j) = 0 for all j. This implies that there is no loss for an evidence supporting the correct class. For a particular piece of evidence xi, the penalty incurred as xi being erroneously supporting hj is:

l(j‘k) · p(hk‘xi)

where p(hk‘xi) is as before the posterior probability that hk is the correct class for evidence xi. Averaging the penalty over all possible hypotheses, we have the average penalty, called the conditional average loss, associated with f evidence xi erroneously support class hj. That is:

L(hj) =  

L is a measure of the accumulated penalty incurred given the evidence could have supported any of the available classes and the penalty functions relating all these classes to class hj.

Thus a useful decision rule for evaluating a piece of evidence for support of a class is to choose that class for which the average loss is the smallest, i.e.,

xi encourages hj, if L(hj) < L(hk) for all

This k_ j is the algorithm that implemented Bayes' rule. Because p(hk‘xi) is usually not available, it is evaluated by p(xi‘hk), p(hk) and p(xi)

p(hk‘xi) =  

Thus

L(hj) =  

l(j‘k)'s can be defined by domain experts.

A special case for l(j‘k)s is given as follows:

Suppose l(j‘k) = 1 - Fjk with Fjj = 1 and Fjk to be defined. Then from the above formula we have

L(hj) =  ·  -  

= 1 -  

The minimum penalty decision rule has become searching the maximum for g(hj) which is

 .

Thus the decision rule is

xi encourages hj, if g(hj) > g(hk) fall all k_ j

If Fjk = djk , the delta function, i.e.,

djk =  

g(hj) is further simplified to

 and thus the decision rule becomes

xi encourages hj if p(xi‘hj)p(hj) > p(xi‘hk)p(hk) for all j_k.

This is the commonly-used maximum likelihood decision rule.