
     Optimal Data Clusterization by Inductive Sorting-Out Method

           Gregory A. Ivakhnenko, Alexey G. Ivakhnenko
 (252034 Kiev, Volodymirska 51/53, kv.14, e-mail: gai@insight.kiev.ua)


    Keywords:  Pattern recognition, processes forecast, diagnostic,
               fuzzy objects, adequateness law, clusterisation.

                              ABSTRACT

   Summary:  Clusterisation of input  data  sample,  optimal  after
 balance of clusterisation criterion, is found by  rationally  orga-
 nized sorting-out procedure. Recommended for fuzzy objects internal
 discritisation of input data samples and search of analogs for each
 realization are  the  essential  parts  of  criterion  calculation.
 Algorithm finds optimal clusterisations of input data sample  among
 all possible clusterisations. The number of interval clusterisation
 levels and random noises compensation coefficient are to be  chosen
 to satisfy the condition of optimal clusterisation choice regulari-
 zation.

      COMPLETE AND PARTIAL MUTUAL COMPENSATION OF "BLACK BOXES"
      OF FUZZY OBJECTS BY BALANCE OF CLUSTERISATIONS CALCULATION.

   Almost all the objects of recognition and control  in  economics,
 ecology, biology and medicine are undeterministic  or  fuzzy.  They
 can be represented by deterministic (robust)  part  and  additional
 black boxes acting on each output of object. The only information
 about these boxes is that they have limited values of output  vari-
 ables, which are similar for the similar states of object.
   According to W.R.Ashby work [1] diversity of control system is to
 be not smaller, than diversity of the object itself.  The  "Law  of
 Adequateness", given by Stafford Beer, establishes, that for  opti-
 mal control object are to be  compensated  by  corresponding  black
 boxes of the control system [2]. For  optimal  pattern  recognition
 and clusterisation only partial compensation is necessary. More  of
 that we are interested to minimize the degree  of  compensation  by
 all the means to get more accurate results.
    The simplest presentation of black boxes outputs are the  set  of
 random figures  with  even  distribution.  Random  figures  can  be
 compensated by another  random  figures  only  after  smoothing  or
 averaging. This operation is realized by interval discritisation of
 the input data to D levels.
   The degree of black  boxes  mutual  compensation  can  be  easily
 regulated and optimized by balance of two clusterisations criterion
 calculation. It is  obtained  using  the  trees  of  clusterisation
 construction (Fig.1).
                                     
                                   BL
                                             2
                                            D = opt
                                             
                                                  |
                                          |     |  |
                                          |     |  |
                                       |  |     |  |  |
                                     *
                                        |     | 9      |
          1          2         3
                                              |
                    D=var               |              |
      1...  .M                                |
      Ŀ  Ŀ           Ŀ
     1 | | |    | | |              |     |            
                         1...N   1 oĿ  |   
     . | | |    | | |   Ŀ          |           
                        0      . oĿ  Ŀ   
     . |   |    | | |    0             |       
                          0    . o        
     . | X |  |X(D)    0         |         
                            0  . oĿ           
                            oĿ     
       | | |    | | |                         |   
     N                           N o  |            
       | | |    | | |              |              |   
                   
                                       
                                             | 4      |
                  Ŀ                    |
                  0                                  |
                   0                       |
                    0  
               5     0                     |        |
                      0               |
                                                |
                                       |     |
                 Ŀ           Ŀ
                           1...N   1 oĿ     |   
                  | | |   Ŀ          |  Ŀ      
                          0      . o    |   
                  | | |    0             |     Ŀ   
                            0    . o     
                |X(D)    0                     
          var                 0  . oĿ  |         
                  | | |               Ŀ           
                              o             
                  | | |                       
                                   N o            
                            
                     6          7                8

      Fig.1.  Algorithm of clusterisation "Pointing Finger".

   There are shown:
      1   - input data sample;
      2   - interval discritisated sample;
      3,7 - calculation of the distances between points;
      4,8 - first and second hierarchical clusterisation trees;
      6   - interval discritisated tree, calculated with the account
            of analogs;
      9   - calculation of the clusterisation balance criterion and
            the number of clusterisations for which it equal BL = 0
            for  several  values  of  discritisation  levels  D  and
            several values of compensation coefficient  . The choice
            of the D-  plane, where  z = 3 (two trivial  clusterisa-
            tions and optimal one).

   According to Widrow theorem each level  is  to  unite  in  itself
 almost equal number of the input  data  realizations  (points).  We
 suppose, that in the similar states of the object the output  vari-
 ables of its black boxes are similar too. To compensate  the  inde-
 finite part of the objects output, the  first  analogs  are  to  be
 found for each point of the data sample. The realizations presented
 in data sample correspond to the points of multidimensional  hyper-
 space. Each point has its nearest neighbour  or  first  analog.  To
 calculate analogs the Hamming measure of distance is used. Than the
 sample of analogs is calculated, according  to  weighted  summation
 formulae:
                 X  (B,A1) = (1-l) X  (B) +   X  (A1)            (1)
                  ij                ij         ij
 where:  B  - realization, given in the input data sample;
         A1 - its first analog (nearest neighbour);
         l  - coefficient of black boxes mutual compensation.

 Formulae is valid for continue-valued  and  interval  discritisated
 features. For binary variables the voting procedures are  developed
 [3].
 The hierarchical tree of clusterisation is constructed for  discri-
 tisated input data sample (B) and for the sample of  analogs  (A1).
 There is proved, that the hierarchical  tree  construction  can  be
 considered as a procedure, which minimizes the sorting-out  volume:
 the optimal clusterisation is not excluded in the  result  of  this
 procedure [4],[5].
    Than  the  balance  of  clusterisations  is  calculated  for  two
 hierarchical trees:
                              k - Dk
                         BL = -------   min                    (2)
                                k
 where:  k  - number of clusters;
         Dk - number of similar clusters.

   The pointing-out characteristic (Fig.1) shows the change  of  the
 criterion along the steps of trees construction.
   Except of the tree clusterisation  balance  criterion  is  to  be
 equal to zero at the  very  beginning  and  at  the  end  of  trees
 construction i.m. for clusterisations: s = 1  - every point is  the
 separate cluster;  s = N  - all points are united to  one  cluster.
 This can be used for the control of program reliability. So,  mini-
 mal number of optimal clusterisations, pointed  and  by  finger  is
 equal to three. This result we can find by means of D and l varia-
 tion :
                              N  N
                       D = N, , , ..., 2
                              2  3
                       l = 0, 0.05, 0.1, ... , 1.0

                      .                 
          l         .                 Z   .         .
                 .                          .     .
                .                       ---- .--. ------
              .                               .
             .                                |
                        |
                            D                            D

    The value of  noise  compensation  coefficient  l  everytimes  is
 chosen to get single zero value of balance criterion  somewhere  in
 the middle part of the trees. If the number of optimal  clusterisa-
 tions is cannot be reduced by increasing of l coefficient there are
 necessary to invite experts for final decision.

                              REFERENCES

 1. Ashby,W.R. (1958).An Introduction to Cybernetics, A Wiley
     Company, New York.
 2. Beer,S. (1959).Cybernetics and Management, English Univ. Press,
     London.
 3. Ivakhnenko,A.G. (1991). An Inductive Sorting Method for the
     Forecasting  of  Multidimensional  Random  Processes  and
     Events with the  Help  of  Analogs  Forecast  Complexing,
     Pattern Recognition and Image Analysis, Vol.1,No.1,99-108.
 4. Kovalchuk,P.I. (1983).To problem  of  inner  convergence  of
     GMDH algorithms, Sov.Autom.Control, Vol.16,No.2.
 5. ,..  ᪨,.. (1991). 室 -
     ᪨ ⬮ -,    
     ࠨ, .30, No.11,165-167.
 6. Farlow,S.J.,ed.,(1984). Self-organizing Methods in  Modeling
     (Statistics:  Textbooks  and  Monographs,Vol.54),  Marcel
     Dekker, Inc., New York and Basel.
