User:Julie Dancer/OC backup

Optimal classification is an arrangement of attributes for a set of elements in an Attribute-value system which minimizes the number of queries necessary to identify a particular element within the element set.

The algorithm used for this purpose first sorts the elements in ascending order according to the value of each attribute as determined by its current position in the sequence of attributes. An empirical separatory value is then calculated for each characteristic in the target set and compared to the highest separatory value calculated for all preceding sets. This process repeats until an optimal sequence is found, all permutations have been evaluated or a time limit expires.

The number of elements which are classified as well as the rapidity with which they can be identified is dependent upon both the radix and the exponent of the data. A larger radix will allow faster identification by means of excluding a greater percentage of elements per characteristic. A binary radix for instance excludes only fifty percent of the elements per characteristic whereas a five-valued radix excludes eighty percent of the elements per characteristic.

What follows is a rigorous explanation of the algorithm.

edit
, where:
  • G is the group size or total number of elements in the group,
  • V is the highest value of logic in the group,
  • C is the highest number of characteristics in the group.
, where:
  • R is the set size or total number of elements in the set,
  • V is the highest value of logic in the group,
  • K is the highest number of characteristics in the set.
edit

Maximum Number of Pairs of Elements to Separate

edit

Maximum number of pairs of elements to separate refers to a matrix in which each element is compared with every other element so as to determine the number of pairs that are separable or disjoint. Pairs are separable or disjoint whenever the logic values of the elements that make up a pair are different. In theory therefore the maximum possible numbers of pairs that can be separated is determined by the following equation:

, where:
  • pmax is the maximum number of pairs to separate, and
  • G is the total number of elements in the group.

Order of Elements

edit

The elements are arranged in descending order according to their truth table value, i.e., the value calculated as the sum of each characteristic's logic state value times the highest value of logic raised to the power of the order of the characteristic.

, where:
  • ei is the element truth table value in the group,
  • C is the highest number of characteristics in the group,
  • V is the highest value of logic in the group,
  • v is the value of logic of each characteristic in the group,
  • j is the jth characteristic index, where:
j = 0..K and where:
  • K is the number of characteristics in the set,
  • i is the ith element index, where:
i = 0..G and where:
  • G is the number of elements in the group.
edit

Theoretical Separation

edit
The General Identification Equation
edit
, where:


  • Sj is the theoretical separatory value per jth characteristic,
  • C is the highest number of characteristics in the group,
  • V is the highest value of logic in the group and
  • j is the jth characteristic index in the set, where:
j = 0..K and where:
  • K is the number of characteristics in the target set.
Minimal Number of Characteristics to Result in Theoretical Separation
edit
, where:
  • tmin is the minimal number of characteristics to result in theoretical separation,
  • G is the number of elements in the group and
  • V is the highest value of logic in the group.

Empirical Separation

edit

(Please note that application of these equations can not be expressed entirely using convention mathematical notation or normal MathCAD expressions without including a MathCAD function or MathCAD user program. The equations have been fully implemented using the Zbasic and the Visual Basic programming languages.)


, where:
  • Sj is the empirical separatory value,
  • j is the jth characteristic index,
  • G is the number of elements in the group,
  • R is the number of elements in the set, where:
R = V^K, where:
  • V is the highest value of logic in the group and
  • K is the number of characteristics in the set.
  • i is the truth table index value of the set, where:
i = 0..R and
, where:
  • is the truth table value of the ith element in the set
  • v is the logic value of the jth characteristic.

Flag Recognition - Application Example

edit

(A flag identification example from Neural Network Identification example.)

Although this example lacks an intuitive sense of optimization it still serves as a good example of how optimization can reduce the number of queries which must be made.

Flag Overlay Grid

edit

 

Designated areas (characteristics) for sampling background colors (states) of all flags (elements).


The overlay is used to determine the color of each area for each flag and the color is recorded in the table as the logical state of the area. The table data is then submitted to the optimization program and processed until the optimal empirical separatory value is obtained.

Original Order

edit
FLAGS/LOC A B C D E F G H I
Belgium   BLACK YELLOW ORANGE BLACK YELLOW ORANGE BLACK YELLOW ORANGE
France   BLUE WHITE RED BLUE WHITE RED BLUE WHITE RED
Germany   BLACK BLACK BLACK RED RED RED YELLOW YELLOW YELLOW
Ireland   GREEN WHITE ORANGE GREEN WHITE ORANGE GREEN WHITE ORANGE
Italy....   GREEN WHITE RED GREEN WHITE RED GREEN WHITE RED
Japan..   WHITE WHITE WHITE WHITE RED WHITE WHITE WHITE WHITE
Luxembourg   RED RED RED WHITE WHITE WHITE BABY BABY BABY
Netherlands   RED RED RED WHITE WHITE WHITE BLUE BLUE BLUE
Spain..   RED RED RED YELLOW YELLOW YELLOW RED RED RED

Systematic Query

edit

Starting with area "A" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of the Netherlands. The answer to the first query in regard to area "A" is RED which would remove 2/3 of the flags from further consideration. The next query for the color in area "B" would be RED which would serve to eliminate none of the remaining flags. In fact we would not be able to eliminate any remaining flags until column "G" when the answer would be BLUE. Here all remaining flags except the flag of the Netherlands would be eliminated from further consideration. It would therefore take a minimum of seven queries to establish the identity of the flag in our possession as belonging to the Netherlands.

Optimized Order

edit
FLAGS/LOC G F C B I D E H A
Spain..   RED YELLOW RED RED RED YELLOW YELLOW RED RED
Luxembourg   BABY WHITE RED RED BABY WHITE WHITE BABY RED
Japan..   WHITE WHITE WHITE WHITE WHITE WHITE RED WHITE WHITE
Italy....   GREEN RED RED WHITE RED GREEN WHITE WHITE GREEN
Ireland   GREEN ORANGE ORANGE WHITE ORANGE GREEN WHITE WHITE GREEN
Germany   YELLOW RED BLACK BLACK YELLOW RED RED YELLOW BLACK
Netherlands   BLUE WHITE RED RED BLUE WHITE WHITE BLUE RED
France   BLUE RED RED WHITE RED BLUE WHITE WHITE BLUE
Belgium   BLACK ORANGE ORANGE YELLOW ORANGE BLACK YELLOW YELLOW BLACK
Theoretical 85.71 97.96 99.71 99.96 99.99 100 . . .
Empirical 94.44 100 . . . . . . .
Best Sequence 7 6 3 2 9 4 5 8 1

Minimized Query

edit

The results of optimization are shown above and include a listing of the theoretical and empirical percentages. The original characteristic sequence is indexed in the bottom row. Starting with area "G" the query begins by asking for the color in this area of the flag. Suppose we have in our possession the flag of Ireland. The answer to this first query would be GREEN. The next query is for the color in area "F" to which we would answer ORANGE. Since no other flags have this combination of GREEN and ORANGE in these areas our query can end here. The algorithm has optimized the order of characteristics and minimized the number of queries that are required to identify the flag. (Please note that there may be more than one optimal solution.)

References

edit

Primary Reference

edit

Biological Identification with Computers edited by R.J. Pankhurst, British museum (natural history) London, England proceedings of a meeting held at Kings College, Cambridge 27 and 28 September 1973 of the Systematics Association Special Volume Number 7 and published by the Academic Press 1975 noting the work of Eugene W. Rypka, Dept. of Microbiology, Lovelace Center for Health Sciences, Albuquerque, New Mexico, "Pattern Recognition and Microbial Identification."

  • Eugene Weston Rypka passed away on April 27, 2006. Gene was born on May 6, 1925 in Owatonna, MN to Charles Frederick and Ethel Marie Rypka. He served in World War II as a paramedic in Iwo Jima and received several medals and commendations. In 1958, Gene received a Ph.D. in Medical Microbiology from Stanford University. He had a long and distinguished career, including work with Russian scientists at Lovelace Medical Center and the University of New Mexico. Bicycle racing was a lifetime love and occupation, and in later years, he also studied martial arts.

See also

edit
edit

/* I broke your category links. The "User:Julie Dancer/OC backup" page was showing up on those category pages. Probably not what you meant to do. Just remove the extra space in each one to put it back.*/

[ [Category:Optimization]] [ [Category:Classification algorithms|*]] [ [Category:Knowledge representation]] [ [Category:Knowledge discovery in databases]] [ [Category:Algorithms]] [ [Category:Machine learning]] [ [Category:Taxonomy]] [ [de:Kategorie:Optimierungsalgorithmus]] [ [fr:Catégorie:Algorithme d'optimisation]] [ [ko:분류:최적화 알고리즘]] [ [vi:Phân loại bằng thống kê]]