Open main menu

Multi-label classification

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to.

Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

Contents

Problem transformation methodsEdit

Several problem transformation methods exist for multi-label classification; the baseline approach, called the binary relevance method,[1][2] amounts to independently training one binary classifier for each label. Given an unseen sample, the combined model then predicts all labels for this sample for which the respective classifiers predict a positive result. Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs.-all (OvA) and one-vs.-rest (OvR) methods for multiclass classification, it is essentially different from both, because a single classifier under binary relevance deals with a single label, without any regard to other labels whatsoever.

Various other transformations exist. Of these, the label powerset (LP) transformation creates one binary classifier for every label combination attested in the training set. The random k-labelsets (RAKEL) algorithm uses multiple LP classifiers, each trained on a random subset of the actual labels; prediction using this ensemble method proceeds by a voting scheme.[3]

Classifier chains are an alternative ensembling methods [1][2][4] that have been applied, for instance, in HIV drug resistance prediction.[5][6] Bayesian network was also applied to guide ordering of classifiers in the Classifier chains.[7]

Adapted algorithmsEdit

Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these include:

  • boosting: AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-label data.
  • k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label data.[8]
  • decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the modification involves the entropy calculations.[9] MMC, MMDT, and SSC refined MMDT, can classify multi-labeled data based on multi-valued attributes without transforming the attributes into single-values. They are also named multi-valued and multi-labeled decision tree classification methods.[10][11][12]
  • kernel methods for vector output
  • neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for multi-label learning.[13]

Learning paradigmsEdit

Based on learning paradigms, the existing multi-label classification techniques can be classified into batch learning and online machine learning. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, xt and predicts its label(s) ŷt using the current model; the algorithm then receives yt, the true label(s) of xt and updates its model based on the sample-label pair: (xt, yt). Recently, a new learning paradigm called the progressive learning technique has been developed.[14] The progressive learning technique is capable of not only learning from new samples but also capable of learning multiple new labels of data being introduced to the model and yet retain the knowledge learnt thus far.[15]

Multi-label Stream ClassificationEdit

Data streams are possibly infinite sequences of data that continuously and rapidly grow over time.[16] Multi-label stream classification (MLSC) is the version of multi-label classification task that takes place in data streams. It is sometimes also called online multi-label classification. The difficulties of multi-label classification (exponential number of possible label sets, capturing dependencies between labels) are combined with difficulties of data streams (time and memory constraints, addressing infinite stream with finite means, concept drifts).

Many MLSC methods resort to ensemble methods in order to increase their predictive performance and deal with concept drifts. Below are the most widely used ensemble methods in the literature:

  • Online Bagging (OzaBagging [17])-based methods: Observing the probability of having K many of a certain data point in a bootstrap sample is approximately Poisson(1) for big datasets, each incoming data instance in a data stream can be weighted proportional to Poisson(1) distribution to mimic bootstrapping in an online setting. This is called Online Bagging (OzaBagging). Many multi-label methods that use Online Bagging are proposed in the literature, each of which utilizes different problem transformation methods. EBR[1], ECC[1], EPS[18], EBRT [19], EBMT [19], ML-Random Rules [20] are examples of such methods.
  • ADWIN Bagging[21]-based methods: Online Bagging methods for MLSC are sometimes combined with explicit concept drift detection mechanisms such as ADWIN [22] (Adaptive Window). ADWIN keeps a variable-sized window to detect changes in the distribution of the data, and improves the ensemble by resetting the components that perform poorly when there is a drift in the incoming data. Generally, the letter 'a' is used as a subscript in the name of such ensembles to indicate the usage of ADWIN change detector. EaBR [21], EaCC [21], EaHTPS [21] are examples of such multi-label ensembles.
  • GOOWE-ML[23]-based methods: Interpreting the relevance scores of each component of the ensemble as vectors in the label space and solving a least squares problem at the end of each batch, Geometrically-Optimum Online-Weighted Ensemble for Multi-label Classification (GOOWE-ML) is proposed. The ensemble tries to minimize the distance between the weighted prediction of its components and the ground truth vector for each instance over a batch. Unlike Online Bagging and ADWIN Bagging, GOOWE-ML utilizes a weighted voting scheme where better performing components of the ensemble are given more weight. The GOOWE-ML ensemble grows over time, and the lowest weight component is replaced by a new component when it is full at the end of a batch. GOBR [23], GOCC [23], GOPS [23], GORT [23] are the proposed GOOWE-ML-based multi-label ensembles.
  • Multiple Windows [24] : Here, BR models that use a sliding window are replaced with two windows for each label, one for relevant and one for non-relevant examples. Instances are oversampled or undersampled according to a load factor that is kept between these two windows. This allows concept drifts that are independent for each label to be detected, and class-imbalance (skewness in the relevant and non-relevant examples) to be handled.

Statistics and evaluation metricsEdit

The extent to which a dataset is multi-label can be captured in two statistics:

  • Label cardinality is the average number of labels per example in the set:  ;
  • Label density is the number of labels per sample divided by the total number of labels, averaged over the samples:   where  .

Evaluation metrics for multi-label classification performance are inherently different from those used in multi-class (or binary) classification, due to the inherent differences of the classification problem. If T denotes the true set of labels for a given sample, and P the predicted set of labels, then the following metrics can be defined on that sample:

  • Hamming loss: the fraction of the wrong labels to the total number of labels, i.e.  , where   is the target and   is the prediction. This is a loss function, so the optimal value is zero.
  • The closely related Jaccard index, also called Intersection over Union in the multi-label setting, is defined as the number of correctly predicted labels divided by the union of predicted and true labels,  
  • Precision, recall and   score: precision is  , recall is  , and   is their harmonic mean.[25]
  • Exact match (also called Subset accuracy): is the most strict metric, indicating the percentage of samples that have all their labels classified correctly.

Cross-validation in multi-label settings is complicated by the fact that the ordinary (binary/multiclass) way of stratified sampling will not work; alternative ways of approximate stratified sampling have been suggested.[26]

Implementations and datasetsEdit

Java implementations of multi-label algorithms are available in the Mulan and Meka software packages, both based on Weka.

The scikit-learn Python package implements some multi-labels algorithms and metrics.

The binary relevance method, classifier chains and other multilabel algorithms with a lot of different base learners are implemented in the R-package mlr[27]

A list of commonly used multi-label data-sets is available at the Mulan website.

See alsoEdit

ReferencesEdit

  1. ^ a b c d Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. Machine Learning Journal. Springer. Vol. 85(3), (2011).
  2. ^ a b Read, Jesse; Martino, Luca; Luengo, David (2014-03-01). "Efficient monte carlo methods for multi-dimensional learning with classifier chains". Pattern Recognition. Handwriting Recognition and other PR Applications. 47 (3): 1535–1546. arXiv:1211.2190. doi:10.1016/j.patcog.2013.10.006.
  3. ^ Tsoumakas, Grigorios; Vlahavas, Ioannis (2007). Random k-labelsets: An ensemble method for multilabel classification (PDF). ECML.
  4. ^ Read, Jesse; Martino, Luca; Olmos, Pablo M.; Luengo, David (2015-06-01). "Scalable multi-output label prediction: From classifier chains to classifier trellises". Pattern Recognition. 48 (6): 2096–2109. arXiv:1501.04870. doi:10.1016/j.patcog.2015.01.004.
  5. ^ Heider, D; Senge, R; Cheng, W; Hüllermeier, E (2013). "Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction". Bioinformatics. 29 (16): 1946–52. doi:10.1093/bioinformatics/btt331. PMID 23793752.
  6. ^ Riemenschneider, M; Senge, R; Neumann, U; Hüllermeier, E; Heider, D (2016). "Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification". BioData Mining. 9: 10. doi:10.1186/s13040-016-0089-1. PMC 4772363. PMID 26933450.
  7. ^ Soufan, Othman; Ba-Alawi, Wail; Afeef, Moataz; Essack, Magbubah; Kalnis, Panos; Bajic, Vladimir B. (2016-11-10). "DRABAL: novel method to mine large high-throughput screening assays using Bayesian active learning". Journal of Cheminformatics. 8: 64. doi:10.1186/s13321-016-0177-8. ISSN 1758-2946.
  8. ^ Zhang, M.L.; Zhou, Z.H. (2007). "ML-KNN: A lazy learning approach to multi-label learning". Pattern Recognition. 40 (7): 2038–2048. doi:10.1016/j.patcog.2006.12.019.
  9. ^ Madjarov, Gjorgji; Kocev, Dragi; Gjorgjevikj, Dejan; Džeroski, Sašo (2012). "An extensive experimental comparison of methods for multi-label learning". Pattern Recognition. 45 (9): 3084–3104. doi:10.1016/j.patcog.2012.03.004.
  10. ^ Chen, Yen-Liang; Hsu, Chang-Ling; Chou, Shih-chieh (2003). "Constructing a multi-valued and multi-labeled decision tree". Expert Systems with Applications. 25 (2): 199–209. doi:10.1016/S0957-4174(03)00047-2.
  11. ^ Chou, Shihchieh; Hsu, Chang-Ling (2005-05-01). "MMDT: a multi-valued and multi-labeled decision tree classifier for data mining". Expert Systems with Applications. 28 (4): 799–812. doi:10.1016/j.eswa.2004.12.035.
  12. ^ Li, Hong; Guo, Yue-jian; Wu, Min; Li, Ping; Xiang, Yao (2010-12-01). "Combine multi-valued attribute decomposition with multi-label learning". Expert Systems with Applications. 37 (12): 8721–8728. doi:10.1016/j.eswa.2010.06.044.
  13. ^ Zhang, M.L.; Zhou, Z.H. (2006). Multi-label neural networks with applications to functional genomics and text categorization (PDF). IEEE Transactions on Knowledge and Data Engineering. 18. pp. 1338–1351.
  14. ^ Dave, Mihika; Tapiawala, Sahil; Meng Joo, Er; Venkatesan, Rajasekar (2016). "A Novel Progressive Multi-label Classifier for Classincremental Data". arXiv:1609.07215 [cs.LG].
  15. ^ Venkatesan, Rajasekar. "Progressive Learning Technique for Multi-label Classification".
  16. ^ Aggarwal, Charu C., ed. (2007). "Data Streams". Advances in Database Systems. doi:10.1007/978-0-387-47534-9.
  17. ^ Oza, Nikunj (2005). "Online Bagging and Boosting". IEEE International Conference on Systems, Man and Cybernetics.
  18. ^ Read, Jesse; Pfahringer, Bernhard; Holmes, Geoff (2008-12-15). "Multi-label Classification Using Ensembles of Pruned Sets". IEEE Computer Society: 995–1000. doi:10.1109/ICDM.2008.74. ISBN 9780769535029.
  19. ^ a b Osojnik, Aljaź; Panov, PanăźE; DźEroski, Sašo (2017-06-01). "Multi-label classification via multi-target regression on data streams". Machine Learning. 106 (6): 745–770. doi:10.1007/s10994-016-5613-5. ISSN 0885-6125.
  20. ^ Sousa, Ricardo; Gama, João (2018-01-24). "Multi-label classification from high-speed data streams with adaptive model rules and random rules". Progress in Artificial Intelligence. 7 (3): 177–187. doi:10.1007/s13748-018-0142-z. ISSN 2192-6352.
  21. ^ a b c d Read, Jesse; Bifet, Albert; Holmes, Geoff; Pfahringer, Bernhard (2012-02-21). "Scalable and efficient multi-label classification for evolving data streams". Machine Learning. 88 (1–2): 243–272. doi:10.1007/s10994-012-5279-6. ISSN 0885-6125.
  22. ^ Bifet, Albert; Gavaldà, Ricard (2007-04-26), "Learning from Time-Changing Data with Adaptive Windowing", Proceedings of the 2007 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 443–448, doi:10.1137/1.9781611972771.42, ISBN 9780898716306, retrieved 2018-11-04
  23. ^ a b c d e Büyükçakir, Alican; Bonab, Hamed; Can, Fazli (2018-10-17). "A Novel Online Stacked Ensemble for Multi-Label Stream Classification". ACM: 1063–1072. doi:10.1145/3269206.3271774. ISBN 9781450360142.
  24. ^ Xioufis, Eleftherios Spyromitros; Spiliopoulou, Myra; Tsoumakas, Grigorios; Vlahavas, Ioannis (2011-07-16). "Dealing with concept drift and class imbalance in multi-label stream classification". AAAI Press: 1583–1588. doi:10.5591/978-1-57735-516-8/IJCAI11-266. ISBN 9781577355144.
  25. ^ Godbole, Shantanu; Sarawagi, Sunita (2004). Discriminative methods for multi-labeled classification (PDF). Advances in Knowledge Discovery and Data Mining. pp. 22–30.
  26. ^ Sechidis, Konstantinos; Tsoumakas, Grigorios; Vlahavas, Ioannis (2011). On the stratification of multi-label data (PDF). ECML PKDD. pp. 145–158.
  27. ^ Philipp Probst, Quay Au, Giuseppe Casalicchio, Clemens Stachl, Bernd Bischl. Multilabel Classification with R Package mlr. The R Journal (2017) 9:1, pages 352-369.

Further readingEdit