Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling

3.1. The Latent Dirichlet Allocation

The latent Dirichlet allocation model (LDA) is a widely employed generative probabilistic model to identify the latent topics in text documents [22]. In LDA, each document is represented as a random mixture of latent topics and each topic is represented as a mixture of words. The mixture distributions are Dirichlet-distributed random variables to be inferred. In this scheme, each document exhibits the topics in different proportions, each word in each document is drawn among the topics, and topics are chosen based on per-document distribution over topics [61]. LDA attempts to determine the underlying latent topic structure based on the observed data. In LDA, the words of each document correspond to the observed data. For each document in the corpus, words are obtained by following a two-staged procedure. Initially, a distribution over topics is randomly chosen for each word of the document [22]. In LDA, a word is a discrete data from a vocabulary indexed by {1, … , V}, a sequence of N words w=(w₁, w₂, …, w_n), and a corpus is a collection of M documents denoted by D= {w₁, w₂, …, w_M}. The generative process of LDA is summarized in Box 1.

Box 1: The generative process of LDA (Blei et al., 2013; [19, 20]).

For each document w in a corpus D:
(1) Choose N ~ Poisson (ξ).
(2) Choose Θ ~ Dir (α).
(3) For each of the N words w_n:
(a) Choose a topic z_n~ Multinomial (Θ).
Choose a word w_n from p(w_n | z_n, β), a multinomial probability conditioned on the topic z_n.

LDA process can be modelled by a three-level Bayesian graphical model, as given in Figure 1. In this graphical model, nodes are used to represent random variables and edges are used to denote the possible dependencies between the variables. In this notation, α refers to Dirichlet parameter, Θ refers to document-level topic variables, z refers to per-word topic assignment, w refers to the observed word, and β indicates the topics [61].

Details are in the caption following the image — **Figure 1**

Open in figure viewerPowerPoint

The graphical representation of LDA [22].

Based on this notation, the generative process of LDA corresponds to a joint distribution of the hidden and observed variables. The probability density function of a k-dimensional Dirichlet random variable is computed as given by (1), the joint distribution of a topic mixture is computed as given by (2), and the probability of a corpus is computed as given by (3) [22]:

$mathematical equation$ ()

In LDA, the computation of the posterior distribution of the hidden variables is an important inferential task. The exact inference of hidden variables is exponentially large. Hence, approximation algorithms (such as Laplace approximation, variational approximation, and Gibbs sampling) have been utilized in LDA process [61].

3.2. Ensemble Learning Methods

Ensemble learning methods aim to combine the predictions of multiple classification algorithms so that a classification model with higher predictive performance can be achieved [62]. In dependent methods, the outputs of former classifiers determine the outputs of following classifiers. In contrast, the outputs of classifiers are individually identified and combined to produce the final prediction in independent methods. Dependent ensemble methods include Boosting (e.g., AdaBoost algorithm) and independent methods include Bagging, Dagging, and Random Subspace. To examine the predictive performance of the proposed scheme, four well-known ensemble learning methods (namely, AdaBoost [63], Bagging [64], Random Subspace [65], and Stacking [66]) are considered.

3.3. Ensemble Pruning Methods

The ensemble pruning methods aim to identify optimal subset of classification algorithms to improve the predictive performance and computational efficiency of multiple classifier systems. To examine the predictive performance of proposed ensemble pruning algorithm, we have employed four ensemble pruning algorithms. These methods are the ensemble pruning methods from libraries of models [49], Bagging ensemble selection [67], LibD3C algorithm [68], and ensemble pruning based on combined diversity measures [21].

3.4. Swarm-Based Optimization Algorithms

Swarm-based optimization algorithms, including genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm, have been successfully employed on applications of data science, such as data clustering and data categorization [68]. In the proposed scheme, swarm-based optimization algorithms have been utilized to optimize the set of parameters of LDA-based topic modelling. In addition, the proposed ensemble pruning algorithm employs swarm-based optimization algorithms to group classifiers into clusters. In the empirical analysis, genetic algorithms [69], particle swarm optimization algorithm [70], firefly algorithm [71], cuckoo search algorithm [72], and bat algorithm [73] are utilized.

3.5. Cluster Validity Indices

This section briefly introduces four cluster validity indices (namely, the Bayesian information criterion, Calinski-Harabasz index, Davies-Bouldin index, and Silhouette index), which are utilized to evaluate the clustering quality of different configurations of LDA.

The Bayesian information criterion (BIC) is computed as given below:

$mathematical equation$ ()

where n denotes the number of topics, L denotes the likelihood of parameters to generate data in the model, and v denotes the number of free parameters in Gaussian model [74]. The smaller the Bayesian information criterion, the better the generated model.

The Calinski-Harabasz index (CH) is the ratio of the traces of between cluster scatter matrix and the internal scatter matrix, which is computed as given below [74]:

$mathematical equation$ ()

where K denotes the number of clusters, N denotes the number of data instances, |C_k| denotes the number of elements in cluster C_k, x_i denotes a point within cluster C_k, B denotes the between-cluster scatter matrix, which represents the error sum of squares between different clusters, and W denotes the internal scatter matrix, which represents the squared differences of instances in a cluster. Here, trace of an n-by-n square matrix corresponds to the sum of the elements on the main diagonal [75].

The Davies-Bouldin index (DB) is a cluster validity index, which aims to maximize between-cluster distance and to minimize the distance between centroids of clusters and the other data points, that is defined as given by the following equation:

$mathematical equation$ ()

where c denotes the number of clusters, i and j correspond to cluster labels, d(c_i, c_j) corresponds to distance between centroids of clusters, and X_i corresponds to a data point within cluster C_i. The smaller the DB criterion, the better the generated model.

The Silhouette index (SI) is defined as given by (9):

$mathematical equation$ ()

where N denotes the number of clusters, n_i denotes the size of cluster C_i, a(x) denotes the average distance between the ith instance and all instances in X_j, b(x) denotes the minimum distance from i to the centroids of clusters not containing i.

3.6. Pairwise Diversity Measures

This section briefly introduces four diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure) which are utilized in the proposed ensemble classification scheme.

Q-statistics, the correlation coefficient (p_i,k), the disagreement measure (Dis), and the double fault measure (DF) among two classifiers D_i and D_k are computed using (12), (13), (14), and (15), respectively [76]:

$mathematical equation$ ()

where N¹¹, N⁰⁰, N¹⁰, and N⁰¹ denote the number of correctly classified instances by the two classifiers, the number of incorrectly classified instances by the two classifiers, the number of instances correctly classified by D_i and incorrectly classified by D_k, and the number of instances correctly classified by D_k and incorrectly classified by D_i, respectively.

Related Articles

Contact us

Article Content

Abstract

1. Introduction

2. Related Work

2.1. Related Work on Topic Modelling

2.2. Related Work on Multiple Classifier Systems

2.3. Motivation and Contribution of the Study

3. Theoretical Foundations

3.1. The Latent Dirichlet Allocation

3.2. Ensemble Learning Methods

3.3. Ensemble Pruning Methods

3.4. Swarm-Based Optimization Algorithms

3.5. Cluster Validity Indices

3.6. Pairwise Diversity Measures

4. The Proposed Text Categorization Framework

4.1. Swarm-Optimized Latent Dirichlet Allocation