Nguyen, Hien D.; McLachlan, Geoffrey J.
12 Citations
The Gaussian mixture model (GMM) is a popular tool for multivariate analysis, in particular, cluster analysis. The expectation–maximization (EM) algorithm is generally used to perform maximum likelihood (ML) estimation for GMMs due to the Mstep existing in closed form and its desirable numerical properties, such as monotonicity. However, the EM algorithm has been criticized as being slow to converge and thus computationally expensive in some situations. In this article, we introduce the linear regression characterization (LRC) of the GMM. We show that the parameters of an LRC of the GMM can be mapped back to the natural parameters, and that a minorization–maximization (MM) algorithm can be constructed, which retains the desirable numerical properties of the EM algorithm, without the use of matrix operations. We prove that the ML estimators of the LRC parameters are consistent and asymptotically normal, like their natural counterparts. Furthermore, we show that the LRC allows for simple handling of singularities in the ML estimation of GMMs. Using numerical simulations in the R programming environment, we then demonstrate that the MM algorithm can be faster than the EM algorithm in various large data situations, where sample sizes range in the tens to hundreds of thousands and for estimating models with up to 16 mixture components on multivariate data with up to 16 variables.
Giusti, Antonio; Grassini, Laura
6 Citations
The aim of this paper is to investigate the economic specialization of the Italian local labor systems (sets of contiguous municipalities with a high degree of selfcontainment of daily commuter travel) by using the Symbolic Data approach, on the basis of data derived from the Census of Industrial and Service Activities. Specifically, the economic structure of a local labor system (LLS) is described by an intervaltype variable, a special symbolic data type that allows for the fact that all municipalities within the same LLS do not have the same economic structure.
Umbleja, Kadri ; Ichino, Manabu; Yaguchi, Hiroyuki
Symbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data mining. In order to accurately cluster these sophisticated data types, usual methods are not enough. Throughout the years different approaches have been proposed but they mainly concentrate on the “macroscopic” similarities between objects. Distributional data, for example symbolic data, has been aggregated from sets of large data and thus even the smallest microscopic differences and similarities become extremely important. In this paper a method is proposed for clustering distributional data based on these microscopic similarities by using quantile values. Having multiple points for comparison enables to identify similarities in small sections of distribution while producing more adequate hierarchical concepts. Proposed algorithm, called microscopic hierarchical conceptual clustering, has a monotone property and has been found to produce more adequate conceptual clusters during experimentation. Furthermore, thanks to the usage of quantiles, this algorithm allows us to compare different types of symbolic data easily without any additional complexity.
Paul, Nicolas; Terre, Michel; Fety, Luc
2 Citations
This paper deals with the unsupervised classification of univariate observations. Given a set of observations originating from a Kcomponent mixture, we focus on the estimation of the component expectations. We propose an algorithm based on the minimization of the “Kproduct” (KP) criterion we introduced in a previous work. We show that the global minimum of this criterion can be reached by first solving a linear system then calculating the roots of some polynomial of order K. The KP global minimum provides a first raw estimate of the component expectations, then a nearestneighbour classification enables to refine this estimation. Our method’s relevance is finally illustrated through simulations of various mixtures. When the mixture components do not strongly overlap, the KP algorithm provides better estimates than the ExpectationMaximization algorithm.
McParland, Damien; Gormley, Isobel Claire
24 Citations
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.
Choi, Hosik; Lee, Seokho
We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical $$\ell _1$$ pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an $$\ell _2$$ fused penalty on the fusion parameters and an $$\ell _1$$ penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorizationminimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method.
Arevalillo, Jorge M. ; Navarro, Hilario
Multivariate scale mixtures of skewnormal distributions are flexible models that account for the nonnormality of data by means of a tail weight parameter and a shape vector representing the asymmetry of the model in a directional fashion. Its stochastic representation involves a skewnormal vector and a non negative mixing scalar variable, independent of the skewnormal vector, that injects tail weight behavior into the model. In this paper we look into the problem of finding the projection that maximizes skewness for vectors that follow a scale mixture of skewnormal distribution; when a simple condition on the moments of the mixing variable is fulfilled, it can be shown that the direction yielding the maximal skewness is proportional to the shape vector. This finding stresses the directional nature of the shape vector to regulate the asymmetry; it also provides the theoretical foundations motivating the skewness based projection pursuit problem in this class of distributions. Some examples that illustrate the application of our results are also given; they include a simulation experiment with artificial data, which sheds light on the usefulness and implications of our results, and the application to real data.
Nyman, Henrik; Xiong, Jie; Pensar, Johan; Corander, Jukka
2 Citations
An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.
Guinot, Christiane; Malvy, Denis; Schémann, JeanFrançois; Afonso, Filipe; Haddad, Raja; Diday, Edwin
2 Citations
Trachoma, caused by repeated ocular infections with Chlamydia trachomatis whose vector is a fly, is an important cause of blindness in the world. We are presenting here an application of the Symbolic Data Analysis approach to an interventional study on trachoma conducted in Mali. This study was conducted to choose among three antibiotic strategies those with the best costeffectiveness ratio and to find the demographic and environmental parameters on which we could try to intervene. The Symbolic Data Analysis approach aims at studying classes of individuals considered as new units. These units are described by variables whose values express for each class the variation of the values taken by each of its individuals. Finally, the results obtained are compared to those previously provided by multiple logistic regression analysis. Symbolic Data Analysis actually provides a new perspective on this study and suggests that some demographic, economics and environmental parameters are related to the disease and its evolution during the treatment, whatever the strategy. Moreover, it is shown that the efficiency of each strategy depends on environmental parameters.
