Showing 1 to 10 of 170 matching Articles
Results per page:
Export (CSV)
By
Li, Zhaoyuan; Yao, Jianfeng
Post to Citeulike
1 Citations
In this paper, we generalize two criteria, the determinantbased and tracebased criteria proposed by Saranadasa (J Multivar Anal 46:154–174, 1993), to general populations for high dimensional classification. These two criteria compare some distances between a new observation and several different known groups. The determinantbased criterion performs well for correlated variables by integrating the covariance structure and is competitive to many other existing rules. The criterion however requires the measurement dimension be smaller than the sample size. The tracebased criterion, in contrast, is an independence rule and effective in the “large dimensionsmall sample size” scenario. An appealing property of these two criteria is that their implementation is straightforward and there is no need for preliminary variable selection or use of turning parameters. Their asymptotic misclassification probabilities are derived using the theory of large dimensional random matrices. Their competitive performances are illustrated by intensive Monte Carlo experiments and a real data analysis.
more …
By
Giusti, Antonio; Grassini, Laura
Post to Citeulike
4 Citations
The aim of this paper is to investigate the economic specialization of the Italian local labor systems (sets of contiguous municipalities with a high degree of selfcontainment of daily commuter travel) by using the Symbolic Data approach, on the basis of data derived from the Census of Industrial and Service Activities. Specifically, the economic structure of a local labor system (LLS) is described by an intervaltype variable, a special symbolic data type that allows for the fact that all municipalities within the same LLS do not have the same economic structure.
more …
By
McParland, Damien; Gormley, Isobel Claire
Post to Citeulike
13 Citations
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.
more …
By
Nyman, Henrik; Xiong, Jie; Pensar, Johan; Corander, Jukka
Show all (4)
Post to Citeulike
2 Citations
An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.
more …
By
Boullé, Marc
Post to Citeulike
6 Citations
In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.
more …
By
Cadre, Benoît; Paris, Quentin
Post to Citeulike
2 Citations
Based on n randomly drawn vectors in a Hilbert space, we study the kmeans clustering scheme. Here, clustering is performed by computing the Voronoi partition associated with centers that minimize an empirical criterion, called distorsion. The performance of the method is evaluated by comparing the theoretical distorsion of empirical optimal centers to the theoretical optimal distorsion. Our first result states that, provided that the underlying distribution satisfies an exponential moment condition, an upper bound for the above performance criterion is
$O(1/\sqrt{n})$
. Then, motivated by a broad range of applications, we focus on the case where the data are realvalued random fields. Assuming that they share a Hölder property in quadratic mean, we construct a numerically simple kmeans algorithm based on a discretized version of the data. With a judicious choice of the discretization, we prove that the performance of this algorithm matches the performance of the classical algorithm.
more …
By
Yamamoto, Michio
Post to Citeulike
12 Citations
To find optimal clusters of functional objects in a lowerdimensional subspace of data, a sequential method called tandem analysis, is often used, though such a method is problematic. A new procedure is developed to find optimal clusters of functional objects and also find an optimal subspace for clustering, simultaneously. The method is based on the kmeans criterion for functional data and seeks the subspace that is maximally informative about the clustering structure in the data. An efficient alternating leastsquares algorithm is described, and the proposed method is extended to a regularized method. Analyses of artificial and real data examples demonstrate that the proposed method gives correct and interpretable results.
more …
By
Carmona, Christian; NietoBarajas, Luis; Canale, Antonio
Post to Citeulike
The Ministry of Social Development in Mexico is in charge of creating and assigning social programmes targeting specific needs in the population for the improvement of the quality of life. To better target the social programmes, the Ministry is aimed to find clusters of households with the same needs based on demographic characteristics as well as poverty conditions of the household. Available data consists of continuous, ordinal, and nominal variables, all of which come from a noni.i.d complex design survey sample. We propose a Bayesian nonparametric mixture model that jointly models a set of latent variables, as in an underlying variable response approach, associated to the observed mixed scale data and accommodates for the different sampling probabilities. The performance of the model is assessed via simulated data. A full analysis of socioeconomic conditions in households in the Mexican State of Mexico is presented.
more …
By
Gnaldi, Michela; Bacci, Silvia; Bartolucci, Francesco
Post to Citeulike
8 Citations
Within the educational context, a key goal is to assess students’ acquired skills and to cluster students according to their ability level. In this regard, a relevant element to be accounted for is the possible effect of the school students come from. For this aim, we provide a methodological tool which takes into account the multilevel structure of the data (i.e., students in schools) and allows us to cluster both students and schools into homogeneous classes of ability and effectiveness, and to assess the effect of certain students’ and school characteristics on the probability to belong to such classes. The proposed approach relies on an extended class of multidimensional latent class IRT models characterised by: (i) latent traits defined at student and school level, (ii) latent traits represented through random vectors with a discrete distribution, (iii) the inclusion of covariates at student and school level, and (iv) a twoparameter logistic parametrisation for the conditional probability of a correct response given the ability. The approach is applied for the analysis of data collected by two national tests administered in Italy to middle school students in June 2009: the INVALSI Language Test and the Mathematics Test.
more …
