Showing 1 to 10 of 212 matching Articles
Results per page:
Export (CSV)
By
Li, Zhaoyuan; Yao, Jianfeng
In this paper, we generalize two criteria, the determinantbased and tracebased criteria proposed by Saranadasa (J Multivar Anal 46:154–174, 1993), to general populations for high dimensional classification. These two criteria compare some distances between a new observation and several different known groups. The determinantbased criterion performs well for correlated variables by integrating the covariance structure and is competitive to many other existing rules. The criterion however requires the measurement dimension be smaller than the sample size. The tracebased criterion, in contrast, is an independence rule and effective in the “large dimensionsmall sample size” scenario. An appealing property of these two criteria is that their implementation is straightforward and there is no need for preliminary variable selection or use of turning parameters. Their asymptotic misclassification probabilities are derived using the theory of large dimensional random matrices. Their competitive performances are illustrated by intensive Monte Carlo experiments and a real data analysis.
more …
By
Giusti, Antonio; Grassini, Laura
The aim of this paper is to investigate the economic specialization of the Italian local labor systems (sets of contiguous municipalities with a high degree of selfcontainment of daily commuter travel) by using the Symbolic Data approach, on the basis of data derived from the Census of Industrial and Service Activities. Specifically, the economic structure of a local labor system (LLS) is described by an intervaltype variable, a special symbolic data type that allows for the fact that all municipalities within the same LLS do not have the same economic structure.
more …
By
Son, Won; Lim, Johan; Wang, Xinlei
We consider a regularized Dclassification rule for high dimensional binary classification, which adapts the linear shrinkage estimator of a covariance matrix as an alternative to the sample covariance matrix in the Dclassification rule (Drule in short). We find an asymptotic expression for misclassification rate of the regularized Drule, when the sample size n and the dimension p both increase and their ratio p/n approaches a positive constant γ. In addition, we compare its misclassification rate to the standard Drule under various settings via simulation.
more …
By
McParland, Damien; Gormley, Isobel Claire
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.
more …
By
Choi, Hosik; Lee, Seokho
We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical $$\ell _1$$ pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an $$\ell _2$$ fused penalty on the fusion parameters and an $$\ell _1$$ penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorizationminimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method.
more …
By
Nyman, Henrik; Xiong, Jie; Pensar, Johan; Corander, Jukka
Show all (4)
An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.
more …
By
Boullé, Marc
In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.
more …
By
Aoshima, Makoto ; Yata, Kazuyoshi
In this paper, we consider highdimensional quadratic classifiers in nonsparse settings. The quadratic classifiers proposed in this paper draw information about heterogeneity effectively through both the differences of growing mean vectors and covariance matrices. We show that they hold a consistency property in which misclassification rates tend to zero as the dimension goes to infinity under nonsparse settings. We also propose a quadratic classifier after feature selection by using both the differences of mean vectors and covariance matrices. We discuss the performance of the classifiers in numerical simulations and actual data analyzes. Finally, we give concluding remarks about the choice of the classifiers for highdimensional, nonsparse data.
more …
By
Cadre, Benoît; Paris, Quentin
Based on n randomly drawn vectors in a Hilbert space, we study the kmeans clustering scheme. Here, clustering is performed by computing the Voronoi partition associated with centers that minimize an empirical criterion, called distorsion. The performance of the method is evaluated by comparing the theoretical distorsion of empirical optimal centers to the theoretical optimal distorsion. Our first result states that, provided that the underlying distribution satisfies an exponential moment condition, an upper bound for the above performance criterion is
$O(1/\sqrt{n})$
. Then, motivated by a broad range of applications, we focus on the case where the data are realvalued random fields. Assuming that they share a Hölder property in quadratic mean, we construct a numerically simple kmeans algorithm based on a discretized version of the data. With a judicious choice of the discretization, we prove that the performance of this algorithm matches the performance of the classical algorithm.
more …
By
Yamamoto, Michio
To find optimal clusters of functional objects in a lowerdimensional subspace of data, a sequential method called tandem analysis, is often used, though such a method is problematic. A new procedure is developed to find optimal clusters of functional objects and also find an optimal subspace for clustering, simultaneously. The method is based on the kmeans criterion for functional data and seeks the subspace that is maximally informative about the clustering structure in the data. An efficient alternating leastsquares algorithm is described, and the proposed method is extended to a regularized method. Analyses of artificial and real data examples demonstrate that the proposed method gives correct and interpretable results.
more …
