Showing 1 to 23 of 23 matching Articles
Results per page:
Export (CSV)
By
ArocheVillarruel, Argenis A.; CarrascoOchoa, J. A.; MartínezTrinidad, José Fco.; OlveraLópez, J. Arturo; PérezSuárez, Airel
Show all (5)
1 Citations
In this paper we present a study of the overlapping clustering algorithms OKM, WOKM and OKMED, which are extensions to the overlapping case of the well known Kmeans algorithm proposed for building partitions. Different to other previously reported comparisons, in our study we compare these algorithms using the external evaluation metric FBcubed which takes into account the overlapping among clusters and we contrast our results against those obtained by Fmeasure, a metric that does not take into account the overlapping among clusters and that has been previously used in another reported comparison.
more …
By
OlveraLópez, J. Arturo; MartínezTrinidad, J. Francisco; CarrascoOchoa, J. Ariel
1 Citations
In supervised classification, the object selection or instance selection is an important task, mainly for instancebased classifiers since through this process the time in training and classification stages could be reduced. In this work, we propose a new mixed data object selection method based on clustering and border objects. We carried out an experimental comparison between our method and other object selection methods using some mixed data classifiers.
more …
By
ArocheVillarruel, Argenis A.; MartínezTrinidad, José Fco.; CarrascoOchoa, Jesús Ariel; PérezSuárez, Airel
Show all (4)
1 Citations
DenStream is a data stream clustering algorithm which has been widely studied due to its ability to find clusters with arbitrary shapes and dealing with noisy objects. In this paper, we propose a different approach for pruning microclusters in DenStream. Our proposal unlike other previously reported pruning, introduces a different way for computing the microcluster radii and provides new options for the pruning stage of DenStream. From our experiments over public standard datasets we conclude that our approach improves the results obtained by DenStream.
more …
By
PerezTellez, Fernando; Pinto, David; Cardiff, John; Rosso, Paolo
Show all (4)
1 Citations
In recent years we have seen a vast increase in the volume of information published on weblog sites and also the creation of new web technologies where people discuss actual events. The need for automatic tools to organize this massive amount of information is clear, but the particular characteristics of weblogs such as shortness and overlapping vocabulary make this task difficult. In this work, we present a novel methodology to cluster weblog posts according to the topics discussed therein. This methodology is based on a generative probabilistic model in conjunction with a SelfTerm Expansion methodology. We present our results which demonstrate a considerable improvement over the baseline.
more …
By
Silva, Sara; Muñoz, Luis; Trujillo, Leonardo; Ingalalli, Vijay; Castelli, Mauro; Vanneschi, Leonardo
Show all (6)
3 Citations
Classification is one of the most important machine learning tasks in science and engineering. However, it can be a difficult task, in particular when a high number of classes is involved. Genetic Programming, despite its recognized successfulness in so many different domains, is one of the machine learning methods that typically struggles, and often fails, to provide accurate solutions for multiclass classification problems. We present a novel algorithm for tree based GP that incorporates some ideas on the representation of the solution space in higher dimensions, and can be generalized to other types of GP. We test three variants of this new approach on a large set of benchmark problems from several different sources, and observe their competitiveness against the most successful stateoftheart classifiers like Random Forests, Random Subspaces and Multilayer Perceptron.
more …
By
Paz, Israel Tabarez; Hernández Gress, Neil; González Mendoza, Miguel
1 Citations
This manuscript is focused on some applications of method Spikeprop of Spiking Neural Networks (SNN) using an especific hardware for parallel programming in order to measure the eficience. So, we are interested on pattern recognition and clustering, that are the main problems to solve for Artificial Neural Networks (ANN). As a result, we are going to know the considerations,its limitations and advantages, that we have to take into account for applying SNN. The main advantage is that the quantity of applications can be expanded for real applications linear or non linear, with more than one attribute, and big volume of datas. In contrast, other methods spend a lot of memory to process the information, which computational complexity is propotional to the volume and quantity of attributes of datas, also is more difficult to program the algorithm for multiclass database. However, the main limitation of SNN is the convergence, that tends forward a Local minimum Value. This implies a high dependency on the configuration and proposed architecture. On the other hand, we programmed the algorithm of SNN in a GPU model NVIDIA GeForce 9400 M. In this GPU we had to reduce parallelism in order to increase quantity of layers and neurons in the same hardware in spite of contains 60000 threads, they were not enought. On the otrher hand, the divergence is reduced when the database is bigger for database multiclass.
more …
By
KuriMorales, Angel
Given the present need for Customer Relationship and the increased growth of the size of databases, many new approaches to large database clustering and processing have been attempted. In this work we propose a methodology based on the idea that statistically proven search space reduction is possible in practice. Following a previous methodology two clustering models are generated: one corresponding to the full data set and another pertaining to the sampled data set. The resulting empirical distributions were mathematically tested by applying an algorithmic verification.
more …
By
Martínez, Eduardo D.; Fonseca, Juan P.; González, Víctor M.; Garduño, Guillermo; Huipet, Héctor H.
Show all (5)
This study proposes an improvement to the Insight Centre for Data Analytics algorithm, which identifies the most relevant topics in a corpus of tweets, and allows the construction of search rules for that topic or topics, in order to build a corpus of tweets for analysis. The improvement shows above 14% improvement in Purity and other metrics, and an execution time of 10% compared to Latent Dirichlet Allocation (LDA).
more …
By
AstengoNoguez, Carlos; SánchezAnte, Gildardo
Flock traffic navigation based on negotiation (FTN) is a new approach for solving traffic congestion problems in big cities. Early works suppose a navigation path based on a bonestructure made by initial, ending and geometrical intersection points of two agents and their rational negotiations. In this paper we present original methods based on clustering analysis to allow other agents to enter or abandon flocks according to their own self interests.
more …
By
Lezama, Fernando; Rodríguez, Ansel Y.; Cote, Enrique Muñoz; Sucar, Luis Enrique
Show all (4)
2 Citations
Electrical Load Pattern Shape (LPS) clustering of customers is an important part of the tariff formulation process. Nevertheless, the patterns describing the energy consumption of a customer have some characteristics (e.g., a high number of features corresponding to time series reflecting the measurements of a typical day) that make their analysis different from other pattern recognition applications. In this paper, we propose a clustering algorithm based on ant colony optimization (ACO) to solve the LPS clustering problem. We use four wellknown clustering metrics (i.e., CDI, SI, DEV and CONN), showing that the selection of a clustering quality metric plays an important role in the LPS clustering problem. Also, we compare our LPSACO algorithm with traditional algorithms, such as kmeans and singlelinkage, and a stateoftheart Electrical Pattern Ant Colony Clustering (EPACC) algorithm designed for this task. Our results show that LPSACO performs remarkably well using any of the metrics presented here.
more …
By
Villarreal, Sara Elena Garza; Schaeffer, Satu Elisa
4 Citations
The structure of scientific collaboration networks provides insight on the relationships between people and disciplines. In this paper, we study a bipartite graph connecting authors to publications and extract from it clusters of authors and articles, interpreting the author clusters as research groups and the article clusters as research topics. Visualisations are proposed to ease the interpretation of such clusters in terms of discovering leaders, the activity level, and other semantic aspects. We discuss the process of obtaining and preprocessing the information from scientific publications, the formulation and implementation of the clustering algorithm, and the creation of the visualisations. Experiments on a test data set are presented, using an initial prototype implementation of the proposed modules.
more …
By
KuriMorales, Angel; TrejoBaños, Daniel; CortesBerrueco, Luis Enrique
1 Citations
The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables (if not most of them) correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behavior of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters. We contrast the characteristics of the clusters gotten from the automated method with those of the experts.
more …
By
RiveraGarcía, Diego; GarcíaEscudero, Luis Angel; MayoIscar, Agustín; Ortega, Joaquín
Show all (4)
In this work a robust clustering algorithm for stationary time series is proposed. The algorithm is based on the use of estimated spectral densities, which are considered as functional data, as the basic characteristic of stationary time series for clustering purposes. A robust algorithm for functional data is then applied to the set of spectral densities. Trimming techniques and restrictions on the scatter within groups reduce the effect of noise in the data and help to prevent the identification of spurious clusters. The procedure is tested in a simulation study, and is also applied to a real data set.
more …
By
KuriMorales, Angel; Lozano, Alexis
The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer’s main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L_{ ∞ } approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.
more …
By
BernábeLoranca, María Beatríz; Velazquez, Rogelio González; Analco, Martín Estrada; RuízVanoye, Jorge; Penna, Alejandro Fuentes; Sánchez, Abraham
Show all (6)
1 Citations
The analytical observation of nature induces inspiration to propose new computational paradigms to create algorithms that solve optimization and artificial intelligence problems. The artificial vision allows establishing a problem with intelligent techniques from living systems. The bioinspired systems are presented as a set of models that are based on the behavior and the way of acting of some biological systems. These models can be expressed in data mining and operations research where the clustering is a recurrent technique in the Pmedian problem and territorial design. On this point, we have solved clustering problems using partitioning with bioinspired aspects and variable neighborhood search to approximate optimal solutions. In this work we have improved the search strategy: we present a bioinspired partitioning algorithm with optimization by tabu search (TS). This clustering problem under a bioinspired connotation has been proposed after observing some characteristics in common between clustering and human behavior in conflict situations, where some characteristics have been modeled.
more …
By
Ordoñez, Hugo; TorresJimenez, Jose; Ordoñez, Armando; Cobos, Carlos
Show all (4)
Due to the large volume of business process repositories, manually finding a particular process or a subset of them based on similarities in functionality or activities may become a difficult task. This paper presents a search method and a cluster method for business processes models. The search method considers linguistic and behavior information. The cluster method is based on covering arrays (a combinatorial object used to minimize the set of trials to find a particular structure). The cluster method avoids overlapping and improves the homogeneity of the groups created using a covering array. Obtained results outperform the precision, recall and FMeasure of previously reported approaches.
more …
By
Mejia, Ivan Ramirez; Batyrshin, Ildar
Similarity measures for binary variables are used in many problems of machine learning, pattern recognition and classification. Currently, the dozens of similarity measures are introduced and the problem of comparative analysis of these measures appears. One of the methods used for such analysis is clustering of similarity measures based on correlation between data similarity values obtained by different measures. The paper proposes the method of comparative analysis of similarity measures based on the set theoretic representation of these measures and comparison of algebraic properties of these representations. The results show existing relationship between results of clustering and the classification of measures by their properties. Due to the results of clustering depend on the clustering method and on data used for measuring correlation between measures we conclude that the classification based on the proposed properties of similarity measures is more suitable for comparative analysis of similarity measures.
more …
By
OlveraLópez, J. Arturo; CarrascoOchoa, J. Ariel; MartínezTrinidad, J. Francisco
51 Citations
In supervised classification, a training set T is given to a classifier for classifying new prototypes. In practice, not all information in T is useful for classifiers, therefore, it is convenient to discard irrelevant prototypes from T. This process is known as prototype selection, which is an important task for classifiers since through this process the time for classification or training could be reduced. In this work, we propose a new fast prototype selection method for large datasets, based on clustering, which selects border prototypes and some interior prototypes. Experimental results showing the performance of our method and comparing accuracy and runtimes against other prototype selection methods are reported.
more …
By
KuriMorales, Angel; Rodríguez, Fátima
Given the present need for Customer Relationship and the increased growth of the size of databases, many new approaches to large database clustering and processing have been attempted. In this work we propose a methodology based on the idea that statistically proven search space reduction is possible in practice. Two clustering models are generated: one corresponding to the full data set and another pertaining to the sampled data set. The resulting empirical distributions were mathematically tested to verify a tight nonlinear significant approximation.
more …
By
PérezSuárez, Airel; MartínezTrinidad, José F.; CarrascoOchoa, Jesús A.
Clustering is a fundamental technique in data mining and pattern recognition, which has been successfully applied in several contexts. However, most of the clustering algorithms developed so far have been focused only in organizing the collection of objects into a set of clusters, leaving the interpretation of those clusters to the user. Conceptual clustering algorithms, in addition to the list of objects belonging to the clusters, provide for each cluster one or several concepts, as an explanation of the clusters. In this work, we present an overview of the most influential algorithms reported in the field of conceptual clustering, highlighting their limitations or drawbacks. Additionally, we present a taxonomy of these methods as well as a qualitative comparison of these algorithms, regarding a set of characteristics desirable since a practical point of view, which may help in the selection of the most appropriate method for solving a problem at hand. Finally, some research lines that need to be further developed in the context of conceptual clustering are discussed.
more …
By
QuiñonesGrueiro, Marcos; Verde, Cristina; LlanesSantiago, Orestes
A novel leak location approach for largescale water distribution networks (WDNs) is discussed in this paper. The location task is formulated as a classification problem, and it is simplified by applying a clustering strategy. Data from each class are formed by measurements associated with leakages that occur within a specific zone of the WDN. A zone is defined as a set of nodes that share similar topological properties. Therefore, clustering is performed for network partitioning. Sensors are then placed within the network for maximizing leak detection coverage, and data of each class are generated by using the EPANET hydraulic simulator. The robustness of the proposal is demonstrated for different kinds of uncertainties and measurements’ noise. A reallife network is used as case study with synthetically generated field data. The proposal achieves an improved performance for the different scenarios in comparison with the node location approach.
more …
By
Rubio, José de Jesús; Pacheco, Jaime
14 Citations
In this paper, we propose a online clustering fuzzy neural network. The proposed neural fuzzy network uses the online clustering to train the structure, the gradient to train the parameters of the hidden layer, and the Kalman filter algorithm to train the parameters of the output layer. In our algorithm, learning structure and parameter learning are updated at the same time, we do not make difference in structure learning and parameter learning. The center of each rule is updated to obtain the center is near to the incoming data in each iteration. In this way, it does not need to generate a new rule in each iteration, i.e., it neither generates many rules nor need to prune the rules. We prove the stability of the algorithm.
more …
By
RiveraGarcía, Diego; GarcíaEscudero, Luis A.; MayoIscar, Agustín; Ortega, Joaquín
Show all (4)
Many clustering algorithms when the data are curves or functions have been recently proposed. However, the presence of contamination in the sample of curves can influence the performance of most of them. In this work we propose a robust, modelbased clustering method that relies on an approximation to the “density function” for functional data. The robustness follows from the joint application of datadriven trimming, for reducing the effect of contaminated observations, and constraints on the variances, for avoiding spurious clusters in the solution. The algorithm is designed to perform clustering and outlier detection simultaneously by maximizing a trimmed “pseudo” likelihood. The proposed method has been evaluated and compared with other existing methods through a simulation study. Better performance for the proposed methodology is shown when a fraction of contaminating curves is added to a noncontaminated sample. Finally, an application to a real data set that has been previously considered in the literature is given.
more …
