Showing 1 to 17 of 17 matching Articles
Results per page:
Export (CSV)
By
ArocheVillarruel, Argenis A.; MartínezTrinidad, José Fco.; CarrascoOchoa, Jesús Ariel; PérezSuárez, Airel
Show all (4)
1 Citations
DenStream is a data stream clustering algorithm which has been widely studied due to its ability to find clusters with arbitrary shapes and dealing with noisy objects. In this paper, we propose a different approach for pruning microclusters in DenStream. Our proposal unlike other previously reported pruning, introduces a different way for computing the microcluster radii and provides new options for the pruning stage of DenStream. From our experiments over public standard datasets we conclude that our approach improves the results obtained by DenStream.
more …
By
FebrerHernández, José Kadir; HernándezPalancar, José; HernándezLeón, Raudel; FeregrinoUribe, Claudia
Show all (4)
In this paper, we propose a novel algorithm for mining frequent sequences, called SPaMiFTS (Sequential Pattern Mining based on Frequent TwoSequences). SPaMiFTS introduces a new data structure to store the frequent sequences, which together with a new pruning strategy to reduce the number of candidate sequences and a new heuristic to generate them, allows to increase the efficiency of the frequent sequence mining. The experimental results show that the SPaMiFTS algorithm has better performance than the main algorithms reported to discover frequent sequences.
more …
By
RodríguezGonzález, Ansel Yoan; MartínezTrinidad, José Francisco; CarrascoOchoa, Jesús Ariel; RuizShulcloper, José
Show all (4)
10 Citations
Most of the current algorithms for mining frequent patterns assume that two object subdescriptions are similar if they are equal, but in many realworld problems some other ways to evaluate the similarity are used. Recently, three algorithms (ObjectMiner, STreeDCMiner and STreeNDCMiner) for mining frequent patterns allowing similarity functions different from the equality have been proposed. For searching frequent patterns, ObjectMiner and STreeDCMiner use a pruning property called Downward Closure property, which should be held by the similarity function. For similarity functions that do not meet this property, the STreeNDCMiner algorithm was proposed. However, for searching frequent patterns, this algorithm explores all subsets of features, which could be very expensive. In this work, we propose a frequent similar pattern mining algorithm for similarity functions that do not meet the Downward Closure property, which is faster than STreeNDCMiner and loses fewer frequent similar patterns than ObjectMiner and STreeDCMiner. Also we show the quality of the set of frequent similar patterns computed by our algorithm with respect to the quality of the set of frequent similar patterns computed by the other algorithms, in a supervised classification context.
more …
By
Bustio, Lázaro; Cumplido, René; Hernández, Raudel; Bande, José M.; Feregrino, Claudia
Show all (5)
Data streams are unbounded and infinite flows of data arriving at high rates which cannot be stored for offline processing. Because of this, classical approaches for Data Mining cannot be used straightforwardly in data stream scenario. This paper introduces a singlepass hardwarebased algorithm for frequent itemsets mining on data streams that uses the topk frequent 1itemsets. Experimental results of the hardware implementation of the proposed algorithm are also presented and discussed.
more …
By
HernándezLeón, Raudel; HernándezPalancar, José; CarrascoOchoa, Jesús Ariel; MartínezTrinidad, José Francisco
Show all (4)
In this paper, we propose two improvements to CARNF classifier, which is a classifier based on Class Association Rules (CARs). The first one, is a theoretical proof that allows selecting the minimum Netconf threshold, independently of the dataset, that avoids ambiguity at the classification stage. The second one, is a new coverage criterion, which aims to reduce the number of noncovered unseentransactions during the classification stage. Experiments over several datasets show that the improved classifier, called CARNF^{ + }, beats the best reported classifiers based on CARs, including the original CARNF classifier.
more …
By
López Chau, Asdrúbal; Li, Xiaoou; Yu, Wen; Cervantes, Jair; MejíaÁlvarez, Pedro
Show all (5)
2 Citations
Border points are those instances located at the outer margin of dense clusters of samples. The detection is important in many areas such as data mining, image processing, robotics, geographic information systems and pattern recognition. In this paper we propose a novel method to detect border samples. The proposed method makes use of a discretization and works on partitions of the set of points. Then the border samples are detected by applying an algorithm similar to the presented in reference [8] on the sides of convex hulls. We apply the novel algorithm on classification task of data mining; experimental results show the effectiveness of our method.
more …
By
Canales, Diana; HernandezGress, Neil; Akella, Ram; Perez, Ivan
Show all (4)
The prevalence of type 2 Diabetes Mellitus (T2DM) has reached critical proportions globally over the past few years. Diabetes can cause devastating personal suffering and its treatment represents a major economic burden for every country around the world. To property guide effective actions and measures, the present study aims to examine the profile of the diabetic population in Mexico. We used the KarhunenLoève transform which is a form of principal component analysis, to identify the factors that contribute to T2DM. The results revealed a unique profile of patients who cannot control this disease. Results also demonstrated that compared to young patients, old patients tend to have better glycemic control. Statistical analysis reveals patient profiles and their health results and identify the variables that measure overlapping health issues as reported in the database (i.e. collinearity).
more …
By
HernándezLeón, Raudel; HernándezPalancar, José; CarrascoOchoa, J. A.; MartínezTrinidad, J. Fco.
Show all (4)
Frequent Itemsets (FI) Mining is one of the most researched areas of data mining. When some new transactions are appended, deleted or modified in a dataset, updating FI is a nontrivial task since such updates may invalidate existing FI or introduce new ones. In this paper a novel algorithm suitable for FI mining in dynamic datasets named Incremental Compressed Arrays is presented. In the experiments, our algorithm was compared against some algorithms as Eclat, PatriciaMine and FPgrowth when new transactions are added or deleted.
more …
By
FebrerHernández, José K.; HernándezLeón, Raudel; HernándezPalancar, José; FeregrinoUribe, Claudia
Show all (4)
In this paper, we propose some improvements to the Sequential Patternsbased Classifiers. First, we introduce a new pruning strategy, using the Netconf as measure of interest, that allows to prune the rules search space for building specific rules with high Netconf. Additionally, a new way for ordering the set of rules based on their sizes and Netconf values, is proposed. The ordering strategy together with the “Best K rules” satisfaction mechanism allow to obtain better accuracy than SVM, J48, NaiveBayes and PART classifiers, over three document collections.
more …
By
KuriMorales, Angel; TrejoBaños, Daniel; CortesBerrueco, Luis Enrique
1 Citations
The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables (if not most of them) correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behavior of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters. We contrast the characteristics of the clusters gotten from the automated method with those of the experts.
more …
By
FrancoArcega, Anilu; FrancoSánchez, Kristell D.; CastroEspinoza, Felix A.; GarcíaIslas, Luis H.
Show all (4)
Nowadays, Data Mining has been successfully applied to several fields such as business administration, marketing and sales, diagnostics, manufacturing processes and astronomy. One of the areas where the use of Data Mining has not been well used is in the solution of social problems, where making effective decisions is essential to offering better social programs. In particular, this paper presents an analysis of Migration, which is an important social phenomenon that affects cultural, economic, ideological and demographic aspects of society, among others. This paper is based on an experiment with data processing and clustering analysis of demographic factors related to migration in the State of Hidalgo, Mexico. This study reveals the character and description of clusters obtained with data mining techniques. The knowledge from this characterization is potentially useful to government and social service agencies in the State of Hidalgo for the creation of specific social programs that might be device to mitigate the migration of the population.
more …
By
CirettGalán, Federico; TorresPeralta, Raquel; Beal, Carole R.
The answering of any test represents a challenge for students; however, foreign students whose first language is not English have to deal with the difficulty of the understanding of a series of questions written on a different language in addition of the effort required to solve the problem. In this study, we recorded the behavior of the brain signals of 16 students, 10 whom first language was English and 6 who were English learners, and used two supervised classification algorithms in order to identify the students’ language proficiency. The results shown that in both approaches, harder problems which required longer time to be responded had a higher accuracy rate; however, more tests are needed in order to understand the physical processing of written math text problem and the difference among both groups.
more …
By
KuriMorales, Angel; Lozano, Alexis
The unsupervised learning process of identifying data clusters on large databases, in common use nowadays, requires an extremely costly computational effort. The analysis of a large volume of data makes it impossible to handle it in the computer’s main storage. In this paper we propose a methodology (henceforth referred to as "FDM" for fast data mining) to determine the optimal sample from a database according to the relevant information on the data, based on concepts drawn from the statistical theory of communication and L_{ ∞ } approximation theory. The methodology achieves significant data reduction on real databases and yields equivalent cluster models as those resulting from the original database. Data reduction is accomplished through the determination of the adequate number of instances required to preserve the information present in the population. Then, special effort is put in the validation of the obtained sample distribution through the application of classical statistical non parametrical tests and other tests based on the minimization of the approximation error of polynomial models.
more …
By
FuentesCabrera, José; PérezVicente, Hugo
1 Citations
In this paper we present the development of a credit score model for payroll issuers based on a credit scoring methodology. Typically, in the Mexican banking system, it is common to provide and administer payroll service for companies via third parties (outsourcing). This service allows employees to get payroll loans of which periodic payment is retained automatically by the creditor. However, if their relationship with the company is lost, the payment is omitted incresing the risk of default. Addressing the problem described, a statistical model was built to predict whether a payroll issuer will churn in the next six months, this allows the decision maker to determine the appropriate business retention actions in order to avoid future payment loan losses. Results showed that the developed model facilitates a practical interpretation based on scoring system and showed stability when it was implemented.
more …
By
Salinas, José Gerardo Moreno; Stephens, Christopher R.
3 Citations
Distance learning is now a key component in higher level education. Given the high dropout rates and the important investments in distance learning it is of utmost concern to determine the most critical data in the success and failure of students. In this article we data mine enrollment profiles, educational background and students´ data from the Open University System and Distance Learning of the National Autonomous University of Mexico to determine the key factors that drive success and failure, creating a relevant predictive model using a Naive Bayes classifier. We have found that the number of subjects approved and their average qualification in the first semester are part of the most interesting predictors of student success.
more …
By
Vargas, Víctor Manuel Corza; Stephens, Christopher R.; Martínez, Gerardo Eugenio Sierra; Rendón, Azucena Montes
Show all (4)
Data Mining represents the cutting edge when we think about extracting information; however it always implicates a considerable spent provided that it needs “structured data”. Following this idea, text mining appears in the horizon, as a little spent, reliable alternative. It is able to provide meaningful expert information without the availability of plenty of resources, all we need is a fair big (real big) corpus of text in order to conduct a research on almost every topic. By themselves, both approaches provide valuable information at the end, nevertheless what would happen if both processes were linked in a way that one approach’s results could be verify by the result of a second process? With this idea on mind we are relaying on one hypothesis this is possible to generate a bound between both mining process and using them back and forth to verify one another. Hence, we describe thoroughly both methodologies making a special emphasis on mentioning those phases which have a propensity to establish a strong bound between them. We found that bound in the fact that once a Natural Language Processing has been performed on the chosen corpora what we got as an output is a list of meaningful nouns which can be used as features that will guide in a reliable way a data mining process.
more …
