Showing 1 to 10 of 1840 matching Articles
Results per page:
Export (CSV)
By
Turunen, Esko
Post to Citeulike
A natural interpretation of GUHA style data mining logic in paraconsistent fuzzy logic framework is introduced. Significance of this interpretation is discussed.
By
Kurban, Hasan; Jenne, Mark; Dalkilic, Mehmet M.
Post to Citeulike
2 Citations
Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real-world and synthetic data sets. We lastly conclude with some theoretical underpinnings that explain why EM* is successful.
more …
By
Chatzigeorgakidis, Georgios; Karagiorgou, Sophia; Athanasiou, Spiros; Skiadopoulos, Spiros
Show all (4)
Post to Citeulike
Efficient management and analysis of large volumes of data is a demanding task of increasing scientific and industrial importance, as the ubiquitous generation of information governs more and more aspects of human life. In this article, we introduce FML-kNN, a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink. The framework’s core is consisted of a k-nearest neighbor joins algorithm which, contrary to similar approaches, is executed in a single distributed session and is able to operate on very large volumes of data of variable granularity and dimensionality. We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines. The results indicate an overall superiority of our framework in all the performed comparisons. Further, we apply FML-kNN in two motivating uses cases for water demand management, against real-world domestic water consumption data. In particular, we focus on forecasting water consumption using 1-h smart meter data, and extracting consumer characteristics from water use data in the shower. We further discuss on the obtained results, demonstrating the framework’s potential in useful knowledge extraction.
more …
By
Chi, Yang; Zhu, Jinchao; Huag, Lan; Xu, Hao
Show all (4)
Post to Citeulike
Scientific retrieval systems need to be given domain search terms for searching publications, however, as natural language, search terms provided by users are often fuzzy and limited and some relevant terms are always overlooked in searching. Meanwhile, users always desire to be given domain related keywords to enlighten themselves what other terms can be used for their searching. This paper presents a concepts recommendation model in scientific paper retrieval, in which concepts are extracted from keyword in scientific papers, and some data mining algorithms are used to calculate the similarity between search terms and concepts and do recommendation for users. This model is simple and can be used with small dataset, in which all training data used is from meta data of papers that is easy to acquired. Experimental result hold good precision, which shows that this research not only simplifies searching step and improves the searching quality for users, but also lays the foundation for semantic search.
more …
By
Kholod, Ivan; Petukhov, Ilya
Post to Citeulike
5 Citations
The article describes extension of λ-calculation for creation of parallel data mining algorithms. The proposed approach uses presentation of the algorithm as a consequence of pure functions with unified interfaces. For parallel execution we use special function that allows to change a structure of the algorithm and to implement various strategies for processing of data set and model.
more …
By
Kikas, Riivo; Dumas, Marlon; Karsai, Márton
Post to Citeulike
3 Citations
In this study we analyze the dynamics of the contact list evolution of millions of users of the Skype communication network. We find that egocentric networks evolve heterogeneously in time as events of edge additions and deletions of individuals are grouped in long bursty clusters, which are separated by long inactive periods. We classify users by their link creation dynamics and show that bursty peaks of contact additions are likely to appear shortly after user account creation. We also study possible relations between bursty contact addition activity and other user-initiated actions like free and paid service adoption events. We show that bursts of contact additions are associated with increases in activity and adoption—an observation that can inform the design of targeted marketing tactics.
more …
By
Nguyen, Thanh-Tung; Zhao, He; Huang, Joshua Zhexue; Nguyen, Thuy Thi; Li, Mark Junjie
Show all (5)
Post to Citeulike
2 Citations
Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply
$$p$$
-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.
more …
By
Yun, Ching-Huang; Chuang, Kun-Ta; Chen, Ming-Syan
Post to Citeulike
1 Citations
In this paper, we devise an efficient algorithm for clustering market-basket data items. Market-basket data analysis has been well addressed in mining association rules for discovering the set of large items which are the frequently purchased items among all transactions. In essence, clustering is meant to divide a set of data items into some proper groups in such a way that items in the same group are as similar to one another as possible. In view of the nature of clustering market basket data, we present a measurement, called the small-large (SL) ratio, which is in essence the ratio of the number of small items to that of large items. Clearly, the smaller the SL ratio of a cluster, the more similar to one another the items in the cluster are. Then, by utilizing a self-tuning technique for adaptively tuning the input and output SL ratio thresholds, we develop an efficient clustering algorithm, algorithm STC (standing for Self-Tuning Clustering), for clustering market-basket data. The objective of algorithm STC is “Given a database of transactions, determine a clustering such that the average SL ratio is minimized.” We conduct several experiments on the real data and the synthetic workload for performance studies. It is shown by our experimental results that by utilizing the self-tuning technique to adaptively minimize the input and output SL ratio thresholds, algorithm STC performs very well. Specifically, algorithm STC not only incurs an execution time that is significantly smaller than that by prior works but also leads to the clustering results of very good quality.
more …
By
Nasrollahzadeh, Kourosh; Afzali, Solmaz
Post to Citeulike
The application of fiber-reinforced polymer (FRP) strips or rods in the form of near-surface-mounted (NSM) reinforcement has become an attractive solution to strengthen the existing buildings and bridges. It is of interest to engineers to have an accurate estimate of the bond capacity of this technique. In this paper, fuzzy logic approach is utilized to propose an alternative method of determining the pullout strength of NSM FRP strips/rods which are bonded to the concrete block. Two types of fuzzy logic models, namely Mamdani and Takagi–Sugeno, are developed. With the aim of enhancing the interpretability of the fuzzy model, the rule base of Mamdani model is extracted from the classification decision tree, and the membership functions corresponding to the linguistic concepts are built by uniform partitioning the range of variables. On the other hand, in order to arrive at closed-form equations for pullout capacity, the subtractive clustering algorithm is employed to deduce the rule base and membership functions of Takagi–Sugeno model (first order), and its consequent part is tuned by the least square optimization using training dataset. Several fuzzy logic models of both types with different numbers of rules are developed and compared in terms of different error measures. To train and validate the fuzzy models, a large database of 384 direct pullout tests on NSM FRP bonded to concrete is assembled from the literature. The results reveal that both of the proposed Mamdani and Takagi–Sugeno models demonstrate good accuracy against the experimental data and outperform the published models. A parametric study indicates that the proposed fuzzy models can predict the maximum effective bond length, and thus, they are able to capture the underlying mechanics of the problem.
more …
-