Showing 1 to 10 of 2441 matching Articles
Results per page:
By
Tallón-Ballesteros, Antonio J.; Correia, Luís; Cho, Sung-Bae
Post to Citeulike
Feature selection has been applied in several areas of science and engineering for a long time. This kind of pre-processing is almost mandatory in problems with huge amounts of features which requires a very high computational cost and also may be handicapped very frequently with more than two classes and lot of instances. The general taxonomy clearly divides the approaches into two groups such as filters and wrappers. This paper introduces a methodology to refine the feature subset with an additional feature selection approach. It reviews the possibilities and deepens into a new class of algorithms based on a refinement of an initial search with another method. We apply sequentially an approximate procedure and an exact procedure. The research is supported by empirical results and some guidelines are drawn as conclusions of this paper.
more …
By
Nguyen, Hoai Bach; Xue, Bing; Liu, Ivy; Andreae, Peter; Zhang, Mengjie
Show all (5)
Post to Citeulike
8 Citations
In classification, feature selection is an important but challenging task, which requires a powerful search technique. Particle swarm optimisation (PSO) has recently gained much attention for solving feature selection problems, but the current representation typically forms a high-dimensional search space. A new representation based on feature clusters was recently proposed to reduce the dimensionality and improve the performance, but it does not form a smooth fitness landscape, which may limit the performance of PSO. This paper proposes a new Gaussian based transformation rule for interpreting a particle as a feature subset, which is combined with the feature cluster based representation to develop a new PSO-based feature selection algorithm. The proposed algorithm is examined and compared with two recent PSO-based algorithms, where the first uses a Gaussian based updating mechanism and the conventional representation, and the second uses the feature cluster representation without using Gaussian distribution. Experiments on commonly used datasets of varying difficulty show that the proposed algorithm achieves better performance than the other two algorithms in terms of the classification performance and the number of features in both the training sets and the test sets. Further analyses show that the Gaussian transformation rule improves the stability, i.e. selecting similar features in different independent runs and almost always selects the most important features.
more …
By
Yu, Qiang; Wang, Longbiao; Dang, Jianwu
Post to Citeulike
1 Citations
Spikes play an essential role in information transmission and neural computation, but how neurons learn them remains unclear. Most learning rules depend on either the rate- or timing-based code, but rare one is suitable for both. In this paper, we present an efficient multi-spike learning rule which is suitable to train neurons to classify both rate- and timing-based spike patterns. With our learning rule, neurons can be trained to fire different numbers of output spikes in response to their input patterns, and therefore single neurons are capable for multi-category classification.
more …
By
Xylogiannopoulos, Konstantinos F.; Karampelas, Panagiotis; Alhajj, Reda
Post to Citeulike
7 Citations
Suffix array is a powerful data structure, used mainly for pattern detection in strings. The main disadvantage of a full suffix array is its quadratic O(n2) space capacity when the actual suffixes are needed. In our previous work [39], we introduced the innovative All Repeated Patterns Detection (ARPaD) algorithm and the Moving Longest Expected Repeated Pattern (MLERP) process. The former detects all repeated patterns in a string using a partition of the full Suffix Array and the latter is capable of analyzing large strings regardless of their size. Furthermore, the notion of Longest Expected Repeated Pattern (LERP), also introduced by the authors in a previous work, significantly reduces to linear O(n) the space capacity needed for the full suffix array. However, so far the LERP value has to be specified in ad hoc manner based on experimental or empirical values. In order to overcome this problem, the Probabilistic Existence of LERP theorem has been proven in this paper and, furthermore, a formula for an accurate upper bound estimation of the LERP value has been introduced using only the length of the string and the size of the alphabet used in constructing the string. The importance of this method is the optimum upper bounding of the LERP value without any previous preprocess or knowledge of string characteristics. Moreover, the new data structure LERP Reduced Suffix Array is defined; it is a variation of the suffix array, and has the advantage of permitting the classification and parallelism to be implemented directly on the data structure. All other alternative methodologies deal with the very common problem of fitting any kind of data structure in a computer memory or disk in order to apply different time efficient methods for pattern detection. The current advanced and elegant proposed methodology allows us to alter the above-mentioned problem such that smaller classes of the problem can be distributed on different systems and then apply current, state-of-the-art, techniques such as parallelism and cloud computing using advanced DBMSs which are capable of handling the storage and analysis of big data. The implementation of the above-described methodology can be achieved by invoking our innovative ARPaD algorithm. Extensive experiments have been conducted on small, comparable strings of Champernowne Constant and DNA as well as on extremely large strings of π with length up to 68 billion digits. Furthermore, the novelty and superiority of our methodology have been also tested on real life application such as a Distributed Denial of Service (DDoS) attack early warning system.
more …
By
Brahmi, Hanen; Brahmi, Imen; Ben Yahia, Sadok
Post to Citeulike
4 Citations
Due to the growing threat of network attacks, the efficient detection as well as the network abuse assessment are of paramount importance. In this respect, the Intrusion Detection Systems (IDS) are intended to protect information systems against intrusions. However, IDS are plugged with several problems that slow down their development, such as low detection accuracy and high false alarm rate. In this paper, we introduce a new IDS, called OMC-IDS, which integrates data mining techniques and On Line Analytical Processing (OLAP) tools. The association of the two fields can be a powerful solution to deal with the defects of IDS. Our experiment results show the effectiveness of our approach in comparison with those fitting in the same trend.
more …
By
Sá, Alex G. C.; Pinto, Walter José G. S.; Oliveira, Luiz Otavio V. B.; Pappa, Gisele L.
Show all (4)
Post to Citeulike
2 Citations
Automatic Machine Learning is a growing area of machine learning that has a similar objective to the area of hyper-heuristics: to automatically recommend optimized pipelines, algorithms or appropriate parameters to specific tasks without much dependency on user knowledge. The background knowledge required to solve the task at hand is actually embedded into a search mechanism that builds personalized solutions to the task. Following this idea, this paper proposes RECIPE (REsilient ClassifIcation Pipeline Evolution), a framework based on grammar-based genetic programming that builds customized classification pipelines. The framework is flexible enough to receive different grammars and can be easily extended to other machine learning tasks. RECIPE overcomes the drawbacks of previous evolutionary-based frameworks, such as generating invalid individuals, and organizes a high number of possible suitable data pre-processing and classification methods into a grammar. Results of f-measure obtained by RECIPE are compared to those two state-of-the-art methods, and shown to be as good as or better than those previously reported in the literature. RECIPE represents a first step towards a complete framework for dealing with different machine learning tasks with the minimum required human intervention.
more …
By
Lin, Hong; Li, Yuezhe
Post to Citeulike
1 Citations
This paper presents the study we have done to detect “meditation” brain state by analyzing electroencephalographic (EEG) data. We firstly discuss what is “meditation” state and some prior studies on meditation. We then discuss how meditation state can be reflected in the subject’s brain waves; and what features of the brain waves data can be used in machine learning algorithms to classify meditation state from other states. We studied the suitability of 3 types of entropy: Shannon entropy, approximate entropy, and sample entropy in different circumstances. We found that overall Sample entropy is a good tool to extract information from EEG data. Discretization of EEG data enhances the classification rates by using both the approximate entropy and Shannon entropy.
more …
By
Grossi, Valerio; Turini, Franco
Post to Citeulike
10 Citations
Mining data streams has become an important and challenging task for a wide range of applications. In these scenarios, data tend to arrive in multiple, rapid and time-varying streams, thus constraining data mining algorithms to look at data only once. Maintaining an accurate model, e.g. a classifier, while the stream goes by requires a smart way of keeping track of the data already passed away. Such a synthetic structure has to serve two purposes: distilling the most of information out of past data and allowing a fast reaction to concept drifting, i.e. to the change of the data trend that necessarily affects the model. The paper outlines novel data structures and algorithms to tackle the above problem, when the model mined out of the data is a classifier. The introduced model and the overall ensemble architecture are presented in details, even considering how the approach can be extended for treating numerical attributes. A large part of the paper discusses the experiments and the comparisons with several existing systems. The comparisons show that the performance of our system in general, and in particular with respect to the reaction to concept drifting, is at the top level.
more …
By
Prasath, R. Rajendra
Post to Citeulike
2 Citations
This work attempts to report the stylistic differences in blogging for gender and age group variations using slang word co-occurrences. We have mainly focused on co-occurrence of non dictionary words across bloggers of different gender and age groups. For this analysis, we have focused on the feature use of slang words to study the stylistic variations of bloggers across various age groups and gender. We have modeled the co-occurrences of slang words used by bloggers as graph based model where nodes are slang words and edges represent the number of cooccurrences and studied the variations in predicting age groups and gender. We have used demographically tagged blog corpus from ICWSM Spinner dataset for these experiments and used Naive Bayes classifier with 10 fold cross validations. Preliminary results shows that the concurrence of of slang words could be a better choice for predicting age and gender.
more …
By
Chatzigeorgakidis, Georgios; Karagiorgou, Sophia; Athanasiou, Spiros; Skiadopoulos, Spiros
Show all (4)
Post to Citeulike
Efficient management and analysis of large volumes of data is a demanding task of increasing scientific and industrial importance, as the ubiquitous generation of information governs more and more aspects of human life. In this article, we introduce FML-kNN, a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink. The framework’s core is consisted of a k-nearest neighbor joins algorithm which, contrary to similar approaches, is executed in a single distributed session and is able to operate on very large volumes of data of variable granularity and dimensionality. We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines. The results indicate an overall superiority of our framework in all the performed comparisons. Further, we apply FML-kNN in two motivating uses cases for water demand management, against real-world domestic water consumption data. In particular, we focus on forecasting water consumption using 1-h smart meter data, and extracting consumer characteristics from water use data in the shower. We further discuss on the obtained results, demonstrating the framework’s potential in useful knowledge extraction.
more …
-