Dynamic data mining for highly intercorrelated feature selection with Graphic Processing Unit Computing (GPU)
View/ Open
Date
2013Author
Silvanie, Astried
Djatna, Taufik
Sukoco, Heru
Metadata
Show full item recordAbstract
Feature selection with correlated features in the real world should be performed on dynamic data. This is due to insertion, deletion and updating transactions occur continuously on the database. This transaction makes the dimension of data becomes very large in number of records and number of features. The question is how we do feature selection on very large data and is constantly changing. This problem should be solved and one common method is to perform the sampling on database. In this research, random sampling method is used to extract the overall synopsis of the data in the database. The reservoir algorithm used to initialize the sample. Type of sample is backing sample composed of identities, priorities and time stamp. When there is a change in the database, the old data in the sample with the lowest priority will be replaced with a new one. Sample as a representation of the database has maintained the same class distribution using the kullback leibler divergence. All the techniques and algorithms are incorporated and implemented in the SQL language using MySQL 5.5. As a result, the sampling process of taking a small amount of new data as a representation of the database with the same class distribution database. Sampling process reduces the dimension of the required number of records. We need other methods to speed up the feature selection on very large number of features. In this research, parallel GPU computing is used to speed up feature selection that have large dimensions. Feature selection algorithm is divided into two sub problems. This problem is discreetization and computational geometry x- monotone. Parallel algorithm is applied to each sub problem as long as no recursiveness and dependencies. The algorithm is written into the program with the CUDA C language. The program is divided into two, which are discreetization kernel function and x-monotone kernel function. Tests performed on a data set with three distributions, which are balanced, negatively skewed and positively skewed. Average acceleration doing with parallel for one pair feature is 1.5 times to 1.87 times for discreetization and x-monotone rather than sequential. Average acceleration parallel for 45, 190, 435, 780, 1225, 1770, 2415, 3160, 4005, 4950 number of features pairs is 8.2 times faster than sequential. Accuracy was measured for each kernel function in CUDA programs. Discreetization kernel has accuracy of 81.76 % and an x-monotone kernel has accuracy of 85 %. Feature selection for 4950 pairs with a balanced distribution requires only 0.85 seconds. The same calculation on negative skewed and positive skewed is 1.66 and 1.84 seconds. These facts show the calculation on the data with skewed distribution requires a longer duration than a balanced distribution.