Kajian Penggerombolan Data Tidak Lengkap dengan Algoritma Khusus Tanpa Imputasi
Abstract
Cluster analysis is a statistical method that aims to classify the n unit objects into k groups, so that the characteristic of objects in a group is more homogeneous than in other groups (Mattjik & Sumertajaya 2002). The main purpose of this technique is to clustering the objects based on specific criteria so that these objects have relatively small variations in the cluster compared to the variation among clusters. The common clustering method can only be used for the complete data set. However, sometimes problems occur when data is incomplete, due to the data not available. Handling of incomplete data clustering can be done with two approaches, namely preprocessing and application of a special algorithm. Preprocessing is a process to solve the problem of incomplete data by applying the results on the complete data (Grzymała & Hu 2001). Two techniques that can be done in preprocessing, namely: engineering marginalization (deletion) of incomplete data and imputation techniques. Wagstaff and Laidler (2005) explain that preprocessing approach is often used, either by the method of marginalization or the imputation method which are easy and simple. Marginalization method is the simplest technique to be used as a solution of incomplete data clustering. There are two possibilities to do with marginalization method; remove objects from the data collection or delete the incomplete, but it should be noted that marginalization can lead to data information loss. Imputation methods performed to estimate the value of incomplete data in clustering with various techniques, such as imputation with constant values, zeros, random values, the median value, average value and others. Troyanskaya et al (2001) in his research concluded that calculation of incomplete data by imputation method is reliable and yield inaccurate information. A special algorithm is done to cover the shortage of marginalization methods and imputation methods. There are some special algorithm for missing data without imputation, such as method of Partial Distance Strategy (PDS) and K-means Soft Constraints (KSC). PDS and KSC adopt the stages of K-means algorithm for complete data. Wagstaff (2004) conducted a study of clustering for incomplete data without imputation with approach Of K-Means Soft Constraints (KSC). Matyja and Simiński (2014) also conducted research of clustering for incomplete data without imputation by comparing Partial Distance Strategy (PDS) with Optimal Completion Strategy (OCS), Nearest Prototype Strategy (NPS), Fuzzy C-Mean (FCM) and Nearest Cluster Strategy (NCS). The development of research has been done before is success to make researchers interested in reviewing the incomplete data clustering without imputation. Study of incomplete data clustering without imputation is done by data simulation and application data. This study aims to assess the method of Partial Distance Strategy (PDS) and the method of K-Means Soft Constraints (KSC) for incomplete data clustering. Simulation data is the data generation with a total population of N = 1200 with numeric data types in terms of several aspects of the simulation, they are the amount of sample (n), the center between cluster (μ) and the correlations between variables (ρ). Data simulation performed by generating three population of normal multivariate ~ N (μ, Σ), which is consists of 7 variables. Three population are simulated into three population model, they are the population model that does not separate (design I), a separate populations model (design II), a perfect separate model (design III). The secondary data of this study is BPS data of people's welfare indicator as many as 10 variables at Aceh Province in 2006. Selection of those indicator variables is evaluated from various sources, such as RPJP (Long Term Development Plan), the MDG (Millennium Development Goals) and indicators of well-being published by BPS in cooperation with other government agencies. Broadly speaking, all sources of are publish the same indicator of public welfare (Bappenas, 2010). Study of incomplete data clustering without imputation against simulated data from a combination of n, μ, ρ and the percentage of incomplete data that were tested are showed when the larger the n, the condition of the population separated from each other, the correlation between the variables is small and tiny percentage of incomplete data, it will cause the average percentage of accuracy of cluster produced by using PDS and KSC will be higher. Study of incomplete data clustering without imputation of data applied declares that the members of each cluster have a homogeneous variance within cluster, whereas, the variation among clusters is more heterogeneous. This result shows that the districts / cities that are in one cluster have a high degree of similarity, so they have the common feature in terms of the indicators of people's welfare.