不确定性的出现使传统算法无法直接用于聚类不确定数据流。该文提出一种不确定数据流环境下基于密度的聚类算法,其中提出不确定度的概念以衡量不确定数据的分布信息,并在改进面向确定数据的聚类算法DENCLUE的基础上,提出一种可处理数据不确定度的UDENCLUE算法,以降低数据的不确定性对聚类结果产生的影响;提出滑动窗口下基于密度的不确定数据流聚类算法USDENCLUE,通过聚类特征指数直方图技术实现快速剪枝,可以高效处理噪音数据、演化数据流并生成任意形状的簇;采用真实数据集及人工合成数据集对USDENCLUE与CluStream聚类算法进行比较,实验结果表明了所提出算法的高效性和有效性。
Uncertainties make it impossible to cluster uncertain data streams using traditional clustering algorithms. This paper presents a density-based clustering algorithms for uncertain data stream environments. An uncertainty metric is used to measure the distribution information in the uncertain data. The uncertain data streams DENCLUE algorithm (USDENCLUE) is then modified to deal with uncertainty data to minimize the impact of the data uncertainty on the clustering results. A density-based clustering algorithm is then given for uncertain data streams with a sliding window to rapidly prune the clusters using an exponential histogram of the cluster features. This algorithm can efficiently handle noisy data in evolving data streams to generate arbitrary clusters to improve the clustering quality. Comparisons of this algorithm with the CluStream clustering algorithm on real and synthetic data sets show the efficiency and effectiveness of this algorithm.
[1] Deshpande A, Guestrin C, Madden S, et al. Model-driven data acquisition in sensor networks[C]//Proceeding of the 30th International Conference on Very Large Data Bases. New York, USA:ACM Press, 2004:588-599. [2] GU Yu, YU Ge, ZHANG Tiancheng. RFID complex event processing techniques[J]. Journal of Frontiers of Computer Science and Technology, 2007, 1(3):255-267. [3] Jeffery S R, Garofalakis M N, Frwanklin M J. Adaptive cleaning for RFID data streams[C]//Proceeding of the 32nd International Conference on Very Large Data Bases. New York, USA:ACM Press, 2006:163-174. [4] ZHOU Aoying, JIN Cheqing, WANG Guoren, et al. A survey on the management of uncertain data[J]. Chinese Journal of Computers, 2009, 32(1):1-16. [5] Aggarwal C C, Han J, Wang J, et al. A framework for clustering evolving data streams[C]//Proceeding of the 29th International Conference on Very Large Data Bases. New York, USA:ACM Press, 2003:81-92. [6] Aggarwal C C, Yu P S. A framework for clustering uncertain data streams[C]//Proceeding of the 24th International Conference on Data Engineering. Cancún, México, 2008:150-159. [7] Aggarwal C C. On high dimensional projected clustering of uncertain data streams[C]//Proceeding of the 25th International Conference on Data Engineering. Shanghai, 2009:1152-1154. [8] Huang G Y, Liang D P, Ren J D, et al. An algorithm for clustering uncertain data streams over sliding windows[C]//Proceeding of the 6th International Conference on Digital Content, Multimedia Technology and Its Applications. Seoul, Korea, 2009:173-177. [9] ZHANG Chen, JIN Cheqing, ZHOU Aoying. Clustering algorithm over uncertain data streams[J]. Journal of Software, 2010, 21(9):2173-2182. [10] CAO Keyan, WANG Guoren, HAN Donghong, et al. A framework for high-quality clustering uncertain data stream over sliding windows[C]//Proceeding of the 13th International Conference on Web-Age Information Management. Harbin, 2012:308-313. [11] YANG Yue, LIU Zhuo, ZHANG Jianpei, et al. Dynamic density-based clustering algorithm over uncertain data streams[C]//Proceeding of the 20129th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). Chongqing, 2012:2664-2670. [12] JIN Cheqing, Yu J X, ZHOU Aoying, et al. Efficient clustering of uncertain data streams[J]. Knowledge and Information Systems, 2014, 40(3):509-539. [13] Dallachiesa M, Jacques-Silva G, Gedik B, et al. Sliding windows over uncertain data streams[J]. Knowledge and Information Systems, 2015, 45(1):159-190. [14] Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise[C]//Proceeding of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, 2010:58-65.