不确定性数据上聚合查询的近似算法

doi:10.16511/j.cnki.qhdxxb.2018.26.015

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1167 KB)
输出: BibTeX | EndNote (RIS)

摘要随着大数据时代的到来，不确定性数据上的聚合查询面临形式多样、计算复杂等挑战。该文将不确定性数据上聚合查询的结果定义为所有可能的值以及对应的概率。基于动态规划思想的求解"和"的分布（distribution sum，DSUM）精确算法，提出贪心的"和"的分布（greedy distribution sum，GDSUM）和折半合并的"和"的分布（binary merge distribution sum，BMDSUM）的近似算法，这2种算法都能应用于元组级不确定性模型和属性级不确定性模型；并通过理论分析，给出算法的时间和空间复杂度以及最终结果的误差范围。实验结果表明：误差设定为1%时，2种近似算法分别能缩短执行时间15%~21%和22%~32%。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	陈东辉
	陈岭
	王俊凯
	吴勇
	王敬昌

关键词 ：聚合查询, 近似算法, 不确定性数据, 动态规划, 误差估计

Abstract：Analyses of big data sets often require aggregate queries on uncertain data with various types of data that are computationally complex. In this paper, the results of aggregate queries on uncertain data are defined to include all possible values and their corresponding probabilities. Dynamic programming is then used to solve the Distribution Sum (DSUM) algorithm using a Greedy-based Distribution Sum and a Binary Merge based Distribution Sum approximation algorithms, which both can be applied to tuple-level and attribute-level uncertainty models. The time and space complexities of the algorithms are determined theoretically as well as the error range of the results. Tests demonstrates that these two approximation algorithms with a 1% allowable error shorten the execution times by 15%-21% and 22%-32%, respectively.

Key words： aggregate queries approximation algorithms uncertain data dynamic programming error estimation

收稿日期: 2017-10-09 出版日期: 2018-03-15

ZTFLH:

TP301.6

基金资助:中国工程科技知识中心资助项目（CKCEST-2014-1-5）；国家自然科学基金资助项目（61332017）；浙江省科技计划项目（2015C33002）

通讯作者: 陈岭,副教授,E-mail:lingchen@zju.edu.cn E-mail: lingchen@zju.edu.cn

作者简介: 陈东辉(1995-),男,博士研究生。

引用本文:

陈东辉, 陈岭, 王俊凯, 吴勇, 王敬昌. 不确定性数据上聚合查询的近似算法[J]. 清华大学学报（自然科学版）, 2018, 58(3): 231-236.
CHEN Donghui, CHEN Ling, WANG Junkai, WU Yong, WANG Jingchang. Approximation algorithms for aggregate queries on uncertain data. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 231-236.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.26.015 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I3/231

表１不确定性示例

表２元组级不确定性

表３属性级不确定性

表４计数查询结果示例

图１状态转移过程

图２　Greedy Distribution SUM 算法

图３　近似合并过程示例

图４　数据集大小对执行时间的影响和数据量对查询结果误差的影响

[1] DESHPANDE A, GUESTRIN C, MADDEN S. Using probabilistic models for data management in acquisitional environments[C]//Proceedings of the International Conference on Innovation Database Research. Asilomar, USA:Online Proceeding, 2005:317-328.
[2] PENG L P, DIAO Y L. Supporting data uncertainty in array databases[C]//Proceedings of 2015 International Conference on Management of Data. Melbourne, Australia:ACM Press, 2015:545-560.
[3] RÉ C, SUCIU D. Approximate lineage for probabilistic databases[J]. Proceedings of the VLDB Endowment, 2008, 1(1):797-808.
[4] SINGH S, MAYFIELD C, MITTAL S, et al. Orion 2.0:Native support for uncertain data[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, Canada:ACM Press, 2008:1239-1242.
[5] HUANG J W, ANTOVA L, KOCH C, et al. MayBMS:A probabilistic database management system[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. Providence, USA:ACM Press, 2009:1071-1074.
[6] WANG Y J, LI X Y, LI X L, et al. A survey of queries over uncertain data[J]. Knowledge and Information Systems, 2013, 37(3):485-530.
[7] ZHOU X, LI K L, ZHOU Y T, et al. Adaptive processing for distributed skyline queries over uncertain data[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(2):371-384.
[8] CICERI E, FRATERNALI P, MARTINENGHI D, et al. Crowdsourcing for top-k query processing over uncertain data[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1):41-53.
[9] CHEN L, GAO Y J, LI X H, et al. Indexing metric uncertain data for range queries[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia:ACM Press, 2015:951-965.
[10] DALLACHIESA M, PALPANAS T, ILYAS I F. Top-k nearest neighbor search in uncertain data series[J]. Proceedings of the VLDB Endowment, 2014, 8(1):13-24.
[11] JAYRAM T S, KALE S, VEE E. Efficient aggregation algorithms for probabilistic data[C]//Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. New Orleans:Society for Industrial and Applied Mathematics, USA:2007:346-355.
[12] MURTHY R, IKEDA R, WIDOM J. Making aggregation work in uncertain and probabilistic databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(8):1261-1273.
[13] DALVI N, SUCIU D. Management of probabilistic data:Foundations and challenges[C]//Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Beijing, China:ACM Press, 2007:1-12.
[14] AGARWAL P K, KUMAR N, SINTOS S, et al. Range-max queries on uncertain data[C]//Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. San Francisco, USA:ACM Press, 2016:465-476.
[15] NATH S, GIBBONS P B, SESHAN S, et al. Synopsis diffusion for robust aggregation in sensor networks[J]. ACM Transactions on Sensor Networks, 2008, 4(2):250-262.
[16] ZENG K, GAO S, MOZAFARI B, et al. The analytical bootstrap:A new method for fast error estimation in approximate query processing[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA:ACM Press, 2014:277-288.
[17] HUA M, PEI J, ZHANG W J, et al. Ranking queries on uncertain data:A probabilistic threshold approach[C]//Proceedings of the 27th International Conference on Management of Data. Vancouver, Canada:ACM Press, 2008:673-686.

[1]	赵俊, 包丛笑, 李星. 软件定义网络中低成本流量数据采集算法[J]. 清华大学学报（自然科学版）, 2019, 59(2): 148-153.
[2]	秦兆博, 罗禹贡, 张东好, 陈龙, 李克强. 液压混合动力履带推土机行星齿轮传动系统的设计[J]. 清华大学学报（自然科学版）, 2017, 57(5): 497-503.
[3]	郭孟武, 钟宏志. 两种严格界面向目标误差估计方法的等价性[J]. 清华大学学报（自然科学版）, 2017, 57(4): 362-368.
[4]	张灿荣, 钟明, 缪立新. 集装箱场地箱位分配问题[J]. 清华大学学报（自然科学版）, 2015, 55(10): 1150-1156.
[5]	朱锐,李云洲,王京. 基于能量收割的认知无线电预编码优化[J]. 清华大学学报（自然科学版）, 2014, 54(4): 407-412.

Viewed

Full text

Abstract

Cited

Shared

Discussed