Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2016, Vol. 56 Issue (9): 963-968    DOI: 10.16511/j.cnki.qhdxxb.2016.21.060
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于矩阵三角化分解的Cholesky分解及FPGA并行结构设计
刘书勇, 林俊宇, 吴艳霞, 张博为
哈尔滨工程大学 计算机科学与技术学院, 哈尔滨 150001
Cholesky decomposition and parallel structure design based on matrix triangularization decomposition
LIU Shuyong, LIN Junyu, WU Yanxia, ZHANG Bowei
Collage of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
全文: PDF(1331 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 矩阵运算是高性能计算中核心问题之一,矩阵分解是提高矩阵运算并行性的重要途径,飞速发展的FPGA为并行运算结构提供了有力的环境支持。该文基于子矩阵更新同一化算法实现了Cholesky分解,基于FPGA设计了相应的并行结构。实验结果表明:与通用处理器的软件实现相比,本文实现的Cholesky分解的FPGA并行结果在核心计算性能上可以取得10倍以上的加速比,该算法针对矩阵三角化计算过程具有更高的数据和流水并行性。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘书勇
林俊宇
吴艳霞
张博为
关键词 矩阵三角化分解Cholesky分解并行结构现场可编程门阵列    
Abstract:Matrix computing is one of the core problems in high performance computing with matrix decomposition being an important way to improve the parallelism of matrix computations. FPGA gives a powerful environment for parallel computing. This study uses Cholesky decomposition based on a hardware-adaptive parallel sub-matrix identity updating algorithm. Its parallel structure is based on FPGA. Tests show that this structure achieves more than 10 fold speedup compared to general-purpose processors in terms of the kernel computational speed because the algorithm has better data-parallelism and pipeline-parallelism during matrix triangularization.
Key wordsmatrix triangularization decomposition    Cholesky decomposition    parallel structure    field programmable gate array
收稿日期: 2016-04-11      出版日期: 2016-09-15
ZTFLH:  TP302.1  
通讯作者: 林俊宇,讲师,E-mail:linjunyu@hrbeu.edu.cn     E-mail: linjunyu@hrbeu.edu.cn
引用本文:   
刘书勇, 林俊宇, 吴艳霞, 张博为. 基于矩阵三角化分解的Cholesky分解及FPGA并行结构设计[J]. 清华大学学报(自然科学版), 2016, 56(9): 963-968.
LIU Shuyong, LIN Junyu, WU Yanxia, ZHANG Bowei. Cholesky decomposition and parallel structure design based on matrix triangularization decomposition. Journal of Tsinghua University(Science and Technology), 2016, 56(9): 963-968.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2016.21.060  或          http://jst.tsinghuajournals.com/CN/Y2016/V56/I9/963
  图1 LT-SC的计算过程
  图2 LT-SC的计算依赖图
  图3 LT-SC的阵列结构与计算时空图
  图4 LT-SC并行结构PE的配置
  图5 LT-SC总体并行结构
  表1 LT-SC并行结构中PE单元的综合结果
  图6 LT-SC阵列并行结构与通用处理器软件实现的加速比和最大频率随矩阵规模增加的变化
[1] Satchidanand G H, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture [J]. Parallel Algorithms and Applications, 2004, 19(4): 211-226.
[2] Kung H T, Leiserson C E. Systolic arrays technical report [R]. Pitsburg, PA, USA: Carneige Mellon University, 1978.
[3] YANG Depeng, Gregory D P, LI Husheng. Compressed sensing and Cholesky decomposition on FPGAs and GPUs [J]. Parallel Computing, 2012, 38(8): 421-437.
[4] Alfredo B, Julien L, Jakub K, et al. A class of parallel tiled linear algebra algorithms for multicore architectures [J]. Parallel Computing, 2009, 35(1): 38-53.
[5] DOU Yong, Vassiliadis S, Kuzmanov G, et al. 64-bit floating-point FPGA matrix multiplication [C]//Proceedings of the 13th ACM/SIGDA International Symposium on Field Programmable Gate Arrays(FPGA’05), New York, NY, USA: ACM, 2005: 86-95.
[6] Vasily V, James W D. LU, QR and Cholesky factorizations using vector capabilities of GPUs [R]. Berkeley, CA, USA: EECS Department, University of California, Berkeley, 2008.
[7] Oleg M, Volodymyr L, Anatoli S, et al. Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor [J]. Parallel Processing and Applied Mathematics, 2008, 49(67): 137-147.
[8] Eunice E S, CHU Peiyun. Efficient and optimal parallel algorithms for Cholesky decomposition [J]. Journal of Mathematical Modeling and Algorithms, 2003, 2(3): 217-234.
[9] Aatonio R, George A. A high throughput FPGA-based floating point conjugate gradient implementation for dense matrices [J]. ACM Transactions on Reconfigurable Technology and Systems, 2010, 3(1): 1-19.
[10] Badia J M, Vidal A M. Parallel algorithms to computer the eigenvalues and eigenvectors of symmetric Toeplitz matrices [J]. Parallel Algorithms and Applications, 1998, 13(1): 75-93.
[11] Anderson E, BAI Z, Bischof C, et al. LAPACK Users Guide [M]. 3rd ED. Philadelphia, PA, USA: The society of industrial and applied mathematics, 1999.
[12] Blackford L S, Choi J, Cleary A, et al. ScaLAPACK Users Guide [M]. Philadelphia, PA, USA: The Society of Industrial and Applied Mathematics, 1997.
[13] WU Guiming, DOU Yong, SUN Junqing, et al. A high performance and memory efficient LU decomposer on FPGAs [J]. Computers, IEEE Transactions, 2012, 61(3): 366-378.
[14] WU Guiming, DOU Yong, Gregory D P. Blocking LU decomposition for FPGAs [C]//Proceedings of the 18th IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM’10). Charlotte, NC,USA: IEEE, 2010, 109-112.
[15] Haridas S G, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared memory multiprocessor architecture [J]. International Journal of Parallel, Emergent and Distributed Systems, 2004, 19(4): 211-226.
[16] 郭磊, 唐玉华, 周杰, 等. 基于FPGA的Cholesky分解细粒度并行结构与实现 [J]. 计算机研究与发展, 48 (Suppl), 2011: 258-265.GUO Lei, TANG Yuhua, ZHOU Jie, et al. Fine-grained architecture and implementation for Cholesky decomposition on FPGA [J]. Journal of Computer Research and Development, 2011, 48(suppl): 258-265. (in Chinese)
[17] WANG Xiaojun, Leeser M. A truly two-dimensional systolic array FPGA implementation of QR decomposition [J]. ACM Transactions on Embedded Computing Systems (TECS), 2009, 9(1): 17-18.
[18] 邬贵明, 窦勇, 王淼. Cholesky分解细粒度并行算法 [J]. 计算机工程与科学, 2010, 32(9): 102-106.WU Guiming, DOU Yong, WANG Miao. A fine-grained parallel algorithm for the Cholesky decomposition [J]. Computer Engineering & Science, 2010, 32(9): 102-106. (in Chinese)
[19] 刘书勇,吴艳霞,张博文,等. 基于可重构计算系统的矩阵三角化分解硬件并行结构研究 [J]. 电子学报, 43(8), 2015: 1642-1650. LIU Shuyong, WU Yanxia, ZHANG Bowen, et al. Research of parallel hardware architecture for matrix triangularization decomposition based on reconfigurable computing system [J]. ACTA Electronica Sinica, 2015, 43 (8): 1642-1650. (in Chinese)
[1] 于金鹏, 周燕, 莫逆, 刘兴男, 赵雷. 基于FPGA的数字化电涡流位移传感器设计[J]. 清华大学学报(自然科学版), 2018, 58(3): 330-336.
[2] 周斌, 邱志昌. 用于MEMS陀螺的PCIe实时测控平台设计[J]. 清华大学学报(自然科学版), 2017, 57(12): 1310-1316.
[3] 彭卓, 邓焱, 马骋, 熊剑平, 尹永利. 基于FPGA的高精度正弦信号发生器设计与实现[J]. 清华大学学报(自然科学版), 2014, 54(2): 197-201.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn