Cholesky decomposition and parallel structure design based on matrix triangularization decomposition

LIU Shuyong, LIN Junyu, WU Yanxia, ZHANG Bowei

Journal of Tsinghua University(Science and Technology) ›› 2016, Vol. 56 ›› Issue (9) : 963-968.

PDF(1331 KB)
PDF(1331 KB)
Journal of Tsinghua University(Science and Technology) ›› 2016, Vol. 56 ›› Issue (9) : 963-968. DOI: 10.16511/j.cnki.qhdxxb.2016.21.060
COMPUTER SCIENCE AND TECHNOLOGY

Cholesky decomposition and parallel structure design based on matrix triangularization decomposition

  • {{article.zuoZhe_EN}}
Author information +
History +

Abstract

Matrix computing is one of the core problems in high performance computing with matrix decomposition being an important way to improve the parallelism of matrix computations. FPGA gives a powerful environment for parallel computing. This study uses Cholesky decomposition based on a hardware-adaptive parallel sub-matrix identity updating algorithm. Its parallel structure is based on FPGA. Tests show that this structure achieves more than 10 fold speedup compared to general-purpose processors in terms of the kernel computational speed because the algorithm has better data-parallelism and pipeline-parallelism during matrix triangularization.

Key words

matrix triangularization decomposition / Cholesky decomposition / parallel structure / field programmable gate array

Cite this article

Download Citations
LIU Shuyong, LIN Junyu, WU Yanxia, ZHANG Bowei. Cholesky decomposition and parallel structure design based on matrix triangularization decomposition[J]. Journal of Tsinghua University(Science and Technology). 2016, 56(9): 963-968 https://doi.org/10.16511/j.cnki.qhdxxb.2016.21.060

References

[1] Satchidanand G H, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture [J]. Parallel Algorithms and Applications, 2004, 19(4): 211-226.
[2] Kung H T, Leiserson C E. Systolic arrays technical report [R]. Pitsburg, PA, USA: Carneige Mellon University, 1978.
[3] YANG Depeng, Gregory D P, LI Husheng. Compressed sensing and Cholesky decomposition on FPGAs and GPUs [J]. Parallel Computing, 2012, 38(8): 421-437.
[4] Alfredo B, Julien L, Jakub K, et al. A class of parallel tiled linear algebra algorithms for multicore architectures [J]. Parallel Computing, 2009, 35(1): 38-53.
[5] DOU Yong, Vassiliadis S, Kuzmanov G, et al. 64-bit floating-point FPGA matrix multiplication [C]//Proceedings of the 13th ACM/SIGDA International Symposium on Field Programmable Gate Arrays(FPGA’05), New York, NY, USA: ACM, 2005: 86-95.
[6] Vasily V, James W D. LU, QR and Cholesky factorizations using vector capabilities of GPUs [R]. Berkeley, CA, USA: EECS Department, University of California, Berkeley, 2008.
[7] Oleg M, Volodymyr L, Anatoli S, et al. Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor [J]. Parallel Processing and Applied Mathematics, 2008, 49(67): 137-147.
[8] Eunice E S, CHU Peiyun. Efficient and optimal parallel algorithms for Cholesky decomposition [J]. Journal of Mathematical Modeling and Algorithms, 2003, 2(3): 217-234.
[9] Aatonio R, George A. A high throughput FPGA-based floating point conjugate gradient implementation for dense matrices [J]. ACM Transactions on Reconfigurable Technology and Systems, 2010, 3(1): 1-19.
[10] Badia J M, Vidal A M. Parallel algorithms to computer the eigenvalues and eigenvectors of symmetric Toeplitz matrices [J]. Parallel Algorithms and Applications, 1998, 13(1): 75-93.
[11] Anderson E, BAI Z, Bischof C, et al. LAPACK Users Guide [M]. 3rd ED. Philadelphia, PA, USA: The society of industrial and applied mathematics, 1999.
[12] Blackford L S, Choi J, Cleary A, et al. ScaLAPACK Users Guide [M]. Philadelphia, PA, USA: The Society of Industrial and Applied Mathematics, 1997.
[13] WU Guiming, DOU Yong, SUN Junqing, et al. A high performance and memory efficient LU decomposer on FPGAs [J]. Computers, IEEE Transactions, 2012, 61(3): 366-378.
[14] WU Guiming, DOU Yong, Gregory D P. Blocking LU decomposition for FPGAs [C]//Proceedings of the 18th IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM’10). Charlotte, NC,USA: IEEE, 2010, 109-112.
[15] Haridas S G, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared memory multiprocessor architecture [J]. International Journal of Parallel, Emergent and Distributed Systems, 2004, 19(4): 211-226.
[16] 郭磊, 唐玉华, 周杰, 等. 基于FPGA的Cholesky分解细粒度并行结构与实现 [J]. 计算机研究与发展, 48 (Suppl), 2011: 258-265.GUO Lei, TANG Yuhua, ZHOU Jie, et al. Fine-grained architecture and implementation for Cholesky decomposition on FPGA [J]. Journal of Computer Research and Development, 2011, 48(suppl): 258-265. (in Chinese)
[17] WANG Xiaojun, Leeser M. A truly two-dimensional systolic array FPGA implementation of QR decomposition [J]. ACM Transactions on Embedded Computing Systems (TECS), 2009, 9(1): 17-18.
[18] 邬贵明, 窦勇, 王淼. Cholesky分解细粒度并行算法 [J]. 计算机工程与科学, 2010, 32(9): 102-106.WU Guiming, DOU Yong, WANG Miao. A fine-grained parallel algorithm for the Cholesky decomposition [J]. Computer Engineering & Science, 2010, 32(9): 102-106. (in Chinese)
[19] 刘书勇,吴艳霞,张博文,等. 基于可重构计算系统的矩阵三角化分解硬件并行结构研究 [J]. 电子学报, 43(8), 2015: 1642-1650. LIU Shuyong, WU Yanxia, ZHANG Bowen, et al. Research of parallel hardware architecture for matrix triangularization decomposition based on reconfigurable computing system [J]. ACTA Electronica Sinica, 2015, 43 (8): 1642-1650. (in Chinese)
PDF(1331 KB)

Accesses

Citation

Detail

Sections
Recommended

/