PDF(9088 KB)
Biological macromolecular structure databases in the artificial intelligence era: Coevolution, transformation, and future architectures
Ruiyun YANG, Jianhua HUANG, Qiangfeng ZHANG
Journal of Tsinghua University(Science and Technology) ›› 2025, Vol. 65 ›› Issue (12) : 2449-2463.
PDF(9088 KB)
PDF(9088 KB)
Biological macromolecular structure databases in the artificial intelligence era: Coevolution, transformation, and future architectures
Significance: Identifying the three-dimensional structures of biological macromolecules is fundamental to understanding the molecular basis of life and for the discovery of novel therapeutics. As the biological sciences enter the era of artificial intelligence (AI), structural data have become increasingly essential, while AI technologies simultaneously impose higher demands on data organization and management. This review traces the five-decade evolution of biological macromolecular structure databases, with a particular focus on the pivotal role of the Protein Data Bank (PDB). The PDB was established as a small archive of experimentally determined atomic coordinates but gradually developed into a global infrastructure that underpins structural biology. Progress: We first chart the progression of structural data resources from early structure archives, which largely functioned as static catalogs of experimentally determined structures, to the emergence of highly curated functional classification systems, such as SCOP and CATH. These resources enable researchers to analyze structural relationships, investigate evolutionary patterns, and derive mechanistic insights. In parallel, sequence-centric databases—such as Pfam, InterPro, and later comprehensive domain-family resources—expanded by annotating conserved elements across the protein domain. Together, these efforts created a rich, multi-layered ecosystem in which the sequence, structure, and function of proteins became increasingly integrated, thereby turning structure databases into indispensable platforms for comparative analysis and mechanistic discovery. A new phase of structural data expansion began with AI-driven structure prediction. The release of the AlphaFold Protein Structure Database (AFDB), followed by complementary resources, including the ESM Atlas, induced an unprecedented expansion in structural coverage, spanning entire proteomes and previously challenging protein families. Conclusions and Prospects: We propose that structural databases and AI models form a mutually reinforcing "double-helix" data model. High-quality experimental structures provide essential references for training and benchmarking predictive models, while large-scale AI-generated structures dramatically increase the amount of available data, thereby revealing new sequence-structure-function relationships, and enriching the databases themselves. This synergy would catalyze a paradigm shift in structural biology, transitioning the field from an experiment-led discipline to an integrated ecosystem in which computation and experimentation may coevolve. Despite rapid progress in this industry, major challenges persist. Structural databases remain affected by experimental sampling biases, uneven representation across organisms and protein families, and persistent inconsistencies in annotation quality. Moreover, the scarcity of dynamic and condition-dependent structural information further limits biological interpretability, particularly for intrinsically disordered regions, conformational ensembles, and transient complexes. Furthermore, AI-driven predictions introduce new concerns regarding model interpretability, calibration of confidence metrics, and the governance of large-scale predictive datasets. We anticipate that biological macromolecular structure databases will evolve from merely "AI-enhanced" to "AI-integrated" and, ultimately, adopt "AI-native" architectures. Such systems will incorporate a continuous feedback model, automated annotation pipelines, and multi-modal data fusion, thereby enabling them to function as reliable knowledge instruments capable of hosting biologically meaningful "digital twins." Collectively, these developments promise to enhance our understanding of structure-function relationships and accelerate rational design in protein engineering, drug discovery, and synthetic biology. As a result, structural databases will continue to underpin scientific innovation while defining a new research standard for biological sciences.
biological macromolecular structure databases / artificial intelligence (AI) / protein structure prediction / database ecosystem / AI-native
| 1 |
|
| 2 |
|
| 3 |
|
| 4 |
|
| 5 |
|
| 6 |
|
| 7 |
|
| 8 |
|
| 9 |
|
| 10 |
|
| 11 |
|
| 12 |
|
| 13 |
|
| 14 |
|
| 15 |
|
| 16 |
|
| 17 |
|
| 18 |
|
| 19 |
|
| 20 |
|
| 21 |
|
| 22 |
|
| 23 |
|
| 24 |
|
| 25 |
|
| 26 |
Crystallography: Protein data bank[J]. Nature New Biology, 1971, 233(42): 223.
|
| 27 |
|
| 28 |
|
| 29 |
|
| 30 |
|
| 31 |
|
| 32 |
|
| 33 |
|
| 34 |
|
| 35 |
|
| 36 |
|
| 37 |
|
| 38 |
|
| 39 |
|
| 40 |
|
| 41 |
|
| 42 |
|
| 43 |
|
| 44 |
|
| 45 |
|
| 46 |
|
| 47 |
|
| 48 |
|
| 49 |
|
| 50 |
|
| 51 |
|
| 52 |
|
| 53 |
|
| 54 |
|
| 55 |
|
| 56 |
|
| 57 |
|
| 58 |
DAI M Z, DONG Z E, FU W N, et al. CryoDomain: Sequence-free protein domain identification from low- resolution Cryo-EM density maps[C]//Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA: AAAI Press, 2025: 119-127.
|
| 59 |
|
| 60 |
|
| 61 |
|
| 62 |
|
| 63 |
|
| 64 |
|
| 65 |
|
| 66 |
|
| 67 |
|
| 68 |
|
| 69 |
|
| 70 |
|
| 71 |
|
| 72 |
|
| 73 |
|
| 74 |
|
| 75 |
|
| 76 |
|
| 77 |
|
| 78 |
|
| 79 |
|
| 80 |
|
| 81 |
|
| 82 |
|
| 83 |
|
| 84 |
|
| 85 |
|
| 86 |
|
| 87 |
SUJ, HE Y, YOU S Y. A trimodal protein language model enables advanced protein searches[J/OL]. Nature Biotechnology, 2025. https://doi.org/10.1038/s41587-025-02836-0.
|
| 88 |
|
| 89 |
|
| 90 |
|
| 91 |
|
| 92 |
|
| 93 |
|
| 94 |
PARDO-AVILA F, WEINER L, CABRAL P, et al. PDBCleanV2: A Python library for generating consistent structure datasets[EB/OL]. (2025-02-19)[2025-10-01]. https://doi.org/10.1101/2025.02.14.638326.
|
| 95 |
|
| 96 |
|
| 97 |
|
| 98 |
LIU C, WANG J, CAI Z, et al. Dynamic PDB: A new dataset and a SE(3) model extension by integrating dynamic behaviors and physical properties in protein structures[EB/OL]. (2024-08-22)[2025-10-01]. https://doi.org/10.48550/arxiv.2408.12413.
|
| 99 |
|
| 100 |
|
| 101 |
|
| 102 |
|
| 103 |
|
| 104 |
|
| 105 |
|
| 106 |
|
| 107 |
|
| 108 |
|
| 109 |
PAN J J, WANG J G, LI G L. Vector database management techniques and systems[C]//Companion of the 2024 International Conference on Management of Data. Santiago Chile: ACM, 2024: 597-604.
|
| 110 |
|
| 111 |
|
感谢清华大学王宏伟教授对论文提供的建设性意见。
/
| 〈 |
|
〉 |