Paper - DeepMSA2: Improving deep learning protein monomer and complex structure prediction

本文主要是介绍Paper - DeepMSA2: Improving deep learning protein monomer and complex structure prediction,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/135520805

DeepMSA2 是用于构建高质量的蛋白质单体和复合体多序列比对(MSA)的流程,利用了迭代的序列搜索和隐马尔可夫模型算法,从多个基因组和元基因组数据库中提取了大量的同源序列。DeepMSA2 的主要优势在于平衡的序列搜索和有效的模型选择,以及利用海量元基因组数据库的能力,这些结果表明通过改进 MSA 的构建,可以为深度学习蛋白质结构预测提供一个新的途径,也证明了优化深度学习方法的输入信息,与设计预测器本身一样重要。

Paper:

  • DeepMSA2: Improving deep learning protein monomer and complex structure prediction using DeepMSA2 with huge metagenomics data - Nature Methods 2024.1.3
  • DMFold: A deep learning platform for protein complex structure and function predictions based on DeepMSA2

源码:

  • DMFold:https://zhanggroup.org/DMFold
  • DeepMSA:https://zhanggroup.org/DeepMSA/download/

DMFold

DMFold (also known as DMFold-Multimer) is a deep learning-based approach to protein complex structure and function prediction built on deep multiple sequence alignments (MSAs).
DMFold(也称为 DMFold-Multimer)是一种基于深度学习的蛋白质复合物结构和功能预测方法,建立在深度多序列比对(MSAs)的基础上。

The core of the pipeline is the integration of DeepMSA2 with the modified structure module of AlphaFold2.
该流程的核心是将 DeepMSA2 与修改后的 AlphaFold2 结构模块集成在一起。

Starting from a set of query sequences, DMFold first creates deep monomeric MSAs using an iterative search procedure through multiple whole-genome (Uniclust30 and UniRef90) and metagenome (Metaclust, BFD, Mgnify, TaraDB, MetaSourceDB and JGIclust) databases, where multimeric MSAs are then constructed by pairing the monomeric MSAs based on species annotations.
从一组查询序列开始,DMFold 首先使用多个全基因组(Uniclust30和UniRef90)和宏基因组(Metaclust,BFD,Mgnify,TaraDB,MetaSourceDB和JGIclust)数据库的迭代搜索过程,创建深度单体 MSAs,然后根据物种注释将单体 MSAs 配对构建多体 MSAs。

Next, complex structure models are predicted by feeding the multimetic MSAs into the structural modules of AlphaFold2-Multimer, where funtional annotations, including Gene Ontology, Enzyme Commission and Ligand Binding Sites, are generated by COFACTOR2 and US-align based on the top DMFold structure models.
接下来,通过将多体 MSAs 输入到 AlphaFold2-Multimer 的结构模块中,预测复合物结构模型,然后根据 DMFold 的最优结构模型,由 COFACTOR2 和 US-align 生成功能注释,包括基因本体、酶分类和配体结合位点。

DMFold participated (as “Zheng”) in CASP15 and ranked as the No. 1 method for protein-protein complex structure prediction, with accuracy significantly higher than the state-of-the-art AlphaFold2 program (i.e., “NBIS-AF2-multimer” in CASP15).
DMFold 以 “Zheng” 为名,参加 CASP15,并在蛋白质-蛋白质复合物结构预测方面排名第一,其准确度显著高于最先进的 AlphaFold2 程序(即CASP15 中的 “NBIS-AF2-multimer”)。

Although DMFold focuses on multi-chain protein complexes, it also accepts single-chain monomer sequences (DMFold-Monomer pipeline). The server is freely accessible to all users, including commercial ones.
尽管 DMFold 专注于多链蛋白质复合物,但是也接受单链单体序列(DMFold-Monomer流程)。该服务器对所有用户(包括商业用户)免费开放。

DeepMSA

DeepMSA2 standalone package is a program for deep multiple sequence alignment generation for both monomer and multimer proteins.
DeepMSA2独立包是一个用于生成单体和多体蛋白质的深度多序列比对的程序。

Please report bugs and questions at Zhang Lab Service System Discussion Board. The DeepMSA2 package is free for academic and non-profit researchers.
请在 Zhang 实验室服务系统讨论板上报告错误和问题。DeepMSA2包对学术和非营利研究者免费。

工程下载:

  • https://zhanggroup.org/DeepMSA/download/download_DeepMSA2.cgi?ID=112358=DeepMSA2-2.0.zip,源码约 3.8G
  • https://zhanggroup.org/DMFold/download/download_DMFold.cgi?ID=235813=DMFold-1.0.zip,源码约 6.5G

补充数据 JGI (JGIclust30) list,参考源码的Download_lib.py,即:

DB.fasta.aa
DB.fasta.ab
DB.fasta.ac
DB.fasta.ad

1. DeepMSA2 Monomer

DeepMSA
dMSA: The dMSA algorithm used in DeepMSA2 is modified from our previous MSA generation tool, DeepMSA.
DeepMSA2中使用的dMSA算法是从我们之前的MSA生成工具DeepMSA修改而来的。

qMSA(quadrupole MSA): The qMSA is composed of four stages that perform HHblits (v.2), Jackhmmer, HHblits (v.3) and HMMsearch searches against Uniref30, Uniref90, BFD and Mgnify databases, respectively.
qMSA由四个阶段组成,分别对Uniref30, Uniref90, BFD和Mgnify数据库进行HHblits (v.2), Jackhmmer, HHblits (v.3)和HMMsearch搜索。

mMSA: the qMSA stage 3 alignment is used as a probe by HMMsearch using parameters ‘-E 10 --incE 1e-3’ to search through a metagenomics database combining JGIclust, TaraDB and MetaSourceDB, with the resulting sequence hits converted to a raw sequence database.
在mMSA中,qMSA第3阶段的比对结果被用作HMMsearch的探针,使用参数‘-E 10 --incE 1e-3’在一个由JGIclust, TaraDB和MetaSourceDB组合的宏基因组数据库中进行搜索,将搜索到的序列命中结果转换为原始序列数据库。

JGI 的 clust 数据是单独提供,参考 DeepMSA2/Download_lib.py

  • ​ JGIclust collected from IMG/M, MetaSourceDB, TaraDB
######### download JGI (JGIclust, MetaSourceDB and TaraDB are included in here)
if not os.path.exists("JGIclust"):os.system("mkdir -p JGIclust")
os.chdir("JGIclust")os.system("wget -c https://zhanggroup.org/ftp/data/JGIclust30/list")
jgiclust_db_list=[]
jgifile=open('list','r')
lines=jgifile.readlines()
jgifile.close()
for line in lines:jgiclust_db_list.append(line.strip('\n'))for jgidb in jgiclust_db_list:print("download jgi db %s"%jgidb)os.system("wget -c https://zhanggroup.org/ftp/data/JGIclust30/%s.xz"%jgidb)print("decompress jgi db %s"%jgidb)os.system("xz -dvf %s.xz"%jgidb)print("download jgi ssi db %s"%jgidb)os.system("wget -c https://zhanggroup.org/ftp/data/JGIclust30/%s.ssi.xz"%jgidb)print("decompress jgi ssi db %s"%jgidb)os.system("xz -dvf %s.ssi.xz"%jgidb)

相关 MSA 库的规模,参考:
MSA
JGIclust30 的下载地址:https://zhanggroup.org/ftp/data/JGIclust30/

JGIclust30

2. DeepMSA2 Multimer

Multimer
MSA Pairing

MSA pairing. Two types of complexes are considered in DeepMSA2.
MSA配对。DeepMSA2考虑了两种类型的复合物。

For homomeric complexes in which all component chains are identical, all of the monomer MSAs are utilized and the multimeric MSAs are created by concatenating each of the monomer MSAs n times side-by-side, where n is the number of monomer chains.
对于同源复合物,所有的组分链都是相同的,所有的单体MSA都被利用,多聚体MSA是通过将每个单体MSA侧向连接n次而生成的,其中n是单体链的数量。

For heteromeric complexes, the top M MSAs are selected for each monomer chain so that M^N distinct paired MSAs can be created for the complex, where N is the number of distinct chains in the complex.
对于异源复合物,为每个单体链选择排名最高的M个MSA,以便为复合物创建M^N个不同的配对MSA,其中N是复合物中不同链的数量。

To avoid an impractically long MSA construction time, M is set as the maximal value to satisfy M^N ≤ 100.
为了避免MSA构建时间过长,M被设定为满足M^N≤100的最大值。

For example, for a complex containing three different protein chains (A2B2C1, N = 3), M will be set to 4 (4^3 ≤ 100)
例如,对于一个包含三种不同蛋白质链(A2B2C1,N = 3)的复合物,M将被设定为4(4^3 ≤ 100)

In other words, for each component chain in this complex, we select four top-ranked monomer MSAs and build paired MSAs for the complex with 64 different combinations of those monomer MSAs.
换句话说,对于这个复合物中的每个组分链,我们选择四个排名最高的单体MSA,并用这些单体MSA的64种不同组合构建配对MSA。

Normally, MN ranges from 50 to 100 for different kinds of heteromer complexes.
通常,对于不同种类的异源复合物,M^N的范围在50到100之间。

Sequence linking

For a given set of M……N paired monomeric MSAs, the sequences from the monomeric MSAs are concatenated into a multimeric MSA as follows (Supplementary Fig. 12b).
对于给定的M^N个配对的单体MSA,将单体MSA中的序列连接成一个多聚体MSA

First, the sequences in each monomeric MSA are grouped based on the UniProt annotated species.
首先,根据UniProt注释的物种,将每个单体MSA中的序列分组。

The sequences in each group are then ordered based on the sequence identity to the query sequence.
然后,根据序列与查询序列的相似度,对每个组中的序列进行排序。

To properly capture orthologous signals of interchain coevolution, the top sequences of different monomeric MSAs belonging to the same species group are linked together side-by-side to form a composite sequence in the multimeric MSA.
为了正确捕捉链间共进化的同源信号,将不同单体MSA中属于同一物种组的顶部序列侧向连接在一起,形成多聚体MSA中的一个复合序列。

In cases where one of the monomeric MSAs is missing for a specific species, which appear in more than one other chains, the component chain is padded with gaps in the composite sequence with other linked chains having that species.
在某些情况下,如果某个物种在一个或多个其他链中出现,但是缺少对应的单体MSA,那么在与其他有该物种的链连接的复合序列中,用空位填充该组分链。

Finally, the unlinked sequences in the monomeric MSAs are padded below the linked sequences.
最后,将单体MSA中未连接的序列填充在连接序列的下方。

This composite linking step is applied only to heteromeric complexes, as the MSAs for homomeric complexes are constructed by simply concatenating the same monomer MSA multiple times.
这个复合连接步骤只适用于异源复合物,因为同源复合物的MSA,是通过简单地将同一个单体MSA多次连接而构建的。

扩展的 Uniref 物种信息源码:

def parse_sequence_specie(msa_sequence_identifier: str, dbtype='uniprot'):"""Gets species from an msa sequence identifier.The sequence identifier has the format specified by_UNIPROT_TREMBL_ENTRY_NAME_PATTERN or _UNIPROT_SWISSPROT_ENTRY_NAME_PATTERN.An example of a sequence identifier: `tr|A0A146SKV9|A0A146SKV9_FUNHE`Args:msa_sequence_identifier: a sequence identifier.Returns:An `Identifiers` instance with species_id. Thesecan be empty in the case where no identifier was found."""if dbtype=='uniref':matches = re.search(UNIREF_PATTERN, msa_sequence_identifier.strip())else:matches = re.search(UNIPROT_PATTERN, msa_sequence_identifier.strip())species_id=''if matches:species_id=matches.group('SpeciesIdentifier')return species_iddef get_species(id: str):"""get species from id"""sequence_identifier, dbtype = extract_sequence_ids(id)#print(sequence_identifier,dbtype)if sequence_identifier is None:return ''else:#print(parse_sequence_specie(sequence_identifier,dbtype=dbtype))return parse_sequence_specie(sequence_identifier,dbtype=dbtype)

MSA selection

Of the MN concatenated MSAs formed from the MSA paring procedure, 25 top MSAs are returned from the DeepMSA2-Multimer pipeline based on the M-score
从MSA配对过程中形成的M^N个连接的MSA中,根据M-score选择25个最优的MSA,由DeepMSA2-Multimer流程返回。

M-score

这篇关于Paper - DeepMSA2: Improving deep learning protein monomer and complex structure prediction的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/629425

相关文章

vue解决子组件样式覆盖问题scoped deep

《vue解决子组件样式覆盖问题scopeddeep》文章主要介绍了在Vue项目中处理全局样式和局部样式的方法,包括使用scoped属性和深度选择器(/deep/)来覆盖子组件的样式,作者建议所有组件... 目录前言scoped分析deep分析使用总结所有组件必须加scoped父组件覆盖子组件使用deep前言

简单的Q-learning|小明的一维世界(3)

简单的Q-learning|小明的一维世界(1) 简单的Q-learning|小明的一维世界(2) 一维的加速度世界 这个世界,小明只能控制自己的加速度,并且只能对加速度进行如下三种操作:增加1、减少1、或者不变。所以行动空间为: { u 1 = − 1 , u 2 = 0 , u 3 = 1 } \{u_1=-1, u_2=0, u_3=1\} {u1​=−1,u2​=0,u3​=1}

简单的Q-learning|小明的一维世界(2)

上篇介绍了小明的一维世界模型 、Q-learning的状态空间、行动空间、奖励函数、Q-table、Q table更新公式、以及从Q值导出策略的公式等。最后给出最简单的一维位置世界的Q-learning例子,从给出其状态空间、行动空间、以及稠密与稀疏两种奖励函数的设置方式。下面将继续深入,GO! 一维的速度世界 这个世界,小明只能控制自己的速度,并且只能对速度进行如下三种操作:增加1、减

Deep Ocr

1.圈出内容,文本那里要有内容.然后你保存,并'导出数据集'. 2.找出deep_ocr_recognition_training_workflow.hdev 文件.修改“DatasetFilename := 'Test.hdict'” 310行 write_deep_ocr (DeepOcrHandle, BestModelDeepOCRFilename) 3.推理test.hdev

Complex Networks Package for MatLab

http://www.levmuchnik.net/Content/Networks/ComplexNetworksPackage.html 翻译: 复杂网络的MATLAB工具包提供了一个高效、可扩展的框架,用于在MATLAB上的网络研究。 可以帮助描述经验网络的成千上万的节点,生成人工网络,运行鲁棒性实验,测试网络在不同的攻击下的可靠性,模拟任意复杂的传染病的传

Learning Memory-guided Normality for Anomaly Detection——学习记忆引导的常态异常检测

又是一篇在自编码器框架中研究使用记忆模块的论文,可以看做19年的iccv的论文的衍生,在我的博客中对19年iccv这篇论文也做了简单介绍。韩国人写的,应该是吧,这名字听起来就像。 摘要abstract 我们解决异常检测的问题,即检测视频序列中的异常事件。基于卷积神经网络的异常检测方法通常利用代理任务(如重建输入视频帧)来学习描述正常情况的模型,而在训练时看不到异常样本,并在测试时使用重建误

Learning Temporal Regularity in Video Sequences——视频序列的时间规则性学习

Learning Temporal Regularity in Video Sequences CVPR2016 无监督视频异常事件检测早期工作 摘要 由于对“有意义”的定义不明确以及场景混乱,因此在较长的视频序列中感知有意义的活动是一个具有挑战性的问题。我们通过在非常有限的监督下使用多种来源学习常规运动模式的生成模型(称为规律性)来解决此问题。体来说,我们提出了两种基于自动编码器的方法,以

COD论文笔记 Adaptive Guidance Learning for Camouflaged Object Detection

论文的主要动机、现有方法的不足、拟解决的问题、主要贡献和创新点如下: 动机: 论文的核心动机是解决伪装目标检测(COD)中的挑战性任务。伪装目标检测旨在识别和分割那些在视觉上与周围环境高度相似的目标,这对于计算机视觉来说是非常困难的任务。尽管深度学习方法在该领域取得了一定进展,但现有方法仍面临有效分离目标和背景的难题,尤其是在伪装目标与背景特征高度相似的情况下。 现有方法的不足之处: 过于

One-Shot Imitation Learning

发表时间:NIPS2017 论文链接:https://readpaper.com/pdf-annotate/note?pdfId=4557560538297540609&noteId=2424799047081637376 作者单位:Berkeley AI Research Lab, Work done while at OpenAI Yan Duan†§ , Marcin Andrychow

Introduction to Deep Learning with PyTorch

1、Introduction to PyTorch, a Deep Learning Library 1.1、Importing PyTorch and related packages import torch# supports:## image data with torchvision## audio data with torchaudio## text data with t