python运行hhsearch二进制命令的包装器类

2023-11-21 14:04

本文主要是介绍python运行hhsearch二进制命令的包装器类,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

hhsearch 是 HMM-HMM(Hidden Markov Model to Hidden Markov Model)比对方法的一部分,属于 HMMER 软件套件。它用于进行蛋白质序列的高效比对,特别适用于检测远缘同源性。

以下是 hhsearch 的一些主要特点和用途:

  1. HMM-HMM比对: hhsearch 使用隐藏马尔可夫模型(HMM)来表示蛋白质家族的模型。与传统的序列-序列比对方法不同,HMM-HMM比对考虑了氨基酸残基的多序列信息,使得在比对中能够更好地捕捉蛋白质家族的模式和结构。

  2. 检测远缘同源性: hhsearch 的一个主要优势是其能够检测到相对远离的同源关系。它在比对中引入了更多的信息,从而提高了对远缘同源蛋白的发现能力。

  3. 灵敏度和特异性: hhsearch 的设计旨在在维持高灵敏度的同时,减少假阳性的比对。这使得它在寻找结构和功能相似性时更为可靠。

  4. 数据库搜索: 用户可以使用 hhsearch 在大型蛋白质数据库中搜索与给定蛋白质序列相似的蛋白质。

"""Library to run HHsearch from Python."""import glob
import os
import subprocess
from typing import Sequence, Optional, List, Iterable
from absl import logging
import contextlib
import tempfile
import dataclasses
import contextlib
import time
import shutil
import re@contextlib.contextmanager
def timing(msg: str):logging.info('Started %s', msg)tic = time.time()yieldtoc = time.time()logging.info('Finished %s in %.3f seconds', msg, toc - tic)@dataclasses.dataclass(frozen=True)
class TemplateHit:"""Class representing a template hit."""index: intname: straligned_cols: intsum_probs: Optional[float]query: strhit_sequence: strindices_query: List[int]indices_hit: List[int]@contextlib.contextmanager
def tmpdir_manager(base_dir: Optional[str] = None):"""Context manager that deletes a temporary directory on exit."""tmpdir = tempfile.mkdtemp(dir=base_dir)try:yield tmpdirfinally:shutil.rmtree(tmpdir, ignore_errors=True)def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:"""Parses the content of an entire HHR file."""lines = hhr_string.splitlines()# Each .hhr file starts with a results table, then has a sequence of hit# "paragraphs", each paragraph starting with a line 'No <hit number>'. We# iterate through each paragraph to parse each hit.block_starts = [i for i, line in enumerate(lines) if line.startswith('No ')]hits = []if block_starts:block_starts.append(len(lines))  # Add the end of the final block.for i in range(len(block_starts) - 1):hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))return hitsdef _parse_hhr_hit(detailed_lines: Sequence[str]) -> TemplateHit:"""Parses the detailed HMM HMM comparison section for a single Hit.This works on .hhr files generated from both HHBlits and HHSearch.Args:detailed_lines: A list of lines from a single comparison section between 2sequences (which each have their own HMM's)Returns:A dictionary with the information from that detailed comparison sectionRaises:RuntimeError: If a certain line cannot be processed"""# Parse first 2 lines.number_of_hit = int(detailed_lines[0].split()[-1])name_hit = detailed_lines[1][1:]# Parse the summary line.pattern = ('Probab=(.*)[\t ]*E-value=(.*)[\t ]*Score=(.*)[\t ]*Aligned_cols=(.*)[\t'' ]*Identities=(.*)%[\t ]*Similarity=(.*)[\t ]*Sum_probs=(.*)[\t '']*Template_Neff=(.*)')match = re.match(pattern, detailed_lines[2])if match is None:raise RuntimeError('Could not parse section: %s. Expected this: \n%s to contain summary.' %(detailed_lines, detailed_lines[2]))(_, _, _, aligned_cols, _, _, sum_probs, _) = [float(x)for x in match.groups()]# The next section reads the detailed comparisons. These are in a 'human# readable' format which has a fixed length. The strategy employed is to# assume that each block starts with the query sequence line, and to parse# that with a regexp in order to deduce the fixed length used for that block.query = ''hit_sequence = ''indices_query = []indices_hit = []length_block = Nonefor line in detailed_lines[3:]:# Parse the query sequence lineif (line.startswith('Q ') and not line.startswith('Q ss_dssp') andnot line.startswith('Q ss_pred') andnot line.startswith('Q Consensus')):# Thus the first 17 characters must be 'Q <query_name> ', and we can parse# everything after that.#              start    sequence       end       total_sequence_lengthpatt = r'[\t ]*([0-9]*) ([A-Z-]*)[\t ]*([0-9]*) \([0-9]*\)'groups = _get_hhr_line_regex_groups(patt, line[17:])# Get the length of the parsed block using the start and finish indices,# and ensure it is the same as the actual block length.start = int(groups[0]) - 1  # Make index zero based.delta_query = groups[1]end = int(groups[2])num_insertions = len([x for x in delta_query if x == '-'])length_block = end - start + num_insertionsassert length_block == len(delta_query)# Update the query sequence and indices list.query += delta_query_update_hhr_residue_indices_list(delta_query, start, indices_query)elif line.startswith('T '):# Parse the hit sequence.if (not line.startswith('T ss_dssp') andnot line.startswith('T ss_pred') andnot line.startswith('T Consensus')):# Thus the first 17 characters must be 'T <hit_name> ', and we can# parse everything after that.#              start    sequence       end     total_sequence_lengthpatt = r'[\t ]*([0-9]*) ([A-Z-]*)[\t ]*[0-9]* \([0-9]*\)'groups = _get_hhr_line_regex_groups(patt, line[17:])start = int(groups[0]) - 1  # Make index zero based.delta_hit_sequence = groups[1]assert length_block == len(delta_hit_sequence)# Update the hit sequence and indices list.hit_sequence += delta_hit_sequence_update_hhr_residue_indices_list(delta_hit_sequence, start, indices_hit)return TemplateHit(index=number_of_hit,name=name_hit,aligned_cols=int(aligned_cols),sum_probs=sum_probs,query=query,hit_sequence=hit_sequence,indices_query=indices_query,indices_hit=indices_hit,)def _get_hhr_line_regex_groups(regex_pattern: str, line: str) -> Sequence[Optional[str]]:match = re.match(regex_pattern, line)if match is None:raise RuntimeError(f'Could not parse query line {line}')return match.groups()def _update_hhr_residue_indices_list(sequence: str, start_index: int, indices_list: List[int]):"""Computes the relative indices for each residue with respect to the original sequence."""counter = start_indexfor symbol in sequence:if symbol == '-':indices_list.append(-1)else:indices_list.append(counter)counter += 1class HHSearch:"""Python wrapper of the HHsearch binary."""def __init__(self,*,binary_path: str,databases: Sequence[str],maxseq: int = 1_000_000):"""Initializes the Python HHsearch wrapper.Args:binary_path: The path to the HHsearch executable.databases: A sequence of HHsearch database paths. This should be thecommon prefix for the database files (i.e. up to but not including_hhm.ffindex etc.)maxseq: The maximum number of rows in an input alignment. Note that thisparameter is only supported in HHBlits version 3.1 and higher.Raises:RuntimeError: If HHsearch binary not found within the path."""self.binary_path = binary_pathself.databases = databasesself.maxseq = maxseq#for database_path in self.databases:#  if not glob.glob(database_path + '_*'):#    logging.error('Could not find HHsearch database %s', database_path)#    raise ValueError(f'Could not find HHsearch database {database_path}')@propertydef output_format(self) -> str:return 'hhr'@propertydef input_format(self) -> str:return 'a3m'def query(self, a3m: str) -> str:"""Queries the database using HHsearch using a given a3m."""with tmpdir_manager() as query_tmp_dir:input_path = os.path.join(query_tmp_dir, 'query.a3m')hhr_path = os.path.join(query_tmp_dir, 'output.hhr')with open(input_path, 'w') as f:f.write(a3m)db_cmd = []for db_path in self.databases:db_cmd.append('-d')db_cmd.append(db_path)cmd = [self.binary_path,'-i', input_path,'-o', hhr_path,'-maxseq', str(self.maxseq)] + db_cmdprint("cmd:",cmd)logging.info('Launching subprocess "%s"', ' '.join(cmd))process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)with timing('HHsearch query'):stdout, stderr = process.communicate()retcode = process.wait()if retcode:# Stderr is truncated to prevent proto size errors in Beam.raise RuntimeError('HHSearch failed:\nstdout:\n%s\n\nstderr:\n%s\n' % (stdout.decode('utf-8'), stderr[:100_000].decode('utf-8')))with open(hhr_path) as f:hhr = f.read()return hhrdef get_template_hits(self,output_string: str,input_sequence: str) -> Sequence[TemplateHit]:"""Gets parsed template hits from the raw string output by the tool."""del input_sequence  # Used by hmmseach but not needed for hhsearch.return parse_hhr(output_string)def convert_stockholm_to_a3m (stockholm_format: str,max_sequences: Optional[int] = None,remove_first_row_gaps: bool = True) -> str:"""Converts MSA in Stockholm format to the A3M format."""descriptions = {}sequences = {}reached_max_sequences = Falsefor line in stockholm_format.splitlines():reached_max_sequences = max_sequences and len(sequences) >= max_sequencesif line.strip() and not line.startswith(('#', '//')):# Ignore blank lines, markup and end symbols - remainder are alignment# sequence parts.seqname, aligned_seq = line.split(maxsplit=1)if seqname not in sequences:if reached_max_sequences:continuesequences[seqname] = ''sequences[seqname] += aligned_seqfor line in stockholm_format.splitlines():if line[:4] == '#=GS':# Description row - example format is:# #=GS UniRef90_Q9H5Z4/4-78            DE [subseq from] cDNA: FLJ22755 ...columns = line.split(maxsplit=3)seqname, feature = columns[1:3]value = columns[3] if len(columns) == 4 else ''if feature != 'DE':continueif reached_max_sequences and seqname not in sequences:continuedescriptions[seqname] = valueif len(descriptions) == len(sequences):break# Convert sto format to a3m line by linea3m_sequences = {}if remove_first_row_gaps:# query_sequence is assumed to be the first sequencequery_sequence = next(iter(sequences.values()))query_non_gaps = [res != '-' for res in query_sequence]for seqname, sto_sequence in sequences.items():# Dots are optional in a3m format and are commonly removed.out_sequence = sto_sequence.replace('.', '')if remove_first_row_gaps:out_sequence = ''.join(_convert_sto_seq_to_a3m(query_non_gaps, out_sequence))a3m_sequences[seqname] = out_sequencefasta_chunks = (f">{k} {descriptions.get(k, '')}\n{a3m_sequences[k]}"for k in a3m_sequences)return '\n'.join(fasta_chunks) + '\n'  # Include terminating newlinedef _convert_sto_seq_to_a3m(query_non_gaps: Sequence[bool], sto_seq: str) -> Iterable[str]:for is_query_res_non_gap, sequence_res in zip(query_non_gaps, sto_seq):if is_query_res_non_gap:yield sequence_reselif sequence_res != '-':yield sequence_res.lower()if __name__ == "__main__":### 1. 准备输入数据## 输入序列先通过Jackhmmer多次迭代从uniref90,MGnify数据库搜索同源序列,输出的多序列比对文件(如globins4.sto),转化为a3m格式后,再通过hhsearch从pdb数据库中找到同源序列input_fasta_file = '/home/zheng/test/Q94K49.fasta'## input_sequencewith open(input_fasta_file) as f:input_sequence = f.read()test_templates_sto_file = "/home/zheng/test/Q94K49_aln.sto"with open(test_templates_sto_file) as f:test_templates_sto = f.read()## sto格式转a3m格式()test_templates_a3m = convert_stockholm_to_a3m(test_templates_sto)hhsearch_binary_path = "/home/zheng/software/hhsuite-3.3.0-SSE2-Linux/bin/hhsearch"### 2.类实例化# scop70_1.75文件名前缀scop70_database_path = "/home/zheng/database/scop70_1.75_hhsuite3/scop70_1.75"pdb70_database_path = "/home/zheng/database/pdb70_from_mmcif_latest/pdb70"#hhsuite数据库下载地址:https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/  ## 单一数据库#template_searcher = HHSearch(binary_path = hhsearch_binary_path,#                             databases = [scop70_database_path])## 多个数据库database_lst = [scop70_database_path, pdb70_database_path]template_searcher = HHSearch(binary_path = hhsearch_binary_path,databases = database_lst) ### 3. 同源序列搜索## 搜索结果返回.hhr文件字符串templates_result = template_searcher.query(test_templates_a3m)print(templates_result)## pdb序列信息列表template_hits = template_searcher.get_template_hits(output_string=templates_result, input_sequence=input_sequence)print(template_hits)

这篇关于python运行hhsearch二进制命令的包装器类的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/402793

相关文章

Redis的Zset类型及相关命令详细讲解

《Redis的Zset类型及相关命令详细讲解》:本文主要介绍Redis的Zset类型及相关命令的相关资料,有序集合Zset是一种Redis数据结构,它类似于集合Set,但每个元素都有一个关联的分数... 目录Zset简介ZADDZCARDZCOUNTZRANGEZREVRANGEZRANGEBYSCOREZ

Python判断for循环最后一次的6种方法

《Python判断for循环最后一次的6种方法》在Python中,通常我们不会直接判断for循环是否正在执行最后一次迭代,因为Python的for循环是基于可迭代对象的,它不知道也不关心迭代的内部状态... 目录1.使用enuhttp://www.chinasem.cnmerate()和len()来判断for

使用Python实现高效的端口扫描器

《使用Python实现高效的端口扫描器》在网络安全领域,端口扫描是一项基本而重要的技能,通过端口扫描,可以发现目标主机上开放的服务和端口,这对于安全评估、渗透测试等有着不可忽视的作用,本文将介绍如何使... 目录1. 端口扫描的基本原理2. 使用python实现端口扫描2.1 安装必要的库2.2 编写端口扫

使用Python实现操作mongodb详解

《使用Python实现操作mongodb详解》这篇文章主要为大家详细介绍了使用Python实现操作mongodb的相关知识,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录一、示例二、常用指令三、遇到的问题一、示例from pymongo import MongoClientf

使用Python合并 Excel单元格指定行列或单元格范围

《使用Python合并Excel单元格指定行列或单元格范围》合并Excel单元格是Excel数据处理和表格设计中的一项常用操作,本文将介绍如何通过Python合并Excel中的指定行列或单... 目录python Excel库安装Python合并Excel 中的指定行Python合并Excel 中的指定列P

一文详解Python中数据清洗与处理的常用方法

《一文详解Python中数据清洗与处理的常用方法》在数据处理与分析过程中,缺失值、重复值、异常值等问题是常见的挑战,本文总结了多种数据清洗与处理方法,文中的示例代码简洁易懂,有需要的小伙伴可以参考下... 目录缺失值处理重复值处理异常值处理数据类型转换文本清洗数据分组统计数据分箱数据标准化在数据处理与分析过

Python调用另一个py文件并传递参数常见的方法及其应用场景

《Python调用另一个py文件并传递参数常见的方法及其应用场景》:本文主要介绍在Python中调用另一个py文件并传递参数的几种常见方法,包括使用import语句、exec函数、subproce... 目录前言1. 使用import语句1.1 基本用法1.2 导入特定函数1.3 处理文件路径2. 使用ex

Python脚本实现自动删除C盘临时文件夹

《Python脚本实现自动删除C盘临时文件夹》在日常使用电脑的过程中,临时文件夹往往会积累大量的无用数据,占用宝贵的磁盘空间,下面我们就来看看Python如何通过脚本实现自动删除C盘临时文件夹吧... 目录一、准备工作二、python脚本编写三、脚本解析四、运行脚本五、案例演示六、注意事项七、总结在日常使用

Python将大量遥感数据的值缩放指定倍数的方法(推荐)

《Python将大量遥感数据的值缩放指定倍数的方法(推荐)》本文介绍基于Python中的gdal模块,批量读取大量多波段遥感影像文件,分别对各波段数据加以数值处理,并将所得处理后数据保存为新的遥感影像... 本文介绍基于python中的gdal模块,批量读取大量多波段遥感影像文件,分别对各波段数据加以数值处

python管理工具之conda安装部署及使用详解

《python管理工具之conda安装部署及使用详解》这篇文章详细介绍了如何安装和使用conda来管理Python环境,它涵盖了从安装部署、镜像源配置到具体的conda使用方法,包括创建、激活、安装包... 目录pytpshheraerUhon管理工具:conda部署+使用一、安装部署1、 下载2、 安装3