真核微生物基因组质量评估工具EukCC的安装和详细使用方法

本文主要是介绍真核微生物基因组质量评估工具EukCC的安装和详细使用方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

 介绍:

GitHub - EBI-Metagenomics/EukCC: Tool to estimate genome quality of microbial eukaryotes

 

安装:

docker:

docker pull  microbiomeinformatics/eukcc

推荐conda 环境:

conda install -c conda-forge -c bioconda "eukcc>=2"
# mamba更快
mamba install -c conda-forge -c bioconda "eukcc>=2"pip install eukcc

数据库配置,docker记得映射目录

mkdir eukccdb
cd eukccdb
wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.1.tar.gz
tar -xzvf eukcc2_db_ver_1.1.tar.gz

数据库下载地址:Index of /pub/databases/metagenomics/eukcc

 

下载数据库注意版本,一般选版本2吧, 

链接:https://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz 

 https://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc_db_v1.1.tar.gz

还有个副产品,diamond的数据库,不过好像看不出是diamond的哪个版本生成的,用的时候不好用的话就用下载的数据库再生成一遍吧。 

https://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/uniref50_20200213_tax.dmnd

 如果不知道数据库位置,或者软件找不到位置,那就简单吧,设置DB目录

export EUKCC2_DB=/path/to/.../eukcc2_db_ver_1.1

 

快速开始

#EukCC on a single MAG
#We assume that you did set you $EUKCC2_DB to the correct location. If not please use the --db flag to pass the database to EukCC.eukcc single --out outfolder --threads 8 bin.fa
#EukCC will then run on 8 threads. You can pass nucleotide fastas or proteomes to EukCC. It will automatically try to detect if it has to predict proteins or not.#By default it will never use more than a single threads for placing the genomes in the reference tree, to save memory.#EukCC on a folder of bins
eukcc folder --out outfolder --threads 8 bins
#EukCC will assume that the folder contains files with the suffix .fa. If that is not the case please adjust the parameter.

序列拼接流程

双端序列需要先构建bam索引

cat binfolder/*.fa > pseudo_contigs.fasta
bwa index pseudo_contigs.fasta
bwa mem -t 8 pseudo_contigs.fasta reads_1.fastq.gz reads_2.fastq.gz  |samtools view -q 20 -Sb - |samtools sort -@ 8 -O bam - -o alignment.bam
samtools index alignment.bam

利用py脚本生成关联表 

binlinks.py  --ANI 99 --within 1500 \--out linktable.csv binfolder alignment.bam

If you have multiple bam files, pass all of them to the script (e.g. *.bam).

You will obtain a three column file (bin_1,bin_2,links).

拼接bins

eukcc folder \--out outfolder \--threads 8  \--links linktable.csv \binfolder

EukCC 首先将分别对所有bins进行运行。随后,它会识别那些至少达到50%完整度但尚未超过100-improve_percent的中等质量bins。接下来,它会找出那些通过至少100对端读配对与这些中等质量bins相连接的bins。若经过合并后bin的质量评分有所提高,则该bin将会被合并。

已合并的bins可以在输出文件夹中找到。

警示

blinks.py

#!/usr/bin/env python3
import pysam
from Bio import SeqIO
from collections import defaultdict
import os
import argparse
import logging
import csvdef is_in(read, contig_map, within=1000):if read.reference_name not in contig_map.keys():return Falseif read.reference_start <= within or read.reference_end <= within:return Trueelif read.reference_start > (contig_map[read.reference_name] - within) or read.reference_end > (contig_map[read.reference_name] - within):return Trueelse:return Falsedef keep_read(read, contig_map, within=1000, min_ANI=98, min_cov=0):ani = ((read.query_alignment_length - read.get_tag("NM"))/ float(read.query_alignment_length)* 100)cov = read.query_alignment_length / float(read.query_length) * 100if ani >= min_ANI and cov >= min_cov and is_in(read, contig_map, within) is True:return Trueelse:return Falsedef contig_map(bindir, suffix=".fa"):m = {}for f in os.listdir(bindir):if f.endswith(suffix) is False:continuepath = os.path.join(bindir, f)with open(path, "r") as handle:for record in SeqIO.parse(handle, "fasta"):m[record.name] = len(record.seq)return mdef bin_map(bindir, suffix=".fa"):contigs = defaultdict(str)contigs_per_bin = defaultdict(int)for f in os.listdir(bindir):if f.endswith(suffix) is False:continuepath = os.path.join(bindir, f)binname = os.path.basename(f)with open(path, "r") as handle:for record in SeqIO.parse(handle, "fasta"):contigs[record.name] = binnamecontigs_per_bin[binname] += 1return contigs, contigs_per_bindef read_pair_generator(bam):"""Generate read pairs in a BAM file or within a region string.Reads are added to read_dict until a pair is found.From: https://www.biostars.org/p/306041/"""read_dict = defaultdict(lambda: [None, None])for read in bam.fetch():if not read.is_paired or read.is_secondary or read.is_supplementary:continueqname = read.query_nameif qname not in read_dict:if read.is_read1:read_dict[qname][0] = readelse:read_dict[qname][1] = readelse:if read.is_read1:yield read, read_dict[qname][1]else:yield read_dict[qname][0], readdel read_dict[qname]def read_bam_file(bamf, link_table, cm, within, ANI):samfile = pysam.AlignmentFile(bamf, "rb")# generate link tablelogging.info("Parsing Bam file. This can take a few moments")for read, mate in read_pair_generator(samfile):if keep_read(read, cm, within, min_ANI=ANI) and keep_read(mate, cm, within, min_ANI=ANI):# fill in the tablelink_table[read.reference_name][mate.reference_name] += 1if read.reference_name != mate.reference_name:link_table[mate.reference_name][read.reference_name] += 1return link_tabledef main():# set arguments# arguments are passed to classesparser = argparse.ArgumentParser(description="Evaluate completeness and contamination of a MAG.")parser.add_argument("bindir", type=str, help="Run script on these bins")parser.add_argument("bam",type=str,help="Bam file(s) with reads aligned against all contigs making up the bins",nargs="+",)parser.add_argument("--out","-o",type=str,required=False,help="Path to output table (Default: links.csv)",default="links.csv",)parser.add_argument("--ANI", type=float, required=False, help="ANI of matching read", default=99)parser.add_argument("--within",type=int,required=False,help="Within this many bp we need the read to map",default=1000,)parser.add_argument("--contigs","-c",action="store_true",default=False,help="Instead of bins print contigs",)parser.add_argument("--quiet","-q",dest="quiet",action="store_true",default=False,help="Silcence most output",)parser.add_argument("--debug","-d",action="store_true",default=False,help="Debug and thus ignore safety",)args = parser.parse_args()# define logginglogLevel = logging.INFOif args.quiet:logLevel = logging.WARNINGelif args.debug:logLevel = logging.DEBUGlogging.basicConfig(format="%(asctime)s %(message)s",datefmt="%d-%m-%Y %H:%M:%S: ",level=logLevel,)bindir = args.bindircm = contig_map(bindir)bm, contigs_per_bin = bin_map(bindir)logging.debug("Found {} contigs".format(len(cm)))link_table = defaultdict(lambda: defaultdict(int))bin_table = defaultdict(lambda: defaultdict(int))# iterate all bam filesfor bamf in args.bam:link_table = read_bam_file(bamf, link_table, cm, args.within, args.ANI)logging.debug("Created link table with {} entries".format(len(link_table)))# generate bin tablefor contig_1, dic in link_table.items():for contig_2, links in dic.items():bin_table[bm[contig_1]][bm[contig_2]] += linkslogging.debug("Created bin table with {} entries".format(len(bin_table)))out_data = []logging.debug("Constructing output dict")if args.contigs:for contig_1, linked in link_table.items():for contig_2, links in linked.items():out_data.append({"bin_1": bm[contig_1],"bin_2": bm[contig_2],"contig_1": contig_1,"contig_2": contig_2,"links": links,"bin_1_contigs": contigs_per_bin[bm[contig_1]],"bin_2_contigs": contigs_per_bin[bm[contig_2]],})else:for bin_1, dic in bin_table.items():for bin_2, links in dic.items():out_data.append({"bin_1": bin_1, "bin_2": bin_2, "links": links})logging.debug("Out data has {} rows".format(len(out_data)))# resultslogging.info("Writing output")with open(args.out, "w") as fout:if len(out_data) > 0:cout = csv.DictWriter(fout, fieldnames=list(out_data[0].keys()))cout.writeheader()for row in out_data:cout.writerow(row)else:logging.warning("No rows to write")if __name__ == "__main__":main()

scripts/filter_euk_bins.py

#!/usr/bin/env python3
#
# This file is part of the EukCC (https://github.com/openpaul/eukcc).
# Copyright (c) 2019 Paul Saary
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# provides all file operation functions
# used inthis package
import os
import argparse
import subprocess
import logging
import tempfile
import gzip
from multiprocessing import Pool# backup fasta handler, so we can use readonly directories
class fa_class:def __init__(self, seq, name, long_name):self.seq = seqself.name = nameself.long_name = long_namedef __str__(self):return self.seqdef __len__(self):return len(self.seq)def Fasta(path):"""Iterator for fasta files"""entry = Falsewith open(path) as fin:for line in fin:if line.startswith(">"):if entry is not False:entry.seq = "".join(entry.seq)yield entry# define new entrylong_name = line.strip()[1:]name = long_name.split()[0]entry = fa_class([], name, long_name)else:entry.seq.append(line.strip())# yield last oneentry.seq = "".join(entry.seq)yield entrydef gunzip(path, tmp_dir):"""Gunzip a file for EukRep"""if path.endswith(".gz"):fna_path = os.path.join(tmp_dir, "contigs.fna")logging.debug("Going to unzip fasta into {}".format(fna_path))with gzip.open(path, "r") as fin, open(fna_path, "w") as fout:for line in fin:fout.write(line.decode())path = fna_pathlogging.debug("Done unzipping {}".format(fna_path))return pathclass EukRep:"""Class to call and handle EukRep data"""def __init__(self, fasta, eukout, bacout=None, minl=1500, tie="euk"):self.fasta = fastaself.eukout = eukoutself.bacout = bacoutself.minl = minlself.tie = tiedef run(self):# command list will be calledcmd = ["EukRep","--min",str(self.minl),"-i",self.fasta,"--seq_names","-ff","--tie",self.tie,"-o",self.eukout,]if self.bacout is not None:cmd.extend(["--prokarya", self.bacout])subprocess.run(cmd, check=True, shell=False)self.read_result()def read_result(self):self.euks = self.read_eukfile(self.eukout)self.bacs = set()if self.bacout is not None:self.bacs = self.read_eukfile(self.bacout)def read_eukfile(self, path):lst = set()with open(path) as infile:for line in infile:lst.add(line.strip())return lstclass bin:def __init__(self, path, eukrep):self.e = eukrepself.bname = os.path.basename(path)self.path = os.path.abspath(path)def __str__(self):return "{} euks: {} bacs: {}".format(self.bname, self.table["euks"], self.table["bacs"])def stats(self):"""read bin content and figure genomic composition"""logging.debug("Loading bin")fa_file = Fasta(self.path)stats = {"euks": 0, "bacs": 0, "NA": 0, "sum": 0}# loop and compute statslogging.debug("Make per bin stats")for seq in fa_file:if seq.name in self.e.euks:stats["euks"] += len(seq)elif seq.name in self.e.bacs:stats["bacs"] += len(seq)else:stats["NA"] += len(seq)stats["sum"] = sum([v for k, v in stats.items()])self.table = statsdef decide(self, eukratio=0.2, bacratio=0.1, minbp=100000, minbpeuks=1000000):"""rule to handle decision logic"""keep = Trueallb = self.table["sum"]if self.table["euks"] < minbpeuks:keep = Falselogging.info(f"Eukaryotic DNA amount only {self.table['euks']} instead of target {minbpeuks}")elif self.table["euks"] / allb <= eukratio:keep = Falselogging.info(f"Eukaryotic DNA ratio not higher than {eukratio}")elif self.table["bacs"] / allb >= bacratio:keep = Falselogging.info(f"Bacterial DNA content higher than {bacratio}")elif self.table["sum"] < minbp:keep = Falselogging.info("We did not find at least %d bp of DNA", minbp)self.keep = keepif __name__ == "__main__":parser = argparse.ArgumentParser()parser.add_argument("--output", help="path for the output table", default="assignment.csv", type=str)parser.add_argument("bins", nargs="+", help="all bins to classify", type=str)parser.add_argument("--threads","-t",type=int,help="How many bins should be run in parallel (Default: 1)",default=1,)parser.add_argument("--minl",type=int,help="define minimal length of contig for EukRep \to classify (default: 1500)",default=1500,)parser.add_argument("--eukratio",type=float,help="This ratio of eukaryotic DNA to all DNA has to be found\at least (default: 0, ignore)",default=0,)parser.add_argument("--bacratio",type=float,help="discard bins with bacterial ratio of higher than\(default: 1, ignore)",default=1,)parser.add_argument("--minbp",type=float,help="Only keep bins with at least n bp of dna\(default: 8000000)",default=8000000,)parser.add_argument("--minbpeuks",type=float,help="Only keep bins with at least n bp of Eukaryotic dna\(default: 5000000)",default=5000000,)parser.add_argument("--rerun", action="store_true", help="rerun even if output exists", default=False)parser.add_argument("--quiet", action="store_true", help="supress information", default=False)parser.add_argument("--debug", action="store_true", help="Make it more verbose", default=False)args = parser.parse_args()# define logginglogLevel = logging.INFOif args.quiet:logLevel = logging.WARNINGelif args.debug:logLevel = logging.DEBUGlogging.basicConfig(format="%(asctime)s %(message)s", datefmt="%m/%d/%Y %H:%M:%S: ", level=logLevel)def evaluate_bin(path):if not os.path.exists(path):logging.error("Can not find file {}".format(path))exit(1)logging.info("Launch on {}".format(path))with tempfile.TemporaryDirectory(prefix="filter_EukRep_") as tmp_dir:logging.debug("Using tmp folder: {}".format(tmp_dir))eukfile = os.path.join(tmp_dir, "euks.fna")bacfile = os.path.join(tmp_dir, "bacs.fna")# EukRep can not deal with Gzipped Fasta files, so we unzip it in case it is a Gzip filepath = gunzip(path, tmp_dir)# Launching EukReplogging.debug(f"Starting EukRep on {path}")eukrep_result = EukRep(path, eukfile, bacfile, minl=args.minl)eukrep_result.run()b = bin(path, eukrep_result)b.stats()b.decide(eukratio=args.eukratio, bacratio=args.bacratio, minbp=args.minbp, minbpeuks=args.minbpeuks)return b# multithreading poolpool = Pool(processes=args.threads)results = pool.map(evaluate_bin, args.bins)pool.close()pool.join()with open(args.output, "w") as outfile:outfile.write("path,binname,passed,bp_eukaryote,bp_prokaryote,bp_unassigned,bp_sum\n")for b in results:outfile.write(f"{b.path},{b.bname},{b.keep},{b.table['euks']},{b.table['bacs']},{b.table['NA']},{b.table['sum']}\n")

这篇关于真核微生物基因组质量评估工具EukCC的安装和详细使用方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/579419

相关文章

C语言中联合体union的使用

本文编辑整理自: http://bbs.chinaunix.net/forum.php?mod=viewthread&tid=179471 一、前言 “联合体”(union)与“结构体”(struct)有一些相似之处。但两者有本质上的不同。在结构体中,各成员有各自的内存空间, 一个结构变量的总长度是各成员长度之和。而在“联合”中,各成员共享一段内存空间, 一个联合变量

ESP32 esp-idf esp-adf环境安装及.a库创建与编译

简介 ESP32 功能丰富的 Wi-Fi & 蓝牙 MCU, 适用于多样的物联网应用。使用freertos操作系统。 ESP-IDF 官方物联网开发框架。 ESP-ADF 官方音频开发框架。 文档参照 https://espressif-docs.readthedocs-hosted.com/projects/esp-adf/zh-cn/latest/get-started/index

问题:第一次世界大战的起止时间是 #其他#学习方法#微信

问题:第一次世界大战的起止时间是 A.1913 ~1918 年 B.1913 ~1918 年 C.1914 ~1918 年 D.1914 ~1919 年 参考答案如图所示

揭秘未来艺术:AI绘画工具全面介绍

📑前言 随着科技的飞速发展,人工智能(AI)已经逐渐渗透到我们生活的方方面面。在艺术创作领域,AI技术同样展现出了其独特的魅力。今天,我们就来一起探索这个神秘而引人入胜的领域,深入了解AI绘画工具的奥秘及其为艺术创作带来的革命性变革。 一、AI绘画工具的崛起 1.1 颠覆传统绘画模式 在过去,绘画是艺术家们通过手中的画笔,蘸取颜料,在画布上自由挥洒的创造性过程。然而,随着AI绘画工

墨刀原型工具-小白入门篇

墨刀原型工具-小白入门篇 简介 随着互联网的发展和用户体验的重要性越来越受到重视,原型设计逐渐成为了产品设计中的重要环节。墨刀作为一款原型设计工具,以其简洁、易用的特点,受到了很多设计师的喜爱。本文将介绍墨刀原型工具的基本使用方法,以帮助小白快速上手。 第一章:认识墨刀原型工具 1.1 什么是墨刀原型工具 墨刀是一款基于Web的原型设计工具,可以帮助设计师快速创建交互原型,并且可以与团队

[word] word设置上标快捷键 #学习方法#其他#媒体

word设置上标快捷键 办公中,少不了使用word,这个是大家必备的软件,今天给大家分享word设置上标快捷键,希望在办公中能帮到您! 1、添加上标 在录入一些公式,或者是化学产品时,需要添加上标内容,按下快捷键Ctrl+shift++就能将需要的内容设置为上标符号。 word设置上标快捷键的方法就是以上内容了,需要的小伙伴都可以试一试呢!

Linux 安装、配置Tomcat 的HTTPS

Linux 安装 、配置Tomcat的HTTPS 安装Tomcat 这里选择的是 tomcat 10.X ,需要Java 11及更高版本 Binary Distributions ->Core->选择 tar.gz包 下载、上传到内网服务器 /opt 目录tar -xzf 解压将解压的根目录改名为 tomat-10 并移动到 /opt 下, 形成个人习惯的路径 /opt/tomcat-10

Tolua使用笔记(上)

目录   1.准备工作 2.运行例子 01.HelloWorld:在C#中,创建和销毁Lua虚拟机 和 简单调用。 02.ScriptsFromFile:在C#中,对一个lua文件的执行调用 03.CallLuaFunction:在C#中,对lua函数的操作 04.AccessingLuaVariables:在C#中,对lua变量的操作 05.LuaCoroutine:在Lua中,

大学湖北中医药大学法医学试题及答案,分享几个实用搜题和学习工具 #微信#学习方法#职场发展

今天分享拥有拍照搜题、文字搜题、语音搜题、多重搜题等搜题模式,可以快速查找问题解析,加深对题目答案的理解。 1.快练题 这是一个网站 找题的网站海量题库,在线搜题,快速刷题~为您提供百万优质题库,直接搜索题库名称,支持多种刷题模式:顺序练习、语音听题、本地搜题、顺序阅读、模拟考试、组卷考试、赶快下载吧! 2.彩虹搜题 这是个老公众号了 支持手写输入,截图搜题,详细步骤,解题必备

Vim使用基础篇

本文内容大部分来自 vimtutor,自带的教程的总结。在终端输入vimtutor 即可进入教程。 先总结一下,然后再分别介绍正常模式,插入模式,和可视模式三种模式下的命令。 目录 看完以后的汇总 1.正常模式(Normal模式) 1.移动光标 2.删除 3.【:】输入符 4.撤销 5.替换 6.重复命令【. ; ,】 7.复制粘贴 8.缩进 2.插入模式 INSERT