从细菌基因组中提取噬菌体变异序列工具PhaseFinder的介绍、安装和使用方法

本文主要是介绍从细菌基因组中提取噬菌体变异序列工具PhaseFinder的介绍、安装和使用方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

PhaseFinder

## 概览,不翻译了,大家自己看吧
The PhaseFinder algorithm is designed to detect DNA inversion mediated phase variation in bacterial genomes using genomic or metagenomic sequencing data. It works by identifying regions flanked by inverted repeats, mimicking their inversion in silico, and identifying regions where sequencing reads support both orientations. Here, we define phase variation as "a process employed by bacteria to generate frequent and reversible changes within specific hypermutable loci, introducing phenotypic diversity into clonal populations”. Not every region detected by PhaseFinder will directly result in phase variation, but the results should be highly enriched for regions that do. 

github: https://github.com/XiaofangJ/PhaseFinder

## Prerequisites,安装依赖
+ [Biopython](https://biopython.org/)
+ [pandas](https://pandas.pydata.org)
+ [samtools](http://samtools.sourceforge.net/) (>=1.4)
+ [bowtie](https://github.com/BenLangmead/bowtie)(>=version 1.2.0)
+ [einverted](http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/einverted.html)
+ [bedops](https://bedops.readthedocs.io/en/latest/)
+ [bedtools](https://bedtools.readthedocs.io/en/latest/)

To install PhaseFinder,安装

git clone git@github.com:nlm-irp-jianglab/PhaseFinder.git
cd PhaseFinder
conda env create --file environment.yml
conda activate PhaseFinder

快速开始
All you need to get started is a genome (in fasta format) you would like to search for invertible DNA regions and genomic sequencing data (preferrably Illumina in fastq format) from the same organism, or metagenomic sequencing data from a sample containing the organism (preferrably Illumina in fastq format). 

To test PhaseFinder, you can use the example files (genome: test.fa, genomic data: p1.fq, p2.fq) Example:

# Identify regions flanked by inverted repeats 
python PhaseFinder.py locate -f ./data/test.fa -t ./data/test.einverted.tab -g 15 85 -p # Mimic inversion
python PhaseFinder.py create -f ./data/test.fa -t ./data/test.einverted.tab -s 1000 -i ./data/test.ID.fasta# Identify regions where sequencing reads support both orientations 
python PhaseFinder.py ratio -i ./data/test.ID.fasta -1 ./data/p1.fq -2 ./data/p2.fq -p 16 -o ./data/out

If successful, the output will be in data/out.ratio.txt

In this example, there is one real example of an invertible DNA region "am_0171_0068_d5_0006:81079-81105-81368-81394" because only this region has reads supporting both the F and R orientation. 

---

教程Tutorial
1. Generate a position table of regions flanked by inverted repeats 
Users can identify inverted repeats using the "PhaseFinder.py locate" command, or generate their own table.

1.1. Generate the position table with the PhaseFinder script

Usage: PhaseFinder.py locate [OPTIONS]Locate putative inverted regionsOptions:-f, --fasta PATH        Input genome sequence file in fasta format[required]-t, --tab PATH          Output table with inverted repeats coordinates[required]-e, --einv TEXT         Einverted parameters, if unspecified run withPhaseFinder default pipeline-m, --mismatch INTEGER  Max number of mismatches allowed between IR pairs,used with -einv (default:3)-r, --IRsize INTEGER    Max size of the inverted repeats, used with -einv(default:50)-g, --gcRatio MIN MAX   The minimum and maximum value of GC ratio-p, --polymer           Remove homopolymer inverted repeats--help                  Show this message and exit.

Input: A fasta file containing the genome sequence
Output: A table file containing the postion information of invereted repeats in the genome

Examples:
* Run the default PhaseFinder locate parameters

python PhaseFinder.py locate -f ./data/test.fa -t ./data/test.einverted.tab 

Run the default PhaseFinder locate parameters and remove inverted repeats with GC content lower than 15% and higher than 85% or with homopolymers

python PhaseFinder.py locate -f ./data/test.fa -t ./data/test.einverted.tab -g 15 85 -p 

* Run with the specified einverted parameters "-maxrepeat 750 -gap 100 -threshold 51 -match 5 -mismatch -9" 

python PhaseFinder.py locate -f ./data/test.fa -t ./data/test.einverted.tab -e "-maxrepeat 750 -gap 100 -threshold 51 -match 5 -mismatch -9" 


1.2. Generate the position table with other tools
You can identify regions flanked by inverted repeats directly with tools such as [einverted](http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/einverted.html) and [palindrome](http://emboss.sourceforge.net/apps/cvs/emboss/apps/palindrome.html). 

Prepare the output into the following format:

A table file with five columns (tab delimited):

 Column name | Explanation                                                   |
-------------|---------------------------------------------------------------|Scaffold    | The scaffold or contig name where the inverted repeat is detectedpos A       | The start coordinate of the first inverted repeat (0-based)pos B       | The end coordinate of the first inverted repeat (1-based)pos C       | The start coordinate of the second inverted repeat (0-based)pos D       | The end coordinate of the second inverted repeat (1-based)---

2. Mimic inversion in silico to create a database of inverted sequences

Usage: PhaseFinder.py create [OPTIONS]Create inverted fasta fileOptions:-f, --fasta PATH         Input genome sequence file in fasta format[required]-t, --tab PATH           Table with inverted repeat coordinates  [required]-s, --flanksize INTEGER  Base pairs of flanking DNA on both sides of theidentified inverted repeats  [required]-i, --inv PATH           Output path of the inverted fasta file  [required]--help                   Show this message and exit.

Input
* The position table from step 1

 Output
* A fasta file containing inverted (R) and non-inverted (F) putative invertible DNA regions flanked by sequences of specified length (bowtie indexed)
* A table file (with suffix ".info.tab") describing the location of inverted repeats in the above fasta file---
3. Align sequence reads to inverted sequence database and calculate the ratio of reads aligning to the F or R orienation. 

Usage: PhaseFinder.py ratio [OPTIONS]Align reads to inverted fasta fileOptions:-i, --inv PATH         Input path of the inverted fasta file  [required]-1, --fastq1 PATH      First pair in fastq  [required]-2, --fastq2 PATH      Second pair in fastq  [required]-p, --threads INTEGER  Number of threads-o, --output TEXT      Output prefix  [required]--help                 Show this message and exit.

输入 Input
* Output from step 2
* fastq file of genomic or metagenomic sequence used to verify DNA inversion
* Number of threads used for bowtie alignment and samtools process
输出Output
* A table file (with suffix ".ratio.txt") containing the reads that supporting either R or F orientation of invertible DNA

 Column name | Explanation                                                                 |
-------------|-----------------------------------------------------------------------------|
Sequence     | Putative invertible regions(Format:Scaffold:posA-posB-posC-posD)
Pe_F         | The number of reads supprting the F orientation with paired-end information
Pe_R         | The number of reads supprting the R orientation with paired-end information
Pe_ratio     | Pe_R/(Pe_F + Pe_R). The percent of reads supporting the R orientation with the paired-end method
Span_F       | The number of reads supporting the F orientation spanning the inverted repeat by at least 10 bp on either side
Span_R       | The number of reads supporting the R orientation spanning the inverted repeat by at least 10 bp on either side
Span_ratio   | Span_R/(Span_F + Span_R). The percent of reads supporting the R orientation with the spanning method. 

True invertible regions have reads supporting both the F and R orientation. We recommend combining the information from both the paired-end (Pe) and spanning (Span) methods to find valid invertible DNA regions. Our default is to classify a region as invertible if at least 1% of reads support the R orientation with a minimum Pe_R > 5 and Span_R > 3. 

4. (Optional) Subset for intergenic invertible DNA regions 

If you are especially interested in intergenic regulatory regions, such as promoters, you can remove predicted invertible regions overlapping with coding sequences (CDS). First, obtain an annotation for the genome of interest from the NCBI or that you genereate yourself in GFF3 format. Second, subsubset the annotation for CDS regions only. Third, use the following command to process the output of PhaseFinder step 3 to obtain a list of intergenic putative invertible DNA regions.

sed '1d' output_from_phasefinder.ratio.txt| awk '{print $1"\t"$0}'|sed 's/:/\t/;s/-[^\t]*-/\t/'|sortBed |closestBed  -a - -b annotation.gff  -d |awk '$20!=0{print $3}' > intergenic_IDR.txt

Citation
Jiang X, Hall AB, et al. Invertible promoters mediate bacterial phase variation, antibiotic resistance, and host adaptation in the gut, *Science* (2019) [DOI: 10.1126/science.aau5238](http://science.sciencemag.org/content/363/6423/181)http://science.sciencemag.org/content/363/6423/181

这篇关于从细菌基因组中提取噬菌体变异序列工具PhaseFinder的介绍、安装和使用方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/575561

相关文章

如何使用 Python 读取 Excel 数据

《如何使用Python读取Excel数据》:本文主要介绍使用Python读取Excel数据的详细教程,通过pandas和openpyxl,你可以轻松读取Excel文件,并进行各种数据处理操... 目录使用 python 读取 Excel 数据的详细教程1. 安装必要的依赖2. 读取 Excel 文件3. 读

解决Maven项目idea找不到本地仓库jar包问题以及使用mvn install:install-file

《解决Maven项目idea找不到本地仓库jar包问题以及使用mvninstall:install-file》:本文主要介绍解决Maven项目idea找不到本地仓库jar包问题以及使用mvnin... 目录Maven项目idea找不到本地仓库jar包以及使用mvn install:install-file基

Windows 上如果忘记了 MySQL 密码 重置密码的两种方法

《Windows上如果忘记了MySQL密码重置密码的两种方法》:本文主要介绍Windows上如果忘记了MySQL密码重置密码的两种方法,本文通过两种方法结合实例代码给大家介绍的非常详细,感... 目录方法 1:以跳过权限验证模式启动 mysql 并重置密码方法 2:使用 my.ini 文件的临时配置在 Wi

MySQL重复数据处理的七种高效方法

《MySQL重复数据处理的七种高效方法》你是不是也曾遇到过这样的烦恼:明明系统测试时一切正常,上线后却频频出现重复数据,大批量导数据时,总有那么几条不听话的记录导致整个事务莫名回滚,今天,我就跟大家分... 目录1. 重复数据插入问题分析1.1 问题本质1.2 常见场景图2. 基础解决方案:使用异常捕获3.

最详细安装 PostgreSQL方法及常见问题解决

《最详细安装PostgreSQL方法及常见问题解决》:本文主要介绍最详细安装PostgreSQL方法及常见问题解决,介绍了在Windows系统上安装PostgreSQL及Linux系统上安装Po... 目录一、在 Windows 系统上安装 PostgreSQL1. 下载 PostgreSQL 安装包2.

Python使用getopt处理命令行参数示例解析(最佳实践)

《Python使用getopt处理命令行参数示例解析(最佳实践)》getopt模块是Python标准库中一个简单但强大的命令行参数处理工具,它特别适合那些需要快速实现基本命令行参数解析的场景,或者需要... 目录为什么需要处理命令行参数?getopt模块基础实际应用示例与其他参数处理方式的比较常见问http

SQL中redo log 刷⼊磁盘的常见方法

《SQL中redolog刷⼊磁盘的常见方法》本文主要介绍了SQL中redolog刷⼊磁盘的常见方法,将redolog刷入磁盘的方法确保了数据的持久性和一致性,下面就来具体介绍一下,感兴趣的可以了解... 目录Redo Log 刷入磁盘的方法Redo Log 刷入磁盘的过程代码示例(伪代码)在数据库系统中,r

C 语言中enum枚举的定义和使用小结

《C语言中enum枚举的定义和使用小结》在C语言里,enum(枚举)是一种用户自定义的数据类型,它能够让你创建一组具名的整数常量,下面我会从定义、使用、特性等方面详细介绍enum,感兴趣的朋友一起看... 目录1、引言2、基本定义3、定义枚举变量4、自定义枚举常量的值5、枚举与switch语句结合使用6、枚

redis过期key的删除策略介绍

《redis过期key的删除策略介绍》:本文主要介绍redis过期key的删除策略,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录第一种策略:被动删除第二种策略:定期删除第三种策略:强制删除关于big key的清理UNLINK命令FLUSHALL/FLUSHDB命

使用Python从PPT文档中提取图片和图片信息(如坐标、宽度和高度等)

《使用Python从PPT文档中提取图片和图片信息(如坐标、宽度和高度等)》PPT是一种高效的信息展示工具,广泛应用于教育、商务和设计等多个领域,PPT文档中常常包含丰富的图片内容,这些图片不仅提升了... 目录一、引言二、环境与工具三、python 提取PPT背景图片3.1 提取幻灯片背景图片3.2 提取