bedtools指南

本文主要是介绍bedtools指南，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

文章目录

官方文档
下载安装
演示版的bed文件 (demo.bed)
我们的基因组文件（genome.txt）
两侧的运算
填充运算
下载测试数据
提取与genes.gff的间隔相对应的序列
获取测试数据
用这个间隔文件去分割bam文件
实战案例
- 获取数据
- bedtools intersect
- 从注释文件中，选取启动子
- 找到跟每个exon最近的启动子
- 以5Kb一个窗口把人类基因组以覆盖

官方文档

https://bedtools.readthedocs.io/en/latest/index.html

下载安装

cd ~/local/app/
curl -OL  https://github.com/arq5x/bedtools2/releases/download/v2.22.0/bedtools-2.22.0.tar.gz
tar zxvf bedtools-2.22.0.tar.gz
cd bedtools2
make
ln -sf ~/local/app/bedtools2/bin/bedtools ~/bin/bedtools

演示版的bed文件 (demo.bed)

vim demo.bedKM034562    100    200    one    0    +
KM034562    400    500    two    0    -

我们的基因组文件（genome.txt）

vim genome.txt
KM034562    18959

两侧的运算

http://bedtools.readthedocs.io/en/latest/content/tools/flank.html


bedtools flank -i demo.gff -g genome.txt -b 10
KM034562    .    .    91    100    0    +    .    .
KM034562    .    .    201    210    0    +    .    .
KM034562    .    .    391    400    0    -    .    .
KM034562    .    .    501    510    0    -    .    .

bedtools flank -i demo.gff -g genome.txt -l 10 -r 0
KM034562    .    .    91    100    0    +    .    .
KM034562    .    .    391    400    0    -    .    .

bedtools flank -i demo.gff -g genome.txt -l 10 -r 0 -s > flank.gff
less flank.gff
KM034562        .       .       91      100     0       +       .       .
KM034562        .       .       501     510     0       -       .       .

示意图：

xxx

填充运算

http://bedtools.readthedocs.io/en/latest/content/tools/complement.html


bedtools complement -i demo.gff -g genome.txt > complement.gff
less complement.gff
KM034562        0       100
KM034562        200     400
KM034562        500     18959

示意图：

xxx

下载测试数据

通过Entrez Direct访问NCBI
efetch -id KM034562 -format gb -db nucleotide > KM034562.gb
数据格式转化
readseq -format=FASTA -o ~/data/852/KM034562.fa KM034562.gb
readseq -format=GFF -o KM034562.gff KM034562.gb

安装readseq


创建  ~/bin 目录
mkdir -p ~/bin
将~/bin 文件夹加到PATH
echo 'export PATH=~/bin:$PATH' >> ~/.bashrc
source ~/.bashrc-------------------------------------
mkdir  ~/local/app/readseq
cd ~/local/app/readseq
curl -OL http://iubio.bio.indiana.edu/soft/molbio/readseq/java/readseq.jar-----------------------------------------
echo '#!/bin/bash' > ~/bin/readseq
echo 'java -jar ~/local/app/readseq/readseq.jar $@' >> ~/bin/readseq
chmod +x ~/bin/readseq

从整个文件中提取基因


cat KM034562.gff | bioawk -c gff ' $feature=="gene" { print $0 } ' > genes.gff
less genes.gff
KM034562        -       gene    56      3026    .       +       .       gene "NP"
KM034562        -       gene    3032    4407    .       +       .       gene "VP35"
KM034562        -       gene    4390    5894    .       +       .       gene "VP40"
KM034562        -       gene    5900    8305    .       +       .       gene "GP"
KM034562        -       gene    8288    9740    .       +       .       gene "VP30"
KM034562        -       gene    9885    11518   .       +       .       gene "VP24" ; note "putative"
KM034562        -       gene    11501   18282   .       +       .       gene "L"

提取与genes.gff的间隔相对应的序列

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html


bedtools getfasta -fi ~/data/852/KM034562.fa -bed genes.gff -fo genes.fa
head genes.fa

示意图：
xxx

获取测试数据


mkdir -p ~/local/app
cd ~/local/app
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zipunzip bowtie2-2.3.4.3-linux-x86_64.zip
ln -s ~/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2 ~/bincd ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads/bowtie2 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq 2>alignment.log |samtools view -bS  >tmp.bamsamtools sort tmp.bam > tmp.sorted.bamsamtools index tmp.sorted.bam

用这个间隔文件去分割bam文件

http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

创建一个间隔文件

vim region.bed
gi|9626243|ref|NC_001416.1|    1000    2000

[sunchengquan 11:48:40 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ bedtools intersect -a tmp.sorted.bam  -b region.bed > region.bam
[sunchengquan 11:50:34 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ wc -l tmp.sorted.bam
10124 tmp.sorted.bam
[sunchengquan 11:50:47 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ wc -l region.bam 
220 region.bam
[sunchengquan 11:52:59 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -h region.bam |awk '{print $4}'|tail -1
2000

示意图：

xxx

实战案例

http://quinlanlab.org/tutorials/bedtools/bedtools.html

获取数据

mkdir ~/edu/lec28
cd ~/edu/lec28
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/maurano.dnaseI.tgz
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/cpg.bed
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/exons.bed
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/gwas.bed
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/genome.txt
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/hesc.chromHmm.bed
tar -zxvf maurano.dnaseI.tgz
rm maurano.dnaseI.tgz

这些文件内容:

胎儿的组织样品，包括脑、心脏、肠道、肾脏、肺、肌肉、皮肤以及胃

cpg.bed
人类基因组中的CpG岛

exons.bed
RefSeq exons from human genes

gwas.bed
human disease-associated SNPs that were identified in genome-wide association studies (GWAS)

bedtools intersect


#找到A和B文件中重叠的部分
bedtools intersect -a cpg.bed -b exons.bed | head -5
chr1    29320    29370    CpG:_116
chr1    135124    135563    CpG:_30
chr1    327790    328229    CpG:_29
chr1    327790    328229    CpG:_29
chr1    327790    328229    CpG:_29#-wa：A和B重叠的区间再加上a的剩余部分
#-wb：A和B重叠的区间再加上b的剩余部分
bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5
chr1    28735    29810    CpG:_116    chr1    29320    29370    NR_024540_exon_10_0_chr1_29321_r    0    -
chr1    135124    135563    CpG:_30    chr1    134772    139696    NR_039983_exon_0_0_chr1_134773_r    0    -
chr1    327790    328229    CpG:_29    chr1    324438    328581    NR_028322_exon_2_0_chr1_324439_f    0    +
chr1    327790    328229    CpG:_29    chr1    324438    328581    NR_028325_exon_2_0_chr1_324439_f    0    +
chr1    327790    328229    CpG:_29    chr1    327035    328581    NR_028327_exon_3_0_chr1_327036_f    0    +#-wo Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported
bedtools intersect -a cpg.bed -b exons.bed -wo | head -5
chr1    28735    29810    CpG:_116    chr1    29320    29370    NR_024540_exon_10_0_chr1_29321_r    0    -    50
chr1    135124    135563    CpG:_30    chr1    134772    139696    NR_039983_exon_0_0_chr1_134773_r    0    -    439
chr1    327790    328229    CpG:_29    chr1    324438    328581    NR_028322_exon_2_0_chr1_324439_f    0    +    439
chr1    327790    328229    CpG:_29    chr1    324438    328581    NR_028325_exon_2_0_chr1_324439_f    0    +    439
chr1    327790    328229    CpG:_29    chr1    327035    328581    NR_028327_exon_3_0_chr1_327036_f    0    +    439 #-c For each entry in A, report the number of hits in B while restricting to -f. Reports 0 for A entries that have no overlap with B
bedtools intersect  -a cpg.bed -b exons.bed -c | head 
chr1    28735    29810    CpG:_116    1
chr1    135124    135563    CpG:_30    1
chr1    327790    328229    CpG:_29    3
chr1    437151    438164    CpG:_84    0
chr1    449273    450544    CpG:_99    0
chr1    533219    534114    CpG:_94    0
chr1    544738    546649    CpG:_171    0
chr1    713984    714547    CpG:_60    1
chr1    762416    763445    CpG:_115    10
chr1    788863    789211    CpG:_28    9# 找到覆盖了最多外显子的CPG岛
bedtools intersect -a cpg.bed -b exons.bed -c | sort -k5,5nr | head -2
chrY    15591259    15591720    CpG:_33    77
chrUn_gl000228    70214    114054    CpG:_3259    72bedtools intersect -a cpg.bed -b exons.bed -c | sort -k1,1 -k2,2nr | head -2
chr1    249200252    249200721    CpG:_58    2
chr1    249167408    249168010    CpG:_48    0#找到A文件中没有重叠B的部分 Only report those entries in A that have no overlap in B
bedtools intersect -a cpg.bed -b exons.bed -v | head
chr1    437151    438164    CpG:_84
chr1    449273    450544    CpG:_99
chr1    533219    534114    CpG:_94 
chr1    544738    546649    CpG:_171
chr1    801975    802338    CpG:_24
chr1    805198    805628    CpG:_50
chr1    839694    840619    CpG:_83
chr1    844299    845883    CpG:_153
chr1    912869    913153    CpG:_28
chr1    919726    919927    CpG:_15

从注释文件中，选取启动子

cat hesc.chromHmm.bed | grep Promoter > promoters.bed
cat promoters.bed |head -3
chr1    27737    28537    2_Weak_Promoter
chr1    28537    30137    1_Active_Promoter
chr1    30137    30337    2_Weak_Promoter

找到跟每个exon最近的启动子

多的一列数值是-a 和 -b 两者最近的距离

bedtools closest -a exons.bed -b promoters.bed  -d | head -2
chr1    11873    12227    NR_046018_exon_0_0_chr1_11874_f    0    +    chr1    27737    28537    2_Weak_Promoter    15511
chr1    12612    12721    NR_046018_exon_1_0_chr1_12613_f    0    +    chr1    27737    28537    2_Weak_Promoter    15017

-d In addition to the closest feature in B, report its distance to A as an extra column. The reported distance for overlapping features will be 0

示意图：

在这里插入图片描述

以5Kb一个窗口把人类基因组以覆盖

bedtools makewindows -g genome.txt -w 50000 > windows.bed
cat windows.bed |head -3
chr1    0    50000
chr1    50000    100000
chr1    100000    150000
bedtools makewindows -g genome.txt -w 100000 > windows0.bed
cat windows0.bed |head -3
chr1    0    100000
chr1    100000    200000
chr1    200000    300000

这篇关于bedtools指南的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！