本文主要是介绍RNA-seq分析(Fastqc+Trimmomatic+STAR+HTseq-count+DESeq2),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
最近做RNA-seq,正好把流程整理下,也希望分享和相互学习。
具体将以Fastqc + Trimmomatic + STAR + HTseq-count + DEseq2的流程来进行。
查看数据完整性
for dir in `ls`; do cd $dir; md5sum -c MD5_*txt; cd ..; done
预处理
FastQC + Trimmomatic
fastqc -t 5 sample_R1.fq.gz
fastqc -t 5 sample_R2.fq.gz
java -jar ~/tools/Trimmomatic/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 20 sample_R1.fq.gz sample_R2.fq.gz -baseout sample_filtered.fq.gz ILLUMINACLIP:~/tools/Trimmomatic/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 HEADCROP:8 MINLEN:36 HEADCROP:15
fastqc后发现有些样本per tile sequence content 1,Per base sequence content,Adapter Content,Kmer Content没有通过。主要问题是去除些质量差的reads;根据前15个左右碱基比不均一,用HEADCROP去掉。用的是TruSeq的adapter,故而加上,用Trimmomatic。
Trimmomatic相关学习内容,见2,3.4.
STAR
make index
人和小鼠的基因组和参考注释用Tophat的igenomes下:
STAR --runThreadN 30 --runMode genomeGenerate --genomeDir STARINDEX_20180118/ --genomeFastaFiles WholeGenomeFasta/genome.fa --sjdbGTFfile ../Annotation/Genes/genes.gtf --sjdbOverhang 134
do the alignment.
可以基于第一次比对的结果,用SJ.out.tab于重新Genome的Index,然后再比对(在用找SNP和Indel时尤其推荐)。7
STAR --runThreadN 30 --genomeDir ~/Ref/UCSC_hg19/Homo_sapiens/UCSC/hg19/Sequence/STARIndex_20180118 --readFilesIn sample_filtered_1P.fq.gz sample_filtered_2P.fq.gz --outFileNamePrefix ./Hs_treat3/Hs_treat3 --readFilesCommand zcat
参考内容:5, 6,
Trim reads map to multiple regions.
samtools view -bS -F 4 Hs_treat3Aligned.out.sam > Hs_treat3_mapped.bam
samtools sort -n Hs_treat3_mapped.bam Hs_treat3_sort
HTSeq
用htseq-count计算read counts。8,9
htseq-count -f bam -s no Hs_treat3_sort.bam ~/Ref/UCSC_hg19/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > sample.count
DESeq2差异分析
library(DESeq2)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(hs, DataFrame(condition), ~ condition)
dds <- dds[ rowSums(counts(dds)) > 1, ] #过滤low count数据
nrow(dds)
dds <- DESeq(dds) #差异分析
res <- results(dds) #用result()函数获取结果
summary(res) #summary()函数统计结果
count_r <- counts(dds, normalized=T) #提取normalized count matrix
10
这篇关于RNA-seq分析(Fastqc+Trimmomatic+STAR+HTseq-count+DESeq2)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!