perl novel可变剪接识别(2)

2024-01-05 15:32
文章标签 识别 可变 perl novel 剪接

本文主要是介绍perl novel可变剪接识别(2),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

博主其实对未知的可变剪接分类有些困惑,但想了很久才使用了一项比较复杂的算法,接下来并不是分类,而是先转换数据库,因为ensembl|gencode数据库的格式并不能满足博主的需求,转换后更能方便地处理接下来的工作:

#!/usr/bin/env perl
use warnings;
use strict;my (%gene);
open GTF, $ARGV[0] or die $!;
while(<GTF>)
{chomp;next if(/^#/);my @tmp = split;my ($gid) = $_ =~ /gene_id "([^;]+)";/; #匹配基因名称if($_ =~ /gene_type "protein_coding";/) #分coding和noncoding分别处理,因为noncoding没有UTR结构,不存在UTR的可变剪接{if($tmp[2] =~ /exon/){push @{$gene{$tmp[0]}{$gid}{$tmp[6]}}, "$tmp[3],$tmp[4],1";}}else{if($tmp[2] =~ /exon/){push @{$gene{$tmp[0]}{$gid}{$tmp[6]}}, "$tmp[3],$tmp[4],0";}}
}
close GTF;open OUT, ">$ARGV[0].myformat" or die $!;
foreach my $c(keys %gene)
{foreach my $g(keys %{$gene{$c}}){foreach my $ps(keys %{$gene{$c}{$g}}){if($ps eq '+'){my ($s, $e, $type);if(@{$gene{$c}{$g}{$ps}} eq 1){($s,$e, $type) = (split /,/, $gene{$c}{$g}{$ps}->[0])[0,1,2];}else{$s = (split /,/, $gene{$c}{$g}{$ps}->[0])[0];$e = (split /,/, $gene{$c}{$g}{$ps}->[-1])[1];$type = (split /,/, $gene{$c}{$g}{$ps}->[0])[2];}my (@start, @end);foreach my $loci(@{$gene{$c}{$g}{$ps}}){push @start, (split /,/, $loci)[0];push @end, (split /,/, $loci)[1];}my $ss = join ",", @start;my $ee = join ",", @end;print OUT join "\t", $c, $g, "$s-$e", $ps, $type, $ss, $ee, "\n";}else{my @ps_a = reverse @{$gene{$c}{$g}{$ps}};my ($s, $e, $type);if(@ps_a eq 1){($s,$e,$type) = (split /,/, $ps_a[0])[0,1,2];}else{$s = (split /,/, $ps_a[0])[0];$e = (split /,/, $ps_a[-1])[1];$type = (split /,/, $ps_a[0])[2];}my (@start, @end);foreach my $loci(@ps_a){push @start, (split /,/, $loci)[0];push @end, (split /,/, $loci)[1];}my $ss = join ",", @start;my $ee = join ",", @end;print OUT join "\t", $c, $g, "$s-$e", $ps, $type, $ss, $ee, "\n";}}}
}

这个格式类似ucsc的refseq格式,当然只是类似而已,而博主所需求的是gene类型和exon起始与终止而已。

说到数据库的转换,博主还想起来一件事情,就是refseq转gff格式,其实两者差的有些多。

有个妹子写了一个转换的,还不错,展示的内容非常的详细想要啥结果都可以,当然在这之中博主也贡献了一点功劳,嘿嘿!就拿来在这展示一下吧:

#!perl -w
use strict; 
die "Usage : perl $0 <in.refGene.lst> <out.gff>" unless (@ARGV == 2);
my ($in, $out) = @ARGV;my($pre, $insert, @inserts, $utr, $utr_o, $i, $j, $cds, $nm, $chr, $direction, $start_exon, $end_exon, $start_cds, $end_cds, $cds_num, $start, $end, $tmp, $gene, @starts, @ends, $up, $down);if ($in =~ /\.gz/){open IN, " gzip -dc $in | " || die $!;}
else{open IN, $in || die $!;}
if ($out =~ /\.gz/){open OUT, "| gzip > $out" || die $!;}
else {open OUT , "> $out" || die $!;}while (<IN>){chomp;($nm, $chr, $direction, $start_exon, $end_exon, $start_cds, $end_cds, $cds_num, $start, $end, $tmp, $gene, $insert) = (split)[1..12,15];if ($nm =~ /NM/){$pre = 'mRNA';}else{$pre = 'ncRNA';}$start =~ s/,$//;$end =~ s/,$//;$insert =~ s/,$//;$insert =~ s/\-1/\./g;if ($cds_num > 1){@starts = split /,/, $start;@ends = split /,/, $end;@inserts = split /,/, $insert;}else{@starts = ($start);@ends = ($end);@inserts = ($insert);}print OUT "$chr\trefGene\t$pre\t$start_exon\t$end_exon\t.\t$direction\t.\tID=$nm; name=$gene;\n";if ($direction eq '+'){$utr = 5;$utr_o = 3;#print OUT "$chr\trefGene\t5-UTR\t$start_exon\t",$start_cds-1,"\t.\t$direction\t.\tParent=$nm;\n";} else {$utr = 3;$utr_o = 5;#print OUT "$chr\trefGene\t3-UTR\t$start_exon\t",$start_cds-1,"\t.\t$direction\t.\tParent=$nm;\n";}for ($i = 0; $i < @starts; $i ++){print OUT "$chr\trefGene\tintron\t",$ends[$i-1]+1,"\t",$starts[$i]-1,"\t.\t$direction\t.\tParent=$nm;\n" if ($i > 0);if ($pre eq 'ncRNA'){print OUT "$chr\trefGene\tCDS\t$starts[$i]\t$ends[$i]\t.\t$direction\t$inserts[$i]\tParent=$nm;\n";next;}if ($ends[$i] < $start_cds){print OUT "$chr\trefGene\t$utr-UTR\t$starts[$i]\t$ends[$i]\t.\t$direction\t$inserts[$i]\tParent=$nm;\n";}elsif($starts[$i] < $start_cds and $ends[$i] > $start_cds){print OUT "$chr\trefGene\t$utr-UTR\t$starts[$i]\t", $start_cds - 1, "\t.\t$direction\t.\tParent=$nm;\n";print OUT "$chr\trefGene\tCDS\t$start_cds\t$ends[$i]\t.\t$direction\t$inserts[$i]\tParent=$nm;\n";}elsif($starts[$i] >= $start_cds and $ends[$i] <= $end_cds){print OUT "$chr\trefGene\tCDS\t$starts[$i]\t$ends[$i]\t.\t$direction\t$inserts[$i]\tParent=$nm;\n";}elsif($starts[$i] < $end_cds and $ends[$i] > $end_cds){print OUT "$chr\trefGene\tCDS\t$starts[$i]\t$end_cds\t.\t$direction\t$inserts[$i]\tParent=$nm;\n";print OUT "$chr\trefGene\t$utr_o-UTR\t", $end_cds + 1, "\t$ends[$i]\t.\t$direction\t.\tParent=$nm;\n";}else{print OUT "$chr\trefGene\t$utr_o-UTR\t$starts[$i]\t$ends[$i]\t.\t$direction\t$inserts[$i]\tParent=$nm;\n";}}
}
close IN;
close OUT;

上面是正常的gff格式,下面将添加一些别的东西:

#!perl -w
use strict;
die "Usage : perl $0 <in.file> <out.file> " if (@ARGV > 2);
my ($in, $out) = @ARGV;$in ||= 'refGene.sort.2.gz';
$out ||= 'refGene.ann.gz';if($in =~ /\.gz/){open IN, "gzip -dc $in |" || die $!;
}else{open IN, $in || die $!;
}if ($out =~ /\.gz/){open OUT, "| gzip > $out " || die $!;
}else{open OUT, "> $out" || die $!;
}my (@tmps, $chr, $ref, $region, $start, $end, $num, $dot, $id, $count_cds, $count_intron, $gene, $nm);
my (%infos, %genes);
my @chrs = (1..22, 'X', 'Y');$/ = "\n>";
chomp (my $line = <IN>);
@tmps = split /\n/, $line;my ($chr_b, $end_b, $orientation, $id_b) = $tmps[0] =~ /(chr\w+)\t\w+\t\w+RNA\t\d+\t(\d+)\t\.\t([\+\-])\t\.\t(ID=\S+; name=\S+)$/;
$tmps[0] =~ s/>//;
print OUT "$tmps[0]\n";
@tmps[1..$#tmps] = &add_num ($orientation, @tmps[1..$#tmps]);
print OUT join "\n", @tmps[1..$#tmps];
print OUT "\n";while (<IN>){chomp;@tmps = split /\n/, $_;($chr, $start, $end, $orientation, $id) = $tmps[0] =~ /(chr\w+)\t\w+\t\w+RNA\t(\d+)\t(\d+)\t\.\t([\+\-])\t\.\t(ID=\S+; name=\S+)$/;if ($chr eq $chr_b and $end_b < $start){print OUT "$chr\trefGene\tintergenic\t",$end_b + 1, "\t", $start - 1, "\t.\t.\t.\t$id_b|$id\n";}($chr_b, $end_b ,$id_b) = ($chr, $end, $id);$tmps[0] =~ s/>//;print OUT "$tmps[0]\n";@tmps[1..$#tmps] = &add_num ($orientation, @tmps[1..$#tmps]);print OUT join "\n", @tmps[1..$#tmps];print OUT "\n";
}
close IN;
close OUT;
#$/ = '\n';sub add_num {my @tmps = @_;my ($count_cds, $count_intron) = (0, 0);if ($tmps[0] eq '+'){for my $tmp (@tmps[1..$#tmps]){if ($tmp =~ /CDS/){$count_cds ++;$tmp =~ s/\.\t\+/$count_cds\t\+/;#$count_cds ++;}elsif($tmp =~ /intron/){$count_intron ++;$tmp =~ s/\.\t\+/$count_intron\t\+/;#$count_intron ++;}}return @tmps[1..$#tmps];}#return @tmps[1..$#tmps];for my $tmp (@tmps[1..$#tmps]){if ($tmp =~ /CDS/){$count_cds ++;}elsif($tmp =~ /intron/){$count_intron ++;}}for my $tmp (@tmps[1..$#tmps]){if ($tmp =~ /CDS/){$tmp =~ s/\.\t\-/$count_cds\t\-/;$count_cds --;}elsif($tmp =~ /intron/){$tmp =~ s/\.\t\-/$count_intron\t\-/;$count_intron --;}}return @tmps[1..$#tmps];
}

脚本如下:

less refGene.txt.gz | sort -k3,3 -k5,5n |gzip > refGene.txt.sort.gz
perl format.pl refGene.txt.sort.gz refGene.gff.gz
less refGene.gff.gz  |perl -lane 'if($F[2] =~ /RNA/){$F[0] = ">$F[0]";}print join "\t",@F[0..7], "$F[8] $F[9]";'  |gzip > refGene.gff.2.gz
perl ann.sort.pl refGene.gff.2.gz refGene.ann.gz

最终的转换格式如下:

chr1	refGene	intron	763230	764381	1	+	.	Parent=NR_047525; 
chr1	refGene	CDS	764382	764484	2	+	.	Parent=NR_047525; 
chr1	refGene	intron	764485	787305	2	+	.	Parent=NR_047525; 
chr1	refGene	CDS	787306	787490	3	+	.	Parent=NR_047525; 
chr1	refGene	intron	787491	788049	3	+	.	Parent=NR_047525; 
chr1	refGene	CDS	788050	788146	4	+	.	Parent=NR_047525; 
chr1	refGene	intron	788147	788769	4	+	.	Parent=NR_047525; 
chr1	refGene	CDS	788770	794826	5	+	.	Parent=NR_047525; 
chr1	refGene	intergenic	794827	803449	.	.	.	ID=NR_047525; name=LOC643837;|ID=NR_027055; name=FAM41C;
chr1	refGene	ncRNA	803450	812182	.	-	.	ID=NR_027055; name=FAM41C;
chr1	refGene	CDS	803450	804055	3	-	.	Parent=NR_027055; 
chr1	refGene	intron	804056	809490	2	-	.	Parent=NR_027055; 
chr1	refGene	CDS	809491	810535	2	-	.	Parent=NR_027055; 
chr1	refGene	intron	810536	812124	1	-	.	Parent=NR_027055; 
chr1	refGene	CDS	812125	812182	1	-	.	Parent=NR_027055; 
chr1	refGene	intergenic	812183	852951	.	.	.	ID=NR_027055; name=FAM41C;|ID=NR_026874; name=LOC100130417;
chr1	refGene	ncRNA	852952	854817	.	-	.	ID=NR_026874; name=LOC100130417;
chr1	refGene	CDS	852952	853100	4	-	.	Parent=NR_026874; 
chr1	refGene	intron	853101	853400	3	-	.	Parent=NR_026874; 
chr1	refGene	CDS	853401	853555	3	-	.	Parent=NR_026874; 
chr1	refGene	intron	853556	854203	2	-	.	Parent=NR_026874; 
chr1	refGene	CDS	854204	854295	2	-	.	Parent=NR_026874; 
chr1	refGene	intron	854296	854713	1	-	.	Parent=NR_026874; 
chr1	refGene	CDS	854714	854817	1	-	.	Parent=NR_026874; 
chr1	refGene	intergenic	854818	861119	.	.	.	ID=NR_026874; name=LOC100130417;|ID=NM_152486; name=SAMD11;
chr1	refGene	mRNA	861120	879961	.	+	.	ID=NM_152486; name=SAMD11;
chr1	refGene	5-UTR	861120	861180	.	+	.	Parent=NM_152486; 
chr1	refGene	intron	861181	861300	1	+	.	Parent=NM_152486; 
chr1	refGene	5-UTR	861301	861320	.	+	.	Parent=NM_152486; 
chr1	refGene	CDS	861321	861393	1	+	0	Parent=NM_152486;

第6列为exon和intron的排列顺序,第8列为翻译偏移量,intergenic为添加两个gene或transcript之间的距离等,加强版的gff格式~~~


 

这篇关于perl novel可变剪接识别(2)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/573333

相关文章

如何通过海康威视设备网络SDK进行Java二次开发摄像头车牌识别详解

《如何通过海康威视设备网络SDK进行Java二次开发摄像头车牌识别详解》:本文主要介绍如何通过海康威视设备网络SDK进行Java二次开发摄像头车牌识别的相关资料,描述了如何使用海康威视设备网络SD... 目录前言开发流程问题和解决方案dll库加载不到的问题老旧版本sdk不兼容的问题关键实现流程总结前言作为

Perl 特殊变量详解

《Perl特殊变量详解》Perl语言中包含了许多特殊变量,这些变量在Perl程序的执行过程中扮演着重要的角色,:本文主要介绍Perl特殊变量,需要的朋友可以参考下... perl 特殊变量Perl 语言中包含了许多特殊变量,这些变量在 Perl 程序的执行过程中扮演着重要的角色。特殊变量通常用于存储程序的

C++11第三弹:lambda表达式 | 新的类功能 | 模板的可变参数

🌈个人主页: 南桥几晴秋 🌈C++专栏: 南桥谈C++ 🌈C语言专栏: C语言学习系列 🌈Linux学习专栏: 南桥谈Linux 🌈数据结构学习专栏: 数据结构杂谈 🌈数据库学习专栏: 南桥谈MySQL 🌈Qt学习专栏: 南桥谈Qt 🌈菜鸡代码练习: 练习随想记录 🌈git学习: 南桥谈Git 🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈�

阿里开源语音识别SenseVoiceWindows环境部署

SenseVoice介绍 SenseVoice 专注于高精度多语言语音识别、情感辨识和音频事件检测多语言识别: 采用超过 40 万小时数据训练,支持超过 50 种语言,识别效果上优于 Whisper 模型。富文本识别:具备优秀的情感识别,能够在测试数据上达到和超过目前最佳情感识别模型的效果。支持声音事件检测能力,支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。高效推

perl的学习记录——仿真regression

1 记录的背景 之前只知道有这个强大语言的存在,但一直侥幸自己应该不会用到它,所以一直没有开始学习。然而人生这么长,怎就确定自己不会用到呢? 这次要搭建一个可以自动跑完所有case并且打印每个case的pass信息到指定的文件中。从而减轻手动跑仿真,手动查看log信息的重复无效低质量的操作。下面简单记录下自己的思路并贴出自己的代码,方便自己以后使用和修正。 2 思路整理 作为一个IC d

Clion不识别C代码或者无法跳转C语言项目怎么办?

如果是中文会显示: 此时只需要右击项目,或者你的源代码目录,将这个项目或者源码目录标记为项目源和头文件即可。 英文如下:

BERN2(生物医学领域)命名实体识别与命名规范化工具

BERN2: an advanced neural biomedical named entity recognition and normalization tool 《Bioinformatics》2022 1 摘要 NER和NEN:在生物医学自然语言处理中,NER和NEN是关键任务,它们使得从生物医学文献中自动提取实体(如疾病和药物)成为可能。 BERN2:BERN2是一个工具,

行为智能识别摄像机

行为智能识别摄像机 是一种结合了人工智能技术和监控摄像技术的先进设备,它能够通过深度学习算法对监控画面进行实时分析,自动识别和分析监控画面中的各种行为动作。这种摄像机在安防领域有着广泛的应用,可以帮助监控人员及时发现异常行为,并采取相应的措施。 行为智能识别摄像机可以有效预防盗窃事件。在商场、超市等公共场所安装这种摄像机,可以通过识别异常行为等情况,及时报警并阻止不安全行为的发生

flutter开发实战-flutter build web微信无法识别二维码及小程序码问题

flutter开发实战-flutter build web微信无法识别二维码及小程序码问题 GitHub Pages是一个直接从GitHub存储库托管的静态站点服务,‌它允许用户通过简单的配置,‌将个人的代码项目转化为一个可以在线访问的网站。‌这里使用flutter build web来构建web发布到GitHub Pages。 最近通过flutter build web,通过发布到GitHu

T1打卡——mnist手写数字识别

🍨 本文为🔗365天深度学习训练营中的学习记录博客🍖 原作者:K同学啊 1.定义GPU import tensorflow as tfgpus=tf.config.list_physical_devices("GPU")if gpus:gpu0=gpus[0]tf.config.experimental.set_memort_groth(gpu0,True) #设置GPU现存用量按需