ANNOVAR | 注释

标签(空格分隔): annovar 注释

[TOC]

介绍

参考链接 http://annovar.openbioinformatics.org/en/latest/
用于对SNV、CNV进行功能注释,目前wANNOVAR专门用于SNV注释。主要包括3种注释模式:

  • gene-based annotation:判断SNV或CNV是否造成蛋白编码或氨基酸的改变,可用基因命名系统包括RefSeq, UCSC, ENSEMBL,GENCODE, AceView等。
  • region-based annotation:变异位于染色体哪个区域,预测转录因子结合位点、SD区域、GWAS hits…
  • filter-based annotation:鉴定在特定数据库中记录的变异,如是否在dbSNP中被报道,在1KG中的频率,ExAC, 计算SIFT/PolyPhen/LRT/MutatonTaster/MutationAssessor scores…
  • 其他功能:批量调取指定区域的核酸序列,调取合Mendelian disease的基因

下载安装

tar xvfz annovar.latest.tar.gz
cd annovar

ANNOVAR的安装包里自带了一些常用的数据库,在humandb/目录下; 如果要进行其他注释,需要使用 -downdb命令下载数据库到 humandb/ 目录。

主要程序结构

ANNOVAR程序结构
│ annotate_variation.pl #主程序,功能包括下载数据库,三种不同的注释
│ coding_change.pl #可用来推断蛋白质序列
│ convert2annovar.pl #将多种格式转为.avinput的程序
│ retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本
│ table_annovar.pl #注释程序,可一次性完成三种类型的注释
│ variants_reduction.pl #可用来更灵活地定制过滤注释流程

├─example #存放示例文件

└─humandb #人类注释数据库

数据库下载

依赖于数据库进行注释,如果没有相应的注释文件就无法进行注释(废话!)
最好下载相应基因组版本的最新注释数据库。

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/ 
#-buildver: 基因组对应版本
#-webfrom annovar: 从annovar库里下载;如果annovar库中没有,则不用写该选项,会从UCSC中下载
#refGene: 数据库名称
#humandb/: 下载至该目录

已下载数据库:refGene,ensGene,cytoBand,avsnp138,exac03,1000g2015aug,clinvar_20170905,dbnsfp30a, avsnp147

  • avsnp138:给出rs编号
  • dbnsfp30a:whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, DANN, fitCons, PhyloP and SiPhy scores from dbNSFP version 3.0a

输入文件

两种输入格式

  • VCF文件:用 -vcfinput指定
  • avinput
    每行代表一个位点
    前5列依次为:chromosome, start position, end position, the reference nucleotides, the observed nucleotides
    reference nucleotides:不知道时可设置为0
    observed nucleotides: insertion,deletion,block subsititution可用-表示
    其余列:可有可无,如果有,在输出文件中会原样输出。
[root@localhost example]# more ex1.avinput
1       948921  948921  T       C       comments: rs15842, a SNP in 5' UTR of ISG15
1       13211293        13211294        TC      -       comments: rs59770105, a 2-bp deletion
1       11403596        11403596        -       AT      comments: rs35561142, a 2-bp insertion
1       105492231       105492231       A       ATAAA   comments: rs10552169, a block substitution
1       67705958        67705958        G       A       comments: rs11209026 (R381Q), a SNP in IL23R associated 

注释

  • 一步到位:table_annovar.pl 可以同时进行3种注释
perl table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt
#-remove: remove all temporary files
#-operation:g,gene-based; gx,gene-based with cross-reference annotation (from -xref argument);r, region-based; f,filter-based.
#-nastring:没有对应注释,则输出`.`
#-csvout:结果用,分隔;去掉则采用默认,用Tab分隔
#-xref: whether a known genetic disease is caused by defects in this gene (this information was suffplied in the  example/gene_xref.txt file in the command line) 这一项没有也OK

其中(每种数据库对应的类型参考官网)
g,gene-based,对应数据库为refGene,ensGene等
r,region-based,对应数据库为cytoBand等
f,filter-based,对应数据库为exac03,avsnp147,dbnsfp30a等

  • 3种注释分开进行:annotate_variation.pl
    gene-based
perl annotate_variation.pl -geneanno -dbtype refGene -buildver hg19 example/ex1.avinput humandb/  

结果文件在example/中,ex1.avinput.variant_functionex1.avinput.exonic_variant_function
ex1.avinput.variant_function
第一列:variant effects,将变异分类,如intergenic, intronic, non-synonymous SNP, frameshift deletion, large-scale duplication等
第二列:基因名,Symbol,括号中为NM_22222,为refGene编号
其余列:输入文件ex1.avinput的内容

[root@localhost example]# head ex1.avinput.variant_function
UTR5    ISG15(NM_005101:c.-33T>C)       1       948921  948921  T       C       comments: rs15842, a SNP in 5' UTR of ISG15
UTR3    ATAD3C(NM_001039211:c.*91G>T)   1       1404001 1404001 G       T       comments: rs149123833, a SNP in 3' UTR of ATAD3C

ex1.avinput.exonic_variant_function
第一列:该变异在input文件的行号
第二列:对编码基因的影响,frameshift,nonsynonymous等
第三列:被影响的基因或转录本,其中NM_22222,为refGene编号
其余列:同输入文件

[root@localhost example]# head  ex1.avinput.exonic_variant_function
line9   nonsynonymous SNV       IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1       67705958        67705958       GA       comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
line10  nonsynonymous SNV       ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A,  2       234183368       234183368       A       G       comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
line11  nonsynonymous SNV       NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W,16       50745926        50745926        C       T       comments: rs2066844 (R702W), a non-synonymous SNP in NOD2

用awk操作时,分隔符设定为\t;不设置时,空格也被当做分隔符,会造成错位

[root@localhost example]# head  ex1.avinput.exonic_variant_function|awk -F '\t' '{print $2}'
nonsynonymous SNV
nonsynonymous SNV
nonsynonymous SNV
nonsynonymous SNV
frameshift insertion
frameshift deletion
frameshift deletion
stoploss
stopgain
frameshift substitution

[root@localhost example]# head  ex1.avinput.exonic_variant_function|awk '{print $2}'
nonsynonymous
nonsynonymous
nonsynonymous
nonsynonymous
frameshift
frameshift
frameshift
stoploss
stopgain
frameshift

region-based

perl annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/ 

鉴定各变异的cytogenetic band,如1p36.33
结果文件在example中,ex1.avinput.hg19_cytoBand
第一列:cytoBand
第二列:1p21.1
其余列:同输入文件

[root@localhost example]# more ex1.avinput.hg19_cytoBand
cytoBand        1p36.33 1       948921  948921  T       C       comments: rs15842, a SNP in 5' UTR of ISG15
cytoBand        1p36.33 1       1404001 1404001 G       T       comments: rs149123833, a SNP in 3' UTR of ATAD3C
cytoBand        1p36.31 1       5935162 5935162 A       T       comments: rs1287637, a splice site variant in NP
HP4
cytoBand        1q23.3  1       162736463       162736463       C       T       comments: rs1000050, a SNP in Il
lumina SNP arrays

filter

perl annotate_variation.pl -filter -dbtype exac03 -buildver hg19 example/ex1.avinput humandb/

结果文件在example/中,ex1.avinput.hg19_exac03_filtered(exac03中没有报道的位点)和ex1.avinput.hg19_exac03_dropped(exac03中报道的位点,包含其等位基因频率)

    原文作者:fatlady
    原文地址: https://www.jianshu.com/p/7607c894eaae
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞