SnpEff
Javaで書かれている。何も書かれていないVCFファイル(test.chr22.vcf)を処理する。
#CHROM POS ID REF ALT QUAL FILTER INFO
22 17071756 . T C . . .
22 17072035 . C T . . .
22 17072258 . C A . . .
22 17072674 . G A . . .
22 17072747 . T C . . .
22 17072781 . C T . . .
22 17073043 . C T . . .
22 17073066 . A G . . .
22 17073119 . C T . . .
以下のスクリプトで処理すると
java -Xmx4g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf
こうなる
##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"
##SnpEffCmd="SnpEff GRCh37.75 examples/test.chr22.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
#CHROM POS ID REF ALT QUAL FILTER INFO
22 17071756 . T C . . ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*4223A>G|||||4223|
22 17072035 . C T . . ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*3944G>A|||||3944|
22 17072258 . C A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*3721G>T|||||3721|
22 17072674 . G A . . ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*3305C>T|||||3305|
Annotation 1
Allele | C |
Annotation | 3_prime_UTR_variant |
Annotation_Impact | MODIFIER |
Gene_Name | CCT8L2 |
Gene_ID | ENSG00000198445 |
Feature_Type | transcript |
Feature_ID | ENST00000359963 |
Transcript_BioType | protein_coding |
Rank | 1/1 |
HGVS.c | c.*11A>G |
HGVS.p | |
cDNA.pos/length | |
CDS.pos/CDS.length | |
AA.pos/AA.length | |
Distance | 11 |
ERRORS/WARNINGS/INFO |
Annotation 2
Allele | C |
Annotation | downstream_gene_variant |
Annotation_Impact | MODIFIER |
Gene_Name | FABP5P11 |
Gene_ID | ENSG00000240122 |
Feature_Type | transcript |
Feature_ID | ENST00000430910 |
Transcript_Biotype | processed_pseudogene |
Rank | |
HGVS.c | n.*4223A>G |
HGVS.p | |
cDNA.pos/length | |
CDS.pos/CDS.length | |
AA.pos/AA.length | |
Distance | 4223 |
ERRORS/WARNINGS/INFO |
- Allele (or ALT): In case of multiple ALT fields, this helps to identify which ALT we are referring to. E.g.:
- Annotation (a.k.a. effect): Annotated using Sequence Ontology terms. Multiple effects can be concatenated using ‘&’.
- Putative_impact: A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
- Gene Name: Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”.
- Gene ID: Gene ID
- Feature type: Which type of feature is in the next field (e.g. transcript, motif, miRNA, etc.). It is preferred to use Sequence Ontology (SO) terms, but ‘custom’ (user defined) are allowed. ANN=A|stop_gained|HIGH|||transcript|… Tissue specific features may include cell type / tissue information separated by semicolon e.g.: ANN=A|histone_binding_site|LOW|||H3K4me3:HeLa-S3|…
- Feature ID: Depending on the annotation, this may be: Transcript ID (preferably using version number), Motif ID, miRNA, ChipSeq peak, Histone mark, etc. Note: Some features may not have ID (e.g. histone marks from custom Chip-Seq experiments may not have a unique ID).
- Transcript biotype: The bare minimum is at least a description on whether the transcript is {“Coding”, “Noncoding”}. Whenever possible, use ENSEMBL biotypes.
- Rank / total: Exon or Intron rank / total number of exons or introns.
- HGVS.c: Variant using HGVS notation (DNA level)
- HGVS.p: If variant is coding, this field describes the variant using HGVS notation (Protein level). Since transcript ID is already mentioned in ‘feature ID’, it may be omitted here.
- cDNA_position / cDNA_len: Position in cDNA and trancript’s cDNA length (one based).
- CDS_position / CDS_len: Position and number of coding bases (one based includes START and STOP codons).
- Protein_position / Protein_len: Position and number of AA (one based, including START, but not STOP).
- Distance to feature: All items in this field are options, so the field could be empty. Up/Downstream: Distance to first / last codon Intergenic: Distance to closest gene Distance to closest Intron boundary in exon (+/- up/downstream). If same, use positive number. Distance to closest exon boundary in Intron (+/- up/downstream) Distance to first base in MOTIF Distance to first base in miRNA Distance to exon-intron boundary in splice_site or splice _region ChipSeq peak: Distance to summit (or peak center) Histone mark / Histone state: Distance to summit (or peak center)
- Errors, Warnings or Information messages: Add errors, warnings or informative message that can affect annotation accuracy. It can be added using either ‘codes’ (as shown in column 1, e.g. W1) or ‘message types’ (as shown in column 2, e.g. WARNING_REF_DOES_NOT_MATCH_GENOME). All these errors, warnings or information messages messages are optional.
コメント
コメントを投稿