SnpEff

Javaで書かれている。何も書かれていないVCFファイル(test.chr22.vcf)を処理する。

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO
22  17071756  .  T  C  .  .  .
22  17072035  .  C  T  .  .  .
22  17072258  .  C  A  .  .  .
22  17072674  .  G  A  .  .  .
22  17072747  .  T  C  .  .  .
22  17072781  .  C  T  .  .  .
22  17073043  .  C  T  .  .  .
22  17073066  .  A  G  .  .  .
22  17073119  .  C  T  .  .  .

以下のスクリプトで処理すると

java -Xmx4g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf

こうなる

##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"
##SnpEffCmd="SnpEff  GRCh37.75 examples/test.chr22.vcf "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO
22  17071756  .  T  C  .  .  ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*4223A>G|||||4223|
22  17072035  .  C  T  .  .  ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*3944G>A|||||3944|
22  17072258  .  C  A  .  .  ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*3721G>T|||||3721|
22  17072674  .  G  A  .  .  ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*3305C>T|||||3305|

Annotation 1

Allele C
Annotation 3_prime_UTR_variant
Annotation_Impact MODIFIER
Gene_Name CCT8L2
Gene_ID ENSG00000198445
Feature_Type transcript
Feature_ID ENST00000359963
Transcript_BioType protein_coding
Rank 1/1
HGVS.c c.*11A>G
HGVS.p
cDNA.pos/length
CDS.pos/CDS.length
AA.pos/AA.length
Distance 11
ERRORS/WARNINGS/INFO

Annotation 2

Allele C
Annotation downstream_gene_variant
Annotation_Impact MODIFIER
Gene_Name FABP5P11
Gene_ID ENSG00000240122
Feature_Type transcript
Feature_ID ENST00000430910
Transcript_Biotype processed_pseudogene
Rank
HGVS.c n.*4223A>G
HGVS.p
cDNA.pos/length
CDS.pos/CDS.length
AA.pos/AA.length
Distance 4223
ERRORS/WARNINGS/INFO
  1. Allele (or ALT): In case of multiple ALT fields, this helps to identify which ALT we are referring to. E.g.:
  2. Annotation (a.k.a. effect): Annotated using Sequence Ontology terms. Multiple effects can be concatenated using ‘&’.
  3. Putative_impact: A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
  4. Gene Name: Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”.
  5. Gene ID: Gene ID
  6. Feature type: Which type of feature is in the next field (e.g. transcript, motif, miRNA, etc.). It is preferred to use Sequence Ontology (SO) terms, but ‘custom’ (user defined) are allowed. ANN=A|stop_gained|HIGH|||transcript|… Tissue specific features may include cell type / tissue information separated by semicolon e.g.: ANN=A|histone_binding_site|LOW|||H3K4me3:HeLa-S3|…
  7. Feature ID: Depending on the annotation, this may be: Transcript ID (preferably using version number), Motif ID, miRNA, ChipSeq peak, Histone mark, etc. Note: Some features may not have ID (e.g. histone marks from custom Chip-Seq experiments may not have a unique ID).
  8. Transcript biotype: The bare minimum is at least a description on whether the transcript is {“Coding”, “Noncoding”}. Whenever possible, use ENSEMBL biotypes.
  9. Rank / total: Exon or Intron rank / total number of exons or introns.
  10. HGVS.c: Variant using HGVS notation (DNA level)
  11. HGVS.p: If variant is coding, this field describes the variant using HGVS notation (Protein level). Since transcript ID is already mentioned in ‘feature ID’, it may be omitted here.
  12. cDNA_position / cDNA_len: Position in cDNA and trancript’s cDNA length (one based).
  13. CDS_position / CDS_len: Position and number of coding bases (one based includes START and STOP codons).
  14. Protein_position / Protein_len: Position and number of AA (one based, including START, but not STOP).
  15. Distance to feature: All items in this field are options, so the field could be empty. Up/Downstream: Distance to first / last codon Intergenic: Distance to closest gene Distance to closest Intron boundary in exon (+/- up/downstream). If same, use positive number. Distance to closest exon boundary in Intron (+/- up/downstream) Distance to first base in MOTIF Distance to first base in miRNA Distance to exon-intron boundary in splice_site or splice _region ChipSeq peak: Distance to summit (or peak center) Histone mark / Histone state: Distance to summit (or peak center)
  16. Errors, Warnings or Information messages: Add errors, warnings or informative message that can affect annotation accuracy. It can be added using either ‘codes’ (as shown in column 1, e.g. W1) or ‘message types’ (as shown in column 2, e.g. WARNING_REF_DOES_NOT_MATCH_GENOME). All these errors, warnings or information messages messages are optional.

コメント

このブログの人気の投稿

Inverse-normal transformation

SKAT

locuszoom