Skip to content

Outputs of PrecisionProDB

Xiaolong Cao edited this page Feb 26, 2021 · 4 revisions

Depending on the settings of inputs, different files will be generated by PrecisionProDB.

Ensembl, GENCODE, RefSeq, GTF gene annotations

Variant file in text format

cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m gnomAD.variant.txt.gz -g ./GENCODE/GENCODE.genome.fa.gz -p ./GENCODE/GENCODE.protein.fa.gz -f ./GENCODE/GENCODE.gtf.gz  -o text_variant

Three files will be generated in the examples folder.

  • text_variant.pergeno.aa_mutations.csv: amino acid change annotations. Check PREFIX.pergeno.aa_mutations.csv for more detals.
  • text_variant.pergeno.protein_all.fa: all proteins after incorporating the variants. All unchanged proteins will be preserved.
  • text_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incorporating the variants.

Note:

  • Protein names and descriptions in the fasta file is the same as in the input protein file, and adding the Tab symbol (\t) + changed or unchanged to indicate if the protein sequence is altered.
  • e.g., ENSP00000328207.6|ENST00000328596.10|ENSG00000186891.14|OTTHUMG00000001414|OTTHUMT00000004085.1|TNFRSF18-201|TNFRSF18|255 unchanged, ENSP00000424920.1|ENST00000502739.5|ENSG00000162458.13|OTTHUMG00000003079|OTTHUMT00000368044.1|FBLIM1-210|FBLIM1|144 changed.

Variant file in VCF format

cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m celline.vcf.gz -g ./GENCODE/GENCODE.genome.fa.gz -p ./GENCODE/GENCODE.protein.fa.gz -f ./GENCODE/GENCODE.gtf.gz -o vcf_variant

Five files will be generated in the examples folder.

  • vcf_variant.pergeno.aa_mutations.csv: annotations of amino acid changes.
  • vcf_variant.pergeno.protein_all.fa: all proteins after incoporating the variants.
  • vcf_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incoporating the variants.
  • vcf_variant.vcf2mutation_1.tsv: variant file extracted from the VCF file in text format, the first alternative alleles.
  • vcf_variant.vcf2mutation_2.tsv: variant file extracted from the VCF file in text format, the second alternative alleles.

Note:

  • For altered proteins, __1, __2, __12 will be added to the ID of the protein.
    • __1 and __2 mean that the alleles of the protein is from the first, second variant file, respectively.
    • __12 means that the first and second alleles altering the protein sequence are the same.
    • e.g., >ENSP00000308367.7|ENST00000312413.10|ENSG00000011021.23|OTTHUMG00000002299|-|CLCN6-201|CLCN6|847__12 changed, ENSP00000263934.6|ENST00000263934.10|ENSG00000054523.18|OTTHUMG00000001817|OTTHUMT00000005103.1|KIF1B-201|KIF1B|1770__2 changed, ENSP00000332771.4|ENST00000331433.5|ENSG00000186510.12|OTTHUMG00000009529|OTTHUMT00000026326.1|CLCNKA-201|CLCNKA|687__1 changed, ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094|OTTHUMT00000003223.1|OR4F5-202|OR4F5|326 unchanged.
  • The variant file looks like
    chr     pos     ref     alt
    chr1    52238   T       G
    chr1    53138   TAA     T
    chr1    55249   C       CTATGG
    chr1    55299   C       T
    chr1    61442   A       G
    

UniProt models

The output is the same as as above. Additional three files will be generated.

When running PrecisionProDB for UniProt proteins, Ensembl model were used first to generate the changed proteins, then UniProt proteins were linked with Ensembl proteins. UniProt proteins without identical Ensembl models will not be changed.

  • PREFIX.uniprot_all.fa: all UniProt proteins after incoporating the variants.
  • PREFIX.uniprot_changed.fa: all UniProt proteins which are different from the input protein sequences after incoporating the variants.
  • PREFIX.uniprot_changed.tsv: link between UniProt_ID and other protein_id. It looks like:
    uniprot_id      ref_id
    tr|A0A075B6H5|A0A075B6H5_HUMAN  ENSP00000368747.3
    sp|A0A075B6H8|KVD42_HUMAN       ENSP00000374813.3
    sp|A0A075B6I1|LV460_HUMAN       ENSP00000374819.2
    sp|A0A075B6I3|LVK55_HUMAN       ENSP00000374821.3
    sp|A0A075B6I4|LVX54_HUMAN       ENSP00000374822.2
    

--PEFF enabled

If --PEFF is enabled when running PrecisonProDB, a PEFF file will be generated, with only \VariantSimple annotations.

The \VariantSimple annotations are extracted from column variant_AA of the 'PREFIX.pergeno.aa_mutations.csv' file.

non-UniProt datatype

  • PREFIX.pergeno.protein_PEFF.fa: resulting sequences in PEFF format.

UniProt datatype

  • PREFIX.pergeno.protein_PEFF.fa: resulting sequences in PEFF format from Ensembl models.
  • PREFIX.uniprot_PEFF.fa: resulting sequences in PEFF format from UniProt models. Please note that, for UniProt datatype, the gene models used for running PrecisionProDB is Ensembl. Then links between Ensembl and UniProt sequences were made to extract the changed sequences.

Clone this wiki locally