-
Notifications
You must be signed in to change notification settings - Fork 2
Outputs of PrecisionProDB
Depending on the settings of inputs, different files will be generated by PrecisionProDB.
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m gnomAD.variant.txt.gz -g ./GENCODE/GENCODE.genome.fa.gz -p ./GENCODE/GENCODE.protein.fa.gz -f ./GENCODE/GENCODE.gtf.gz -o text_variantThree files will be generated in the examples folder.
-
text_variant.pergeno.aa_mutations.csv: amino acid change annotations. Check PREFIX.pergeno.aa_mutations.csv for more detals. -
text_variant.pergeno.protein_all.fa: all proteins after incorporating the variants. All unchanged proteins will be preserved. -
text_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incorporating the variants.
Note:
- Protein names and descriptions in the fasta file is the same as in the input protein file, and adding the
Tabsymbol (\t) +changedorunchangedto indicate if the protein sequence is altered. - e.g.,
ENSP00000328207.6|ENST00000328596.10|ENSG00000186891.14|OTTHUMG00000001414|OTTHUMT00000004085.1|TNFRSF18-201|TNFRSF18|255 unchanged,ENSP00000424920.1|ENST00000502739.5|ENSG00000162458.13|OTTHUMG00000003079|OTTHUMT00000368044.1|FBLIM1-210|FBLIM1|144 changed.
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m celline.vcf.gz -g ./GENCODE/GENCODE.genome.fa.gz -p ./GENCODE/GENCODE.protein.fa.gz -f ./GENCODE/GENCODE.gtf.gz -o vcf_variantFive files will be generated in the examples folder.
-
vcf_variant.pergeno.aa_mutations.csv: annotations of amino acid changes. -
vcf_variant.pergeno.protein_all.fa: all proteins after incoporating the variants. -
vcf_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incoporating the variants. -
vcf_variant.vcf2mutation_1.tsv: variant file extracted from the VCF file in text format, the first alternative alleles. -
vcf_variant.vcf2mutation_2.tsv: variant file extracted from the VCF file in text format, the second alternative alleles.
Note:
- For altered proteins,
__1,__2,__12will be added to the ID of the protein.-
__1and__2mean that the alleles of the protein is from the first, second variant file, respectively. -
__12means that the first and second alleles altering the protein sequence are the same. - e.g.,
>ENSP00000308367.7|ENST00000312413.10|ENSG00000011021.23|OTTHUMG00000002299|-|CLCN6-201|CLCN6|847__12 changed,ENSP00000263934.6|ENST00000263934.10|ENSG00000054523.18|OTTHUMG00000001817|OTTHUMT00000005103.1|KIF1B-201|KIF1B|1770__2 changed,ENSP00000332771.4|ENST00000331433.5|ENSG00000186510.12|OTTHUMG00000009529|OTTHUMT00000026326.1|CLCNKA-201|CLCNKA|687__1 changed,ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094|OTTHUMT00000003223.1|OR4F5-202|OR4F5|326 unchanged.
-
- The variant file looks like
chr pos ref alt chr1 52238 T G chr1 53138 TAA T chr1 55249 C CTATGG chr1 55299 C T chr1 61442 A G
The output is the same as as above. Additional three files will be generated.
When running PrecisionProDB for UniProt proteins, Ensembl model were used first to generate the changed proteins, then UniProt proteins were linked with Ensembl proteins. UniProt proteins without identical Ensembl models will not be changed.
-
PREFIX.uniprot_all.fa: all UniProt proteins after incoporating the variants. -
PREFIX.uniprot_changed.fa: all UniProt proteins which are different from the input protein sequences after incoporating the variants. -
PREFIX.uniprot_changed.tsv: link between UniProt_ID and other protein_id. It looks like:uniprot_id ref_id tr|A0A075B6H5|A0A075B6H5_HUMAN ENSP00000368747.3 sp|A0A075B6H8|KVD42_HUMAN ENSP00000374813.3 sp|A0A075B6I1|LV460_HUMAN ENSP00000374819.2 sp|A0A075B6I3|LVK55_HUMAN ENSP00000374821.3 sp|A0A075B6I4|LVX54_HUMAN ENSP00000374822.2
If --PEFF is enabled when running PrecisonProDB, a PEFF file will be generated, with only \VariantSimple annotations.
The \VariantSimple annotations are extracted from column variant_AA of the 'PREFIX.pergeno.aa_mutations.csv' file.
-
PREFIX.pergeno.protein_PEFF.fa: resulting sequences in PEFF format.
-
PREFIX.pergeno.protein_PEFF.fa: resulting sequences in PEFF format from Ensembl models. -
PREFIX.uniprot_PEFF.fa: resulting sequences in PEFF format from UniProt models. Please note that, for UniProt datatype, the gene models used for running PrecisionProDB is Ensembl. Then links between Ensembl and UniProt sequences were made to extract the changed sequences.
PrecisonProDB