- 
                Notifications
    You must be signed in to change notification settings 
- Fork 31
Add gnomad_gks() and get_gks() for extracting gks information for a specified variant #539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added variant_report() code to produce a JSON with VRS information for the variant in question. NOTE THAT 1) I have not been able to get with open() to work for gs:// paths and there are still some left-behinds from when I was running this locally. There are some pretty obvious changes I still need to make AND we need to discuss what we actually want this to return. Do we want it to write out the json or return it as a variable or? Much to discuss, but I just wanted to get the code I do have up in the correct place !
opening with Hadoop Open , adding Interval to dict path to make sure everything actually works ,
        
          
                gnomad/utils/annotations.py
              
                Outdated
          
        
      |  | ||
|  | ||
| def variant_report( | ||
| table_to_parse: hl.Table, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should discuss in group: it's nice to pass in a Table because it's more generalizable, although am not sure if the desired ask was to make it gnomad-specific (in which case we would want to pass in a version number for gnomad and then just load the ht for that version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see KC's comment in slack about adding an example
        
          
                gnomad/utils/annotations.py
              
                Outdated
          
        
      | ) | ||
|  | ||
| with hl.utils.hadoop_open( | ||
| "gs://gnomad-vrs-io-finals/chromosome_seqID_dictionary.json", "r" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gs://gnomad-vrs-io-finals is a private bucket, so won't work for many users, you can put this dictionary into a constant named VRS_CHROMS at the top of the script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what is the original source of the seqID dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the GA4GH Team didn't supply it anywhere I could find it, so I constructed it myself with the Seqeucne_IDs for each of the chromosomes from a .PKL file in a truth set that had sent a while back. I can send the code if you would like
Some changes to docstring, some clarity and formatting fixes - more substantial changes on the way Co-authored-by: klaricch <[email protected]>
Changed fxn name to 'get_variant_json' , json import added to start of script , VRS_Chrom_IDs added to start of file as well, now constructing the JSOn inside of the function , am now returning the JSON/dictionary , and changed some errors when I was writing to the dictionary that was resulting in some structure being overwritten
changed some variable and the function name, fully incorporated the annotations into the VRS dictionary itself, adding the get_reference_genome function fully, and am now returning a string via json.dumps() instead of a dictionary
| 1) A reference table to parse containing the variant of interest and its VRS Annotations | ||
| 2) A string of the Chr, Pos, Ref, and Alt of a Variant | ||
| 3) A GA4GH-VRS JSON Template | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still needs to be addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need param descriptions
Changed import order, updated Doc String, added VRS attribute parsing to before the dictionary is first defined, and set up the control structure for (but not implementation of) GRCh37 SeqIDs and the Population information as well
add fxn generate_pop_json() and incorporate it into get_gks()
| I like leaving get_vrs as a more generic function, and then building up a new function for the va schema and allowing users to specify the pops they want and if they want it split by sex (and that way if they only want one pop they don't end up with a giant json). Here's some pseudocode to show that suggestion: def get_gnomad_gks(version: str, variant:str, ancestry_groups: list=None, by_sex: bool=True):  | 
        
          
                gnomad/utils/annotations.py
              
                Outdated
          
        
      | } | ||
|  | ||
| VRS_CHROM_IDS = { | ||
| VRS_CHROM_GRCH38_IDS = { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) do you have the ids for GRCH37?
b)you can change "GRCH38" -> "GRCh38"
and then create a nested dictionary for each build, ex:
VRS_CHROM_IDS = {'GRCh38': {
"chr1": "ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
"chr2": "ga4gh:SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g".....}}
so that downstream you can just do:
VRS_CHROM_IDS[build][chr_in]
commit docstring suggestions and some fixes, more on the way ! Co-authored-by: klaricch <[email protected]>
updates: mostly completed main JSON with information from HT, nested XX and XY inside of ancestry, and got the requested reformatting of how args are added in corrected. TODO: incorporate meanDepth , add more documentation , and I believe incorporate one additional line from the sent pseudocode I didn't understand in
include coverage stats
        
          
                gnomad/utils/annotations.py
              
                Outdated
          
        
      |  | ||
|  | ||
| def get_gks( | ||
| def get_gnomad_gks( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can change this back to get_gks
def get_gks(
ht: hl.Table,
variant: str,
vrs_only: bool = False,
ancestry_groups: list = None,
ancestry_group_names: dict = None,
by_sex: bool = False,
coverage_ht: hl.Table = None,
) -> dict:
"""
:param vrs_only: Whether or not to return only the VRS schema for the specified variant. Default is False.
:param ancestry_group_names: Dictionary containing mapping for ancestry group names, where keys are the shortened names for the ancestry groups, and values are the full label names for that group.
:param coverage_ht: Table containing coverage annotations, with mean coverage for a locus stored in 'mean' annotation.
"""
then create a gnomad-specific function (this will eventually be put in a different file but need to discuss that)
def get_gnomad_gks(version: str, variant:str, by_pops: bool=True, by_sex: bool=True):
# Based on version, load in proper gnomad ht, coverage ht
# Warn if by_sex but not ancestry groups (TODO: ask if they ever want by_sex but not by_pops)
# Call get_gks(fill in params here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this scheme is followed now, with a gnomad_gks() function calling get_gks(). Any thoughts on putting gnomad_gks() in a different file?
commit suggestions suggestions from code review - more to come tonight ! Co-authored-by: klaricch <[email protected]>
split out separate get_gnomad_gks and added the GRCh37 IDs , more to come by Monday morning <3
function to remove items from freq and freq_meta
Co-authored-by: Qin He <[email protected]>
…ore_than_2 Fix `merge_freq_arrays` for cases with more than two arrays
…pe struct{r: float64} is not a subset of struct{s: str, global_idx: int64, pop_idx: int64}
    fix negative values issue with 'diff'
fix ValueError for count_arrays in merge_freq_arrays function
…_fix Fix `annotate_downsamplings` when used with a Table instead of a MatrixTable
Co-authored-by: jkgoodrich <[email protected]>
…mple_too Add optional filter to freq meta sample counts
…te/gnomad_methods into variant_report_code
| Hey all. Just checking on the status of this PR. What is still needed? | 
| Hey Alex, a couple notes on some very recent development: Admin/Git : When rebasing this PR and the PRs branched off of it, I encountered a weird problem with GitHub Desktop and VSCode's version control that wound up confusing/mangling the Git History for the PR. The Variant Report Code functions weren't touched, but all of the changes from  Copy of: Add gnomad_gks() and get_gks() for extracting gks information for a specified variant Progress : My own PR off of this branch, linked here which refactors the code and makes it smoother/more comprehensible on our end, is only maybe one review away from merging. Let me know if you have any comments or questions or concerns about this update! | 
| Closing because copied to new PR: #596 | 
Added variant_report() code to produce a JSON with VRS information for the variant in question. NOTE THAT 1) I have not been able to get with open() to work for gs:// paths and there are still some left-behinds from when I was running this locally.
There are some pretty obvious changes I still need to make AND we need to discuss what we actually want this to return. Do we want it to write out the json or return it as a variable or?
Much to discuss, but I just wanted to get the code I do have up in the correct place !