Add gnomad_gks() and get_gks() for extracting gks information for a specified variant #539

matren395 · 2023-05-08T16:52:38Z

Added variant_report() code to produce a JSON with VRS information for the variant in question. NOTE THAT 1) I have not been able to get with open() to work for gs:// paths and there are still some left-behinds from when I was running this locally.

There are some pretty obvious changes I still need to make AND we need to discuss what we actually want this to return. Do we want it to write out the json or return it as a variable or?

Much to discuss, but I just wanted to get the code I do have up in the correct place !

Added variant_report() code to produce a JSON with VRS information for the variant in question. NOTE THAT 1) I have not been able to get with open() to work for gs:// paths and there are still some left-behinds from when I was running this locally. There are some pretty obvious changes I still need to make AND we need to discuss what we actually want this to return. Do we want it to write out the json or return it as a variable or? Much to discuss, but I just wanted to get the code I do have up in the correct place !

opening with Hadoop Open , adding Interval to dict path to make sure everything actually works ,

gnomad/utils/annotations.py

klaricch · 2023-05-10T17:06:19Z

gnomad/utils/annotations.py

+
+
+def variant_report(
+    table_to_parse: hl.Table,


should discuss in group: it's nice to pass in a Table because it's more generalizable, although am not sure if the desired ask was to make it gnomad-specific (in which case we would want to pass in a version number for gnomad and then just load the ht for that version)

see KC's comment in slack about adding an example

gnomad/utils/annotations.py

klaricch · 2023-05-10T17:54:41Z

gnomad/utils/annotations.py

+        )
+
+    with hl.utils.hadoop_open(
+        "gs://gnomad-vrs-io-finals/chromosome_seqID_dictionary.json", "r"


gs://gnomad-vrs-io-finals is a private bucket, so won't work for many users, you can put this dictionary into a constant named VRS_CHROMS at the top of the script

also what is the original source of the seqID dictionary?

the GA4GH Team didn't supply it anywhere I could find it, so I constructed it myself with the Seqeucne_IDs for each of the chromosomes from a .PKL file in a truth set that had sent a while back. I can send the code if you would like

gnomad/utils/annotations.py

Some changes to docstring, some clarity and formatting fixes - more substantial changes on the way Co-authored-by: klaricch <[email protected]>

Changed fxn name to 'get_variant_json' , json import added to start of script , VRS_Chrom_IDs added to start of file as well, now constructing the JSOn inside of the function , am now returning the JSON/dictionary , and changed some errors when I was writing to the dictionary that was resulting in some structure being overwritten

gnomad/utils/annotations.py

changed some variable and the function name, fully incorporated the annotations into the VRS dictionary itself, adding the get_reference_genome function fully, and am now returning a string via json.dumps() instead of a dictionary

gnomad/utils/annotations.py

klaricch · 2023-05-12T13:04:48Z

gnomad/utils/annotations.py

+    1) A reference table to parse containing the variant of interest and its VRS Annotations
+    2) A string of the Chr, Pos, Ref, and Alt of a Variant
+    3) A GA4GH-VRS JSON Template
+


still needs to be addressed

klaricch

still need param descriptions

Changed import order, updated Doc String, added VRS attribute parsing to before the dictionary is first defined, and set up the control structure for (but not implementation of) GRCh37 SeqIDs and the Population information as well

add fxn generate_pop_json() and incorporate it into get_gks()

klaricch · 2023-05-15T21:47:49Z

I like leaving get_vrs as a more generic function, and then building up a new function for the va schema and allowing users to specify the pops they want and if they want it split by sex (and that way if they only want one pop they don't end up with a giant json). Here's some pseudocode to show that suggestion:

def get_gnomad_gks(version: str, variant:str, ancestry_groups: list=None, by_sex: bool=True):

# Based on version, load in proper gnomad ht and run get_vrs function on the specified version
vrs_id_for_pop = "XXX"

if not ancestry_groups:
    ancestry_groups = POPS["v" + str(version[0])]

list_of_ancestry_jsons = []


def _fill_in(ht, index_int, pop_id, label_str, version, vrs_id_for_pop):
    variant_freq = ht.freq[index_int]
    subpop_record = {
        "subpopulationFrequency": [
            {
                "id": pop_id,
                "type": "PopulationAlleleFrequency",
                "label": f"{label_str} Population Allele Frequency for ID",
                "focusAllele": vrs_id_for_pop,
                "focusAlleleCount": variant_freq["AC"].collect()[0],
                "locusAlleleCount": variant_freq["AN"].collect()[0],
                "alleleFrequency": variant_freq["AF"].collect()[0],
                "population": f"gnomad3:{pop_id}", 
                "ancillaryResults": {
                    "homozygotes": variant_freq["homozygote_count"].collect()[0]
                    },
                }
            ]
        }
    return subpop_record

 # Filter ht to variant
for ancestry_group in ancestry_groups:
    key = f'{ancestry_group}-adj'
    index_value = ht.freq_index_dict.get(key)
    result = _fill_in(ht =ht, index_int=index_value, pop_id=ancestry_group, label_str=POP_NAMES[ancestry_group], version=version)
    list_of_ancestry_jsons.append(result)
    
    if by_sex:
        for sex in ["XX", "YY"]:
            key = f'{ancestry_group}-{sex}-adj'
            # similar as lines above per ancestry group

klaricch · 2023-05-15T17:46:45Z

gnomad/utils/annotations.py

 }

-VRS_CHROM_IDS = {
+VRS_CHROM_GRCH38_IDS = {


a) do you have the ids for GRCH37?

b)you can change "GRCH38" -> "GRCh38"
and then create a nested dictionary for each build, ex:

VRS_CHROM_IDS = {'GRCh38': {
"chr1": "ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
"chr2": "ga4gh:SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g".....}}

so that downstream you can just do:
VRS_CHROM_IDS[build][chr_in]

gnomad/utils/annotations.py

commit docstring suggestions and some fixes, more on the way ! Co-authored-by: klaricch <[email protected]>

updates: mostly completed main JSON with information from HT, nested XX and XY inside of ancestry, and got the requested reformatting of how args are added in corrected. TODO: incorporate meanDepth , add more documentation , and I believe incorporate one additional line from the sent pseudocode I didn't understand in

include coverage stats

gnomad/utils/annotations.py

klaricch · 2023-05-18T21:49:35Z

gnomad/utils/annotations.py



-def get_gks(
+def get_gnomad_gks(


can change this back to get_gks

def get_gks(
ht: hl.Table,
variant: str,
vrs_only: bool = False,
ancestry_groups: list = None,
ancestry_group_names: dict = None,
by_sex: bool = False,
coverage_ht: hl.Table = None,
) -> dict:
"""
:param vrs_only: Whether or not to return only the VRS schema for the specified variant. Default is False.
:param ancestry_group_names: Dictionary containing mapping for ancestry group names, where keys are the shortened names for the ancestry groups, and values are the full label names for that group.
:param coverage_ht: Table containing coverage annotations, with mean coverage for a locus stored in 'mean' annotation.
"""

then create a gnomad-specific function (this will eventually be put in a different file but need to discuss that)
def get_gnomad_gks(version: str, variant:str, by_pops: bool=True, by_sex: bool=True):
# Based on version, load in proper gnomad ht, coverage ht
# Warn if by_sex but not ancestry groups (TODO: ask if they ever want by_sex but not by_pops)
# Call get_gks(fill in params here)

this scheme is followed now, with a gnomad_gks() function calling get_gks(). Any thoughts on putting gnomad_gks() in a different file?

commit suggestions suggestions from code review - more to come tonight ! Co-authored-by: klaricch <[email protected]>

split out separate get_gnomad_gks and added the GRCh37 IDs , more to come by Monday morning <3

function to remove items from freq and freq_meta

…'t overlap

Co-authored-by: Qin He <[email protected]>

…ore_than_2 Fix `merge_freq_arrays` for cases with more than two arrays

…pe struct{r: float64} is not a subset of struct{s: str, global_idx: int64, pop_idx: int64}

fix negative values issue with 'diff'

fix ValueError for count_arrays in merge_freq_arrays function

…_fix Fix `annotate_downsamplings` when used with a Table instead of a MatrixTable

Co-authored-by: jkgoodrich <[email protected]>

…mple_too Add optional filter to freq meta sample counts

…te/gnomad_methods into variant_report_code

ahwagner · 2023-09-12T12:58:44Z

Hey all. Just checking on the status of this PR. What is still needed?

matren395 · 2023-09-12T14:35:28Z

Hey Alex, a couple notes on some very recent development:

Admin/Git : When rebasing this PR and the PRs branched off of it, I encountered a weird problem with GitHub Desktop and VSCode's version control that wound up confusing/mangling the Git History for the PR. The Variant Report Code functions weren't touched, but all of the changes from main since we created the branch are showing up as commits and in the 'Files Changed' tab - which is absolutely not what should be happening , and confuses the PR a great deal. As much as I'd like to just blame GitHub Desktop for this , this is partially on me . In response to this, at the suggestion of Steve Jahl, the quickest and clearest solution was just to make new copies of PR with only the relevant code changes. I've only gone and created these PRs yesterday - I'll link the copy to this one below:

Copy of: Add gnomad_gks() and get_gks() for extracting gks information for a specified variant

Progress : My own PR off of this branch, linked here which refactors the code and makes it smoother/more comprehensible on our end, is only maybe one review away from merging.

Let me know if you have any comments or questions or concerns about this update!

klaricch · 2023-09-18T16:12:24Z

Closing because copied to new PR: #596

jkgoodrich and others added 3 commits May 4, 2023 10:22

Add annotate_allele_info function to annotations.py

20a4e7e

Rearrange docstring based on annotation order

6d30766

matren395 requested review from ch-kr and klaricch May 8, 2023 16:52

matren395 self-assigned this May 8, 2023

matren395 and others added 2 commits May 9, 2023 11:32

Update annotations.py

f9d2a7b

opening with Hadoop Open , adding Interval to dict path to make sure everything actually works ,

Add ability to use allele specific annotations in the gvcf_info field

fa6bfd4

klaricch requested changes May 10, 2023

View reviewed changes

matren395 and others added 2 commits May 11, 2023 10:11

Apply suggestions from code review

9985815

Some changes to docstring, some clarity and formatting fixes - more substantial changes on the way Co-authored-by: klaricch <[email protected]>

matren395 requested a review from klaricch May 11, 2023 14:53

klaricch requested changes May 11, 2023

View reviewed changes

Update annotations.py

c0ac756

changed some variable and the function name, fully incorporated the annotations into the VRS dictionary itself, adding the get_reference_genome function fully, and am now returning a string via json.dumps() instead of a dictionary

matren395 requested a review from klaricch May 11, 2023 19:28

jkgoodrich added the ga4gh VRS label May 11, 2023

klaricch reviewed May 12, 2023

View reviewed changes

klaricch requested changes May 12, 2023

View reviewed changes

matren395 added 2 commits May 12, 2023 12:14

Update annotations.py

b22197d

Changed import order, updated Doc String, added VRS attribute parsing to before the dictionary is first defined, and set up the control structure for (but not implementation of) GRCh37 SeqIDs and the Population information as well

Update annotations.py

26c99f8

add fxn generate_pop_json() and incorporate it into get_gks()

matren395 requested a review from klaricch May 15, 2023 18:41

klaricch requested changes May 15, 2023

View reviewed changes

matren395 and others added 3 commits May 17, 2023 13:17

Apply suggestions from code review

e5dbbb1

commit docstring suggestions and some fixes, more on the way ! Co-authored-by: klaricch <[email protected]>

Update annotations.py

377bd69

include coverage stats

matren395 requested a review from klaricch May 18, 2023 13:31

klaricch requested changes May 18, 2023

View reviewed changes

matren395 and others added 2 commits May 19, 2023 16:23

Apply suggestions from code review

36b64b0

commit suggestions suggestions from code review - more to come tonight ! Co-authored-by: klaricch <[email protected]>

Update annotations.py

d811c4b

split out separate get_gnomad_gks and added the GRCh37 IDs , more to come by Monday morning <3

KoalaQin and others added 26 commits August 28, 2023 14:45

fix unexpected indentation

a4032b7

fix unexpected indentation again

4383dc7

fix unexpected indentation again again

b1d749f

commit omitted suggestions

9888ce5

Merge pull request #582 from broadinstitute/qh/remove_from_freq

7095e94

function to remove items from freq and freq_meta

Add optional freq_meta_sample_count to filtering

5e290b2

Change order of filtering so can use meta for freq sample counts

4b20c83

Change variable name to be less specific

95dccd5

Switch to dict of expressions, allows multiple arrays to be filtered

b4997ae

Flip to dict comprehension

cab3034

Fix the case where there are more than 2 freqs/fmetas and fmeta doesn…

2aedfe5

…'t overlap

Apply suggestions from code review

0db45d9

Co-authored-by: Qin He <[email protected]>

Update gnomad/utils/filtering.py

3753558

Merge pull request #587 from broadinstitute/jg/fix_merge_freqs_with_m…

e6f042a

…ore_than_2 Fix `merge_freq_arrays` for cases with more than two arrays

Fix for error Error summary: AssertionError: assertion failed: Key ty…

0ce44f1

…pe struct{r: float64} is not a subset of struct{s: str, global_idx: int64, pop_idx: int64}

fix negative values issue with 'diff'

01713a7

Merge pull request #590 from broadinstitute/qh/merge_freq

0a7aad4

fix negative values issue with 'diff'

fix ValueError for count_arrays in merge_freq_arrays function

82cc9b3

add more info for raise ValueError in merge_freq_arrays

1926316

Merge pull request #591 from broadinstitute/qh/merge_freq

0ca8b3b

fix ValueError for count_arrays in merge_freq_arrays function

Merge pull request #589 from broadinstitute/jg/annotate_downsamplings…

08ce557

…_fix Fix `annotate_downsamplings` when used with a Table instead of a MatrixTable

Add flexiility around passed indexed expression type

24188af

Update gnomad/utils/filtering.py

e48c7b5

Co-authored-by: jkgoodrich <[email protected]>

Black on github suggestion

71d25bb

Merge pull request #585 from broadinstitute/mw/filter_to_freq_meta_sa…

8855a36

…mple_too Add optional filter to freq meta sample counts

Merge branch 'variant_report_code' of https://github.com/broadinstitu…

97f6ab5

…te/gnomad_methods into variant_report_code

klaricch closed this Sep 18, 2023

Add gnomad_gks() and get_gks() for extracting gks information for a specified variant #539

Add gnomad_gks() and get_gks() for extracting gks information for a specified variant #539

Uh oh!

Conversation

matren395 commented May 8, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klaricch left a comment

Choose a reason for hiding this comment

Uh oh!

klaricch commented May 15, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahwagner commented Sep 12, 2023

Uh oh!

matren395 commented Sep 12, 2023

Uh oh!

klaricch commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants