Skip to content

Conversation

@matren395
Copy link
Contributor

Added variant_report() code to produce a JSON with VRS information for the variant in question. NOTE THAT 1) I have not been able to get with open() to work for gs:// paths and there are still some left-behinds from when I was running this locally.

There are some pretty obvious changes I still need to make AND we need to discuss what we actually want this to return. Do we want it to write out the json or return it as a variable or?

Much to discuss, but I just wanted to get the code I do have up in the correct place !

jkgoodrich and others added 3 commits May 4, 2023 10:22
Added variant_report() code to produce a JSON with VRS information for the variant in question. NOTE THAT 1) I have not been able to get with open() to work for gs:// paths and there are still some left-behinds from when I was running this locally.

There are some pretty obvious changes I still need to make AND we need to discuss what we actually want this to return. Do we want it to write out the json or return it as a variable or?

Much to discuss, but I just wanted to get the code I do have up in the correct place !
@matren395 matren395 requested review from ch-kr and klaricch May 8, 2023 16:52
@matren395 matren395 self-assigned this May 8, 2023
matren395 and others added 2 commits May 9, 2023 11:32
opening with Hadoop Open , adding Interval to dict path to make sure everything actually works ,


def variant_report(
table_to_parse: hl.Table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should discuss in group: it's nice to pass in a Table because it's more generalizable, although am not sure if the desired ask was to make it gnomad-specific (in which case we would want to pass in a version number for gnomad and then just load the ht for that version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see KC's comment in slack about adding an example

)

with hl.utils.hadoop_open(
"gs://gnomad-vrs-io-finals/chromosome_seqID_dictionary.json", "r"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gs://gnomad-vrs-io-finals is a private bucket, so won't work for many users, you can put this dictionary into a constant named VRS_CHROMS at the top of the script

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what is the original source of the seqID dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GA4GH Team didn't supply it anywhere I could find it, so I constructed it myself with the Seqeucne_IDs for each of the chromosomes from a .PKL file in a truth set that had sent a while back. I can send the code if you would like

matren395 and others added 2 commits May 11, 2023 10:11
Some changes to docstring, some clarity and formatting fixes - more substantial changes on the way

Co-authored-by: klaricch <[email protected]>
Changed fxn name to 'get_variant_json' ,  json import added to start of script , VRS_Chrom_IDs added to start of file as well, now constructing the JSOn inside of the function , am now returning the JSON/dictionary , and changed some errors when I was writing to the dictionary that was resulting in some structure being overwritten
@matren395 matren395 requested a review from klaricch May 11, 2023 14:53
changed some variable and the function name, fully incorporated the annotations into the VRS dictionary itself, adding the get_reference_genome function fully, and am now returning a string via json.dumps() instead of a dictionary
1) A reference table to parse containing the variant of interest and its VRS Annotations
2) A string of the Chr, Pos, Ref, and Alt of a Variant
3) A GA4GH-VRS JSON Template
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still needs to be addressed

Copy link
Contributor

@klaricch klaricch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need param descriptions

matren395 added 2 commits May 12, 2023 12:14
Changed import order, updated Doc String, added VRS attribute parsing to before the dictionary is first defined, and set up the control structure for (but not implementation of) GRCh37 SeqIDs and the Population information as well
add fxn generate_pop_json() and incorporate it into get_gks()
@matren395 matren395 requested a review from klaricch May 15, 2023 18:41
@klaricch
Copy link
Contributor

I like leaving get_vrs as a more generic function, and then building up a new function for the va schema and allowing users to specify the pops they want and if they want it split by sex (and that way if they only want one pop they don't end up with a giant json). Here's some pseudocode to show that suggestion:

def get_gnomad_gks(version: str, variant:str, ancestry_groups: list=None, by_sex: bool=True):

# Based on version, load in proper gnomad ht and run get_vrs function on the specified version
vrs_id_for_pop = "XXX"

if not ancestry_groups:
    ancestry_groups = POPS["v" + str(version[0])]

list_of_ancestry_jsons = []


def _fill_in(ht, index_int, pop_id, label_str, version, vrs_id_for_pop):
    variant_freq = ht.freq[index_int]
    subpop_record = {
        "subpopulationFrequency": [
            {
                "id": pop_id,
                "type": "PopulationAlleleFrequency",
                "label": f"{label_str} Population Allele Frequency for ID",
                "focusAllele": vrs_id_for_pop,
                "focusAlleleCount": variant_freq["AC"].collect()[0],
                "locusAlleleCount": variant_freq["AN"].collect()[0],
                "alleleFrequency": variant_freq["AF"].collect()[0],
                "population": f"gnomad3:{pop_id}", 
                "ancillaryResults": {
                    "homozygotes": variant_freq["homozygote_count"].collect()[0]
                    },
                }
            ]
        }
    return subpop_record

 # Filter ht to variant
for ancestry_group in ancestry_groups:
    key = f'{ancestry_group}-adj'
    index_value = ht.freq_index_dict.get(key)
    result = _fill_in(ht =ht, index_int=index_value, pop_id=ancestry_group, label_str=POP_NAMES[ancestry_group], version=version)
    list_of_ancestry_jsons.append(result)
    
    if by_sex:
        for sex in ["XX", "YY"]:
            key = f'{ancestry_group}-{sex}-adj'
            # similar as lines above per ancestry group

}

VRS_CHROM_IDS = {
VRS_CHROM_GRCH38_IDS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) do you have the ids for GRCH37?

b)you can change "GRCH38" -> "GRCh38"
and then create a nested dictionary for each build, ex:

VRS_CHROM_IDS = {'GRCh38': {
"chr1": "ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
"chr2": "ga4gh:SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g".....}}

so that downstream you can just do:
VRS_CHROM_IDS[build][chr_in]

matren395 and others added 3 commits May 17, 2023 13:17
commit docstring suggestions and some fixes, more on the way !

Co-authored-by: klaricch <[email protected]>
updates: mostly completed main JSON with information from HT, nested XX and XY inside of ancestry, and got the requested reformatting of how args are added in corrected.

TODO: incorporate meanDepth , add more documentation , and I believe incorporate one additional line from the sent pseudocode I didn't understand in
include coverage stats
@matren395 matren395 requested a review from klaricch May 18, 2023 13:31


def get_gks(
def get_gnomad_gks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can change this back to get_gks

def get_gks(
ht: hl.Table,
variant: str,
vrs_only: bool = False,
ancestry_groups: list = None,
ancestry_group_names: dict = None,
by_sex: bool = False,
coverage_ht: hl.Table = None,
) -> dict:
"""
:param vrs_only: Whether or not to return only the VRS schema for the specified variant. Default is False.
:param ancestry_group_names: Dictionary containing mapping for ancestry group names, where keys are the shortened names for the ancestry groups, and values are the full label names for that group.
:param coverage_ht: Table containing coverage annotations, with mean coverage for a locus stored in 'mean' annotation.
"""

then create a gnomad-specific function (this will eventually be put in a different file but need to discuss that)
def get_gnomad_gks(version: str, variant:str, by_pops: bool=True, by_sex: bool=True):
# Based on version, load in proper gnomad ht, coverage ht
# Warn if by_sex but not ancestry groups (TODO: ask if they ever want by_sex but not by_pops)
# Call get_gks(fill in params here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this scheme is followed now, with a gnomad_gks() function calling get_gks(). Any thoughts on putting gnomad_gks() in a different file?

matren395 and others added 2 commits May 19, 2023 16:23
commit suggestions suggestions from code review - more to come tonight !

Co-authored-by: klaricch <[email protected]>
split out separate get_gnomad_gks and added the GRCh37 IDs , more to come by Monday morning <3
KoalaQin and others added 26 commits August 28, 2023 14:45
function to remove items from freq and freq_meta
…ore_than_2

Fix `merge_freq_arrays` for cases with more than two arrays
…pe struct{r: float64} is not a subset of struct{s: str, global_idx: int64, pop_idx: int64}
fix negative values issue with 'diff'
fix ValueError for count_arrays in merge_freq_arrays function
…_fix

Fix `annotate_downsamplings` when used with a Table instead of a MatrixTable
…mple_too

Add optional filter to freq meta sample counts
@ahwagner
Copy link

Hey all. Just checking on the status of this PR. What is still needed?

@matren395
Copy link
Contributor Author

Hey Alex, a couple notes on some very recent development:

Admin/Git : When rebasing this PR and the PRs branched off of it, I encountered a weird problem with GitHub Desktop and VSCode's version control that wound up confusing/mangling the Git History for the PR. The Variant Report Code functions weren't touched, but all of the changes from main since we created the branch are showing up as commits and in the 'Files Changed' tab - which is absolutely not what should be happening , and confuses the PR a great deal. As much as I'd like to just blame GitHub Desktop for this , this is partially on me . In response to this, at the suggestion of Steve Jahl, the quickest and clearest solution was just to make new copies of PR with only the relevant code changes. I've only gone and created these PRs yesterday - I'll link the copy to this one below:

Copy of: Add gnomad_gks() and get_gks() for extracting gks information for a specified variant

Progress : My own PR off of this branch, linked here which refactors the code and makes it smoother/more comprehensible on our end, is only maybe one review away from merging.

Let me know if you have any comments or questions or concerns about this update!

@klaricch
Copy link
Contributor

Closing because copied to new PR: #596

@klaricch klaricch closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.