Skip to content

Conversation

jkgoodrich
Copy link
Contributor

Also adds some functions to help transform the methylation data for annotation onto the context HT:

  • transform_methylation_level
  • transform_grch37_methylation
  • transform_grch38_methylation

Tested with:

from gnomad.resources.grch38.gnomad import coverage, all_sites_an, public_release
from gnomad.resources.grch38.reference_data import methylation_sites, vep_context

context_ht = vep_context.ht()
coverage_hts = {
    "exomes": coverage("exomes").ht(),
    "genomes": coverage("genomes").ht(),
}
an_hts = {
    "exomes": all_sites_an("exomes").ht(),
    "genomes": all_sites_an("genomes").ht(),
}
exome_ht = public_release("exomes").ht()
genomes_ht = public_release("genomes").ht()
freq_hts = {
    "exomes": exome_ht.select("freq"),
    "genomes": genomes_ht.select("freq"),
}
filter_hts = {
    "exomes": exome_ht.select("filters"),
    "genomes": genomes_ht.select("filters"),
}
methylation_ht = methylation_sites.ht()
gerp_ht = hl.experimental.load_dataset(name="gerp_scores", version="hg19", reference_genome="GRCh38")

annotated_context_ht = assemble_constraint_context_ht(
    context_ht,
    coverage_hts=coverage_hts,
    an_hts=an_hts,
    freq_hts=freq_hts,
    filter_hts=filter_hts,
    methylation_ht=methylation_ht,
    gerp_ht=gerp_ht,
    transformation_funcs=None,
)

annotated_context_ht.describe()
annotated_context_ht.show()

and

from gnomad.resources.grch37.gnomad import coverage, public_release
from gnomad.resources.grch37.reference_data import methylation_sites, vep_context

context_ht = vep_context.ht()
coverage_hts = {
    "exomes": coverage("exomes").ht(),
    "genomes": coverage("genomes").ht(),
}
an_hts = None
exome_ht = public_release("exomes").ht()
genomes_ht = public_release("genomes").ht()
freq_hts = {
    "exomes": exome_ht.select("freq"),
    "genomes": genomes_ht.select("freq"),
}
filter_hts = {
    "exomes": exome_ht.select("filters"),
    "genomes": genomes_ht.select("filters"),
}
methylation_ht = methylation_sites.ht()
gerp_ht = hl.experimental.load_dataset(name="gerp_scores", version="hg19", reference_genome="GRCh37")

annotated_context_ht = assemble_constraint_context_ht(
    context_ht,
    coverage_hts=coverage_hts,
    an_hts=an_hts,
    freq_hts=freq_hts,
    filter_hts=filter_hts,
    methylation_ht=methylation_ht,
    gerp_ht=gerp_ht,
    transformation_funcs=None,
)

annotated_context_ht.describe()
annotated_context_ht.show()

Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sending the first round of my review back, I think it's very clear to combine these tables to context HT with dictionaries, I have a few questions to help me imagine the abstract part.

vep_csq_fields = [x for x in vep_csq_fields if x in csqs.dtype.element_type]
ht = ht.annotate(
vep=ht.vep.select(
"most_severe_consequence",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need 'most_severe_consequence' under both 'vep' and 'transcript_consequences'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transcript_consequences is a list, so the most_severe_consequence in each element of that list is the most severe consequence for the specific transcript. The higher level most_severe_consequence is the most severe consequence across all transcripts at the variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't know that, I thought one variant would only have one consequence in one transcript, it would also choose between "missense" and "splice region variant" to be most severe for one transcript.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's whatever the most significant consequence is (based on csq_order) in the consequence_terms annotation of the element in transcript_consequences.

hl.is_missing(x[t.locus].S), 0, x[t.locus].S
)

# If necessary, pull out first element of coverage statistics (which includes all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see coverage_stats in the coverage_hts in the ones you tested, which coverage_hts are you referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment specifies that this is relevant to v4...

It's added here:

"coverage_stats": (

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant when I tried to get coverage('exomes').ht() for v4, I don't see coverage_stats as you mentioned in the comment, but the nested rows are there. Are there other versions of coverage table?

Copy link
Contributor

@KoalaQin KoalaQin Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's a difference between GRCh37 and GRCh38 coverage table, the columns changed from:

Row fields:
    'row_id': int64 
    'locus': locus<GRCh37> 
    'mean': float64 
    'median': int32 

to:

Row fields:
    'locus': locus<GRCh38> 
    'mean': float64 
    'median_approx': int32 
    'total_DP': int64 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in this one: "gs://gnomad/release/4.0/ht/exomes/gnomad.exomes.v4.0.coverage.ht"

from gnomad_qc.v4.resources.release import release_coverage
ht = release_coverage("exomes", public=False).ht()
ht.describe()
----------------------------------------
Global fields:
    'coverage_stats_meta': array<dict<str, str>> 
    'coverage_stats_meta_sample_count': array<int32> 
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'coverage_stats': array<struct {
        mean: float64, 
        median_approx: int32, 
        total_DP: int64, 
        over_1: float64, 
        over_5: float64, 
        over_10: float64, 
        over_15: float64, 
        over_20: float64, 
        over_25: float64, 
        over_30: float64, 
        over_50: float64, 
        over_100: float64
    }> 
----------------------------------------
Key: ['locus']
----------------------------------------

The point is that it can work with either possible schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's what I want to know, thanks.

@jkgoodrich jkgoodrich requested a review from KoalaQin December 3, 2024 18:46
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, we're close.
Sorry, it's been a while and I forgot.

vep_csq_fields = [x for x in vep_csq_fields if x in csqs.dtype.element_type]
ht = ht.annotate(
vep=ht.vep.select(
"most_severe_consequence",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't know that, I thought one variant would only have one consequence in one transcript, it would also choose between "missense" and "splice region variant" to be most severe for one transcript.

hl.is_missing(x[t.locus].S), 0, x[t.locus].S
)

# If necessary, pull out first element of coverage statistics (which includes all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant when I tried to get coverage('exomes').ht() for v4, I don't see coverage_stats as you mentioned in the comment, but the nested rows are there. Are there other versions of coverage table?

@jkgoodrich jkgoodrich requested a review from KoalaQin December 4, 2024 17:01
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jkgoodrich jkgoodrich merged commit 75fb930 into main Dec 4, 2024
5 checks passed
@jkgoodrich jkgoodrich deleted the jg/annotate_context_ht_for_constraint branch December 4, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants