Add `assemble_constraint_context_ht` function to create a fully annotated context HT for computing constraint on #733

jkgoodrich · 2024-10-03T00:07:11Z

Also adds some functions to help transform the methylation data for annotation onto the context HT:

transform_methylation_level
transform_grch37_methylation
transform_grch38_methylation

Tested with:

from gnomad.resources.grch38.gnomad import coverage, all_sites_an, public_release
from gnomad.resources.grch38.reference_data import methylation_sites, vep_context

context_ht = vep_context.ht()
coverage_hts = {
    "exomes": coverage("exomes").ht(),
    "genomes": coverage("genomes").ht(),
}
an_hts = {
    "exomes": all_sites_an("exomes").ht(),
    "genomes": all_sites_an("genomes").ht(),
}
exome_ht = public_release("exomes").ht()
genomes_ht = public_release("genomes").ht()
freq_hts = {
    "exomes": exome_ht.select("freq"),
    "genomes": genomes_ht.select("freq"),
}
filter_hts = {
    "exomes": exome_ht.select("filters"),
    "genomes": genomes_ht.select("filters"),
}
methylation_ht = methylation_sites.ht()
gerp_ht = hl.experimental.load_dataset(name="gerp_scores", version="hg19", reference_genome="GRCh38")

annotated_context_ht = assemble_constraint_context_ht(
    context_ht,
    coverage_hts=coverage_hts,
    an_hts=an_hts,
    freq_hts=freq_hts,
    filter_hts=filter_hts,
    methylation_ht=methylation_ht,
    gerp_ht=gerp_ht,
    transformation_funcs=None,
)

annotated_context_ht.describe()
annotated_context_ht.show()

and

from gnomad.resources.grch37.gnomad import coverage, public_release
from gnomad.resources.grch37.reference_data import methylation_sites, vep_context

context_ht = vep_context.ht()
coverage_hts = {
    "exomes": coverage("exomes").ht(),
    "genomes": coverage("genomes").ht(),
}
an_hts = None
exome_ht = public_release("exomes").ht()
genomes_ht = public_release("genomes").ht()
freq_hts = {
    "exomes": exome_ht.select("freq"),
    "genomes": genomes_ht.select("freq"),
}
filter_hts = {
    "exomes": exome_ht.select("filters"),
    "genomes": genomes_ht.select("filters"),
}
methylation_ht = methylation_sites.ht()
gerp_ht = hl.experimental.load_dataset(name="gerp_scores", version="hg19", reference_genome="GRCh37")

annotated_context_ht = assemble_constraint_context_ht(
    context_ht,
    coverage_hts=coverage_hts,
    an_hts=an_hts,
    freq_hts=freq_hts,
    filter_hts=filter_hts,
    methylation_ht=methylation_ht,
    gerp_ht=gerp_ht,
    transformation_funcs=None,
)

annotated_context_ht.describe()
annotated_context_ht.show()

…straint on

…n isort

KoalaQin

I'm sending the first round of my review back, I think it's very clear to combine these tables to context HT with dictionaries, I have a few questions to help me imagine the abstract part.

gnomad/resources/grch37/reference_data.py

gnomad/resources/grch38/reference_data.py

gnomad/resources/grch37/reference_data.py

gnomad/resources/grch38/reference_data.py

gnomad/utils/constraint.py

KoalaQin · 2024-10-18T15:27:46Z

gnomad/utils/constraint.py

+    vep_csq_fields = [x for x in vep_csq_fields if x in csqs.dtype.element_type]
+    ht = ht.annotate(
+        vep=ht.vep.select(
+            "most_severe_consequence",


Why do you need 'most_severe_consequence' under both 'vep' and 'transcript_consequences'?

transcript_consequences is a list, so the most_severe_consequence in each element of that list is the most severe consequence for the specific transcript. The higher level most_severe_consequence is the most severe consequence across all transcripts at the variant

oh, I didn't know that, I thought one variant would only have one consequence in one transcript, it would also choose between "missense" and "splice region variant" to be most severe for one transcript.

Yeah, it's whatever the most significant consequence is (based on csq_order) in the consequence_terms annotation of the element in transcript_consequences.

KoalaQin · 2024-10-18T20:10:56Z

gnomad/utils/constraint.py

+            hl.is_missing(x[t.locus].S), 0, x[t.locus].S
+        )
+
+    # If necessary, pull out first element of coverage statistics (which includes all


I don't see coverage_stats in the coverage_hts in the ones you tested, which coverage_hts are you referring to?

The comment specifies that this is relevant to v4...

It's added here:

gnomad_methods/gnomad/utils/sparse_mt.py

Line 1334 in d29acb3

"coverage_stats": (

I meant when I tried to get coverage('exomes').ht() for v4, I don't see coverage_stats as you mentioned in the comment, but the nested rows are there. Are there other versions of coverage table?

I see there's a difference between GRCh37 and GRCh38 coverage table, the columns changed from:

Row fields: 'row_id': int64 'locus': locus<GRCh37> 'mean': float64 'median': int32

to:

Row fields: 'locus': locus<GRCh38> 'mean': float64 'median_approx': int32 'total_DP': int64

It's in this one: "gs://gnomad/release/4.0/ht/exomes/gnomad.exomes.v4.0.coverage.ht"

from gnomad_qc.v4.resources.release import release_coverage ht = release_coverage("exomes", public=False).ht() ht.describe()

---------------------------------------- Global fields: 'coverage_stats_meta': array<dict<str, str>> 'coverage_stats_meta_sample_count': array<int32> ---------------------------------------- Row fields: 'locus': locus<GRCh38> 'coverage_stats': array<struct { mean: float64, median_approx: int32, total_DP: int64, over_1: float64, over_5: float64, over_10: float64, over_15: float64, over_20: float64, over_25: float64, over_30: float64, over_50: float64, over_100: float64 }> ---------------------------------------- Key: ['locus'] ----------------------------------------

The point is that it can work with either possible schema

Okay, that's what I want to know, thanks.

gnomad/utils/constraint.py

Co-authored-by: Qin He <[email protected]>

KoalaQin

A few comments, we're close.
Sorry, it's been a while and I forgot.

gnomad/utils/constraint.py

KoalaQin · 2024-12-04T15:49:42Z

gnomad/utils/constraint.py

+    vep_csq_fields = [x for x in vep_csq_fields if x in csqs.dtype.element_type]
+    ht = ht.annotate(
+        vep=ht.vep.select(
+            "most_severe_consequence",


oh, I didn't know that, I thought one variant would only have one consequence in one transcript, it would also choose between "missense" and "splice region variant" to be most severe for one transcript.

KoalaQin · 2024-12-04T16:13:56Z

gnomad/utils/constraint.py

+            hl.is_missing(x[t.locus].S), 0, x[t.locus].S
+        )
+
+    # If necessary, pull out first element of coverage statistics (which includes all


I meant when I tried to get coverage('exomes').ht() for v4, I don't see coverage_stats as you mentioned in the comment, but the nested rows are there. Are there other versions of coverage table?

Co-authored-by: Qin He <[email protected]>

KoalaQin

LGTM!

Add function to create a fully annotated context HT for computing con…

47bc19a

…straint on

jkgoodrich added Changelog: new feature Constraint labels Oct 3, 2024

jkgoodrich self-assigned this Oct 3, 2024

jkgoodrich added 3 commits October 2, 2024 18:18

format

95ab909

Small change to use add_most_severe_consequence_to_consequence and ru…

a912b01

…n isort

Format comment

5b1d2e8

jkgoodrich requested a review from KoalaQin October 4, 2024 16:35

jkgoodrich assigned KoalaQin Oct 4, 2024

KoalaQin suggested changes Oct 18, 2024

View reviewed changes

jkgoodrich and others added 2 commits December 3, 2024 10:31

Apply suggestions from code review

8c53f41

Co-authored-by: Qin He <[email protected]>

Address reviewer comments

eb38fa2

jkgoodrich requested a review from KoalaQin December 3, 2024 18:46

KoalaQin suggested changes Dec 4, 2024

View reviewed changes

Update gnomad/utils/constraint.py

cc73837

Co-authored-by: Qin He <[email protected]>

jkgoodrich requested a review from KoalaQin December 4, 2024 17:01

KoalaQin approved these changes Dec 4, 2024

View reviewed changes

jkgoodrich merged commit 75fb930 into main Dec 4, 2024
5 checks passed

jkgoodrich deleted the jg/annotate_context_ht_for_constraint branch December 4, 2024 19:12

Add assemble_constraint_context_ht function to create a fully annotated context HT for computing constraint on #733

Add assemble_constraint_context_ht function to create a fully annotated context HT for computing constraint on #733

Uh oh!

Conversation

jkgoodrich commented Oct 3, 2024

Uh oh!

KoalaQin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KoalaQin Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KoalaQin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KoalaQin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add `assemble_constraint_context_ht` function to create a fully annotated context HT for computing constraint on #733

Add `assemble_constraint_context_ht` function to create a fully annotated context HT for computing constraint on #733

KoalaQin Dec 4, 2024 •

edited

Loading