-
Notifications
You must be signed in to change notification settings - Fork 31
Add assemble_constraint_context_ht
function to create a fully annotated context HT for computing constraint on
#733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sending the first round of my review back, I think it's very clear to combine these tables to context HT with dictionaries, I have a few questions to help me imagine the abstract part.
vep_csq_fields = [x for x in vep_csq_fields if x in csqs.dtype.element_type] | ||
ht = ht.annotate( | ||
vep=ht.vep.select( | ||
"most_severe_consequence", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need 'most_severe_consequence' under both 'vep' and 'transcript_consequences'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transcript_consequences
is a list, so the most_severe_consequence
in each element of that list is the most severe consequence for the specific transcript. The higher level most_severe_consequence
is the most severe consequence across all transcripts at the variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I didn't know that, I thought one variant would only have one consequence in one transcript, it would also choose between "missense" and "splice region variant" to be most severe for one transcript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's whatever the most significant consequence is (based on csq_order) in the consequence_terms annotation of the element in transcript_consequences.
hl.is_missing(x[t.locus].S), 0, x[t.locus].S | ||
) | ||
|
||
# If necessary, pull out first element of coverage statistics (which includes all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see coverage_stats
in the coverage_hts in the ones you tested, which coverage_hts are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment specifies that this is relevant to v4...
It's added here:
gnomad_methods/gnomad/utils/sparse_mt.py
Line 1334 in d29acb3
"coverage_stats": ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant when I tried to get coverage('exomes').ht()
for v4, I don't see coverage_stats
as you mentioned in the comment, but the nested rows are there. Are there other versions of coverage table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's a difference between GRCh37 and GRCh38 coverage table, the columns changed from:
Row fields:
'row_id': int64
'locus': locus<GRCh37>
'mean': float64
'median': int32
to:
Row fields:
'locus': locus<GRCh38>
'mean': float64
'median_approx': int32
'total_DP': int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in this one: "gs://gnomad/release/4.0/ht/exomes/gnomad.exomes.v4.0.coverage.ht"
from gnomad_qc.v4.resources.release import release_coverage
ht = release_coverage("exomes", public=False).ht()
ht.describe()
----------------------------------------
Global fields:
'coverage_stats_meta': array<dict<str, str>>
'coverage_stats_meta_sample_count': array<int32>
----------------------------------------
Row fields:
'locus': locus<GRCh38>
'coverage_stats': array<struct {
mean: float64,
median_approx: int32,
total_DP: int64,
over_1: float64,
over_5: float64,
over_10: float64,
over_15: float64,
over_20: float64,
over_25: float64,
over_30: float64,
over_50: float64,
over_100: float64
}>
----------------------------------------
Key: ['locus']
----------------------------------------
The point is that it can work with either possible schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's what I want to know, thanks.
Co-authored-by: Qin He <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, we're close.
Sorry, it's been a while and I forgot.
vep_csq_fields = [x for x in vep_csq_fields if x in csqs.dtype.element_type] | ||
ht = ht.annotate( | ||
vep=ht.vep.select( | ||
"most_severe_consequence", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I didn't know that, I thought one variant would only have one consequence in one transcript, it would also choose between "missense" and "splice region variant" to be most severe for one transcript.
hl.is_missing(x[t.locus].S), 0, x[t.locus].S | ||
) | ||
|
||
# If necessary, pull out first element of coverage statistics (which includes all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant when I tried to get coverage('exomes').ht()
for v4, I don't see coverage_stats
as you mentioned in the comment, but the nested rows are there. Are there other versions of coverage table?
Co-authored-by: Qin He <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Also adds some functions to help transform the methylation data for annotation onto the context HT:
transform_methylation_level
transform_grch37_methylation
transform_grch38_methylation
Tested with:
and