diff --git a/browser/help/faq/general/what-features-are-not-yet-in-v4-and-where-can-i-find-them.md b/browser/help/faq/general/what-features-are-not-yet-in-v4-and-where-can-i-find-them.md index 97bee2474..161da81f6 100644 --- a/browser/help/faq/general/what-features-are-not-yet-in-v4-and-where-can-i-find-them.md +++ b/browser/help/faq/general/what-features-are-not-yet-in-v4-and-where-can-i-find-them.md @@ -13,6 +13,5 @@ Below is a list of all features not included in the v4 MVP and where to find the | Genetic ancestry subgroups (prevously subpops) | v2 variant page | | Multi Nucleotide (MNV) calls | v2 variant table and variant page | | Variant co-occurrence | v2 gene page | -| Manual LoF curation | v2 variant table and variant page | | Regional Missense Constraint | Now available on v2 gene page | | Linkage disequilibrium scores | [v2](/downloads/#v2-linkage-disequilibrium) downloads | diff --git a/browser/help/topics/lof-curation.md b/browser/help/topics/lof-curation.md index 10d2e14e9..b5c3afb10 100644 --- a/browser/help/topics/lof-curation.md +++ b/browser/help/topics/lof-curation.md @@ -3,71 +3,73 @@ id: lof-curation title: Loss-of-Function Curation --- -The Loss-of-Function (LoF) classification is a result of a specialized and manual curation of predicted loss of function (pLoF) variants that have passed all LOFTEE filters and other QC flags in gnomAD and determines how likely these variants are to result in loss of function. For each curated variant, two curators performed an independent curation, and this is a process that yields a **prediction** for the likelihood of loss of function. Note that these predictions are based on _in silico_ metrics only, and do not incorporate experimental evidence, so should be regarded as more confident than automated curations but still uncertain. +The Loss-of-Function (LoF) classification is a result of a specialized and **manual** curation of predicted loss of function (pLoF) variants that have passed all LOFTEE filters and other QC flags in gnomAD, and determines how likely these variants are to result in loss of function. For each curated variant, two curators performed an independent curation, and this is a process that yields a **prediction** for the likelihood of loss of function. Note that these predictions are based on _in silico_ metrics only, and do not incorporate experimental evidence. This work is now published in the American Journal of Human Genetics ([Singer-Berk et al. 2023](https://pubmed.ncbi.nlm.nih.gov/37633279/)). ### Classification Categories -pLoF curated variants are assigned one of five classifications based on the presence or absence of certain error modes (described below). These classifications include: LoF, likely LoF, uncertain LoF, likely not LoF, and not LoF. LoF classified variants have no error modes that indicate they may not cause LoF, while not LoF classified variants have some error modes that indicate these variants are predicted to not result in LoF. Similar to ACMG/AMP criteria for likely pathogenic and likely benign classification of variants, likely LoF and likely not LoF classified variants are slightly less confidently predicted to result in LoF or not LoF, respectively. Variants with an uncertain LoF classification are similar to the ACMG/AMP variants of uncertain significance (VUS’s), and do not have sufficient evidence to point towards a classification of LoF or not LoF. +pLoF curated variants are assigned one of five classifications based on the presence or absence of certain flags (described below). These classifications include: _LoF, likely LoF, uncertain LoF, likely not LoF, and not LoF_. Variants classified as _LoF_ have no error modes that indicate they may escape NMD (nonsense mediated decay), while variants classified as _not LoF_ have some indication that they may escape NMD. Similar to ACMG/AMP criteria for likely pathogenic and likely benign classification of variants, _likely LoF_ and _likely not LoF_ classified variants are less confidently predicted to result in NMD or to escape NMD, respectively. Variants with an _uncertain LoF_ classification are similar to the ACMG/AMP variants of uncertain significance (VUS’s), and do not have sufficient evidence to predict if the result would be NMD or escaping NMD. ### How is this useful? -For likely LoF and LoF variants, this manual curation provides increased confidence that these variants are truly LoF variants for a gene and should result in nonsense-mediated decay, by systematically excluding a number of common annotation and sequencing error modes. Ultimately, however, functional studies would be needed to fully validate the potential LoF impact of a variant. +For _likely LoF_ and _LoF_ variants, this manual curation provides increased confidence that these variants are truly LoF variants for a gene and should result in NMD, by systematically ruling out a number of common annotation and sequencing error modes. Ultimately, however, functional studies would be needed to fully validate the potential LoF impact of a variant. -For likely not LoF and not LoF variants, this curation supports the conclusion that either these variants are likely the result of a technical sequencing error, or their predicted effect based on our extensive manual curation is not LoF. For those variants with technical errors that are also deemed not LoF, the allele frequency of these variants in gnomAD should not be considered the true frequency of these variants in gnomAD. For other variants that have been curated as likely not LoF and not LoF, these may still be pathogenic variants, but the predicted mechanism for these variants in causing disease is not predicted to be due to LoF. For example, a nonsense variant in the last exon would not be expected to result in nonsense mediated decay but it could still disrupt the function of the protein if the terminal end of the protein was essential for function. In addition, a variant that is located in a homopolymer repeat could be an artifact of sequencing in gnomAD, but if it is identified in an individual with disease (where it was likely Sanger confirmed), then the mechanism is actually likely to be LoF. - -Uncertain LoF variants represent cases where we were unable to reach a more conclusive classification and therefore should not be interpreted as falling into either of the above two categories. +For _likely not LoF_ and _not LoF_ variants, this curation supports the conclusion that either these variants are likely the result of a technical sequencing error, or their predicted effect based on our extensive manual curation is to escape NMD. For those variants with technical errors that are also classified as _not LoF_, the allele frequency of these variants in gnomAD should not be considered the true frequency of these variants. For other variants that have been curated as _likely not LoF_ and _not LoF_, these may still be pathogenic variants, but they are expected to escape NMD. ### LoF Curation Flags #### Mapping Issue Flag -There is a potential mapping issue for variants in this region as identified by repeat tracks (Mappability, RepeatMasker, Segmental Dups, Self Chain, and Simple Repeats) in the UCSC genome browser ([Kent et al. 2002](https://pubmed.ncbi.nlm.nih.gov/12045153/)). +There is a potential mapping issue for variants in this region as identified by repeat tracks (Mappability, RepeatMasker, Segmental Dups, Self Chain, and Simple Repeats) in the UCSC genome browser ([Kent et al. 2002](https://pubmed.ncbi.nlm.nih.gov/12045153/)). Mapping Error may also be flagged in the case of visual inspection of mapping issues in the IGV reads in gnomAD. #### Genotyping Issue Flag -Quality of variant calls is at the lower end of the allowed “passing” threshold based on allele balance, depth, and genotype quality scores so we have lower confidence that this variant is real (i.e. not a sequencing artifact) and conservatively do not want to assert that it would result in loss of function in the individual in gnomAD. This flag may also indicate a region that is of lower sequence complexity, GC rich, or has evidence of polymerase stuttering. +Quality of variant calls is at the lower end of the allowed “passing” threshold based on allele balance, depth, and genotype quality scores so there is reduced confidence that this variant is real (i.e. not a sequencing artifact) or that it would result in loss of function in the individual in gnomAD. This flag may also indicate a region that has low complexity, is GC rich, or has evidence of polymerase stuttering. #### Homopolymer Flag -This flag was used when an indel variant falls within a homopolymer repeat. Homopolymers are defined as a single nucleotide, dinucleotide, or trinucleotide motifs that are repeated in the reference sequence at least 5-7 times (see the homozygous versus heterozygous variant curation section below). +This flag was used when an indel variant falls within a homopolymer repeat. Homopolymers are defined as single nucleotide, dinucleotide, or trinucleotide motifs that are repeated in the reference sequence at least 5-7 times. #### No Read Flag -No read data was available for visualization of individuals with this variant in gnomAD. In some cases we could still determine a variant was not LoF, due to likely annotation errors. As this is an important part of the LoF curation process, we flagged these variants. Indels without any read data from homozygotes or heterozygotes were curated as uncertain_LoF if there was no indication of other major errors. In some cases, this flag was only used when there was an absence of reads for individuals with homozygous variants in gnomAD. +No read data was available for visualization of individuals with this variant in gnomAD (in the most up to date version as well as previous versions). In some cases we could still determine a variant was _not LoF_, due to other errors. Variants without any read data were curated as _uncertain_LoF_ if there was no indication of other major errors. #### Reference Error Flag -The variant was called due to an error or rare variant in the human genome reference sequence GRCh37/hg19, and therefore is present at a far higher than expected frequency (>99% MAF) across all gnomAD populations. Additionally, these variants may appear as gaps in the UCSC genome browser or as falling within small artificial intronic sequences of 1-2 bp in length created around reference errors. Most of these have been corrected in GRCh38. +The variant was called due to an error or rare variant in the human genome reference sequence GRCh37/hg19, and therefore is present at a far higher than expected frequency (>99% MAF) across all gnomAD populations. Additionally, these variants may appear as gaps in the UCSC genome browser or as falling within small artificial intronic sequences of <5bp bp in length. Most of these have been corrected in GRCh38. #### Strand Bias Flag -The variant shows evidence of strand imbalance where it is unevenly distributed among forward and reverse strands in each individual's read data in the gnomAD browser. Strand bias in exome data in variants (i.e. not sequencing artifacts) at intron-exon boundaries has been previously described, so this error mode was not flagged in canonical splice site variants ([Guo et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3532123/)). +The variant shows evidence of strand imbalance where it is unevenly distributed among forward or reverse strands. Strand bias in exome/genome data in variants at intron-exon boundaries has been previously described, so this error mode was not flagged in canonical splice site variants ([Guo et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3532123/)). #### MNV (Multi-nucleotide variants) / Frame-restoring Indel Flag ##### MNV -The variant is observed in cis with another variant in the same codon. When combining both variants, the predicted change is a missense or synonymous variant rather than a termination event, which would have been predicted from one of the pair of variants alone. MNVs are already flagged in gnomAD, but this flag will specifically indicate when the presence of a MNV does not result in LoF ([Wang et al., 2020](https://www.nature.com/articles/s41467-019-12438-5)). +The variant is observed in _cis_ with at least one other variant in the same codon. When combining both variants, the predicted change is a missense or synonymous variant rather than a termination event, which would have been predicted from one of the pair of variants alone. ##### Frame-restoring indel -This flag indicated there are two indels that were observed in cis. While at least one of these indels would have resulted in a frameshift and is therefore a pLoF variant, the second indel restores the reading frame, and is thus predicted to result overall in an in-frame indel (potentially with an intervening stretch of out-of-frame sequence) rather than a variant that triggers NMD. This flag may also be used for tandem repeats occurring near splice sites that are predicted to undergo normal splicing and maintain the reading frame. +This flag indicated there are at least two indels that were observed in _cis_. While at least one of these indels would have resulted in a frameshift and is therefore a pLoF variant, the other indel restores the reading frame, and is thus predicted to result in an in-frame indel (potentially with an intervening stretch of out-of-frame sequence), rather than a variant that results in NMD. This flag may also be used for tandem repeats occurring near splice sites that are predicted to undergo normal splicing and maintain the reading frame. #### Splice Rescue Flag -_In silico_ splice predictors predict a cryptic in-frame splice site to rescue a canonical splice site abolished by the pLoF variant in question. Alternatively, the presence of a cryptic splice site within 6bp of the abolished canonical splice site can be considered rescued in some cases. These cryptic sites may be either up or downstream of the canonical splice site (see homozygous versus heterozygous curations). +_In silico_ splice predictors predict a cryptic in-frame splice site that rescues a canonical splice site abolished by the pLoF variant in question without including stop codons. These cryptic sites may be either up or downstream of the canonical splice site. These essential splice site rescues are detected using SpliceAI and or Pangolin. To be considered a splice rescue, spliceAI and pangolin must agree on the prediction of an in-frame rescue. Additionally, if there is an additional prediction of an out-of-frame cryptic rescue, this must be at a significantly lower score (defined by score >0.2 from the in-frame rescue), in order for the pLoF variant to be classified as _likely not LoF_ or _not LoF_. #### Splice Variant at In-frame Exon Flag -The variant is a splice disrupting variant that is adjacent to an in-frame exon and therefore not expected to result in NMD. +_In silico_ splice predictors predict skipping of the adjacent exon (which is in-frame), usually using spliceAI and pangolin. The predictions must agree across predictors. Skipping of this exon will maintain the reading frame and is not predicted to result in NMD. The skipping of this exon will also remove <25% of the coding sequence for the gene. Additionally, if there is an additional prediction of an out-of-frame cryptic rescue, this must be at a significantly lower score (defined by score >0.2 from the in-frame exon predictions), in order for the pLoF variant to be classified as _likely not LoF_ or _not LoF_. + +#### Conflicting Splice Prediction Flag + +_In silico_ splice predictors disagree on the prediction for the variant. There are multiple predictions that have both in-frame and out-of-frame consequences. It is not possible to distinguish the likelihood that this would escape NMD. These variants are classified as uncertain_LoF. #### Weak/Unrecognized Splice Rescue Flag -_In silico_ splice predictors predict a weak rescue of a canonical splice site variant by a cryptic in-frame splice site up or downstream of the canonical splice site, or the cryptic splice sites are unrecognized by splicing predictors. “Weak” rescue by _in silico_ splice predictors is defined as a prediction that differs by <50% from the prediction scores at the canonical splice site. +_In silico_ splice predictors predict a weak rescue of a canonical splice site variant by a cryptic in-frame splice site upstream or downstream of the canonical splice site. #### Minority of Transcripts Flag -Variants are annotated as LoF in less than or equal to 50% of protein-coding GENCODE transcripts for the gene ([Harrow et al. 2012](https://pubmed.ncbi.nlm.nih.gov/22955987/)). GENCODE transcripts (rather than Ensembl or RefSeq) transcripts were selected for these curations because this is a refined transcript set built on various types of evidence. Another transcript set could have been used and in cases where there is additional transcript information available, there may be benefit from reinterpreting variants in terms of the disease or biologically relevant transcript(s). +Variants are annotated as LoF in less than or equal to 50% of protein-coding **GENCODE** transcripts for the gene ([Harrow et al. 2012](https://pubmed.ncbi.nlm.nih.gov/22955987/)). GENCODE transcripts (rather than Ensembl or RefSeq) transcripts were selected for these curations because they are a refined transcript set built on various types of evidence. #### Weak Exon Conservation Flag @@ -83,22 +85,29 @@ The variant falls within a gene where the entire gene is weakly conserved as vis #### Last Exon Flag -The variant results in a termination event that falls within the last coding exon of the gene or within the last 50bp of the penultimate exon, and occurs at a position of the gene that does not remove more than 25% of the gene’s coding sequence. +The variant results in a termination event that falls within the last coding exon or within the last 50bp of the penultimate exon of the gene, and occurs at a position of the gene that retains more than 25% of the gene’s coding sequence. #### Other Transcript Error Flag -This flag could refer to several possible error modes, all related to situations where we believe that the affected transcripts are either artifacts, or are unlikely to be biologically critical: +This flag could refer to several possible error modes, all related to situations in which we believe that the affected transcripts are either sequencing artifacts, or are unlikely to be biologically critical: - The variant falls within an “extension” of the exon in a transcript in which the majority of transcripts for the gene do not have the same extended portion. This region is also poorly conserved across vertebrates in the UCSC genome browser and has a lower mean pext score relative to the mean pext for the gene. -- The variant is in a gene where there are multiple (at least 2) transcripts that exist in different frames (overprinting), and the variant is annotated as LoF in at least one of those transcripts. -- The variant is nonsense and falls within the first coding exon for the gene, for which there is a downstream methionine, also in the first exon, that is well conserved across vertebrates and could rescue initiation of translation for this transcript. -- The variant falls in a codon that is split between 2 exons, and one of these exons is in a minority of transcripts. In this case, this variant is a nonsense variant in the exon that is in a minority of transcripts, but is synonymous or missense in the other transcripts. +- The variant is in a gene where there are at least 2 transcripts that exist in different frames (overprinting), and the variant is annotated as LoF in at least one of those transcripts. +- A nonsense variant that falls within the first coding exon for the gene, for which there is a downstream methionine, that is well conserved across vertebrates and could rescue initiation of translation for this transcript. #### Low Relative Mean Pext / Pext Does not Support Splicing Flag -The pext score is a metric that captures the expression of all transcripts overlapping the affected base across a variety of adult human tissues, calculated from the GTEx dataset ([Cummings et al. 2020](https://www.nature.com/articles/s41586-020-2329-2)). This flag means that the variant in question has a lower mean pext relative to the mean pext of the entire gene, typically meaning that the exon affected by the variant is weakly expressed compared to other exons in the gene. Alternatively, there is a splicing variant in an exon where pext does not support splicing of that exon to occur at that position. For example, a gene may have two transcripts with overlapping exons of different lengths, such that a splice variant in the shorter version of the exon would fall in the coding region of the longer version of the exon. A high pext (indicating expression) for the longer exon would lend more support to the longer exon being the more biologically relevant version. Thus, the splice variant in the shorter transcript would not be as strongly supported. +The pext score is a metric that captures the expression of all transcripts overlapping the affected base across a variety of adult human tissues, calculated from the GTEx dataset ([Cummings et al. 2020](https://www.nature.com/articles/s41586-020-2329-2)). This flag means that the exon/region affected by the variant is weakly expressed compared to other exons in the gene. Low pext is typically defined as a region of the gene with a mean pext score across tissues being <20% of the maximum mean pext across the gene. As mean pext across all tissues was used for curation, alternative interpretations may be supported for variants in genes that are known to exhibit highly tissue-specific expression. + +Alternatively, there is a splicing variant in an exon where pext does not support splicing of that exon to occur at that position. For example, a gene may have two transcripts with overlapping exons of different lengths, such that a splice variant in the shorter version of the exon would fall in the coding region of the longer version of the exon. A high pext (indicating expression) for the longer exon would lend more support to the longer exon being the more biologically relevant version. Thus, the splice variant in the shorter transcript would not be as strongly supported. + +#### Partially Decreased Relative Pext Flag + +The pext score is a metric that captures the expression of all transcripts overlapping the affected base across a variety of adult human tissues, calculated from the GTEx dataset ([Cummings et al. 2020](https://www.nature.com/articles/s41586-020-2329-2)). This flag means that the variant is located in an exon/region of an exon that has a slightly reduced mean pext score compared to the rest of the gene. A slightly reduced pext score was defined as a mean pext score of <50% of the max pext for the gene. If this flag is in combination with other flags (e.g. minority transcripts/exon conservation), then the classification of the variant will likely be _uncertain LoF_ or _likely not LoF/not LoF_. As mean pext across all tissues was used for curation, alternative interpretations may be supported for variants in genes that are known to exhibit predominately tissue-specific expression. + +#### LoF in Untranslated Transcript Flag -As mean pext across all tissues was used for curation, alternative interpretations may be supported for variants in genes that are known to exhibit highly tissue-specific expression. +The variant falls in a transcript that is not a translated transcript. Therefore, the variant is not in a biologically relevant transcript. These transcripts were identified by having stop codons scattered throughout the transcript by visualization in the UCSC genome browser. #### Coverage Issue Flag @@ -106,12 +115,12 @@ The variant falls within a region of the gene where the per-base mean depth of c #### Uninformative Pext Flag -The variant falls within a gene that has a low mean pext across the entirety of the gene and has weak gene conservation or the variant falls within a minority of transcripts with no other flags that indicate it is not LoF. Alternatively, the variant falls within a gene that is not expressed in GTEx, the gene is too large to assess with pext scores, pext was too difficult to visualize across the gene, or there is a small transcript for the gene that is more highly expressed compared to the canonical transcript for the gene and therefore distorts the pext scores across the gene. +The variant falls within a gene that has a low mean pext across the entirety of the gene and has weak gene conservation, or the variant falls within a minority of transcripts with no other flags that indicate it is _not LoF_. Alternatively, the variant falls within a gene that is not expressed in GTEx, the gene is too large to assess with pext scores, pext was too difficult to visualize across the gene, or there is a small transcript for the gene that is more highly expressed compared to the canonical/MANE transcript for the gene and therefore distorts the pext scores. ### Homozygous versus Heterozygous Curations -All homozygous pLoF variants in gnomAD were curated as part of the efforts taken in [Karczewski et al. 2020 Nature](https://www.nature.com/articles/s41576-020-0255-7) paper to identify genes that were tolerant of complete knockout in humans, and therefore these curations tended to be more stringent. Therefore, certain error modes are applied differently depending on whether the variant appears in the homozygous versus heterozygous state in gnomAD. For example, in the homozygous state, a homopolymer error is only applied when a motif is repeated seven or more times in the reference sequence. In the heterozygous state, a homopolymer error is applied if a repeat motif is observed five or more times. This difference in interpretation is based on the more unlikely chance that an artifact would be seen in the homozygous state versus a heterozygous state. In both cases, the error would result in a likely not LoF verdict. Besides technical error differences, essential splice rescues were evaluated more strictly for homozygous individuals compared to variants found only in the heterozygous state. The presence of any cryptic splice sites occurring within 6bp of a canonical splice site automatically lead to a not LoF classification when evaluated in the heterozygous state. However, cryptic splice sites required a strong rescue prediction by in silico splice predictors in order to achieve the same verdict for a variant in the homozygous state. +All homozygous pLoF variants in gnomAD v2 were curated as part of the efforts taken in [Karczewski et al. 2020 Nature](https://www.nature.com/articles/s41576-020-0255-7) paper to identify genes that were tolerant of complete knockout in humans, and therefore these curations tended to be more stringent. Therefore, certain error modes are applied differently depending on whether the variant appears in the homozygous versus heterozygous state in gnomAD. This difference in interpretation is based on the more unlikely chance that an artifact would be seen in the homozygous state versus a heterozygous state. ### Caveats -These curations are based on the presence of these variants in the gnomAD database and do not take into account the presence of these variants in individuals with documented disease. We view LoF curation as an important step to be taken to evaluate whether a variant is expected to result in LoF and would follow this with curation for the clinical impact (pathogenic, benign, etc.) of a variant following ACMG’s Standards and guidelines for the interpretation of sequence variants ([Richards et al. 2015](https://www.nature.com/articles/gim201530)). Therefore, even if a variant is classified as LoF or likely LoF, it may not meet criteria for pathogenicity with respect to human disease. Additionally, the curation of a variant as LoF, likely LoF, uncertain, likely not LoF, and not LoF is not in any way determined based on functional evidence. Rather, predictions are entirely based on manual interpretation of splicing, conservation, annotations, and sequence data quality in gnomAD and the UCSC genome browser. +These curations are based on the presence of these variants in the gnomAD database and do not take into account the presence of these variants in individuals with documented disease. We view LoF curation as an important step to be taken to evaluate whether a variant is expected to result in NMD and would follow this with curation for the clinical impact (pathogenic, benign, etc.) of a variant following ACMG/AMP’s Standards and guidelines for the interpretation of sequence variants ([Richards et al. 2015](https://www.nature.com/articles/gim201530)). Therefore, even if a variant is classified as _LoF_ or _likely LoF_, it may not meet criteria for pathogenicity with respect to human disease. Additionally, the curation of a variant as _LoF_, _likely LoF_, _uncertain_, _likely not LoF_, and _not LoF_ is not in any way determined based on functional evidence. Rather, predictions are entirely based on manual interpretation of splicing, conservation, annotations, and sequence data quality in gnomAD and the UCSC genome browser. diff --git a/browser/src/DataPage/GnomadV4Downloads.tsx b/browser/src/DataPage/GnomadV4Downloads.tsx index 5a4c698f7..30d67a1c6 100644 --- a/browser/src/DataPage/GnomadV4Downloads.tsx +++ b/browser/src/DataPage/GnomadV4Downloads.tsx @@ -600,6 +600,52 @@ const GnomadV4Downloads = () => { + + + Secondary Analyses + + + Additional research analyses created using the core gnomAD releases in collaboration with + members of the gnomAD steering committee. + + + + Loss-of-function curation results +

+ For information on v4 loss-of-function curation results, see{' '} + {/* @ts-expect-error TS(2769) FIXME: No overload matches this call. */} + + The mutational constraint spectrum quantified from variation in 141,456 humans.{' '} + Nature 581, 434–443 (2020) + {' '} + (all homozygous LoF curation results),{' '} + {/* @ts-expect-error TS(2769) FIXME: No overload matches this call. */} + + Transcript expression-aware annotation improves rare variant interpretation.{' '} + Nature 581, 452–458 (2020) + {' '} + (haploinsufficient genes LoF curation results), and{' '} + {/* @ts-expect-error TS(2769) FIXME: No overload matches this call. */} + + + Advanced variant classification framework reduces the false positive rate of predicted + loss-of-function variants in population sequencing data. + {' '} + Am J Hum Genet 110, 1496-1508 (2023) + + . +

+ + + {/* @ts-expect-error TS(2745) FIXME: This JSX tag's 'children' prop expects type 'never... Remove this comment to see the full error message */} + + + + +
) } diff --git a/browser/src/DataPage/__snapshots__/DataPage.spec.tsx.snap b/browser/src/DataPage/__snapshots__/DataPage.spec.tsx.snap index e91f4f36f..cf6844b7b 100644 --- a/browser/src/DataPage/__snapshots__/DataPage.spec.tsx.snap +++ b/browser/src/DataPage/__snapshots__/DataPage.spec.tsx.snap @@ -7300,6 +7300,145 @@ exports[`Data Page has no unexpected changes 1`] = ` + +

+ + Secondary Analyses +

+
+

+ Additional research analyses created using the core gnomAD releases in collaboration with members of the gnomAD steering committee. +

+
+ +

+ + Loss-of-function curation results +

+
+

+ For information on v4 loss-of-function curation results, see + + + + The mutational constraint spectrum quantified from variation in 141,456 humans. + + + Nature 581, 434–443 (2020) + + + (all homozygous LoF curation results), + + + + Transcript expression-aware annotation improves rare variant interpretation. + + + Nature 581, 452–458 (2020) + + + (haploinsufficient genes LoF curation results), and + + + + Advanced variant classification framework reduces the false positive rate of predicted loss-of-function variants in population sequencing data. + + + Am J Hum Genet 110, 1496-1508 (2023) + + . +

+ +
diff --git a/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap b/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap index 8df41804f..da3164234 100644 --- a/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap +++ b/browser/src/help/__snapshots__/HelpPage.spec.tsx.snap @@ -1248,7 +1248,6 @@ Below is a list of all features not included in the v4 MVP and where to find the | Genetic ancestry subgroups (prevously subpops) | v2 variant page | | Multi Nucleotide (MNV) calls | v2 variant table and variant page | | Variant co-occurrence | v2 gene page | -| Manual LoF curation | v2 variant table and variant page | | Regional Missense Constraint | Now available on v2 gene page | | Linkage disequilibrium scores | [v2](/downloads/#v2-linkage-disequilibrium) downloads | ", diff --git a/data-pipeline/src/data_pipeline/datasets/gnomad_v2/gnomad_v2_lof_curation.py b/data-pipeline/src/data_pipeline/datasets/gnomad_v2/gnomad_v2_lof_curation.py index c7411d10e..d5c7b6ee9 100644 --- a/data-pipeline/src/data_pipeline/datasets/gnomad_v2/gnomad_v2_lof_curation.py +++ b/data-pipeline/src/data_pipeline/datasets/gnomad_v2/gnomad_v2_lof_curation.py @@ -30,8 +30,10 @@ "not_lof": "Not LoF", } +VERDICT_MAPPINGS_CLEAN = VERDICT_MAPPING.values() -def import_gnomad_v2_lof_curation_results(curation_result_paths, genes_path): + +def import_gnomad_lof_curation_results(curation_result_paths, genes_path, reference_genome="GRCh37"): all_flags = set() with hl.hadoop_open("/tmp/import_temp.tsv", "w") as temp_output_file: @@ -52,8 +54,13 @@ def import_gnomad_v2_lof_curation_results(curation_result_paths, genes_path): for row in reader: [chrom, pos, ref, alt] = row["Variant ID"].split("-") + chrom = f"chr{chrom}" if reference_genome == "GRCh38" else chrom - variant_flags = [FLAG_MAPPING.get(f, f) for f in raw_dataset_flags if row[f"Flag {f}"] == "TRUE"] + variant_flags = [ + FLAG_MAPPING.get(f, f) + for f in raw_dataset_flags + if row.get(f"Flag {f}") == "TRUE" or row.get(f"FLAG {f}") == "1" + ] genes = [gene_id for (gene_id, gene_symbol) in (gene.split(":") for gene in row["Gene"].split(";"))] @@ -62,7 +69,8 @@ def import_gnomad_v2_lof_curation_results(curation_result_paths, genes_path): if verdict == "inufficient_evidence": verdict = "insufficient_evidence" - verdict = VERDICT_MAPPING[verdict] + if verdict not in VERDICT_MAPPINGS_CLEAN: + verdict = VERDICT_MAPPING[verdict] output_row = [ chrom, @@ -81,7 +89,7 @@ def import_gnomad_v2_lof_curation_results(curation_result_paths, genes_path): ds = hl.import_table("/tmp/import_temp.tsv") ds = ds.transmute( - locus=hl.locus(ds.chrom, hl.int(ds.position)), + locus=hl.locus(ds.chrom, hl.int(ds.position), reference_genome), alleles=[ds.ref, ds.alt], ) diff --git a/data-pipeline/src/data_pipeline/pipelines/export_to_elasticsearch.py b/data-pipeline/src/data_pipeline/pipelines/export_to_elasticsearch.py index 789238f9b..bf074c45b 100644 --- a/data-pipeline/src/data_pipeline/pipelines/export_to_elasticsearch.py +++ b/data-pipeline/src/data_pipeline/pipelines/export_to_elasticsearch.py @@ -38,8 +38,8 @@ from data_pipeline.pipelines.gnomad_v3_short_tandem_repeats import pipeline as gnomad_v3_short_tandem_repeats_pipeline from data_pipeline.pipelines.gnomad_v4_variants import pipeline as gnomad_v4_variants_pipeline from data_pipeline.pipelines.gnomad_v4_coverage import pipeline as gnomad_v4_coverage_pipeline - from data_pipeline.pipelines.gnomad_v4_cnvs import pipeline as gnomad_v4_cnvs_pipeline +from data_pipeline.pipelines.gnomad_v4_lof_curation_results import pipeline as gnomad_v4_lof_curation_results_pipeline logger = logging.getLogger("gnomad_data_pipeline") @@ -145,6 +145,18 @@ def add_liftover_document_id(ds): # ), # "args": {"index": "gnomad_v4_genome_coverage", "id_field": "xpos", "num_shards": 2, "block_size": 10_000}, # }, + "gnomad_v4_lof_curation_results": { + "get_table": lambda: add_variant_document_id( + hl.read_table(gnomad_v4_lof_curation_results_pipeline.get_output("lof_curation_results").get_output_path()) + ), + "args": { + "index": "gnomad_v4_lof_curation_results", + "index_fields": ["document_id", "variant_id", "locus", "lof_curations.gene_id"], + "id_field": "document_id", + "num_shards": 1, + "block_size": 1_000, + }, + }, ############################################################################################################## # gnomAD v4 CNVs ############################################################################################################## diff --git a/data-pipeline/src/data_pipeline/pipelines/gnomad_v2_lof_curation_results.py b/data-pipeline/src/data_pipeline/pipelines/gnomad_v2_lof_curation_results.py index e001ae4cb..93219f6ba 100644 --- a/data-pipeline/src/data_pipeline/pipelines/gnomad_v2_lof_curation_results.py +++ b/data-pipeline/src/data_pipeline/pipelines/gnomad_v2_lof_curation_results.py @@ -1,6 +1,6 @@ from data_pipeline.pipeline import Pipeline, run_pipeline -from data_pipeline.datasets.gnomad_v2.gnomad_v2_lof_curation import import_gnomad_v2_lof_curation_results +from data_pipeline.datasets.gnomad_v2.gnomad_v2_lof_curation import import_gnomad_lof_curation_results from data_pipeline.pipelines.genes import pipeline as genes_pipeline @@ -9,7 +9,7 @@ pipeline.add_task( "prepare_gnomad_v2_lof_curation_results", - import_gnomad_v2_lof_curation_results, + import_gnomad_lof_curation_results, "/gnomad_v2/gnomad_v2_lof_curation_results.ht", {"genes_path": genes_pipeline.get_output("genes_grch37")}, { diff --git a/data-pipeline/src/data_pipeline/pipelines/gnomad_v4_lof_curation_results.py b/data-pipeline/src/data_pipeline/pipelines/gnomad_v4_lof_curation_results.py new file mode 100644 index 000000000..62b5bc996 --- /dev/null +++ b/data-pipeline/src/data_pipeline/pipelines/gnomad_v4_lof_curation_results.py @@ -0,0 +1,36 @@ +from data_pipeline.pipeline import Pipeline, run_pipeline + +from data_pipeline.datasets.gnomad_v2.gnomad_v2_lof_curation import import_gnomad_lof_curation_results + +from data_pipeline.pipelines.genes import pipeline as genes_pipeline + + +pipeline = Pipeline() + +pipeline.add_task( + "prepare_gnomad_v4_lof_curation_results", + import_gnomad_lof_curation_results, + "/gnomad_v4/gnomad_v4_lof_curation_results.ht", + {"genes_path": genes_pipeline.get_output("genes_grch38")}, + { + # If a result for a variant/gene pair is present in more than one file, + # the result in the first file in this list takes precedence. + "curation_result_paths": [ + "gs://gnomad-v4-data-pipeline/inputs/lof_curation/gnomAD_v4/gnomAD_incomplete_penetrance_final_results.csv", + ], + "reference_genome": "GRCh38", + }, +) + +############################################### +# Outputs +############################################### + +pipeline.set_outputs({"lof_curation_results": "prepare_gnomad_v4_lof_curation_results"}) + +############################################### +# Run +############################################### + +if __name__ == "__main__": + run_pipeline(pipeline) diff --git a/graphql-api/src/queries/lof-curation-result-queries.ts b/graphql-api/src/queries/lof-curation-result-queries.ts index 1e8981d01..e15a09fed 100644 --- a/graphql-api/src/queries/lof-curation-result-queries.ts +++ b/graphql-api/src/queries/lof-curation-result-queries.ts @@ -1,12 +1,36 @@ -const GNOMAD_V2_LOF_CURATION_RESULTS_INDEX = 'gnomad_v2_lof_curation_results' +type GnomadVersion = 'ExAC' | 'v2' | 'v4' + +const GNOMAD_LOF_CURATION_RESULTS_INDICES: Record = { + ExAC: 'gnomad_v2_lof_curation_results', + v2: 'gnomad_v2_lof_curation_results', + v4: 'gnomad_v4_lof_curation_results', +} + +type LoFCuration = { + gene_id: string + gene_version: string + gene_symbol: string | null + verdict: string + flags: string[] | null + project: string +} + +type LoFCurationForVariant = { + variant_id: string + lof_curations: LoFCuration[] +} // ================================================================================================ // Variant query // ================================================================================================ -export const fetchLofCurationResultsByVariant = async (esClient: any, variantId: any) => { +export const fetchLofCurationResultsByVariant = async ( + esClient: any, + gnomadVersion: GnomadVersion, + variantId: string +) => { const response = await esClient.search({ - index: GNOMAD_V2_LOF_CURATION_RESULTS_INDEX, + index: GNOMAD_LOF_CURATION_RESULTS_INDICES[gnomadVersion], type: '_doc', body: { query: { @@ -28,10 +52,17 @@ export const fetchLofCurationResultsByVariant = async (esClient: any, variantId: // ================================================================================================ // Gene query // ================================================================================================ +type Gene = { + gene_id: string +} -export const fetchLofCurationResultsByGene = async (esClient: any, gene: any) => { +export const fetchLofCurationResultsByGene = async ( + esClient: any, + gnomadVersion: GnomadVersion, + gene: Gene +) => { const response = await esClient.search({ - index: GNOMAD_V2_LOF_CURATION_RESULTS_INDEX, + index: GNOMAD_LOF_CURATION_RESULTS_INDICES[gnomadVersion], type: '_doc', size: 1000, body: { @@ -47,16 +78,29 @@ export const fetchLofCurationResultsByGene = async (esClient: any, gene: any) => }, }) - return response.body.hits.hits.map((doc: any) => doc._source.value) + const lofCurations: LoFCurationForVariant[] = response.body.hits.hits.map( + (doc: any) => doc._source.value + ) + + return lofCurations } // ================================================================================================ // Region query // ================================================================================================ +type Region = { + chrom: string + start: number + stop: number +} -export const fetchLofCurationResultsByRegion = async (esClient: any, region: any) => { +export const fetchLofCurationResultsByRegion = async ( + esClient: any, + gnomadVersion: GnomadVersion, + region: Region +) => { const response = await esClient.search({ - index: GNOMAD_V2_LOF_CURATION_RESULTS_INDEX, + index: GNOMAD_LOF_CURATION_RESULTS_INDICES[gnomadVersion], type: '_doc', size: 1000, body: { @@ -78,5 +122,9 @@ export const fetchLofCurationResultsByRegion = async (esClient: any, region: any }, }) - return response.body.hits.hits.map((doc: any) => doc._source.value) + const lofCurations: LoFCurationForVariant[] = response.body.hits.hits.map( + (doc: any) => doc._source.value + ) + + return lofCurations } diff --git a/graphql-api/src/queries/variant-datasets/exac-variant-queries.ts b/graphql-api/src/queries/variant-datasets/exac-variant-queries.ts index c1ff6b8fe..5cf54889c 100644 --- a/graphql-api/src/queries/variant-datasets/exac-variant-queries.ts +++ b/graphql-api/src/queries/variant-datasets/exac-variant-queries.ts @@ -78,7 +78,11 @@ export const fetchVariantById = async (esClient: any, variantIdOrRsid: any) => { const { variantFlags, exomeFlags } = getFlagsForContext({ type: 'region' }, variant) - const lofCurationResults = await fetchLofCurationResultsByVariant(esClient, variant.variant_id) + const lofCurationResults = await fetchLofCurationResultsByVariant( + esClient, + 'ExAC', + variant.variant_id + ) return { ...variant, @@ -184,7 +188,7 @@ export const fetchVariantsByGene = async (esClient: any, gene: any) => { .map((hit: any) => hit._source.value) .map(shapeVariantSummary({ type: 'gene', geneId: gene.gene_id })) - const lofCurationResults = await fetchLofCurationResultsByGene(esClient, gene) + const lofCurationResults = await fetchLofCurationResultsByGene(esClient, 'ExAC', gene) const lofCurationResultsByVariant = {} lofCurationResults.forEach((result: any) => { // @ts-expect-error TS(7053) FIXME: Element implicitly has an 'any' type because expre... Remove this comment to see the full error message @@ -246,7 +250,7 @@ export const fetchVariantsByRegion = async (esClient: any, region: any) => { .map((hit: any) => hit._source.value) .map(shapeVariantSummary({ type: 'region' })) - const lofCurationResults = await fetchLofCurationResultsByRegion(esClient, region) + const lofCurationResults = await fetchLofCurationResultsByRegion(esClient, 'ExAC', region) const lofCurationResultsByVariant = {} lofCurationResults.forEach((result: any) => { diff --git a/graphql-api/src/queries/variant-datasets/gnomad-v2-variant-queries.ts b/graphql-api/src/queries/variant-datasets/gnomad-v2-variant-queries.ts index ed3ee8cd1..e586f2203 100644 --- a/graphql-api/src/queries/variant-datasets/gnomad-v2-variant-queries.ts +++ b/graphql-api/src/queries/variant-datasets/gnomad-v2-variant-queries.ts @@ -119,7 +119,11 @@ const fetchVariantById = async (esClient: any, variantIdOrRsid: any, subset: any const { variantFlags, exomeFlags, genomeFlags } = getFlagsForContext({ type: 'region' }, variant) - const lofCurationResults = await fetchLofCurationResultsByVariant(esClient, variant.variant_id) + const lofCurationResults = await fetchLofCurationResultsByVariant( + esClient, + 'v2', + variant.variant_id + ) return { ...variant, @@ -272,7 +276,7 @@ const fetchVariantsByGene = async (esClient: any, gene: any, subset: any) => { ) .map(shapeVariantSummary(exomeSubset, genomeSubset, { type: 'gene', geneId: gene.gene_id })) - const lofCurationResults = await fetchLofCurationResultsByGene(esClient, gene) + const lofCurationResults = await fetchLofCurationResultsByGene(esClient, 'v2', gene) const lofCurationResultsByVariant = {} lofCurationResults.forEach((result: any) => { // @ts-expect-error TS(7053) FIXME: Element implicitly has an 'any' type because expre... Remove this comment to see the full error message @@ -344,7 +348,7 @@ const fetchVariantsByRegion = async (esClient: any, region: any, subset: any) => ) .map(shapeVariantSummary(exomeSubset, genomeSubset, { type: 'region' })) - const lofCurationResults = await fetchLofCurationResultsByRegion(esClient, region) + const lofCurationResults = await fetchLofCurationResultsByRegion(esClient, 'v2', region) const lofCurationResultsByVariant = {} lofCurationResults.forEach((result: any) => { diff --git a/graphql-api/src/queries/variant-datasets/gnomad-v4-variant-queries.ts b/graphql-api/src/queries/variant-datasets/gnomad-v4-variant-queries.ts index 62f6610ab..46955fb37 100644 --- a/graphql-api/src/queries/variant-datasets/gnomad-v4-variant-queries.ts +++ b/graphql-api/src/queries/variant-datasets/gnomad-v4-variant-queries.ts @@ -7,6 +7,11 @@ import { UserVisibleError } from '../../errors' import { fetchLocalAncestryPopulationsByVariant } from '../local-ancestry-queries' import { fetchAllSearchResults } from '../helpers/elasticsearch-helpers' import { mergeOverlappingRegions } from '../helpers/region-helpers' +import { + fetchLofCurationResultsByVariant, + fetchLofCurationResultsByGene, + fetchLofCurationResultsByRegion, +} from '../lof-curation-result-queries' import { getFlagsForContext } from './shared/flags' import { getConsequenceForContext } from './shared/transcriptConsequence' @@ -113,6 +118,12 @@ const fetchVariantById = async (esClient: any, variantId: any, subset: Subset) = const { variantFlags, exomeFlags, genomeFlags } = getFlagsForContext({ type: 'region' }, variant) + const lofCurationResults = await fetchLofCurationResultsByVariant( + esClient, + 'v4', + variant.variant_id + ) + let genome_ancestry_groups = subsetGenomeFreq.ancestry_groups || [] // Include HGDP and 1KG populations with gnomAD subsets if (variant.genome.freq.hgdp.ac_raw > 0) { @@ -232,6 +243,7 @@ const fetchVariantById = async (esClient: any, variantId: any, subset: Subset) = : null, flags: variantFlags, // TODO: Include RefSeq transcripts once the browser supports them. + lof_curations: lofCurationResults, transcript_consequences: (variant.transcript_consequences || []).filter((csq: any) => csq.gene_id.startsWith('ENSG') ), @@ -466,7 +478,21 @@ const fetchVariantsByGene = async (esClient: any, gene: any, subset: Subset) => ) .map(shapeVariantSummary(subset, { type: 'gene', geneId: gene.gene_id })) - return shapedHits + const lofCurationResults = await fetchLofCurationResultsByGene(esClient, 'v4', gene) + + const lofCurationByVariantId = new Map( + lofCurationResults.map((result) => [ + result.variant_id, + result.lof_curations.find((c) => c.gene_id === gene.gene_id), + ]) + ) + + const shapedHitsWithLof = shapedHits.map((variant: any) => ({ + ...variant, + lof_curation: lofCurationByVariantId.get(variant.variant_id), + })) + + return shapedHitsWithLof } catch (error) { throw new Error(`'Error fetching variants by gene:', ${error}`) } @@ -506,7 +532,7 @@ const fetchVariantsByRegion = async (esClient: any, region: any, subset: Subset) }, }) - return hits + const variants = hits .map((hit: any) => hit._source.value) .filter( (variant: any) => @@ -514,6 +540,26 @@ const fetchVariantsByRegion = async (esClient: any, region: any, subset: Subset) variant.exome.freq[subset].ac_raw > 0 ) .map(shapeVariantSummary(subset, { type: 'region' })) + + const lofCurationResults = await fetchLofCurationResultsByRegion(esClient, 'v4', region) + + const lofCurationsByVariantAndGene = new Map( + lofCurationResults.map((result) => [ + result.variant_id, + new Map(result.lof_curations.map((c) => [c.gene_id, c])), + ]) + ) + + const variantsWithLofCurations = variants.map((variant: any) => ({ + ...variant, + lof_curation: variant.transcript_consequence + ? lofCurationsByVariantAndGene + .get(variant.variant_id) + ?.get(variant.transcript_consequence.gene_id) + : undefined, + })) + + return variantsWithLofCurations } // ================================================================================================