Function to get expression proportion #649

KoalaQin · 2023-11-17T20:28:10Z

This function is similar to this, by taking the proportions of transcript expression to gene expression per tissue, and a mean proportion. It's extended to use both array-format expression data and tissue-as-row expression data as input.

jkgoodrich

Some changes based on my PR

gnomad/utils/transcript_annotation.py

jkgoodrich · 2023-12-13T17:37:00Z

gnomad/utils/transcript_annotation.py

+    # Calculate the proportion of expression of transcript to gene per tissue
+    transcript_ht = transcript_ht.annotate(
+        exp_prop=hl.or_else(
+            transcript_ht.transcript_expression / transcript_ht.gene_expression,


I'm still thinking about this one, I need to follow the original code in more detail, will add comment after

I spot checked two genes on the pext values on v2 exomes, it's the same.

Ah, I see now, I had in my mind that summarize_transcript_expression was running summary_agg_func on the grouped gene_id rather than hl.agg.sum. Now that I look at this I think we probably want to change the other function because that is confusing.

There are two ways we can do this.

The first is to change the other function so it returns a single table with structs for transcript_expression and expression_proportion. Then modify the other functions accordingly. Something like this:

def summarize_transcript_expression( mt: hl.MatrixTable, transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm", tissue_expr: Union[hl.expr.StringExpression, str] = "tissue", summary_agg_func: Optional[Callable] = None, ) -> hl.Table: if summary_agg_func is None: summary_agg_func = lambda x: hl.median(hl.agg.collect(x)) if isinstance(transcript_expression_expr, str): transcript_expression_expr = mt[transcript_expression_expr] if isinstance(tissue_expr, str): tissue_expr = mt[tissue_expr] mt = mt.group_cols_by(tissue=tissue_expr).aggregate( tx=summary_agg_func(transcript_expression_expr) ) transcript_ht = mt.rename({"tx": ""}).make_table() transcript_ht = transcript_ht.key_by("transcript_id", "gene_id") tissues = list(transcript_ht.row_value) # Calculate the sum of transcript expression by gene per tissue. gene_ht = transcript_ht.group_by("gene_id").aggregate( **{ tissue: hl.agg.sum(transcript_ht[tissue]) for tissue in tissues } ) # Calculate the proportion of expression of transcript to gene per tissue. gene_prop = gene_ht[transcript_ht.gene_id] transcript_ht = transcript_ht.select( transcript_expression=transcript_ht.row_value, expression_proportion=hl.struct( **{ tissue: transcript_ht[tissue]/gene_prop[tissue] for tissue in tissues } ) ) return transcript_ht def tissue_expression_struct_to_array( tissue_struct_expr: hl.expr.StringExpression, tissues: Optional[List[str]] = None, tissues_to_filter: Optional[List[str]] = None, ) -> Tuple[hl.expr.ArrayExpression, str]: if tissues is None: tissues = list(tissue_struct_expr.keys()) if tissues_to_filter is not None: logger.info("Filtering tissues: %s", tissues_to_filter) tissues = [t for t in tissues if t not in tissues_to_filter] tissue_expression = hl.array([tissue_struct_expr[t] for t in tissues]) return tissue_expression, tissues def get_expression_proportion( tissue_struct_expr: hl.expr.StringExpression, tissues_to_filter: Optional[List[str]] = None, ) -> hl.expr.Float64Expression: tissue_expr, _ = tissue_expression_struct_to_array( tissue_struct_expr=tissue_struct_expr, tissues_to_filter=tissues_to_filter ) # Calculate the mean expression proportion across tissues. return hl.mean(hl.filter(lambda e: ~hl.is_nan(e), tissue_expr))

The other option is to change the summarize function back to only returning the transcript HT and then get the transcript proportion within this function instead of passing in 2 HTs. Something like this:

def summarize_transcript_expression( mt: hl.MatrixTable, transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm", tissue_expr: Union[hl.expr.StringExpression, str] = "tissue", summary_agg_func: Optional[Callable] = None, ) -> hl.Table: if summary_agg_func is None: summary_agg_func = lambda x: hl.median(hl.agg.collect(x)) if isinstance(transcript_expression_expr, str): transcript_expression_expr = mt[transcript_expression_expr] if isinstance(tissue_expr, str): tissue_expr = mt[tissue_expr] mt = mt.group_cols_by(tissue=tissue_expr).aggregate( tx=summary_agg_func(transcript_expression_expr) ) transcript_ht = mt.rename({"tx": ""}).make_table() transcript_ht = transcript_ht.key_by("transcript_id", "gene_id") return transcript_ht def get_expression_proportion( transcript_ht: hl.Table, tissues_to_filter: Optional[List[str]] = None, ) -> hl.Table: transcript_ht = tissue_expression_ht_to_array( transcript_ht, tissues_to_filter=tissues_to_filter ) tissues = hl.eval(transcript_ht.tissues) # Calculate the sum of transcript expression by gene per tissue. gene_ht = transcript_ht.group_by("gene_id").aggregate( expression_sum=hl.agg.array_sum(transcript_ht.tissue_expression) ) # Calculate the mean expression proportion across tissues. transcript_ht = transcript_ht.annotate( exp_prop_mean=hl.mean( hl.filter( lambda e: ~hl.is_nan(e), hl.map( lambda x, y: x/y, transcript_ht.tissue_expression, gene_ht[transcript_ht.gene_id].expression_sum ) ) ) ) return transcript_ht

It sort of depends on whether it's worth storing the proportion as well (or if the HT with that info needs to be released), or if it can just be computed each time it is needed

Im my original code, I put transcript_expression, gene_expression, and expression_proportion in one HT, it was clearer to me that way. And pext scores on v2 exomes are actually showing the expression_proportion, unless we will change that.

…to qh/get_expression_proportion # Conflicts: # gnomad/utils/transcript_annotation.py

jkgoodrich

OK, I finally got around to looking at this again and figured out why I was confused

jkgoodrich · 2023-12-18T20:47:08Z

gnomad/utils/transcript_annotation.py

+    # Calculate the proportion of expression of transcript to gene per tissue
+    transcript_ht = transcript_ht.annotate(
+        exp_prop=hl.or_else(
+            transcript_ht.transcript_expression / transcript_ht.gene_expression,


Ah, I see now, I had in my mind that summarize_transcript_expression was running summary_agg_func on the grouped gene_id rather than hl.agg.sum. Now that I look at this I think we probably want to change the other function because that is confusing.

There are two ways we can do this.

The first is to change the other function so it returns a single table with structs for transcript_expression and expression_proportion. Then modify the other functions accordingly. Something like this:

def summarize_transcript_expression( mt: hl.MatrixTable, transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm", tissue_expr: Union[hl.expr.StringExpression, str] = "tissue", summary_agg_func: Optional[Callable] = None, ) -> hl.Table: if summary_agg_func is None: summary_agg_func = lambda x: hl.median(hl.agg.collect(x)) if isinstance(transcript_expression_expr, str): transcript_expression_expr = mt[transcript_expression_expr] if isinstance(tissue_expr, str): tissue_expr = mt[tissue_expr] mt = mt.group_cols_by(tissue=tissue_expr).aggregate( tx=summary_agg_func(transcript_expression_expr) ) transcript_ht = mt.rename({"tx": ""}).make_table() transcript_ht = transcript_ht.key_by("transcript_id", "gene_id") tissues = list(transcript_ht.row_value) # Calculate the sum of transcript expression by gene per tissue. gene_ht = transcript_ht.group_by("gene_id").aggregate( **{ tissue: hl.agg.sum(transcript_ht[tissue]) for tissue in tissues } ) # Calculate the proportion of expression of transcript to gene per tissue. gene_prop = gene_ht[transcript_ht.gene_id] transcript_ht = transcript_ht.select( transcript_expression=transcript_ht.row_value, expression_proportion=hl.struct( **{ tissue: transcript_ht[tissue]/gene_prop[tissue] for tissue in tissues } ) ) return transcript_ht def tissue_expression_struct_to_array( tissue_struct_expr: hl.expr.StringExpression, tissues: Optional[List[str]] = None, tissues_to_filter: Optional[List[str]] = None, ) -> Tuple[hl.expr.ArrayExpression, str]: if tissues is None: tissues = list(tissue_struct_expr.keys()) if tissues_to_filter is not None: logger.info("Filtering tissues: %s", tissues_to_filter) tissues = [t for t in tissues if t not in tissues_to_filter] tissue_expression = hl.array([tissue_struct_expr[t] for t in tissues]) return tissue_expression, tissues def get_expression_proportion( tissue_struct_expr: hl.expr.StringExpression, tissues_to_filter: Optional[List[str]] = None, ) -> hl.expr.Float64Expression: tissue_expr, _ = tissue_expression_struct_to_array( tissue_struct_expr=tissue_struct_expr, tissues_to_filter=tissues_to_filter ) # Calculate the mean expression proportion across tissues. return hl.mean(hl.filter(lambda e: ~hl.is_nan(e), tissue_expr))

The other option is to change the summarize function back to only returning the transcript HT and then get the transcript proportion within this function instead of passing in 2 HTs. Something like this:

def summarize_transcript_expression( mt: hl.MatrixTable, transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm", tissue_expr: Union[hl.expr.StringExpression, str] = "tissue", summary_agg_func: Optional[Callable] = None, ) -> hl.Table: if summary_agg_func is None: summary_agg_func = lambda x: hl.median(hl.agg.collect(x)) if isinstance(transcript_expression_expr, str): transcript_expression_expr = mt[transcript_expression_expr] if isinstance(tissue_expr, str): tissue_expr = mt[tissue_expr] mt = mt.group_cols_by(tissue=tissue_expr).aggregate( tx=summary_agg_func(transcript_expression_expr) ) transcript_ht = mt.rename({"tx": ""}).make_table() transcript_ht = transcript_ht.key_by("transcript_id", "gene_id") return transcript_ht def get_expression_proportion( transcript_ht: hl.Table, tissues_to_filter: Optional[List[str]] = None, ) -> hl.Table: transcript_ht = tissue_expression_ht_to_array( transcript_ht, tissues_to_filter=tissues_to_filter ) tissues = hl.eval(transcript_ht.tissues) # Calculate the sum of transcript expression by gene per tissue. gene_ht = transcript_ht.group_by("gene_id").aggregate( expression_sum=hl.agg.array_sum(transcript_ht.tissue_expression) ) # Calculate the mean expression proportion across tissues. transcript_ht = transcript_ht.annotate( exp_prop_mean=hl.mean( hl.filter( lambda e: ~hl.is_nan(e), hl.map( lambda x, y: x/y, transcript_ht.tissue_expression, gene_ht[transcript_ht.gene_id].expression_sum ) ) ) ) return transcript_ht

It sort of depends on whether it's worth storing the proportion as well (or if the HT with that info needs to be released), or if it can just be computed each time it is needed

…ion_suggestions Suggestions to get_expression_proportion PR

jkgoodrich

Just a few small requests to docstring changes

jkgoodrich · 2023-12-22T21:06:07Z

gnomad/utils/transcript_annotation.py

    :param tissue_expr: Column expression indicating tissue type. Default is 'tissue'.
    :param summary_agg_func: Optional aggregation function to use to summarize the
-        transcript expression quantification by tissue. Example: `hl.agg.mean`. Default
+        transcript expression quantification by tissue. Example: `hl.mean`. Default


It needs to use an aggregation function. There is no hl.agg.median, so that is why the hl.median(hl.agg.collect(x)) is used below. We could also switch to use hl.agg.approx_median as the default, but it's a small dataset which is why the collect here wasn't that big of an issue.

Suggested change

transcript expression quantification by tissue. Example: `hl.mean`. Default

transcript expression quantification by tissue. Example: `hl.agg.mean`. Default

gnomad/utils/transcript_annotation.py

jkgoodrich · 2023-12-22T21:10:27Z

gnomad/utils/transcript_annotation.py

+    :return: Table with expression proportion of transcript to gene per tissue
+        and mean expression proportion across tissues.


This needs to change since it no longer returns a table

jkgoodrich · 2023-12-22T21:22:55Z

gnomad/utils/transcript_annotation.py

+    The output is a Table with fields in `annotations_to_extract`,
+    each containing an array of summarized expression values or proportion
+    by tissue, where the order of tissues in the array is indicated by
+    the "tissues" global annotation.


Not exactly, it either has an annotation of tissue_expression that is an array of information per tissue, where each element of the array is the Table's row value for a given tissue.

Or it can be a Table with one array annotation for each field defined in annotations_to_extract, where each array is an array of the given field value per Tissue.

Maybe an example of the two options would make it clear?

jkgoodrich · 2023-12-22T21:23:54Z

gnomad/utils/transcript_annotation.py

    :param ht: Table with a row annotation for each tissue.
-    :param tissues: Optional list of tissues to keep in the 'tissue_expression' array.
-        Default is all non-key rows in the Table.
+    :param tissues_to_keep: Optional list of tissues to keep in the 'tissue_expression'


Te wording of this should change a little to since it isn't always 'tissue_expression'

jkgoodrich

Very close, just some small doc changes

gnomad/utils/transcript_annotation.py

jkgoodrich

LGTM!

KoalaQin added 2 commits November 17, 2023 12:20

get gene expression HT

a2a1122

Function to get expression proportion of transcript

418731b

KoalaQin self-assigned this Nov 17, 2023

KoalaQin requested a review from jkgoodrich November 17, 2023 20:28

KoalaQin assigned jkgoodrich Nov 17, 2023

jkgoodrich requested changes Dec 13, 2023

View reviewed changes

KoalaQin changed the base branch from qh/add_rsem_summary_function_suggestions to jg/add_rsem_summary_function December 13, 2023 20:06

apply review suggestions

eee4b7f

KoalaQin force-pushed the qh/get_expression_proportion branch from 415cce7 to eee4b7f Compare December 13, 2023 20:22

Merge remote-tracking branch 'origin/jg/add_rsem_summary_function' in…

092f226

…to qh/get_expression_proportion # Conflicts: # gnomad/utils/transcript_annotation.py

KoalaQin added the Changelog: new feature label Dec 13, 2023

re-apply review changes

45449bb

Base automatically changed from jg/add_rsem_summary_function to main December 13, 2023 20:52

KoalaQin requested a review from jkgoodrich December 13, 2023 21:32

jkgoodrich added the Pext label Dec 14, 2023

jkgoodrich reviewed Dec 18, 2023

View reviewed changes

apply review comments

6abda14

KoalaQin requested a review from jkgoodrich December 19, 2023 01:13

KoalaQin and others added 6 commits December 18, 2023 20:19

merge from main

c7d7de3

small edit in import

aab1654

Suggestions to get_expression_proportion

eca4299

Update gnomad/utils/transcript_annotation.py

eaeaa0c

Merge pull request #653 from broadinstitute/jg/get_expression_proport…

601085b

…ion_suggestions Suggestions to get_expression_proportion PR

re-organize the code

536b929

jkgoodrich reviewed Dec 22, 2023

View reviewed changes

address revies comments on docstrings

256964b

KoalaQin requested a review from jkgoodrich December 26, 2023 17:13

change identation with comment

972c1b8

jkgoodrich reviewed Jan 8, 2024

View reviewed changes

apply review suggestions

129ac6a

jkgoodrich approved these changes Jan 8, 2024

View reviewed changes

KoalaQin merged commit 08b1a53 into main Jan 8, 2024

jkgoodrich deleted the qh/get_expression_proportion branch January 22, 2024 14:57

	transcript expression quantification by tissue. Example: `hl.mean`. Default
	transcript expression quantification by tissue. Example: `hl.agg.mean`. Default

		:return: Table with expression proportion of transcript to gene per tissue
		and mean expression proportion across tissues.

Function to get expression proportion #649

Function to get expression proportion #649

Uh oh!

Conversation

KoalaQin commented Nov 17, 2023

Uh oh!

jkgoodrich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkgoodrich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkgoodrich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkgoodrich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkgoodrich left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants