Skip to content

Conversation

KoalaQin
Copy link
Contributor

This function is similar to this, by taking the proportions of transcript expression to gene expression per tissue, and a mean proportion. It's extended to use both array-format expression data and tissue-as-row expression data as input.

@KoalaQin KoalaQin self-assigned this Nov 17, 2023
@KoalaQin KoalaQin requested a review from jkgoodrich November 17, 2023 20:28
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes based on my PR

# Calculate the proportion of expression of transcript to gene per tissue
transcript_ht = transcript_ht.annotate(
exp_prop=hl.or_else(
transcript_ht.transcript_expression / transcript_ht.gene_expression,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about this one, I need to follow the original code in more detail, will add comment after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spot checked two genes on the pext values on v2 exomes, it's the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now, I had in my mind that summarize_transcript_expression was running summary_agg_func on the grouped gene_id rather than hl.agg.sum. Now that I look at this I think we probably want to change the other function because that is confusing.

There are two ways we can do this.

The first is to change the other function so it returns a single table with structs for transcript_expression and expression_proportion. Then modify the other functions accordingly. Something like this:

def summarize_transcript_expression(
    mt: hl.MatrixTable,
    transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
    tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
    summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
    if summary_agg_func is None:
        summary_agg_func = lambda x: hl.median(hl.agg.collect(x))

    if isinstance(transcript_expression_expr, str):
        transcript_expression_expr = mt[transcript_expression_expr]

    if isinstance(tissue_expr, str):
        tissue_expr = mt[tissue_expr]

    mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
        tx=summary_agg_func(transcript_expression_expr)
    )

    transcript_ht = mt.rename({"tx": ""}).make_table()
    transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")

    tissues = list(transcript_ht.row_value)
    
    # Calculate the sum of transcript expression by gene per tissue.
    gene_ht = transcript_ht.group_by("gene_id").aggregate(
        **{
            tissue: hl.agg.sum(transcript_ht[tissue])
            for tissue in tissues
        }
    )

    # Calculate the proportion of expression of transcript to gene per tissue.
    gene_prop = gene_ht[transcript_ht.gene_id]
    transcript_ht = transcript_ht.select(
        transcript_expression=transcript_ht.row_value,
        expression_proportion=hl.struct(
            **{
                tissue: transcript_ht[tissue]/gene_prop[tissue]
                for tissue in tissues
            }
        )
    )
        
    return transcript_ht


def tissue_expression_struct_to_array(
    tissue_struct_expr: hl.expr.StringExpression,
    tissues: Optional[List[str]] = None,
    tissues_to_filter: Optional[List[str]] = None,
) -> Tuple[hl.expr.ArrayExpression, str]:
    if tissues is None:
        tissues = list(tissue_struct_expr.keys())

    if tissues_to_filter is not None:
        logger.info("Filtering tissues: %s", tissues_to_filter)
        tissues = [t for t in tissues if t not in tissues_to_filter]

    tissue_expression = hl.array([tissue_struct_expr[t] for t in tissues])

    return tissue_expression, tissues


def get_expression_proportion(
    tissue_struct_expr: hl.expr.StringExpression,
    tissues_to_filter: Optional[List[str]] = None,
) -> hl.expr.Float64Expression:
    tissue_expr, _ = tissue_expression_struct_to_array(
        tissue_struct_expr=tissue_struct_expr, 
        tissues_to_filter=tissues_to_filter
    )
    
    # Calculate the mean expression proportion across tissues.
    return hl.mean(hl.filter(lambda e: ~hl.is_nan(e), tissue_expr))

The other option is to change the summarize function back to only returning the transcript HT and then get the transcript proportion within this function instead of passing in 2 HTs. Something like this:

def summarize_transcript_expression(
    mt: hl.MatrixTable,
    transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
    tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
    summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
    if summary_agg_func is None:
        summary_agg_func = lambda x: hl.median(hl.agg.collect(x))

    if isinstance(transcript_expression_expr, str):
        transcript_expression_expr = mt[transcript_expression_expr]

    if isinstance(tissue_expr, str):
        tissue_expr = mt[tissue_expr]

    mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
        tx=summary_agg_func(transcript_expression_expr)
    )

    transcript_ht = mt.rename({"tx": ""}).make_table()
    transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")
        
    return transcript_ht


def get_expression_proportion(
    transcript_ht: hl.Table,
    tissues_to_filter: Optional[List[str]] = None,
) -> hl.Table:
    transcript_ht = tissue_expression_ht_to_array(
        transcript_ht, tissues_to_filter=tissues_to_filter
    )
    
    tissues = hl.eval(transcript_ht.tissues)
    
    # Calculate the sum of transcript expression by gene per tissue.
    gene_ht = transcript_ht.group_by("gene_id").aggregate(
        expression_sum=hl.agg.array_sum(transcript_ht.tissue_expression)
    )

    # Calculate the mean expression proportion across tissues.
    transcript_ht = transcript_ht.annotate(
        exp_prop_mean=hl.mean(
            hl.filter(
                lambda e: ~hl.is_nan(e), 
                hl.map(
                    lambda x, y: x/y,
                    transcript_ht.tissue_expression,
                    gene_ht[transcript_ht.gene_id].expression_sum
                )
            )
        )
    )
    
    return transcript_ht

It sort of depends on whether it's worth storing the proportion as well (or if the HT with that info needs to be released), or if it can just be computed each time it is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im my original code, I put transcript_expression, gene_expression, and expression_proportion in one HT, it was clearer to me that way. And pext scores on v2 exomes are actually showing the expression_proportion, unless we will change that.

@KoalaQin KoalaQin changed the base branch from qh/add_rsem_summary_function_suggestions to jg/add_rsem_summary_function December 13, 2023 20:06
@KoalaQin KoalaQin force-pushed the qh/get_expression_proportion branch from 415cce7 to eee4b7f Compare December 13, 2023 20:22
…to qh/get_expression_proportion

# Conflicts:
#	gnomad/utils/transcript_annotation.py
Base automatically changed from jg/add_rsem_summary_function to main December 13, 2023 20:52
@KoalaQin KoalaQin requested a review from jkgoodrich December 13, 2023 21:32
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I finally got around to looking at this again and figured out why I was confused

# Calculate the proportion of expression of transcript to gene per tissue
transcript_ht = transcript_ht.annotate(
exp_prop=hl.or_else(
transcript_ht.transcript_expression / transcript_ht.gene_expression,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now, I had in my mind that summarize_transcript_expression was running summary_agg_func on the grouped gene_id rather than hl.agg.sum. Now that I look at this I think we probably want to change the other function because that is confusing.

There are two ways we can do this.

The first is to change the other function so it returns a single table with structs for transcript_expression and expression_proportion. Then modify the other functions accordingly. Something like this:

def summarize_transcript_expression(
    mt: hl.MatrixTable,
    transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
    tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
    summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
    if summary_agg_func is None:
        summary_agg_func = lambda x: hl.median(hl.agg.collect(x))

    if isinstance(transcript_expression_expr, str):
        transcript_expression_expr = mt[transcript_expression_expr]

    if isinstance(tissue_expr, str):
        tissue_expr = mt[tissue_expr]

    mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
        tx=summary_agg_func(transcript_expression_expr)
    )

    transcript_ht = mt.rename({"tx": ""}).make_table()
    transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")

    tissues = list(transcript_ht.row_value)
    
    # Calculate the sum of transcript expression by gene per tissue.
    gene_ht = transcript_ht.group_by("gene_id").aggregate(
        **{
            tissue: hl.agg.sum(transcript_ht[tissue])
            for tissue in tissues
        }
    )

    # Calculate the proportion of expression of transcript to gene per tissue.
    gene_prop = gene_ht[transcript_ht.gene_id]
    transcript_ht = transcript_ht.select(
        transcript_expression=transcript_ht.row_value,
        expression_proportion=hl.struct(
            **{
                tissue: transcript_ht[tissue]/gene_prop[tissue]
                for tissue in tissues
            }
        )
    )
        
    return transcript_ht


def tissue_expression_struct_to_array(
    tissue_struct_expr: hl.expr.StringExpression,
    tissues: Optional[List[str]] = None,
    tissues_to_filter: Optional[List[str]] = None,
) -> Tuple[hl.expr.ArrayExpression, str]:
    if tissues is None:
        tissues = list(tissue_struct_expr.keys())

    if tissues_to_filter is not None:
        logger.info("Filtering tissues: %s", tissues_to_filter)
        tissues = [t for t in tissues if t not in tissues_to_filter]

    tissue_expression = hl.array([tissue_struct_expr[t] for t in tissues])

    return tissue_expression, tissues


def get_expression_proportion(
    tissue_struct_expr: hl.expr.StringExpression,
    tissues_to_filter: Optional[List[str]] = None,
) -> hl.expr.Float64Expression:
    tissue_expr, _ = tissue_expression_struct_to_array(
        tissue_struct_expr=tissue_struct_expr, 
        tissues_to_filter=tissues_to_filter
    )
    
    # Calculate the mean expression proportion across tissues.
    return hl.mean(hl.filter(lambda e: ~hl.is_nan(e), tissue_expr))

The other option is to change the summarize function back to only returning the transcript HT and then get the transcript proportion within this function instead of passing in 2 HTs. Something like this:

def summarize_transcript_expression(
    mt: hl.MatrixTable,
    transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
    tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
    summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
    if summary_agg_func is None:
        summary_agg_func = lambda x: hl.median(hl.agg.collect(x))

    if isinstance(transcript_expression_expr, str):
        transcript_expression_expr = mt[transcript_expression_expr]

    if isinstance(tissue_expr, str):
        tissue_expr = mt[tissue_expr]

    mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
        tx=summary_agg_func(transcript_expression_expr)
    )

    transcript_ht = mt.rename({"tx": ""}).make_table()
    transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")
        
    return transcript_ht


def get_expression_proportion(
    transcript_ht: hl.Table,
    tissues_to_filter: Optional[List[str]] = None,
) -> hl.Table:
    transcript_ht = tissue_expression_ht_to_array(
        transcript_ht, tissues_to_filter=tissues_to_filter
    )
    
    tissues = hl.eval(transcript_ht.tissues)
    
    # Calculate the sum of transcript expression by gene per tissue.
    gene_ht = transcript_ht.group_by("gene_id").aggregate(
        expression_sum=hl.agg.array_sum(transcript_ht.tissue_expression)
    )

    # Calculate the mean expression proportion across tissues.
    transcript_ht = transcript_ht.annotate(
        exp_prop_mean=hl.mean(
            hl.filter(
                lambda e: ~hl.is_nan(e), 
                hl.map(
                    lambda x, y: x/y,
                    transcript_ht.tissue_expression,
                    gene_ht[transcript_ht.gene_id].expression_sum
                )
            )
        )
    )
    
    return transcript_ht

It sort of depends on whether it's worth storing the proportion as well (or if the HT with that info needs to be released), or if it can just be computed each time it is needed

@KoalaQin KoalaQin requested a review from jkgoodrich December 19, 2023 01:13
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small requests to docstring changes

:param tissue_expr: Column expression indicating tissue type. Default is 'tissue'.
:param summary_agg_func: Optional aggregation function to use to summarize the
transcript expression quantification by tissue. Example: `hl.agg.mean`. Default
transcript expression quantification by tissue. Example: `hl.mean`. Default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to use an aggregation function. There is no hl.agg.median, so that is why the hl.median(hl.agg.collect(x)) is used below. We could also switch to use hl.agg.approx_median as the default, but it's a small dataset which is why the collect here wasn't that big of an issue.

Suggested change
transcript expression quantification by tissue. Example: `hl.mean`. Default
transcript expression quantification by tissue. Example: `hl.agg.mean`. Default

Comment on lines 77 to 78
:return: Table with expression proportion of transcript to gene per tissue
and mean expression proportion across tissues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to change since it no longer returns a table

Comment on lines 141 to 144
The output is a Table with fields in `annotations_to_extract`,
each containing an array of summarized expression values or proportion
by tissue, where the order of tissues in the array is indicated by
the "tissues" global annotation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, it either has an annotation of tissue_expression that is an array of information per tissue, where each element of the array is the Table's row value for a given tissue.

Or it can be a Table with one array annotation for each field defined in annotations_to_extract, where each array is an array of the given field value per Tissue.

Maybe an example of the two options would make it clear?

:param ht: Table with a row annotation for each tissue.
:param tissues: Optional list of tissues to keep in the 'tissue_expression' array.
Default is all non-key rows in the Table.
:param tissues_to_keep: Optional list of tissues to keep in the 'tissue_expression'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Te wording of this should change a little to since it isn't always 'tissue_expression'

@KoalaQin KoalaQin requested a review from jkgoodrich December 26, 2023 17:13
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very close, just some small doc changes

Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KoalaQin KoalaQin merged commit 08b1a53 into main Jan 8, 2024
@jkgoodrich jkgoodrich deleted the qh/get_expression_proportion branch January 22, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants