-
Notifications
You must be signed in to change notification settings - Fork 31
Function to get expression proportion #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes based on my PR
# Calculate the proportion of expression of transcript to gene per tissue | ||
transcript_ht = transcript_ht.annotate( | ||
exp_prop=hl.or_else( | ||
transcript_ht.transcript_expression / transcript_ht.gene_expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still thinking about this one, I need to follow the original code in more detail, will add comment after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spot checked two genes on the pext values on v2 exomes, it's the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now, I had in my mind that summarize_transcript_expression
was running summary_agg_func
on the grouped gene_id
rather than hl.agg.sum
. Now that I look at this I think we probably want to change the other function because that is confusing.
There are two ways we can do this.
The first is to change the other function so it returns a single table with structs for transcript_expression
and expression_proportion
. Then modify the other functions accordingly. Something like this:
def summarize_transcript_expression(
mt: hl.MatrixTable,
transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
if summary_agg_func is None:
summary_agg_func = lambda x: hl.median(hl.agg.collect(x))
if isinstance(transcript_expression_expr, str):
transcript_expression_expr = mt[transcript_expression_expr]
if isinstance(tissue_expr, str):
tissue_expr = mt[tissue_expr]
mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
tx=summary_agg_func(transcript_expression_expr)
)
transcript_ht = mt.rename({"tx": ""}).make_table()
transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")
tissues = list(transcript_ht.row_value)
# Calculate the sum of transcript expression by gene per tissue.
gene_ht = transcript_ht.group_by("gene_id").aggregate(
**{
tissue: hl.agg.sum(transcript_ht[tissue])
for tissue in tissues
}
)
# Calculate the proportion of expression of transcript to gene per tissue.
gene_prop = gene_ht[transcript_ht.gene_id]
transcript_ht = transcript_ht.select(
transcript_expression=transcript_ht.row_value,
expression_proportion=hl.struct(
**{
tissue: transcript_ht[tissue]/gene_prop[tissue]
for tissue in tissues
}
)
)
return transcript_ht
def tissue_expression_struct_to_array(
tissue_struct_expr: hl.expr.StringExpression,
tissues: Optional[List[str]] = None,
tissues_to_filter: Optional[List[str]] = None,
) -> Tuple[hl.expr.ArrayExpression, str]:
if tissues is None:
tissues = list(tissue_struct_expr.keys())
if tissues_to_filter is not None:
logger.info("Filtering tissues: %s", tissues_to_filter)
tissues = [t for t in tissues if t not in tissues_to_filter]
tissue_expression = hl.array([tissue_struct_expr[t] for t in tissues])
return tissue_expression, tissues
def get_expression_proportion(
tissue_struct_expr: hl.expr.StringExpression,
tissues_to_filter: Optional[List[str]] = None,
) -> hl.expr.Float64Expression:
tissue_expr, _ = tissue_expression_struct_to_array(
tissue_struct_expr=tissue_struct_expr,
tissues_to_filter=tissues_to_filter
)
# Calculate the mean expression proportion across tissues.
return hl.mean(hl.filter(lambda e: ~hl.is_nan(e), tissue_expr))
The other option is to change the summarize function back to only returning the transcript HT and then get the transcript proportion within this function instead of passing in 2 HTs. Something like this:
def summarize_transcript_expression(
mt: hl.MatrixTable,
transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
if summary_agg_func is None:
summary_agg_func = lambda x: hl.median(hl.agg.collect(x))
if isinstance(transcript_expression_expr, str):
transcript_expression_expr = mt[transcript_expression_expr]
if isinstance(tissue_expr, str):
tissue_expr = mt[tissue_expr]
mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
tx=summary_agg_func(transcript_expression_expr)
)
transcript_ht = mt.rename({"tx": ""}).make_table()
transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")
return transcript_ht
def get_expression_proportion(
transcript_ht: hl.Table,
tissues_to_filter: Optional[List[str]] = None,
) -> hl.Table:
transcript_ht = tissue_expression_ht_to_array(
transcript_ht, tissues_to_filter=tissues_to_filter
)
tissues = hl.eval(transcript_ht.tissues)
# Calculate the sum of transcript expression by gene per tissue.
gene_ht = transcript_ht.group_by("gene_id").aggregate(
expression_sum=hl.agg.array_sum(transcript_ht.tissue_expression)
)
# Calculate the mean expression proportion across tissues.
transcript_ht = transcript_ht.annotate(
exp_prop_mean=hl.mean(
hl.filter(
lambda e: ~hl.is_nan(e),
hl.map(
lambda x, y: x/y,
transcript_ht.tissue_expression,
gene_ht[transcript_ht.gene_id].expression_sum
)
)
)
)
return transcript_ht
It sort of depends on whether it's worth storing the proportion as well (or if the HT with that info needs to be released), or if it can just be computed each time it is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im my original code, I put transcript_expression, gene_expression, and expression_proportion in one HT, it was clearer to me that way. And pext scores on v2 exomes are actually showing the expression_proportion, unless we will change that.
415cce7
to
eee4b7f
Compare
…to qh/get_expression_proportion # Conflicts: # gnomad/utils/transcript_annotation.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I finally got around to looking at this again and figured out why I was confused
# Calculate the proportion of expression of transcript to gene per tissue | ||
transcript_ht = transcript_ht.annotate( | ||
exp_prop=hl.or_else( | ||
transcript_ht.transcript_expression / transcript_ht.gene_expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now, I had in my mind that summarize_transcript_expression
was running summary_agg_func
on the grouped gene_id
rather than hl.agg.sum
. Now that I look at this I think we probably want to change the other function because that is confusing.
There are two ways we can do this.
The first is to change the other function so it returns a single table with structs for transcript_expression
and expression_proportion
. Then modify the other functions accordingly. Something like this:
def summarize_transcript_expression(
mt: hl.MatrixTable,
transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
if summary_agg_func is None:
summary_agg_func = lambda x: hl.median(hl.agg.collect(x))
if isinstance(transcript_expression_expr, str):
transcript_expression_expr = mt[transcript_expression_expr]
if isinstance(tissue_expr, str):
tissue_expr = mt[tissue_expr]
mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
tx=summary_agg_func(transcript_expression_expr)
)
transcript_ht = mt.rename({"tx": ""}).make_table()
transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")
tissues = list(transcript_ht.row_value)
# Calculate the sum of transcript expression by gene per tissue.
gene_ht = transcript_ht.group_by("gene_id").aggregate(
**{
tissue: hl.agg.sum(transcript_ht[tissue])
for tissue in tissues
}
)
# Calculate the proportion of expression of transcript to gene per tissue.
gene_prop = gene_ht[transcript_ht.gene_id]
transcript_ht = transcript_ht.select(
transcript_expression=transcript_ht.row_value,
expression_proportion=hl.struct(
**{
tissue: transcript_ht[tissue]/gene_prop[tissue]
for tissue in tissues
}
)
)
return transcript_ht
def tissue_expression_struct_to_array(
tissue_struct_expr: hl.expr.StringExpression,
tissues: Optional[List[str]] = None,
tissues_to_filter: Optional[List[str]] = None,
) -> Tuple[hl.expr.ArrayExpression, str]:
if tissues is None:
tissues = list(tissue_struct_expr.keys())
if tissues_to_filter is not None:
logger.info("Filtering tissues: %s", tissues_to_filter)
tissues = [t for t in tissues if t not in tissues_to_filter]
tissue_expression = hl.array([tissue_struct_expr[t] for t in tissues])
return tissue_expression, tissues
def get_expression_proportion(
tissue_struct_expr: hl.expr.StringExpression,
tissues_to_filter: Optional[List[str]] = None,
) -> hl.expr.Float64Expression:
tissue_expr, _ = tissue_expression_struct_to_array(
tissue_struct_expr=tissue_struct_expr,
tissues_to_filter=tissues_to_filter
)
# Calculate the mean expression proportion across tissues.
return hl.mean(hl.filter(lambda e: ~hl.is_nan(e), tissue_expr))
The other option is to change the summarize function back to only returning the transcript HT and then get the transcript proportion within this function instead of passing in 2 HTs. Something like this:
def summarize_transcript_expression(
mt: hl.MatrixTable,
transcript_expression_expr: Union[hl.expr.NumericExpression, str] = "transcript_tpm",
tissue_expr: Union[hl.expr.StringExpression, str] = "tissue",
summary_agg_func: Optional[Callable] = None,
) -> hl.Table:
if summary_agg_func is None:
summary_agg_func = lambda x: hl.median(hl.agg.collect(x))
if isinstance(transcript_expression_expr, str):
transcript_expression_expr = mt[transcript_expression_expr]
if isinstance(tissue_expr, str):
tissue_expr = mt[tissue_expr]
mt = mt.group_cols_by(tissue=tissue_expr).aggregate(
tx=summary_agg_func(transcript_expression_expr)
)
transcript_ht = mt.rename({"tx": ""}).make_table()
transcript_ht = transcript_ht.key_by("transcript_id", "gene_id")
return transcript_ht
def get_expression_proportion(
transcript_ht: hl.Table,
tissues_to_filter: Optional[List[str]] = None,
) -> hl.Table:
transcript_ht = tissue_expression_ht_to_array(
transcript_ht, tissues_to_filter=tissues_to_filter
)
tissues = hl.eval(transcript_ht.tissues)
# Calculate the sum of transcript expression by gene per tissue.
gene_ht = transcript_ht.group_by("gene_id").aggregate(
expression_sum=hl.agg.array_sum(transcript_ht.tissue_expression)
)
# Calculate the mean expression proportion across tissues.
transcript_ht = transcript_ht.annotate(
exp_prop_mean=hl.mean(
hl.filter(
lambda e: ~hl.is_nan(e),
hl.map(
lambda x, y: x/y,
transcript_ht.tissue_expression,
gene_ht[transcript_ht.gene_id].expression_sum
)
)
)
)
return transcript_ht
It sort of depends on whether it's worth storing the proportion as well (or if the HT with that info needs to be released), or if it can just be computed each time it is needed
…ion_suggestions Suggestions to get_expression_proportion PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small requests to docstring changes
:param tissue_expr: Column expression indicating tissue type. Default is 'tissue'. | ||
:param summary_agg_func: Optional aggregation function to use to summarize the | ||
transcript expression quantification by tissue. Example: `hl.agg.mean`. Default | ||
transcript expression quantification by tissue. Example: `hl.mean`. Default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to use an aggregation function. There is no hl.agg.median, so that is why the hl.median(hl.agg.collect(x))
is used below. We could also switch to use hl.agg.approx_median
as the default, but it's a small dataset which is why the collect here wasn't that big of an issue.
transcript expression quantification by tissue. Example: `hl.mean`. Default | |
transcript expression quantification by tissue. Example: `hl.agg.mean`. Default |
:return: Table with expression proportion of transcript to gene per tissue | ||
and mean expression proportion across tissues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to change since it no longer returns a table
The output is a Table with fields in `annotations_to_extract`, | ||
each containing an array of summarized expression values or proportion | ||
by tissue, where the order of tissues in the array is indicated by | ||
the "tissues" global annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly, it either has an annotation of tissue_expression
that is an array of information per tissue, where each element of the array is the Table's row value for a given tissue.
Or it can be a Table with one array annotation for each field defined in annotations_to_extract
, where each array is an array of the given field value per Tissue.
Maybe an example of the two options would make it clear?
:param ht: Table with a row annotation for each tissue. | ||
:param tissues: Optional list of tissues to keep in the 'tissue_expression' array. | ||
Default is all non-key rows in the Table. | ||
:param tissues_to_keep: Optional list of tissues to keep in the 'tissue_expression' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Te wording of this should change a little to since it isn't always 'tissue_expression'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very close, just some small doc changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This function is similar to this, by taking the proportions of transcript expression to gene expression per tissue, and a mean proportion. It's extended to use both array-format expression data and tissue-as-row expression data as input.