Skip to content

Conversation

stemangiola
Copy link
Collaborator

No description provided.

…d refactor update_SE_from_tibble to utilize these new functions. Include tests for the new extraction functions to ensure correct functionality.
…ate_features and mutate_samples to modify_features and modify_samples, respectively. Introduce a new helper function, mutate_assay, for handling assay-only mutations. Add query scope analysis functions to determine the target data types for mutate operations. Remove deprecated documentation for the old mutate functions.
Enhance the analyze_query_scope_mutate function to include a new dependency analysis step that detects mixed operations in mutate expressions. Introduce analyze_expression_dependencies function to evaluate the scope of variables referenced in expressions, determining if they span multiple data types (assays, colData, rowData). This addition improves the accuracy of query scope analysis by allowing for more nuanced handling of mixed data types.
Enhance the mutate.SummarizedExperiment function to support mixed scope operations by introducing a new helper function, modify_se_plyxp, which converts expressions to plyxp form for efficient execution. Add a utility function, convert_to_plyxp_form, to handle the conversion of colData and rowData references. This update improves the flexibility and performance of mutate operations involving multiple data types.
Revise the mutate.SummarizedExperiment function to improve scope handling by consolidating logic into a single case_when structure. Introduce a new internal helper function, mutate_via_tibble, for efficient tibble conversion during mutations. This update enhances code clarity and maintains functionality across various data types.
…ures

Introduce a new function, decompose_tidy_operation, to handle the decomposition of dplyr operations into individual steps for better analysis. Update the mutate.SummarizedExperiment function to utilize this new decomposition approach, improving the handling of mixed scope operations. Additionally, add a new utility function, map_to_plyxp, to convert mixed scope expressions into plyxp form for efficient execution. This update enhances the flexibility and performance of mutate operations across various data types.
Introduce a new test file for the mutate function, covering a wide range of scenarios including simple and complex operations on colData, rowData, and assays. The tests validate functionality, scope detection, error handling, and integration with other dplyr operations, ensuring robust performance and accuracy in various use cases.
Update the mutate.SummarizedExperiment function to store the latest mutate scope report in the metadata of the result. This addition facilitates testing and introspection of the mutate operations, improving the overall functionality and usability of the mutate feature.
@stemangiola stemangiola linked an issue Aug 31, 2025 that may be closed by this pull request
Introduce a new vignette titled "Speed-up `tidySummarizedExperiment` with `plyxp`" that benchmarks various `mutate()` scenarios across two branches of the repository. The vignette outlines the motivation, design principles, and setup for running benchmarks, including helper functions and visualization of results. This addition supports ongoing efforts to enhance the performance of `mutate()` operations in the `tidySummarizedExperiment` package.
Refactor the filter.SummarizedExperiment function to support decomposition of filter queries and analyze the scope of filtering operations. Introduce a new internal function, analyze_query_scope_filter, to determine whether filters target colData, rowData, or assays. Update the modify_se_plyxp function to handle additional arguments, improving flexibility. Additionally, add comprehensive tests to validate the new features and ensure correct scope reporting in metadata.
…marizedExperiment

Introduce the analyze_query_scope_filter function to assess the scope of filter queries, determining whether they target colData, rowData, or assays. Integrate this analysis into the filter.SummarizedExperiment function to enhance its capability in handling mixed operations. Additionally, remove the previous implementation of analyze_query_scope_filter from query_scope_analysis.R to streamline the codebase. Update tests and vignettes to reflect these changes and ensure comprehensive coverage of new filtering functionalities.
Remove deprecated filter and join methods from dplyr_methods.R, streamlining the codebase. Introduce a new rename.SummarizedExperiment function in rename.R, enhancing the ability to rename columns in colData and rowData while ensuring proper handling of special/view-only columns. This update improves the overall functionality and usability of the package.
…ty functions

Remove deprecated join methods from dplyr_methods.R to streamline the codebase. Introduce a new join utility function, join_efficient_for_SE, in join.R to handle various join operations efficiently based on the scope of join keys. Update tests to validate the functionality of left_join, inner_join, right_join, and full_join for SummarizedExperiment, ensuring comprehensive coverage and improved performance.
Update the join_efficient_for_SE function in join.R to improve clarity and efficiency in join operations. Modify comments for better understanding and remove unnecessary lines. Enhance test cases for left_join, inner_join, right_join, and full_join to ensure accurate metadata reporting of join scopes. Clean up unused code in test files to streamline the testing process.
Introduce a new select.SummarizedExperiment function that allows users to select columns from colData, rowData, and assays while analyzing the scope of the selection. Implement an internal function, analyze_query_scope_select, to determine the domains affected by the selection. Add comprehensive tests to validate the new functionality and ensure correct metadata reporting. Update the benchmark vignette to include select scenarios, enhancing the overall usability and performance of the package.
… comments from dplyr_methods.R to streamline the codebase. This change enhances clarity and prepares for future improvements in column selection functionality.
Introduce a parameter to enable caching of benchmark results, improving efficiency by loading existing results from a file if available. This change enhances the performance of the benchmarking process and reduces redundant computations.
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces "Query to slot routines" which refactors tidySummarizedExperiment's dplyr operations to use scope analysis for improved performance. The changes decompose multi-expression operations, analyze query scope to determine target data types (colData, rowData, assays), and implement optimized execution paths including plyxp-backed mixed operations.

  • Query scope analysis system to classify operations by target data type
  • Decomposition of multi-expression operations for better routing
  • Optimized execution paths with plyxp support for mixed-scope operations

Reviewed Changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vignettes/introduction.Rmd Add dplyr dependency
vignettes/benchmark-scoping.Rmd Comprehensive benchmarking suite for new implementation
tests/testthat/test-*.R New test suites for select, mutate, join, filter, and helper functions
R/*.R Major refactoring: separate files for each operation with scope analysis
man/*.Rd Updated documentation reflecting new file organization
DESCRIPTION Add plyxp dependency

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 357 to +362
# Comply to CRAN notes
. <- NULL

# Get the colnames of samples and feature datasets
colnames_col <-
colnames(colData(se)) %>%
c(s_(se)$name) %>%

# Forcefully add the column I know the source. This is useful in nesting
# where a unique value cannot be linked to sample or feature
c(names(column_belonging[column_belonging == s_(se)$name]))

colnames_row <- se %>%
when(
.hasSlot(., "rowData") | .hasSlot(., "elementMetadata") ~ colnames(rowData(.)),
TRUE ~ c()
) %>%
c(f_(se)$name) %>%

# Forcefully add the column I know the source. This is useful in nesting
# where a unique value cannot be linked to sample or feature
c(names(column_belonging[column_belonging == f_(se)$name]))

special_columns <- get_special_columns(
# Decrease the size of the dataset
se[1:min(100, nrow(se)), min(1, ncol(se)):min(20, ncol(se))]
)

# This is used if I have one column with one value that can be mapped to rows and columns
new_colnames_col = c()

# This condiction is because if I don't have any samples, the new column
# could be mapped to samples and return NA in the final SE
if(ncol(se) > 0)
new_colnames_col <-
.data_mutated %>%
select_if(!colnames(.) %in% setdiff(colnames_col, s_(se)$name)) %>%

# Eliminate special columns that are read only. Assays
select_if(!colnames(.) %in% special_columns) %>%
select_if(!colnames(.) %in% colnames_row) %>%
# Replace for subset
select(!!s_(se)$symbol, get_subset_columns(., !!s_(se)$symbol)) %>%
colnames()

col_data <-
.data_mutated %>%

select(c(colnames_col, new_colnames_col)) %>%


# Filter Sample is NA from SE that have 0 samples
filter(!is.na(!!s_(se)$symbol))

# This works even if I have 0 samples
duplicated_samples = col_data |> pull(!!s_(se)$symbol) %>% duplicated()
if(duplicated_samples |> which() |> length() > 0)
# Make fast distinct()
col_data = col_data |> filter(!duplicated_samples)

col_data =
col_data |>

# In case unitary SE subset does not work
data.frame(row.names = pull(col_data, !!s_(se)$symbol), check.names = FALSE) %>%
select(-!!s_(se)$symbol) %>%
DataFrame(check.names = FALSE)
# Extract column data
col_data <- extract_col_data(.data_mutated, se, column_belonging)
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment 'Comply to CRAN notes' followed by '. <- NULL' is repeated throughout the codebase. Consider creating a utility function or documenting this pattern more clearly, as the purpose isn't immediately obvious.

Copilot uses AI. Check for mistakes.

Comment on lines +170 to +188
mutate_assay <- function(.data, ...) {
# Get the mutation expressions
dots <- enquos(...)

# For each new assay column, add it directly to assays
for (i in seq_along(dots)) {
new_assay_name <- names(dots)[i]
expr <- dots[[i]]

# Evaluate the expression to create the new assay
# This assumes the expression references existing assay names
new_assay_data <- rlang::eval_tidy(expr, data = as.list(assays(.data)))

# Add the new assay
assays(.data)[[new_assay_name]] <- new_assay_data
}

return(.data)
}
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function assumes expression references existing assay names but doesn't validate this assumption. If the expression references non-existent assays, eval_tidy will fail with unclear error messages.

Copilot uses AI. Check for mistakes.

Comment on lines +15 to +17
extract_col_data <- function(.data_mutated, se, column_belonging = NULL,
colnames_col = NULL, special_columns = NULL,
colnames_row = NULL) {
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The function has many optional parameters with defaults computed inside the function. Consider using a more explicit parameter design or providing clearer documentation about when each parameter should be provided.

Copilot uses AI. Check for mistakes.

Comment on lines +7 to +9
test_that("extract_col_data works correctly", {
# Use testthat edition 3
local_edition(3)
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Setting testthat edition should be done at the package level (e.g., in testthat.R) rather than in individual test files to ensure consistency across all tests.

Copilot uses AI. Check for mistakes.

Comment on lines +62 to +71
#' @examples
#' \dontrun{
#' library(dplyr)
#' library(airway)
#' data(airway)
#'
#' # Basic mutate decomposition
#' mutate_fn <- decompose_tidy_operation("mutate", new_dex = dex, new_cell = cell)
#' attr(mutate_fn, "pipeline_text")
#' # "mutate(new_dex = dex) |> mutate(new_cell = cell)"
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The examples use \dontrun{} but these appear to be valid executable examples that would help users understand the functionality. Consider using \donttest{} instead if the concern is execution time.

Copilot uses AI. Check for mistakes.

Comment on lines +212 to +214
### Visualise with ggplot boxplots

We cap at 1 second to avoid
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence 'We cap at 1 second to avoid' is incomplete and should be finished or removed.

Suggested change
### Visualise with ggplot boxplots
We cap at 1 second to avoid
We cap at 1 second to avoid skewing the plot with extreme outliers and to improve readability.

Copilot uses AI. Check for mistakes.

…in introduction vignette

Replace devtools with remotes for GitHub installation and specify dplyr::slice for clarity in the examples. This enhances the documentation for better user guidance.
…ng dplyr::rename for improved clarity in examples.
…eamline codebase and improve maintainability.
@stemangiola
Copy link
Collaborator Author

@jtlandis @mikelove if you have throughput to test the new tidySE before I merge this branch will be amazing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plyxp backend
1 participant