-
Notifications
You must be signed in to change notification settings - Fork 7
Query to slot routines #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…d refactor update_SE_from_tibble to utilize these new functions. Include tests for the new extraction functions to ensure correct functionality.
…ate_features and mutate_samples to modify_features and modify_samples, respectively. Introduce a new helper function, mutate_assay, for handling assay-only mutations. Add query scope analysis functions to determine the target data types for mutate operations. Remove deprecated documentation for the old mutate functions.
Enhance the analyze_query_scope_mutate function to include a new dependency analysis step that detects mixed operations in mutate expressions. Introduce analyze_expression_dependencies function to evaluate the scope of variables referenced in expressions, determining if they span multiple data types (assays, colData, rowData). This addition improves the accuracy of query scope analysis by allowing for more nuanced handling of mixed data types.
Enhance the mutate.SummarizedExperiment function to support mixed scope operations by introducing a new helper function, modify_se_plyxp, which converts expressions to plyxp form for efficient execution. Add a utility function, convert_to_plyxp_form, to handle the conversion of colData and rowData references. This update improves the flexibility and performance of mutate operations involving multiple data types.
Revise the mutate.SummarizedExperiment function to improve scope handling by consolidating logic into a single case_when structure. Introduce a new internal helper function, mutate_via_tibble, for efficient tibble conversion during mutations. This update enhances code clarity and maintains functionality across various data types.
…ures Introduce a new function, decompose_tidy_operation, to handle the decomposition of dplyr operations into individual steps for better analysis. Update the mutate.SummarizedExperiment function to utilize this new decomposition approach, improving the handling of mixed scope operations. Additionally, add a new utility function, map_to_plyxp, to convert mixed scope expressions into plyxp form for efficient execution. This update enhances the flexibility and performance of mutate operations across various data types.
Introduce a new test file for the mutate function, covering a wide range of scenarios including simple and complex operations on colData, rowData, and assays. The tests validate functionality, scope detection, error handling, and integration with other dplyr operations, ensuring robust performance and accuracy in various use cases.
Update the mutate.SummarizedExperiment function to store the latest mutate scope report in the metadata of the result. This addition facilitates testing and introspection of the mutate operations, improving the overall functionality and usability of the mutate feature.
Introduce a new vignette titled "Speed-up `tidySummarizedExperiment` with `plyxp`" that benchmarks various `mutate()` scenarios across two branches of the repository. The vignette outlines the motivation, design principles, and setup for running benchmarks, including helper functions and visualization of results. This addition supports ongoing efforts to enhance the performance of `mutate()` operations in the `tidySummarizedExperiment` package.
Refactor the filter.SummarizedExperiment function to support decomposition of filter queries and analyze the scope of filtering operations. Introduce a new internal function, analyze_query_scope_filter, to determine whether filters target colData, rowData, or assays. Update the modify_se_plyxp function to handle additional arguments, improving flexibility. Additionally, add comprehensive tests to validate the new features and ensure correct scope reporting in metadata.
…marizedExperiment Introduce the analyze_query_scope_filter function to assess the scope of filter queries, determining whether they target colData, rowData, or assays. Integrate this analysis into the filter.SummarizedExperiment function to enhance its capability in handling mixed operations. Additionally, remove the previous implementation of analyze_query_scope_filter from query_scope_analysis.R to streamline the codebase. Update tests and vignettes to reflect these changes and ensure comprehensive coverage of new filtering functionalities.
Remove deprecated filter and join methods from dplyr_methods.R, streamlining the codebase. Introduce a new rename.SummarizedExperiment function in rename.R, enhancing the ability to rename columns in colData and rowData while ensuring proper handling of special/view-only columns. This update improves the overall functionality and usability of the package.
…ty functions Remove deprecated join methods from dplyr_methods.R to streamline the codebase. Introduce a new join utility function, join_efficient_for_SE, in join.R to handle various join operations efficiently based on the scope of join keys. Update tests to validate the functionality of left_join, inner_join, right_join, and full_join for SummarizedExperiment, ensuring comprehensive coverage and improved performance.
Update the join_efficient_for_SE function in join.R to improve clarity and efficiency in join operations. Modify comments for better understanding and remove unnecessary lines. Enhance test cases for left_join, inner_join, right_join, and full_join to ensure accurate metadata reporting of join scopes. Clean up unused code in test files to streamline the testing process.
Introduce a new select.SummarizedExperiment function that allows users to select columns from colData, rowData, and assays while analyzing the scope of the selection. Implement an internal function, analyze_query_scope_select, to determine the domains affected by the selection. Add comprehensive tests to validate the new functionality and ensure correct metadata reporting. Update the benchmark vignette to include select scenarios, enhancing the overall usability and performance of the package.
… comments from dplyr_methods.R to streamline the codebase. This change enhances clarity and prepares for future improvements in column selection functionality.
Introduce a parameter to enable caching of benchmark results, improving efficiency by loading existing results from a file if available. This change enhances the performance of the benchmarking process and reduces redundant computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces "Query to slot routines" which refactors tidySummarizedExperiment's dplyr operations to use scope analysis for improved performance. The changes decompose multi-expression operations, analyze query scope to determine target data types (colData, rowData, assays), and implement optimized execution paths including plyxp-backed mixed operations.
- Query scope analysis system to classify operations by target data type
- Decomposition of multi-expression operations for better routing
- Optimized execution paths with plyxp support for mixed-scope operations
Reviewed Changes
Copilot reviewed 34 out of 35 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
vignettes/introduction.Rmd | Add dplyr dependency |
vignettes/benchmark-scoping.Rmd | Comprehensive benchmarking suite for new implementation |
tests/testthat/test-*.R | New test suites for select, mutate, join, filter, and helper functions |
R/*.R | Major refactoring: separate files for each operation with scope analysis |
man/*.Rd | Updated documentation reflecting new file organization |
DESCRIPTION | Add plyxp dependency |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
# Comply to CRAN notes | ||
. <- NULL | ||
|
||
# Get the colnames of samples and feature datasets | ||
colnames_col <- | ||
colnames(colData(se)) %>% | ||
c(s_(se)$name) %>% | ||
|
||
# Forcefully add the column I know the source. This is useful in nesting | ||
# where a unique value cannot be linked to sample or feature | ||
c(names(column_belonging[column_belonging == s_(se)$name])) | ||
|
||
colnames_row <- se %>% | ||
when( | ||
.hasSlot(., "rowData") | .hasSlot(., "elementMetadata") ~ colnames(rowData(.)), | ||
TRUE ~ c() | ||
) %>% | ||
c(f_(se)$name) %>% | ||
|
||
# Forcefully add the column I know the source. This is useful in nesting | ||
# where a unique value cannot be linked to sample or feature | ||
c(names(column_belonging[column_belonging == f_(se)$name])) | ||
|
||
special_columns <- get_special_columns( | ||
# Decrease the size of the dataset | ||
se[1:min(100, nrow(se)), min(1, ncol(se)):min(20, ncol(se))] | ||
) | ||
|
||
# This is used if I have one column with one value that can be mapped to rows and columns | ||
new_colnames_col = c() | ||
|
||
# This condiction is because if I don't have any samples, the new column | ||
# could be mapped to samples and return NA in the final SE | ||
if(ncol(se) > 0) | ||
new_colnames_col <- | ||
.data_mutated %>% | ||
select_if(!colnames(.) %in% setdiff(colnames_col, s_(se)$name)) %>% | ||
|
||
# Eliminate special columns that are read only. Assays | ||
select_if(!colnames(.) %in% special_columns) %>% | ||
select_if(!colnames(.) %in% colnames_row) %>% | ||
# Replace for subset | ||
select(!!s_(se)$symbol, get_subset_columns(., !!s_(se)$symbol)) %>% | ||
colnames() | ||
|
||
col_data <- | ||
.data_mutated %>% | ||
|
||
select(c(colnames_col, new_colnames_col)) %>% | ||
|
||
|
||
# Filter Sample is NA from SE that have 0 samples | ||
filter(!is.na(!!s_(se)$symbol)) | ||
|
||
# This works even if I have 0 samples | ||
duplicated_samples = col_data |> pull(!!s_(se)$symbol) %>% duplicated() | ||
if(duplicated_samples |> which() |> length() > 0) | ||
# Make fast distinct() | ||
col_data = col_data |> filter(!duplicated_samples) | ||
|
||
col_data = | ||
col_data |> | ||
|
||
# In case unitary SE subset does not work | ||
data.frame(row.names = pull(col_data, !!s_(se)$symbol), check.names = FALSE) %>% | ||
select(-!!s_(se)$symbol) %>% | ||
DataFrame(check.names = FALSE) | ||
# Extract column data | ||
col_data <- extract_col_data(.data_mutated, se, column_belonging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment 'Comply to CRAN notes' followed by '. <- NULL' is repeated throughout the codebase. Consider creating a utility function or documenting this pattern more clearly, as the purpose isn't immediately obvious.
Copilot uses AI. Check for mistakes.
mutate_assay <- function(.data, ...) { | ||
# Get the mutation expressions | ||
dots <- enquos(...) | ||
|
||
# For each new assay column, add it directly to assays | ||
for (i in seq_along(dots)) { | ||
new_assay_name <- names(dots)[i] | ||
expr <- dots[[i]] | ||
|
||
# Evaluate the expression to create the new assay | ||
# This assumes the expression references existing assay names | ||
new_assay_data <- rlang::eval_tidy(expr, data = as.list(assays(.data))) | ||
|
||
# Add the new assay | ||
assays(.data)[[new_assay_name]] <- new_assay_data | ||
} | ||
|
||
return(.data) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function assumes expression references existing assay names but doesn't validate this assumption. If the expression references non-existent assays, eval_tidy will fail with unclear error messages.
Copilot uses AI. Check for mistakes.
extract_col_data <- function(.data_mutated, se, column_belonging = NULL, | ||
colnames_col = NULL, special_columns = NULL, | ||
colnames_row = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The function has many optional parameters with defaults computed inside the function. Consider using a more explicit parameter design or providing clearer documentation about when each parameter should be provided.
Copilot uses AI. Check for mistakes.
test_that("extract_col_data works correctly", { | ||
# Use testthat edition 3 | ||
local_edition(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Setting testthat edition should be done at the package level (e.g., in testthat.R) rather than in individual test files to ensure consistency across all tests.
Copilot uses AI. Check for mistakes.
#' @examples | ||
#' \dontrun{ | ||
#' library(dplyr) | ||
#' library(airway) | ||
#' data(airway) | ||
#' | ||
#' # Basic mutate decomposition | ||
#' mutate_fn <- decompose_tidy_operation("mutate", new_dex = dex, new_cell = cell) | ||
#' attr(mutate_fn, "pipeline_text") | ||
#' # "mutate(new_dex = dex) |> mutate(new_cell = cell)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The examples use \dontrun{} but these appear to be valid executable examples that would help users understand the functionality. Consider using \donttest{} instead if the concern is execution time.
Copilot uses AI. Check for mistakes.
### Visualise with ggplot boxplots | ||
|
||
We cap at 1 second to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence 'We cap at 1 second to avoid' is incomplete and should be finished or removed.
### Visualise with ggplot boxplots | |
We cap at 1 second to avoid | |
We cap at 1 second to avoid skewing the plot with extreme outliers and to improve readability. |
Copilot uses AI. Check for mistakes.
…in introduction vignette Replace devtools with remotes for GitHub installation and specify dplyr::slice for clarity in the examples. This enhances the documentation for better user guidance.
…ng dplyr::rename for improved clarity in examples.
…eamline codebase and improve maintainability.
No description provided.