Skip to content

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Apr 14, 2024

This is a draft PR that contains multiple assorted tweaks and refactorings that I have implemented for a SEM.jl package over a last few weeks.
The goal of these changes was to

  • improve the performance and memory footprint for my use cases (few thousands of observed vars)
  • cleanup the public API (also to simplify building SEM models for my usescases)
  • cleanup the internal API to simplify maintenance

I realize that there's a lot of changes, so it is hard to review all that is going on in this PR, but I thought it still would be nice to submit it as a preview of all the potential changes that I have implemented.
In particular, these are the bigger pieces implemented:

  • introduce ParamArray type that incapsulates the logic of substituting parameter values into RAM matrices (what is currently done by A_inds, S_inds etc).
    It simplifies the code, reduces code duplication and allows easy switching between dense/sparse matrices of substituted model parameters.
  • unify objective/gradient/hessian calculation of loss functions and implied objects into a single evaluate!() call (instead of upto 7 methods in the current version).
    That greatly improves the maintenance and tweaking this code, and also maps better onto optimization packages (Optim.jl).
  • predict_latent_vars() for predicting factor scores per observation. At the moment regression and Bartlett methods are implemented.
  • tweaks to the external API: identifier() is renamed into param() for clarity, param_type into relation, consistently added support for vars, observed_vars, latent_vars methods to ParameterTable, RAMMatrices and other types
  • converting SEM.jl models into lavaan models; extracting parameters from lavaan models (helper function for the unit tests was moved into the main package code)
  • missing data handling: introduce SemObsMissingPattern object to simplify missing data-related code (including FIML)
  • multiple computation optimizations across the package (EM, ML, FIML, WLS, generic and symbolic implied objects): mostly to linear algebra, e.g. using 5-arg mul!() to avoid unnecessary allocations
  • some cleanups to the unit tests to improve reporting of the failed tests
  • use Julia 1.9 package extensions for supporting different optimizations backend (NLOpt.jl, ProximalOperators.jl, BlackBoxOptim.jl). This is both to reduce the number of the package dependencies and to streamline switching between the optimizers for the users.

Let me know what you think about this PR and how you would like these changes to be incorporated into the main tree (of course, if you are interested in that :) )

I have tried to make commits as granular as possible to allow selecting subset of commits around a particular feature into a smaller PRs.

Alexey Stukalov and others added 30 commits March 17, 2024 17:43
for resolving the scope
* both columns and vars
* remove _vars suffix from variable categories ("variables" is already a field name)
* destructure lambda function args
* don't use zip(): mapreduce support multi-arg functions
to match julia naming convention
avoids full evalulation when the end result is known
* simplify parameters() function to return just a vector of params
* RAMMatrices ctor: warn on duplicate params
* introduce ArrayParamsMap type for clarity
* annotate types of the corresp. RAMMatrices fields
* rename get_parameter_indices() into array_parameters_map[_linear()]
(two distinct functions without kwarg for type stability) and
remove cartesian2linear() (not needed)
* use the single pass over the array for performance
RAMMatrices depend on RAMConstant type
* declare RAMConstant field types
* refactor constants collection to avoid code duplication
* this method is not RAMMatrices ctor, it is
  Dict{K, RAMMatrices} convert
* use comprehension to construct dict
convert() is a proper method to call to avoid unnecessary construction,
ctor semantics requires that a new object is constructed
* use named tuples
* reduce code duplication
* use graph as a main parameter
* simplify rows processing
* don't reallocate table.columns
Copy link

codecov bot commented Apr 14, 2024

Codecov Report

Attention: Patch coverage is 68.64932% with 506 lines in your changes are missing coverage. Please review.

Project coverage is 62.57%. Comparing base (68fba92) to head (090ef5e).
Report is 26 commits behind head on devel.

❗ Current head 090ef5e differs from pull request most recent head 52c1a8b. Consider uploading reports for the commit 52c1a8b to get more accurate results

Files Patch % Lines
src/frontend/specification/ParameterTable.jl 56.42% 78 Missing ⚠️
src/frontend/predict.jl 0.00% 57 Missing ⚠️
ext/SEMBlackBoxOptimExt/DiffEvoFactory.jl 0.00% 51 Missing ⚠️
src/frontend/specification/RAMMatrices.jl 81.87% 29 Missing ⚠️
ext/SEMBlackBoxOptimExt/BlackBoxOptim.jl 0.00% 22 Missing ⚠️
...t/SEMBlackBoxOptimExt/SemOptimizerBlackBoxOptim.jl 0.00% 21 Missing ⚠️
src/optimizer/documentation.jl 34.37% 21 Missing ⚠️
ext/SEMBlackBoxOptimExt/AdamMutation.jl 0.00% 17 Missing ⚠️
ext/optimizer/ProximalAlgorithms.jl 0.00% 17 Missing ⚠️
src/types.jl 46.87% 17 Missing ⚠️
... and 28 more
Additional details and impacted files
@@            Coverage Diff             @@
##            devel     #181      +/-   ##
==========================================
- Coverage   67.22%   62.57%   -4.66%     
==========================================
  Files          51       59       +8     
  Lines        2490     2322     -168     
==========================================
- Hits         1674     1453     -221     
- Misses        816      869      +53     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alyst
Copy link
Contributor Author

alyst commented Apr 14, 2024

I have benchmarked one of my models (~120 observed variables, generic imply, ML) 2 weeks ago.
With the main branch:

47.710711 seconds (1.55 M allocations: 35.937 GiB, 11.03% gc time)
47.635976 seconds (1.55 M allocations: 35.937 GiB, 10.97% gc time)

 * Candidate solution
    Final objective value:     2.002460e+02

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 8.45e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 6.51e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 2.88e-07 ≰ 1.0e-08

 * Work counters
    Seconds run:   48  (vs limit Inf)
    Iterations:    4279
    f(x) calls:    12529
    ∇f(x) calls:   12529

With this PR (earlier version):

20.348503 seconds (665.00 k allocations: 12.666 GiB, 9.55% gc time)
20.356397 seconds (665.00 k allocations: 12.666 GiB, 9.42% gc time)

Candidate solution
    Final objective value:     1.999197e+02

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.89e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.99e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 3.68e-08 ≰ 1.0e-08

 * Work counters
    Seconds run:   20  (vs limit Inf)
    Iterations:    2774
    f(x) calls:    8085
    ∇f(x) calls:   8085

The number of iterations is different, because the input data is slightly different due to imputation.
But for the main branch it's 89.7 iterations/sec, 8.39 Mb/iteration
For the PR: 136.3 iterations/sec, 4.54 Mb/iteration

@Maximilian-Stefan-Ernst
Copy link
Collaborator

These changes look amazing, thanks a lot for feeding them back. I will take a look at them and try to figure out what the smartest way of merging is.

@Maximilian-Stefan-Ernst Maximilian-Stefan-Ernst changed the base branch from main to feature/formatter April 17, 2024 09:18
@Maximilian-Stefan-Ernst Maximilian-Stefan-Ernst changed the base branch from feature/formatter to devel April 17, 2024 11:17
@Maximilian-Stefan-Ernst
Copy link
Collaborator

Maximilian-Stefan-Ernst commented Apr 17, 2024

I tried to wrap my head around how to format first and then merge, but everything I tried lead to quite large merge conflicts... So I guess it's easier to merge this first and format after...

@alyst
Copy link
Contributor Author

alyst commented Apr 17, 2024

I tried to wrap my head around how to format first and then merge, but everything I tried lead to quite large merge conflicts...

This would require a bit of scripting, but I think it's doable.
The idea is to create a parallel branch, which would recreate the history of this branch, but with the formatting applied after each commit.
Then at each iteration of the script, it would check out the specific commit from the unformatted branch (this branch), format the whole codebase and commit the resulting changes (relative to the previous commit of the new branch) to the new branch.
I can do that, it would be a fun exercise :) We just have to agree on the formatting style, apply it to the master branch, and then I can reformat the commits in this PR.

@Maximilian-Stefan-Ernst
Copy link
Collaborator

Cool, if you would like to do that of course we can^^ I thought about using the default formatter style with some minor tweaks: https://github.com/StructuralEquationModels/StructuralEquationModels.jl/blob/feature/formatter/.JuliaFormatter.toml
Would you be happy with that?

@alyst
Copy link
Contributor Author

alyst commented Apr 17, 2024

Would you be happy with that?

I like always_for_in. :)
In principle, I would be happy with anything that you, as the author&maintainer, would decide, but if you can create a draft PR/branch that would show the reformatted code, I can give a specific feedback.

@Maximilian-Stefan-Ernst
Copy link
Collaborator

This branch is formatted accordingly: https://github.com/StructuralEquationModels/StructuralEquationModels.jl/tree/feature/formatter

Yes, always_for_in made me realize I used =, in and , haha. The other options are just some minor changes that let you have multiple assignments / matrix definitions / ... aligned.

One minor thing is that formatter changes spaces between - and *, and sometimes in a weird way... But I guess that's okay.

@alyst
Copy link
Contributor Author

alyst commented Apr 17, 2024

This branch is formatted accordingly

Thank you! I like this style, even the condensing of formulas within the indices. :)
I personally also like to remove superfluous newlines, so that more code can fit on one screen (which remove_extra_newlines = true should implement), but I don't know if that would be too aggressive and remove any reasonable separation of the code blocks.

@Maximilian-Stefan-Ernst
Copy link
Collaborator

Maximilian-Stefan-Ernst commented Apr 18, 2024

I added remove_extra_newlines = true (I also like it better), and also added whitespace_typedefs = true which changes for example struct SemEnsemble{N,T<:Tuple,V<:AbstractVector,D,I} <: AbstractSemCollection to struct SemEnsemble{N, T <: Tuple, V <: AbstractVector, D, I} <: AbstractSemCollection. If you want you can check again on the feature/formatter branch; and if you're okay with that I would go ahead and update the main branch accordingly.

@alyst
Copy link
Contributor Author

alyst commented Apr 18, 2024

I added remove_extra_newlines = true (I also like it better), and also added whitespace_typedefs = true.

Thank you, I like these updates. It's good to go from my side.
One thing I noticed -- the formatter adds explicit returns (which is a good thing), which exposes a few cases like return print(...), where returning the result of print was not intended.
This has to be replaced with

    print(...)
    return nothing

but it's not critical to do that immediately, it could be gradually fixed over time.

@Maximilian-Stefan-Ernst
Copy link
Collaborator

Okay, devel and main are both reformatted now, and since these are the only two branches the complete repo is reformatted.

(W.r.t. the return print: I think this was because I messed up; I played around with different styles, and the option always_use_return was applied altough it's not in the default style. I now re-applied the reformatting with the correct specifications in one go, so the return print thing is fixed.)

@alyst
Copy link
Contributor Author

alyst commented Apr 18, 2024

Okay, devel and main are both reformatted now, and since these are the only two branches the complete repo is reformatted.

Thanks! Later today I will try to implement per-commit formatting I suggested.
If that works, I'll probably force-push to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants