Multiple Refactorings #181

alyst · 2024-04-14T22:36:36Z

This is a draft PR that contains multiple assorted tweaks and refactorings that I have implemented for a SEM.jl package over a last few weeks.
The goal of these changes was to

improve the performance and memory footprint for my use cases (few thousands of observed vars)
cleanup the public API (also to simplify building SEM models for my usescases)
cleanup the internal API to simplify maintenance

I realize that there's a lot of changes, so it is hard to review all that is going on in this PR, but I thought it still would be nice to submit it as a preview of all the potential changes that I have implemented.
In particular, these are the bigger pieces implemented:

introduce ParamArray type that incapsulates the logic of substituting parameter values into RAM matrices (what is currently done by A_inds, S_inds etc).
It simplifies the code, reduces code duplication and allows easy switching between dense/sparse matrices of substituted model parameters.
unify objective/gradient/hessian calculation of loss functions and implied objects into a single evaluate!() call (instead of upto 7 methods in the current version).
That greatly improves the maintenance and tweaking this code, and also maps better onto optimization packages (Optim.jl).
predict_latent_vars() for predicting factor scores per observation. At the moment regression and Bartlett methods are implemented.
tweaks to the external API: identifier() is renamed into param() for clarity, param_type into relation, consistently added support for vars, observed_vars, latent_vars methods to ParameterTable, RAMMatrices and other types
converting SEM.jl models into lavaan models; extracting parameters from lavaan models (helper function for the unit tests was moved into the main package code)
missing data handling: introduce SemObsMissingPattern object to simplify missing data-related code (including FIML)
multiple computation optimizations across the package (EM, ML, FIML, WLS, generic and symbolic implied objects): mostly to linear algebra, e.g. using 5-arg mul!() to avoid unnecessary allocations
some cleanups to the unit tests to improve reporting of the failed tests
use Julia 1.9 package extensions for supporting different optimizations backend (NLOpt.jl, ProximalOperators.jl, BlackBoxOptim.jl). This is both to reduce the number of the package dependencies and to streamline switching between the optimizers for the users.

Let me know what you think about this PR and how you would like these changes to be incorporated into the main tree (of course, if you are interested in that :) )

I have tried to make commits as granular as possible to allow selecting subset of commits around a particular feature into a smaller PRs.

for resolving the scope

* both columns and vars * remove _vars suffix from variable categories ("variables" is already a field name)

* destructure lambda function args * don't use zip(): mapreduce support multi-arg functions

to match julia naming convention

avoids full evalulation when the end result is known

* simplify parameters() function to return just a vector of params * RAMMatrices ctor: warn on duplicate params

* introduce ArrayParamsMap type for clarity * annotate types of the corresp. RAMMatrices fields * rename get_parameter_indices() into array_parameters_map[_linear()] (two distinct functions without kwarg for type stability) and remove cartesian2linear() (not needed) * use the single pass over the array for performance

RAMMatrices depend on RAMConstant type

* declare RAMConstant field types * refactor constants collection to avoid code duplication

* this method is not RAMMatrices ctor, it is Dict{K, RAMMatrices} convert * use comprehension to construct dict

convert() is a proper method to call to avoid unnecessary construction, ctor semantics requires that a new object is constructed

* use named tuples * reduce code duplication

* use graph as a main parameter * simplify rows processing * don't reallocate table.columns

so EM MVN could be done when SemObsMissing is constructed

codecov · 2024-04-14T23:03:29Z

Codecov Report

Attention: Patch coverage is 68.64932% with 506 lines in your changes are missing coverage. Please review.

Project coverage is 62.57%. Comparing base (68fba92) to head (090ef5e).
Report is 26 commits behind head on devel.

❗ Current head 090ef5e differs from pull request most recent head 52c1a8b. Consider uploading reports for the commit 52c1a8b to get more accurate results

Files	Patch %	Lines
src/frontend/specification/ParameterTable.jl	56.42%	78 Missing ⚠️
src/frontend/predict.jl	0.00%	57 Missing ⚠️
ext/SEMBlackBoxOptimExt/DiffEvoFactory.jl	0.00%	51 Missing ⚠️
src/frontend/specification/RAMMatrices.jl	81.87%	29 Missing ⚠️
ext/SEMBlackBoxOptimExt/BlackBoxOptim.jl	0.00%	22 Missing ⚠️
...t/SEMBlackBoxOptimExt/SemOptimizerBlackBoxOptim.jl	0.00%	21 Missing ⚠️
src/optimizer/documentation.jl	34.37%	21 Missing ⚠️
ext/SEMBlackBoxOptimExt/AdamMutation.jl	0.00%	17 Missing ⚠️
ext/optimizer/ProximalAlgorithms.jl	0.00%	17 Missing ⚠️
src/types.jl	46.87%	17 Missing ⚠️
... and 28 more

Additional details and impacted files

@@            Coverage Diff             @@
##            devel     #181      +/-   ##
==========================================
- Coverage   67.22%   62.57%   -4.66%     
==========================================
  Files          51       59       +8     
  Lines        2490     2322     -168     
==========================================
- Hits         1674     1453     -221     
- Misses        816      869      +53

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

to make tests pass

alyst · 2024-04-14T23:44:08Z

I have benchmarked one of my models (~120 observed variables, generic imply, ML) 2 weeks ago.
With the main branch:

47.710711 seconds (1.55 M allocations: 35.937 GiB, 11.03% gc time)
47.635976 seconds (1.55 M allocations: 35.937 GiB, 10.97% gc time)

 * Candidate solution
    Final objective value:     2.002460e+02

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 8.45e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 6.51e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 2.88e-07 ≰ 1.0e-08

 * Work counters
    Seconds run:   48  (vs limit Inf)
    Iterations:    4279
    f(x) calls:    12529
    ∇f(x) calls:   12529

With this PR (earlier version):

20.348503 seconds (665.00 k allocations: 12.666 GiB, 9.55% gc time)
20.356397 seconds (665.00 k allocations: 12.666 GiB, 9.42% gc time)

Candidate solution
    Final objective value:     1.999197e+02

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.89e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.99e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 3.68e-08 ≰ 1.0e-08

 * Work counters
    Seconds run:   20  (vs limit Inf)
    Iterations:    2774
    f(x) calls:    8085
    ∇f(x) calls:   8085

The number of iterations is different, because the input data is slightly different due to imputation.
But for the main branch it's 89.7 iterations/sec, 8.39 Mb/iteration
For the PR: 136.3 iterations/sec, 4.54 Mb/iteration

Maximilian-Stefan-Ernst · 2024-04-15T07:39:02Z

These changes look amazing, thanks a lot for feeding them back. I will take a look at them and try to figure out what the smartest way of merging is.

for faster EM MVN

Maximilian-Stefan-Ernst · 2024-04-17T12:20:30Z

I tried to wrap my head around how to format first and then merge, but everything I tried lead to quite large merge conflicts... So I guess it's easier to merge this first and format after...

alyst · 2024-04-17T19:56:17Z

I tried to wrap my head around how to format first and then merge, but everything I tried lead to quite large merge conflicts...

This would require a bit of scripting, but I think it's doable.
The idea is to create a parallel branch, which would recreate the history of this branch, but with the formatting applied after each commit.
Then at each iteration of the script, it would check out the specific commit from the unformatted branch (this branch), format the whole codebase and commit the resulting changes (relative to the previous commit of the new branch) to the new branch.
I can do that, it would be a fun exercise :) We just have to agree on the formatting style, apply it to the master branch, and then I can reformat the commits in this PR.

Maximilian-Stefan-Ernst · 2024-04-17T20:47:21Z

Cool, if you would like to do that of course we can^^ I thought about using the default formatter style with some minor tweaks: https://github.com/StructuralEquationModels/StructuralEquationModels.jl/blob/feature/formatter/.JuliaFormatter.toml
Would you be happy with that?

alyst · 2024-04-17T20:57:27Z

Would you be happy with that?

I like always_for_in. :)
In principle, I would be happy with anything that you, as the author&maintainer, would decide, but if you can create a draft PR/branch that would show the reformatted code, I can give a specific feedback.

Maximilian-Stefan-Ernst · 2024-04-17T21:20:34Z

This branch is formatted accordingly: https://github.com/StructuralEquationModels/StructuralEquationModels.jl/tree/feature/formatter

Yes, always_for_in made me realize I used =, in and ∈, haha. The other options are just some minor changes that let you have multiple assignments / matrix definitions / ... aligned.

One minor thing is that formatter changes spaces between - and *, and sometimes in a weird way... But I guess that's okay.

alyst · 2024-04-17T21:45:38Z

This branch is formatted accordingly

Thank you! I like this style, even the condensing of formulas within the indices. :)
I personally also like to remove superfluous newlines, so that more code can fit on one screen (which remove_extra_newlines = true should implement), but I don't know if that would be too aggressive and remove any reasonable separation of the code blocks.

Maximilian-Stefan-Ernst · 2024-04-18T10:15:11Z

I added remove_extra_newlines = true (I also like it better), and also added whitespace_typedefs = true which changes for example struct SemEnsemble{N,T<:Tuple,V<:AbstractVector,D,I} <: AbstractSemCollection to struct SemEnsemble{N, T <: Tuple, V <: AbstractVector, D, I} <: AbstractSemCollection. If you want you can check again on the feature/formatter branch; and if you're okay with that I would go ahead and update the main branch accordingly.

alyst · 2024-04-18T15:47:38Z

I added remove_extra_newlines = true (I also like it better), and also added whitespace_typedefs = true.

Thank you, I like these updates. It's good to go from my side.
One thing I noticed -- the formatter adds explicit returns (which is a good thing), which exposes a few cases like return print(...), where returning the result of print was not intended.
This has to be replaced with

    print(...)
    return nothing

but it's not critical to do that immediately, it could be gradually fixed over time.

Maximilian-Stefan-Ernst · 2024-04-18T22:00:48Z

Okay, devel and main are both reformatted now, and since these are the only two branches the complete repo is reformatted.

(W.r.t. the return print: I think this was because I messed up; I played around with different styles, and the option always_use_return was applied altough it's not in the default style. I now re-applied the reformatting with the correct specifications in one go, so the return print thing is fixed.)

alyst · 2024-04-18T22:05:17Z

Okay, devel and main are both reformatted now, and since these are the only two branches the complete repo is reformatted.

Thanks! Later today I will try to implement per-commit formatting I suggested.
If that works, I'll probably force-push to this PR.

Alexey Stukalov and others added 30 commits March 17, 2024 17:43

const SEM=StructuralEquationModels

30083a6

for resolving the scope

ParamTable: convert from Dict to NamedTuple

a505da0

* both columns and vars * remove _vars suffix from variable categories ("variables" is already a field name)

obj_grad_hess: simplify mapreduce

0d0d389

* destructure lambda function args * don't use zip(): mapreduce support multi-arg functions

cov_and_mean(): use StatsBase.mean_and_cov()

080a8ad

fill_A_S_M(): use @inbounds

5e9db0a

fill_A_S_M!(): add !

74735f3

to match julia naming convention

==: use &&

60c61f3

avoids full evalulation when the end result is known

RAMMatrices: cleanup params index

ae89781

* simplify parameters() function to return just a vector of params * RAMMatrices ctor: warn on duplicate params

RAMMatrices: tiny rename for clarity

4609c00

RAMConstant: move before RAMMatrices

613dba7

RAMMatrices depend on RAMConstant type

RAMConstant: simplify

436d31c

* declare RAMConstant field types * refactor constants collection to avoid code duplication

RAMMatrices: optimize F_indices init

6ce98c3

RAMMatrices ctor: dim checks

80b1927

RAMMatrices: declare types for all fields

67e01ce

include RAMMatrices before EnsParTable

b3e12a0

fix EnsParTable to Dict{RAMMatrices} convert

b3b8a8a

* this method is not RAMMatrices ctor, it is Dict{K, RAMMatrices} convert * use comprehension to construct dict

replace no-op ctors with convert(T, obj)

05040e9

convert() is a proper method to call to avoid unnecessary construction, ctor semantics requires that a new object is constructed

ParTable ctor: simplify rows code

6187f24

* use named tuples * reduce code duplication

ParTable: full support for Iterator iface

07ed9a9

use Fix1 instead of anonymous function

c6eb013

fix typo

165c640

rename vars for type stability

b48a0c5

obj!()/grad!(): avoid tmp array creation

0391a5e

grad!(): avoid extra array copying

b18e658

check_acyclic(): use istril/u()

764d4a0

grad!(SemML): reduce * ops

0bbf5c9

obj/grad/hess!(SemML): avoid extra arr copying

74e23e9

use .+= to reduce allocs

67a988e

ParTable: update StenGraph-based ctor

28950f3

* use graph as a main parameter * simplify rows processing * don't reallocate table.columns

Alexey Stukalov and others added 7 commits April 14, 2024 13:51

simplify duplication_matrix()

3a6f596

FIML: optimize Jmu

5c4ac3c

fix typo

09c93c8

SemWLS: dim checks

f281de1

EM: move code refs to docstring

7cc2f58

EM MVN: decouple from SemObsMissing

1b30da4

so EM MVN could be done when SemObsMissing is constructed

fixup semoptimizer in ext tests

0eb2690

test/fiml: set EM MVN rtol=1e-10

090ef5e

to make tests pass

alyst mentioned this pull request Apr 15, 2024

CI: add julia-format action #182

Merged

Alexey Stukalov added 4 commits April 17, 2024 01:05

MissingPattern: transpose data

1eb987a

for faster EM MVN

EM MVN: report rel_error if not converged

fd71ebd

EM: max_nobs_em opt to limit obs used

b93f128

EM: optimize mean handling

52c1a8b

Maximilian-Stefan-Ernst changed the base branch from main to feature/formatter April 17, 2024 09:18

Maximilian-Stefan-Ernst changed the base branch from feature/formatter to devel April 17, 2024 11:17

alyst mentioned this pull request Apr 19, 2024

Multiple refactorings reformatted #193

Draft

alyst closed this Apr 20, 2024

Multiple Refactorings #181

Multiple Refactorings #181

Uh oh!

Conversation

alyst commented Apr 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alyst commented Apr 14, 2024

Uh oh!

Maximilian-Stefan-Ernst commented Apr 15, 2024

Uh oh!

Maximilian-Stefan-Ernst commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyst commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maximilian-Stefan-Ernst commented Apr 17, 2024

Uh oh!

alyst commented Apr 17, 2024

Uh oh!

Maximilian-Stefan-Ernst commented Apr 17, 2024

Uh oh!

alyst commented Apr 17, 2024

Uh oh!

Maximilian-Stefan-Ernst commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyst commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maximilian-Stefan-Ernst commented Apr 18, 2024

Uh oh!

alyst commented Apr 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alyst commented Apr 14, 2024 •

edited

Loading

codecov bot commented Apr 14, 2024 •

edited

Loading

Maximilian-Stefan-Ernst commented Apr 17, 2024 •

edited

Loading

alyst commented Apr 17, 2024 •

edited

Loading

Maximilian-Stefan-Ernst commented Apr 18, 2024 •

edited

Loading

alyst commented Apr 18, 2024 •

edited

Loading