Enhance SemObserved #230

alyst · 2024-12-26T00:16:47Z

Another set of changes from #193 related to SemObserved:

each SemObserved implementation now stores observed_vars and provides observed_vars(observed) method.
That makes things simpler internally and, I think, also for the SEM users -- in particular it is now easy to check the SemSpecification, SemImply, and SemObserved compatibility.
the rules for data import are relaxed and streamlined (to simplify things for the user):
- obs_colnames kwarg renamed to observed_vars for consistency with the other functions (but I would like to hear feedback on the better name).
  The idea is that now it should be more clear what this argument is for.
- as before, SemObserved ctors accept both specification and observed_vars. But with this PR having SEM.observed_vars(spec) != observed_vars would not raise an error.
  Instead, observed_vars kwarg takes precedence in mapping the data matrix columns to the observed variable names, whereas the spec takes precedence in defining the order of the variables in the imported data
  (also, observed variables not referenced in spec will not be imported). So if the correct SEM specification is already defined, the user will not have to do an extra manual step to match the order of the columns
  in the data matrix.
- automatically generated observed variable IDs have "obs" prefix (:obs1, :obs2 etc instead of :1, :2), that could be tweaked with observed_var_prefix kw arg.
- this logic is implemented by prepare_data() function that is used by all SemObserved ctors
SemObservedCovariation is made a type alias of SemObservedData{Nothing} (i.e. nothing as data matrix) instead of being its own type.
That reduces some code duplication.
SemObservedMissing is refactored to incapsulate all properties of a missing pattern into the single SemObservedMissingPattern object, so SemObservedMissing just stores a vector of these objects.
In the context of the missing pattern, the observed variables that have values are called measured (so there are measured_cov, measured_mask etc fields); the vars without values are termed missing (miss_mask).
That refactoring simplifies some code for EM, FIML, and minLL; and is also required for further enhancements of EM & FIML.

define a single source_to_dest_perm() function

codecov · 2024-12-26T00:27:06Z

Codecov Report

Attention: Patch coverage is 95.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.88%. Comparing base (c5b48c7) to head (673aa2b).
Report is 1 commits behind head on devel.

Files with missing lines	Patch %	Lines
src/observed/abstract.jl	92.45%	4 Missing ⚠️
src/observed/missing.jl	85.71%	2 Missing ⚠️
src/frontend/specification/Sem.jl	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel     #230      +/-   ##
==========================================
- Coverage   73.28%   72.88%   -0.40%     
==========================================
  Files          49       50       +1     
  Lines        2231     2213      -18     
==========================================
- Hits         1635     1613      -22     
- Misses        596      600       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/frontend/fit/fitmeasures/minus2ll.jl

src/observed/covariance.jl

src/observed/data.jl

src/observed/missing_pattern.jl

src/observed/missing.jl

src/observed/abstract.jl

test/unit_tests/data_input_formats.jl

Maximilian-Stefan-Ernst · 2024-12-28T11:49:57Z

src/frontend/specification/Sem.jl

 ############################################################################################

 function Sem(;
+    specification = RAMMatrices,


I think we should make ParameterTable the default here - I know you are relying on the RAM notation to specify your models, but I believe that most users of the package (~90%) would rather use the ParameterTable specification.

I'm also thinking about always storing the specification object in the Sem model, maybe as it's own field in Sem - what do you think about that?

I can change it, but actually specifying the class is no-op, as internally most SEM terms convert it into RAMMatrices anyway.
As for a separate specification object -- I am just not sure what additional features it will enable in addition to the existing mechanisms of conversion DataFrame into SemSpecification and SemFit into DataFrame?

I think it would be nice to change that to ParameterTable - I'll write something below on the specification object.

Fixed (also for SemFiniteDiff below)

Thanks, I also added a keyword-only constructor for ParameterTable - so this works:

model = Sem( observed_vars = observed_vars, latent_vars = latent_vars, graph = graph, data = data, loss = SemML, meanstructure = false, )

src/observed/abstract.jl

Maximilian-Stefan-Ernst · 2024-12-28T13:33:05Z

Thanks again for this PR, I really like the streamlining of FIML and how the missing patterns are now computed/stored/used.

I also like the changes to observed - I'm still thinking about whether we should have a field specification in Sem - ATM, the "default" workflow includes having a separate specification object (mostly a ParameterTable), then creating a model from it, and then updating the partable with fitted estimates - it seems like this could be streamlined.

src/observed/abstract.jl

alyst · 2024-12-28T21:10:00Z

Thank you for the review -- thorough and helpful as usual!

I'm still thinking about whether we should have a field specification in Sem - ATM, the "default" workflow includes having a separate specification object (mostly a ParameterTable), then creating a model from it, and then updating the partable with fitted estimates - it seems like this could be streamlined.

To me the efficiency of the model inference and the cleanliness of the user API are the priorities in the design of Sem types.
In that context ParameterTable is not a primary candidate for SemSpecification implementation.
I also do export the results of the SEM inference, but ParameterTable is just a transient intermediate before the conversion of SemFit into a DataFrame,
because ParameterTable lacks data manipulation capabilities of the DataFrame, and it does not support model-specific columns (in my case observed_vars are experiments, which have different metadata associated with them).
Theoretically, the conversion to/from DataFrame or the generic Tables.jl object into SemSpecification could be further streamlined.

Also, I am mostly working with SEM ensemble/multigroup models. Does your suggestion imply having specification objects both at the SemEnsemble level and Sem levels?
(Actually, the next big PR I plan merges SemEnsemble into Sem, which would then contain multiple SEM loss terms as well as multiple regularization terms -- so that problem goes away)

src/observed/abstract.jl

Maximilian-Stefan-Ernst · 2025-01-01T10:48:17Z

My thoughts about having this specification field were that it would just simplify things for the user bcs they don't have to manage a separate specification object. I expect that the overwhelming majority would have a partable to convert to a DataFrame, and they always have to create a partable -> create the model, and then at the end, update partable -> convert to data frame. Assuming the partable is living in the model, we could have a function export_parameters that just gives the DataFrame, and could make all the updating of the partable internally.

But this is not high priority right now, so after the last open comment is resolved, I would merge this 🥳

reduces code duplication; also annotate types of ctor args now samples(SemObsCov) returns nothing

add observed_vars(data::SemObserved)

* use SemObsMissingPattern struct to simplify code * replace O(Nvars^2) common pattern detection with Dict{} * don't store row-wise, store sub-matrices of non-missing data instead * use StatsBase.mean_and_cov()

StatsBase.mean_and_cov() is used instead

- SemObservedData: parameterize by cov/mean eltype instead of the whole container types Co-authored-by: Maximilian Ernst <[email protected]>

to match the update data preparation behaviour

to specify the prefix of the generated observed_vars if none provided could be inferred, defaults to :obs

alyst · 2025-01-01T21:50:32Z

I've changed the Sem default specification type to ParameterTable. I think that was the last remaining item from the review, so I have cleaned up the commit history a bit.

My thoughts about having this specification field were that it would just simplify things for the user bcs they don't have to manage a separate specification object.

I think we actually agree on how the workflow should look from the user perspective: most of the time the user works with the parameter table representation - both for specifying the model and processing the results.
And, most likely, for the user it is most convenient to work with the parameter table as a DataFrame object, since it has all data manipulation tools that ParameterTable type does not have.
My point was just that since the conversion to/from ParameterTable representation is trivial and cheap, we don't actually have to store it, we can just generate it on the fly.
For the user it is fairly easy to integrate that minimal ParameterTable with just the parameter estimates into the existing SEM data frame by using innerjoin() -- it also automatically takes care of the integrity.
With that we also avoid the information flow loop: we don't store the results of model fitting in the same object that was used as an input.

Maximilian-Stefan-Ernst · 2025-01-02T09:46:36Z

Thanks! I added one last commit, if this one is fine with you, I would merge.

Also, the workflow you suggest seems like a good idea - so maybe writing a function like export_params(sem_fit) that generates a DataFrame on the fly, and have functions like sem_summary(sem_fit) create the partable on the fly.

alyst · 2025-01-02T19:00:07Z

Thanks! I added one last commit, if this one is fine with you, I would merge.

Thank you!

v0.7 changed the diff interface (v0.6 was skipped)

alyst · 2025-01-06T19:56:58Z

I've realized that the tests were failing with the commit that switches Sem to ParameterTable by default. I've fixed it, but I think it is another indication that SEM.jl should move away from kw-arg based constructors.
Julia does not dispatch based on the names of kw used or on their types, so if there is a kw-only ctor for a type, it has to be the single kw-only ctor.
But that puts limitations on the how extendable is the package.

alyst · 2025-01-06T19:58:35Z

ProximalAlgorithms.jl were update to 0.7 yesterday (from 0.5.4).
It looks like the new version has changed how user-defined functions should provide their gradients, so I had restricted ProxAlgorithms to 0.5 for now.

Maximilian-Stefan-Ernst · 2025-01-09T12:29:33Z

Thanks a lot for the fix!

alyst and others added 3 commits December 24, 2024 23:48

Project.toml: fix ProximalOperators ID

d5357f0

docs: fix nsamples, nobserved_vars

48a744f

cleanup data columns reordering

5faf116

define a single source_to_dest_perm() function

alyst force-pushed the enhance_data_import branch 2 times, most recently from 557c49d to a3b8bc8 Compare December 26, 2024 02:11