Skip to content

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Dec 26, 2024

Another set of changes from #193 related to SemObserved:

  • each SemObserved implementation now stores observed_vars and provides observed_vars(observed) method.
    That makes things simpler internally and, I think, also for the SEM users -- in particular it is now easy to check the SemSpecification, SemImply, and SemObserved compatibility.
  • the rules for data import are relaxed and streamlined (to simplify things for the user):
    • obs_colnames kwarg renamed to observed_vars for consistency with the other functions (but I would like to hear feedback on the better name).
      The idea is that now it should be more clear what this argument is for.
    • as before, SemObserved ctors accept both specification and observed_vars. But with this PR having SEM.observed_vars(spec) != observed_vars would not raise an error.
      Instead, observed_vars kwarg takes precedence in mapping the data matrix columns to the observed variable names, whereas the spec takes precedence in defining the order of the variables in the imported data
      (also, observed variables not referenced in spec will not be imported). So if the correct SEM specification is already defined, the user will not have to do an extra manual step to match the order of the columns
      in the data matrix.
    • automatically generated observed variable IDs have "obs" prefix (:obs1, :obs2 etc instead of :1, :2), that could be tweaked with observed_var_prefix kw arg.
    • this logic is implemented by prepare_data() function that is used by all SemObserved ctors
  • SemObservedCovariation is made a type alias of SemObservedData{Nothing} (i.e. nothing as data matrix) instead of being its own type.
    That reduces some code duplication.
  • SemObservedMissing is refactored to incapsulate all properties of a missing pattern into the single SemObservedMissingPattern object, so SemObservedMissing just stores a vector of these objects.
    In the context of the missing pattern, the observed variables that have values are called measured (so there are measured_cov, measured_mask etc fields); the vars without values are termed missing (miss_mask).
    That refactoring simplifies some code for EM, FIML, and minLL; and is also required for further enhancements of EM & FIML.

Copy link

codecov bot commented Dec 26, 2024

Codecov Report

Attention: Patch coverage is 95.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.88%. Comparing base (c5b48c7) to head (673aa2b).
Report is 1 commits behind head on devel.

Files with missing lines Patch % Lines
src/observed/abstract.jl 92.45% 4 Missing ⚠️
src/observed/missing.jl 85.71% 2 Missing ⚠️
src/frontend/specification/Sem.jl 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel     #230      +/-   ##
==========================================
- Coverage   73.28%   72.88%   -0.40%     
==========================================
  Files          49       50       +1     
  Lines        2231     2213      -18     
==========================================
- Hits         1635     1613      -22     
- Misses        596      600       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alyst alyst force-pushed the enhance_data_import branch 2 times, most recently from 557c49d to a3b8bc8 Compare December 26, 2024 02:11
############################################################################################

function Sem(;
specification = RAMMatrices,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make ParameterTable the default here - I know you are relying on the RAM notation to specify your models, but I believe that most users of the package (~90%) would rather use the ParameterTable specification.

I'm also thinking about always storing the specification object in the Sem model, maybe as it's own field in Sem - what do you think about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it, but actually specifying the class is no-op, as internally most SEM terms convert it into RAMMatrices anyway.
As for a separate specification object -- I am just not sure what additional features it will enable in addition to the existing mechanisms of conversion DataFrame into SemSpecification and SemFit into DataFrame?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to change that to ParameterTable - I'll write something below on the specification object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (also for SemFiniteDiff below)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I also added a keyword-only constructor for ParameterTable - so this works:

model = Sem(
    observed_vars = observed_vars,
    latent_vars = latent_vars,
    graph = graph,
    data = data,
    loss = SemML,
    meanstructure = false,
)

@Maximilian-Stefan-Ernst
Copy link
Collaborator

Maximilian-Stefan-Ernst commented Dec 28, 2024

Thanks again for this PR, I really like the streamlining of FIML and how the missing patterns are now computed/stored/used.

I also like the changes to observed - I'm still thinking about whether we should have a field specification in Sem - ATM, the "default" workflow includes having a separate specification object (mostly a ParameterTable), then creating a model from it, and then updating the partable with fitted estimates - it seems like this could be streamlined.

@alyst
Copy link
Contributor Author

alyst commented Dec 28, 2024

Thank you for the review -- thorough and helpful as usual!

I'm still thinking about whether we should have a field specification in Sem - ATM, the "default" workflow includes having a separate specification object (mostly a ParameterTable), then creating a model from it, and then updating the partable with fitted estimates - it seems like this could be streamlined.

To me the efficiency of the model inference and the cleanliness of the user API are the priorities in the design of Sem types.
In that context ParameterTable is not a primary candidate for SemSpecification implementation.
I also do export the results of the SEM inference, but ParameterTable is just a transient intermediate before the conversion of SemFit into a DataFrame,
because ParameterTable lacks data manipulation capabilities of the DataFrame, and it does not support model-specific columns (in my case observed_vars are experiments, which have different metadata associated with them).
Theoretically, the conversion to/from DataFrame or the generic Tables.jl object into SemSpecification could be further streamlined.

Also, I am mostly working with SEM ensemble/multigroup models. Does your suggestion imply having specification objects both at the SemEnsemble level and Sem levels?
(Actually, the next big PR I plan merges SemEnsemble into Sem, which would then contain multiple SEM loss terms as well as multiple regularization terms -- so that problem goes away)

@Maximilian-Stefan-Ernst
Copy link
Collaborator

My thoughts about having this specification field were that it would just simplify things for the user bcs they don't have to manage a separate specification object. I expect that the overwhelming majority would have a partable to convert to a DataFrame, and they always have to create a partable -> create the model, and then at the end, update partable -> convert to data frame. Assuming the partable is living in the model, we could have a function export_parameters that just gives the DataFrame, and could make all the updating of the partable internally.

But this is not high priority right now, so after the last open comment is resolved, I would merge this 🥳

alyst and others added 10 commits January 1, 2025 12:46
reduces code duplication;
also annotate types of ctor args
now samples(SemObsCov) returns nothing
add observed_vars(data::SemObserved)
* use SemObsMissingPattern struct to
  simplify code
* replace O(Nvars^2) common pattern detection
  with Dict{}
* don't store row-wise, store sub-matrices
  of non-missing data instead
* use StatsBase.mean_and_cov()
StatsBase.mean_and_cov() is used instead
- SemObservedData: parameterize by cov/mean eltype
  instead of the whole container types

Co-authored-by: Maximilian Ernst <[email protected]>
to match the update data preparation behaviour
to specify the prefix of the generated observed_vars
if none provided could be inferred, defaults to :obs
@alyst alyst force-pushed the enhance_data_import branch from e16d68f to 673aa2b Compare January 1, 2025 21:15
@alyst
Copy link
Contributor Author

alyst commented Jan 1, 2025

I've changed the Sem default specification type to ParameterTable. I think that was the last remaining item from the review, so I have cleaned up the commit history a bit.

My thoughts about having this specification field were that it would just simplify things for the user bcs they don't have to manage a separate specification object.

I think we actually agree on how the workflow should look from the user perspective: most of the time the user works with the parameter table representation - both for specifying the model and processing the results.
And, most likely, for the user it is most convenient to work with the parameter table as a DataFrame object, since it has all data manipulation tools that ParameterTable type does not have.
My point was just that since the conversion to/from ParameterTable representation is trivial and cheap, we don't actually have to store it, we can just generate it on the fly.
For the user it is fairly easy to integrate that minimal ParameterTable with just the parameter estimates into the existing SEM data frame by using innerjoin() -- it also automatically takes care of the integrity.
With that we also avoid the information flow loop: we don't store the results of model fitting in the same object that was used as an input.

@Maximilian-Stefan-Ernst
Copy link
Collaborator

Thanks! I added one last commit, if this one is fine with you, I would merge.

Also, the workflow you suggest seems like a good idea - so maybe writing a function like export_params(sem_fit) that generates a DataFrame on the fly, and have functions like sem_summary(sem_fit) create the partable on the fly.

@alyst
Copy link
Contributor Author

alyst commented Jan 2, 2025

Thanks! I added one last commit, if this one is fine with you, I would merge.

Thank you!

@alyst alyst force-pushed the enhance_data_import branch from 0d255ea to 8c7cd14 Compare January 6, 2025 19:20
v0.7 changed the diff interface (v0.6 was skipped)
@alyst alyst force-pushed the enhance_data_import branch from 3fc0a0e to ff67cf7 Compare January 6, 2025 19:46
@alyst
Copy link
Contributor Author

alyst commented Jan 6, 2025

I've realized that the tests were failing with the commit that switches Sem to ParameterTable by default. I've fixed it, but I think it is another indication that SEM.jl should move away from kw-arg based constructors.
Julia does not dispatch based on the names of kw used or on their types, so if there is a kw-only ctor for a type, it has to be the single kw-only ctor.
But that puts limitations on the how extendable is the package.

@alyst
Copy link
Contributor Author

alyst commented Jan 6, 2025

ProximalAlgorithms.jl were update to 0.7 yesterday (from 0.5.4).
It looks like the new version has changed how user-defined functions should provide their gradients, so I had restricted ProxAlgorithms to 0.5 for now.

@Maximilian-Stefan-Ernst
Copy link
Collaborator

Thanks a lot for the fix!

@Maximilian-Stefan-Ernst Maximilian-Stefan-Ernst merged commit d3915c1 into StructuralEquationModels:devel Jan 9, 2025
2 of 3 checks passed
@alyst alyst deleted the enhance_data_import branch January 9, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants