Skip to content

Conversation

@nevans
Copy link
Contributor

@nevans nevans commented Oct 30, 2024

This adds basic encoding/decoding for ruby 3.2 Data objects, using !ruby/data:ClassName (a mapping of member names to member values) and !ruby/data-with-ivars:ClassName tags (a mapping with members and ivars mappings). Like Marshal, this uses rb_struct_initialize to assign member values to new Data objects.


Originally (in the first commit), I updated Visitor::ToRuby to simply call data.send(:initialize, **members). But #initialize freezes the object, so we can't set instance variables afterwards.

The second commit sets instance variables before calling #initialize, which I think will lead to unhappy surprises for some users. Unfortunately, Data doesn't give us any other mechanism for setting the members... except for the C API.

The third commit works around the above issue by copying how Marshal loads Data objects, delegating ToRuby#init_struct(obj, members) to rb_struct_initialize. I think this is the least surprising behavior for users. This also bypasses any user-defined #initialize method (which may be seen as a feature or a bug).

@nevans
Copy link
Contributor Author

nevans commented Nov 1, 2024

FYI: I haven't put any guards in yet to make this work for older versions of ruby (before Data was defined).

@nevans
Copy link
Contributor Author

nevans commented Nov 5, 2024

I'll update this to pass for earlier versions of ruby.

But I'm also looking for feedback on the approach. The more I think about it, the more I'm thinking this should use rb_struct_initialize from the C API, then copy over any ivars, then freeze. Unless I'm misreading the C code (very possible), that will skip any user-defined #initialize override, and just use the underlying Struct implementation... and, although that's not identical to rb_data_initialize_m, its close enough and it's exported and in the headers. That's what Marshal does, since Marshal just treats Data like Struct... which also means that Marshal does not freeze Data objects!

@nevans nevans force-pushed the data-object-encoding branch from 3110d41 to 3a845fd Compare November 8, 2024 14:35
@nevans
Copy link
Contributor Author

nevans commented Nov 8, 2024

For what it's worth, I've updated the PR to:

  1. pass with ruby versions down to 2.5 (at least on my machine)
  2. use rb_struct_initialize from the C API

@nevans nevans force-pushed the data-object-encoding branch 2 times, most recently from 2292f0b to 325b073 Compare December 2, 2024 19:00
@nevans
Copy link
Contributor Author

nevans commented Dec 2, 2024

Sorry. I thought I had it passing before, but I must have made a mistake in my earlier tests with ruby 2.5 (perhaps I had somehow disabled warnings locally)? I've changed the guards from defined?(::Data.define) to RUBY_VERSION >= "3.1.0". Unfortunately, it's impossible to check if a method is defined without triggering deprecated constant warnings.

@nevans
Copy link
Contributor Author

nevans commented Dec 5, 2024

Any thoughts about this approach? (Is it too close to the 3.4.0 release to hope for getting support for Data in time for 3.4?)

@tenderlove
Copy link
Member

@nevans I'm not thrilled about adding more C code, but I understand why we need it. I think if we introduce some C code we'll have to figure out the JRuby side of things though. cc @headius

@headius
Copy link
Contributor

headius commented Jan 17, 2025

It would be possible to call JRuby's internal API from Ruby, but since we already have an extension for this library we can just add it there.

This feature and the tests provided are guarded to only happen on Ruby 3.2+, so JRuby 9.4 would not be affected (it is 3.1 compatible) and JRuby 10 (3.4 compatible) is unreleased so we have some time to add JRuby support.

@nevans
Copy link
Contributor Author

nevans commented Jan 18, 2025

@tenderlove Thanks. I did briefly consider abusing Marshal.dump/load with some evil string manipulation. If psych didn't already have a C extension, I probably would've experimented with that approach. 😜

@headius I think that I may have screwed up the ruby 3.2 guard, the last time I updated my branch. I need to rebase anyway, so I'll fix it and double check that the tests pass under older ruby versions.

This sets the ivars _before_ calling initialize, which feels wrong.  But
Data doesn't give us any mechanism for setting the members other than 1)
initialize, or 2) drop down into the C API.  Since initialize freezes
the object, we need to set the ivars before that.  I think this is a
reasonable compromise—if users need better handling, they can implement
their own `encode_with` and `init_with`.  But it will lead to unhappy
surprises for some users.

Alternatively, we could use the C API, similarly to Marshal.  Psych _is_
already using the C API for path2class and build_exception.  This would
be the least surprising behavior for users, I think.
@nevans nevans force-pushed the data-object-encoding branch from 325b073 to 3573fb3 Compare January 19, 2025 14:56
@nevans
Copy link
Contributor Author

nevans commented Jan 19, 2025

@tenderlove Okay, I rebased on the latest master and fixed the tests under ruby 3.1.

I also changed the tests to use omit "Data requires ruby >= 3.2" if RUBY_VERSION < "3.2", rather than conditionally defining the test methods. IMO, it's better to have the more explicit test output when skipping tests. But I'll change that back if you'd prefer clean test output for older ruby versions.

@tenderlove tenderlove merged commit 992297d into ruby:master May 1, 2025
74 checks passed
Comment on lines +29 to +31
VALUE args = rb_ary_new2(1);
rb_ary_push(args, attrs);
rb_struct_initialize(data, args);
Copy link
Member

@eregon eregon Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very brittle, and probably only works on CRuby due to this bug or similar: https://bugs.ruby-lang.org/issues/18632
i.e. this is passing [{ ... }] to rb_struct_initialize(), i.e. a positional Hash, not keyword arguments.
Normal behavior would be to treat that Hash as the value of the first member, and don't assign other members. But it seems to accidentally work.

It would be a lot safer to pass the values as a simple Array instead, and query the members from the class by calling struct_or_data_class.members.
From truffleruby/truffleruby#4012 (comment)

Copy link
Member

@eregon eregon Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to repro this in new C API specs but I actually can't, they fail on CRuby too:

      it "accepts a single Hash in an Array" do
        @klass = @s.rb_data_define(nil, "a", "b", "c")
        data = @klass.allocate
        @s.rb_struct_initialize(data, [{ a: 1, b: 2, c: 3 }]).should == nil
        data.a.should == 1
        data.b.should == 2
        data.c.should == 3
      end

gives

ruby 3.3.5 (2024-09-03 revision ef084cc8f4) [x86_64-linux]
2)
C-API Data function rb_struct_initialize accepts a single Hash in an Array FAILED
Expected {:a=>1, :b=>2, :c=>3} == 1
to be truthy but was false

From experimenting by tweaking those specs it only works when using keyword_init: true explicitly (like Struct.new(:a, :b, :c, keyword_init: true)) , and just never works for Data.

Maybe it "works" due to init_struct getting kwargs and accidentally propagating that flag to other C calls or so (just a guess).
Looks like it in https://github.com/ruby/ruby/blob/48c7f349f68846e10d60ae77ad299a38ee014479/struct.c#L766

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In marshal.c https://github.com/ruby/ruby/blob/48c7f349f68846e10d60ae77ad299a38ee014479/marshal.c#L2150-L2176
CRuby uses a Hash only if rb_struct_s_keyword_init(), in all other cases it uses an Array of values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since for this use case in Psych we only care about Data, not Struct, we should pass an Array of values (same as what Marshal does for Data, as far as I can tell rb_struct_s_keyword_init() is always false for Data classes).
@nevans Would you have time to fix this?

Copy link
Member

@eregon eregon Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought is it would really be good to ask for rb_data_initialize() upstream at https://bugs.ruby-lang.org/, because reusing rb_struct_initialize() for Data feels like a pretty huge hack, given it doesn't run any function related to Data, only rb_struct_initialize_m() which is not meant to be used with Data.
Of course that can't be used now, but would enable a cleaner way in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix in #749

eregon added a commit to eregon/psych that referenced this pull request Nov 5, 2025
…s and not a Hash

* rb_struct_initialize() does not accept a Hash, and it's very brittle
  to pass `[{...}]` and to rely on that C function using rb_keyword_given_p().
  It basically worked accidentally, by having **members in the caller of the caller.
  Such logic when Struct#initialize is defined in Ruby (as in TruffleRuby) is basically impossible to implement,
  because it's incorrectly treating positional arguments as keyword arguments.
* rb_struct_initialize() is used in CRuby to set members of Data instances in marshal.c (there is no rb_data_initialize() yet).
  There, the code passes an Array of members values for Data (and for Struct which are not `keyword_init: true`):
  https://github.com/ruby/ruby/blob/48c7f349f68846e10d60ae77ad299a38ee014479/marshal.c#L2150-L2176
  So we should do the same in psych.
* Rename to init_data since it's only used for Data.
* See ruby#692 (comment).
@eregon
Copy link
Member

eregon commented Nov 5, 2025

Originally (in the first commit), I updated Visitor::ToRuby to simply call data.send(:initialize, **members). But #initialize freezes the object, so we can't set instance variables afterwards.

I thought maybe we could just not serialize ivars of Data instances and not try to set them, because the user-defined initialize is the only thing that can set ivars on Data instances (after it's frozen), so initialize will just set the ivars again with data.send(:initialize, **members).
However, that doesn't work because the values of these ivars may be mutable (like an Array), and such changes then would be lost.
Also if a value of these ivars is also referenced by other objects in the same Psych.dump then upon unsafe_load they would be separate objects.

IOW, yeah I think we need rb_struct_initialize() (until we have a better API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants