Skip to content

Conversation

@ariostas
Copy link
Contributor

@ariostas ariostas commented Apr 17, 2025

This is still a work in progress, but I'm opening this PR so I can ask questions about my changes, since I'm not familiar with how Coffea works.

This PR depends on scikit-hep/uproot5#1419

@lgray
Copy link
Collaborator

lgray commented Apr 17, 2025

Oh, cool! Thanks for starting this!

@lgray
Copy link
Collaborator

lgray commented Jul 29, 2025

Are we in a spot where we can start to wrap up this functionality @ariostas?

@ariostas
Copy link
Contributor Author

ariostas commented Jul 29, 2025

I think so. I was told that there was some major-ish refactoring underway and that I should wait. If that's done, I'm happy to get back to looking into this.

Although I'll probably wait a couple weeks to see what happens on the CMSSW side.

@ariostas
Copy link
Contributor Author

ariostas commented Aug 7, 2025

Quick update on this. After cms-sw/cmssw#48071 we can now produce RNTuple NanoAODs with CMSSW. The format is not final, but we can probably start playing around with it.

@lgray
Copy link
Collaborator

lgray commented Aug 28, 2025

@ariostas do you have anything further you'd like to do with this PR, shall we mark it ready for review?

It needs a rebase as well, there are some conflicts.

@ariostas
Copy link
Contributor Author

I'll spend some time on this next week. It probably mostly works, but I might need to make some changes on the Uproot side.

@lgray
Copy link
Collaborator

lgray commented Aug 29, 2025

@ariostas OK - great, thanks! Please keep me informed!

@ariostas ariostas force-pushed the ariostas/rntuple_support branch from 6ba606c to ffcea34 Compare September 5, 2025 19:24
@ariostas
Copy link
Contributor Author

ariostas commented Sep 5, 2025

Quick update on this. Now that coffea switched to virtual arrays by default, it seems like we don't need any changes on the Uproot side.

I got it to read the form correctly, but it is failing to materialize the arrays. I'm guessing that it's because it's assuming that all fields are at the top level, but I'll have to look into it.

@lgray
Copy link
Collaborator

lgray commented Sep 5, 2025

Awesome - thanks for coming back to this.

@ariostas
Copy link
Contributor Author

Another quick update on this. The ROOT team agreed to add the RNTuple extension that I proposed, so we will be able to have neatly organized NanoAOD RNTuples. It will take a while until we see them, but we will eventually.

As for this PR, next week I'll focus on finishing it so that at least the NanoAODs produced with the RNTuple converter work.

@lgray
Copy link
Collaborator

lgray commented Oct 10, 2025

Hi @ariostas - it has been two weeks, any word? Thanks!

@ariostas
Copy link
Contributor Author

Not yet, sorry. I find the Coffea code a little confusing, so I'll try to have a chat with Iason to figure out what's needed. In the meantime, I've been working on adding the option of reading virtual arrays directly in Uproot, so hopefully that will simplify things.

@ikrommyd
Copy link
Collaborator

@lgray We found with @ariostas that hopefully the issue is that extract_base_form now has a continue statement in the case of RNTuple so that the rest doesn't run. The rest however does extract the form from the branch interpretation and runs lazify_form which brings the form to something that the DSL of coffea expects. Hopefully once we have a 1-1 mapping of extract_base_form between TTree and RNTuple, this will work.

@ariostas
Copy link
Contributor Author

Thanks so much, @ikrommyd! Surprisingly, even NanoAODSchema seems to work now. I added a couple of basic tests. I'll check the status of PHYSLITE and report back.

@ariostas
Copy link
Contributor Author

Now it works for PHYSLITE too. So I think now it's just a matter of doing a bit of cleanup and maybe adding some more tests?

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 14, 2025

@ariostas Not quite..

events = NanoEventsFactory.from_root({"tests/samples/nano_dimuon_rntuple.root": "Events"}, schemaclass=BaseSchema, mode="eager").events()

is giving me

IndexError: cannot slice NumpyArray (of length 289) with 'CorrT1METJet_area': not an array of records

This error occurred while calling

    ak.from_buffers(
        {'class': 'RecordArray', 'fields': ['run', 'luminosityBlock', 'event'...
        40
        <coffea.nanoevents.mapping.uproot.UprootSourceMapping object at 0x710...
        buffer_key = partial-instance
        behavior = {'Systematic': <class 'coffea.nanoevents.methods.base.Syst...
        attrs = {'@events_factory': <coffea.nanoevents.factory.NanoEventsFact...
        allow_noncanonical_form = True
    )

@ikrommyd
Copy link
Collaborator

mode="eager" in from_root enables eager reading btw.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 14, 2025

@ariostas I added this print and I'm getting this out in the rntuple case

diff --git a/src/coffea/nanoevents/mapping/uproot.py b/src/coffea/nanoevents/mapping/uproot.py
index 58963d6e..f833bb9f 100644
--- a/src/coffea/nanoevents/mapping/uproot.py
+++ b/src/coffea/nanoevents/mapping/uproot.py
@@ -258,6 +258,7 @@ class UprootSourceMapping(BaseSourceMapping):
                     decompression_executor=self.decompression_executor,
                     interpretation_executor=self.interpretation_executor,
                 )
+            print(the_array)
         if isinstance(the_array.layout, awkward.contents.ListOffsetArray):
             the_array = awkward.Array(the_array.layout.to_ListOffsetArray64(True))
In [4]: events = NanoEventsFactory.from_root({"/home/iason/work/coffea_dev/coffea/tests/samples/nano_dimuon_rntuple.root": "Events"}, mode="eager").events()
/home/iason/Dropbox/work/coffea_dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_electronIdx => Electron
 warnings.warn(
/home/iason/Dropbox/work/coffea_dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_photonIdx => Photon
 warnings.warn(
[[{CorrT1METJet_area: 0.389, CorrT1METJet_eta: -0.23, ...}, ..., {...}], ...]
[[{CorrT1METJet_area: 0.389, CorrT1METJet_eta: -0.23, ...}, ..., {...}], ...]
[[{CorrT1METJet_area: 0.389, CorrT1METJet_eta: -0.23, ...}, ..., {...}], ...]
[[{CorrT1METJet_area: 0.389, CorrT1METJet_eta: -0.23, ...}, ..., {...}], ...]
[[{CorrT1METJet_area: 0.389, CorrT1METJet_eta: -0.23, ...}, ..., {...}], ...]
[[{CorrT1METJet_area: 0.389, CorrT1METJet_eta: -0.23, ...}, ..., {...}], ...]
[[0.389, 0.489, 0.499, 0.409, 0.519, ..., 0.509, 0.519, 0.509, 0.479], ...]

How is .array() returning record arrays?

@ariostas
Copy link
Contributor Author

Ah okay, yeah. I'll have to look into the other modes. Dask probably doesn't work either.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 15, 2025

Thanks @ariostas. Seems good at least for base schema and nanoaod schema

In [22]: ttree = NanoEventsFactory.from_root({"tests/samples/nano_dimuon.root": "Events"}, mode="eager", schemaclass=BaseSchema).events()

In [23]: rntuple = NanoEventsFactory.from_root({"tests/samples/nano_dimuon_rntuple.root": "Events"}, mode="eager", schemaclass=BaseSchema).events()

In [24]: ak.array_equal(rntuple, ttree, dtype_exact=False, check_parameters=False)
Out[24]: True

In [25]: rntuple = NanoEventsFactory.from_root({"tests/samples/nano_dimuon_rntuple.root": "Events"}, mode="eager", schemaclass=NanoAODSchema).events()
/Users/iason/Dropbox/work/coffea_dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_electronIdx => Electron
  warnings.warn(
/Users/iason/Dropbox/work/coffea_dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_photonIdx => Photon
  warnings.warn(

In [27]: ttree = NanoEventsFactory.from_root({"tests/samples/nano_dimuon.root": "Events"}, mode="eager", schemaclass=NanoAODSchema).events()
/Users/iason/Dropbox/work/coffea_dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_electronIdx => Electron
  warnings.warn(
/Users/iason/Dropbox/work/coffea_dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:264: RuntimeWarning: Missing cross-reference index for LowPtElectron_photonIdx => Photon
  warnings.warn(

In [28]: ak.array_equal(rntuple, ttree, dtype_exact=False, check_parameters=False)
Out[28]: True

Do you happen to know where the dtype mismatch comes from? Is it from the file conversion?

Also, is it possible to make the parameters match?

In [32]: ttree.layout.parameters
Out[32]:
{'__doc__': 'Events',
 '__record__': 'NanoEvents',
 'metadata': {'version': 'latest'}}

In [33]: rntuple.layout.parameters
Out[33]: {'__doc__': '', '__record__': 'NanoEvents', 'metadata': {'version': 'latest'}}

@ariostas
Copy link
Contributor Author

Both the dtype and parameter mismatches are due to the RNTuple Importer.

Seems like it's always converting the offsets (e.g. nElectrons) into int64 cardinalities (projections of offset column).

Also, for some reason it's not keeping the TTree title as the RNTuple description.

But anyway, since these are just artifacts of the files being converted, they won't be an issue once we have "real" generated RNTuples.

@ikrommyd
Copy link
Collaborator

Both the dtype and parameter mismatches are due to the RNTuple Importer.

Seems like it's always converting the offsets (e.g. nElectrons) into int64 cardinalities (projections of offset column).

Also, for some reason it's not keeping the TTree title as the RNTuple description.

But anyway, since these are just artifacts of the files being converted, they won't be an issue once we have "real" generated RNTuples.

Yeah that's totally fine. I was just wondering if it's from the conversion or the reading. If it's from the conversion, I'm happy.

@ikrommyd
Copy link
Collaborator

@ariostas Since you have the setup ready, is there any chance you could push rntuple versions of the other root files we use in tests? I would like to the test the different schemas.

@ariostas
Copy link
Contributor Author

I couldn't add samples for other schemas because the RNTupleImporter couldn't convert them since some classes didn't have dictionaries. This can be solved by including the right headers. But for example for PHYSLITE we should just use real samples instead of converted ones. I'll look into trimming the one mentioned in #1436.

@ikrommyd
Copy link
Collaborator

Sure. What makes our lives easier if we can use the existing files (but converted) is that we have an equality comparison right then and there and also some hard-coded values in the tests I believe. If you add new files, It's generally good to have a TTree version of the new files too to compare them to if that's possible.

@ikrommyd
Copy link
Collaborator

@ariostas I have added some testing which works fine for the nano samples but fails for treemaker. The files appear different

In [5]: len(uproot.open("tests/samples/treemaker.root")["PreSelection"].arrays().fields)
Out[5]: 391

In [6]: len(uproot.open("tests/samples/treemaker_rntuple.root")["PreSelection"].arrays().fields)
Out[6]: 338

Do you know why that is? Should I just test for the common fields one by one instead of asserting that the whole arrays are equal?

@ariostas
Copy link
Contributor Author

ariostas commented Oct 20, 2025

Seems like it's just that subbranches/subfields are handled a bit different. For TTrees, you can access directly arrays["Electrons.fCoordinates.fEta"], whereas for RNTuples you would do arrays.Electrons.fCoordinates.fEta.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 20, 2025

arrays.Electrons.fCoordinates.fEta

Is it an issue in how the arrays are being structured in uproot or is it inherent to the file? Just so I can know what to test for.

@ariostas
Copy link
Contributor Author

It's on the Uproot side. For RNTuples, I'm mirroring how Awkward does things since the file format itself is pretty much equivalent to Awkward layouts. So it makes a lot more sense to have arrays.Electrons.fCoordinates.fEta than a field named "Electrons.fCoordinates.fEta"

@ikrommyd
Copy link
Collaborator

It's on the Uproot side. For RNTuples, I'm mirroring how Awkward does things since the file format itself is pretty much equivalent to Awkward layouts. So it makes a lot more sense to have arrays.Electrons.fCoordinates.fEta than a field named "Electrons.fCoordinates.fEta"

Yeah I agree. So the answer is it will stay like that for RNTuple and not change for TTree. Okay in that case I can assert equality between a few fields that are common.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 20, 2025

I think I got the treemaker with baseschema test passing but because of this different structure of the treemaker rntuple, the treemaker schema cannot be tested. It just gives incorrect structure. That's out of scope of this PR though. It should be updated separately to comply to the RNTuple structure.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 20, 2025

I wouldn't like to wait longer on this PR. I think it currently shows that RNTuple reading worsks fine for baseschema and when the structure of the fields does not change when converting from TTree->RNTuple. If it does, schema updates are needed which is out of the scope of this PR.
I guess we can just add physlite testing and merge it. @ariostas any progress on getting a skimmed proper physlite sample?
Other than that, we add this as preliminary reading and then we just need delphes, edm4hep, etc.. samples to test with.

@lgray @nsmith- what do y'all think?

@ikrommyd ikrommyd marked this pull request as ready for review October 30, 2025 05:19
Copy link
Collaborator

@ikrommyd ikrommyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see anything harmful here. We can merge this to have some preliminary support and we can test more as more rntuple test files come in. LGTM!

@ikrommyd ikrommyd enabled auto-merge (squash) October 30, 2025 05:39
@ikrommyd ikrommyd merged commit 42fd67a into scikit-hep:master Oct 30, 2025
23 checks passed
@ariostas ariostas deleted the ariostas/rntuple_support branch October 30, 2025 13:43
@nsmith-
Copy link
Member

nsmith- commented Oct 30, 2025

Looks ok, sorry for being late!

@nsmith- nsmith- mentioned this pull request Oct 31, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants