Skip to content

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jul 31, 2025

We have long expected users to be explicit about the library name for ccall, and the @ccall macro has even always enforced that. That means users should have already been using explicit syntax, even though it wasn't strictly enforced. And indeed, the other syntax forms weren't handled reliably anyways (since doing so would require linearizing IR if and only if the runtime values required it, which is not something that is computable, and thus was often done wrong). This now intends to align the runtime and compiler to expect only those syntax forms that we could reliably handle in the past without errors, and adds explicit errors for other cases, most of which we previously knew would be unreliable due to reliance upon inference in making particular decisions for the semantics. The ccall function is already very special since it is more like a actual macro (it does not exist as a binding or value), so we can make unusual syntax decisions like this, mirroring @ccall also.

This fixes #57931, mostly by restricting the set of things that are allowed to the set of things that have an obvious and pre-existing behavior to be guaranteed to do that behavior. The hope is to PkgEval this to check if any packages are doing something unusual and see if we even need to allow anything else.

This drops support for #37123, since we were going to use that for LazyLibraries, be we decided that approach was quite buggy and that PR would make static compilation quite impossible to support, so we instead actually implemented LazyLibraries with a different approach. It could be re-enabled, but we never had correct lowering or inference support for it, so it is presumably still unused.

The goal is to cause breakage only where the package authors really failed to express intent with syntax, and otherwise to explicitly maintain support by adding cases to normalize the given syntax into one of the supported cases. All existing functionality (and more) can be accessed by explicit management of a pointer or by a LazyLibrary-like type, so this shouldn't cause any reduction in possible functionality, just possibly altered syntax.

In followup work, we may want to apply some similar improvements to the cglobal semantics, though not nearly as urgent, since cglobal on a pointer is permitted as a complicated bitcast right now, which doesn't make sense and probably should be deprecated, so that there is no ambiguity with that function's meaning.

Mostly written by Claude

@vtjnash vtjnash added minor change Marginal behavior change acceptable for a minor release ffi foreign function interfaces, ccall, etc. backport 1.12 Change should be backported to release-1.12 labels Jul 31, 2025
@Keno
Copy link
Member

Keno commented Jul 31, 2025

Can we specify that the variable must be a constant binding (in the evaluation world) and have it participate in the invalidation? I don't know that I like the time traveling binding behavior. Seems like a nice change otherwise though.

@vtjnash vtjnash added the needs pkgeval Tests for all registered packages should be run with this change label Jul 31, 2025
@vtjnash
Copy link
Member Author

vtjnash commented Jul 31, 2025

No, far too much code (all JLLs before they are recompiled to be lazy) relies upon the dynamic behavior of allowing a non-constant binding to evaluate at runtime and cache. Basically, I don't consider this to be time traveling anymore either, but rather that it is intended to express equivalent behavior to there existing a separate const cache variable, which can be out of date (ahead or behind) the variable that is supposed to compute it, and thus enables explicitly rejecting the notion of time travel here (data never time travels, only bindings and functions):

const call_target = Ref(C_NULL)
global lib = "runtime value"
global name = "runtime value"
f() = @atomiconce call_target[] = dlsym(name, dlopen(lib))

@Keno
Copy link
Member

Keno commented Jul 31, 2025

Ok. Perhaps we could still invalidate on binding replacement though? I don't think that'd be much harder to implement, and gets us 90% of the way there (and non-const could be disallowed in the future or in strict mode).

@vtjnash
Copy link
Member Author

vtjnash commented Jul 31, 2025

I'm still not sure I quite follow the request. Are you suggesting we add a semantic dependence on data, or simply best-effort? I think we already will do a best effort attempt which should happen in most cases (particularly after Jeff's PR, and other related corrections to trimming)

@Keno
Copy link
Member

Keno commented Jul 31, 2025

My request is to make that invalidation a semantic guarantee by making the completely generic fallback path record the world age it was created in and then checking if the latest binding partition covers that world age.

@topolarity topolarity changed the title ccall: make distinction of pointer vs name a syntatic distinction ccall: make distinction of pointer vs name a syntactic distinction Jul 31, 2025
@topolarity
Copy link
Member

topolarity commented Jul 31, 2025

Thanks for taking the initiative!

This feels like a much more pragmatic (and complete) solution to #57931 than anything else we've implemented so far.

I think I prefer this to the more dynamic options proposed in #57931, esp. since those only covered the case of a dynamically-chosen library ccall'd with statically-evaluated ABI / types. That seems like a narrow use case (and is anyway already possible with a function pointer and some amount of manual caching, as you mention) so preserving our existing static-by-default semantics, as determined by syntax, feels like a much less radical change to me.

This drops support for #37123, since we were going to use that for LazyLibraries, be we decided that approach was quite buggy and that PR would make static compilation quite impossible to support

Why does this make static compilation impossible?

It doesn't seem like, e.g., the behavior proposed in solution (5) in #57931 is impossible to statically compile, assuming it is type-stable (but not necessarily value-stable) - but maybe you were referring just to the particular implementation in that PR?

@vtjnash
Copy link
Member Author

vtjnash commented Jul 31, 2025

Why does this make static compilation impossible?

It seemed mainly that we couldn't quite define whether it was a static name expression or a dynamic pointer so it couldn't be quite linearized so it was sort of an embedded eval call in some unknown world

if (nargs_tuple == 0)
jl_error("ccall function name cannot be empty tuple");
if (nargs_tuple > 2)
jl_error("ccall function name tuple can have at most 2 elements");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these kinds of syntax validity checks should be done by lowering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are value checks (it is preparing to call jl_toplevel_eval), so they are not straightforward to have duplicated in both of our lowering implementations now

@JeffBezanson
Copy link
Member

I'm not sure we should do this. I'm willing to hold out for a version that eliminates static_eval completely and has simpler semantics.

@vtjnash
Copy link
Member Author

vtjnash commented Aug 1, 2025

This version does eliminate it semantically, though I didn't quite remove it, it only gets trivial cases now and only as an optimization (to peek at whether the expression is constant before emitting a branch to contain it). It is still used by cglobal, which I left as followup work.

@vtjnash vtjnash force-pushed the jn/syntax-verified-ccall branch from 8aed4d7 to cce193b Compare August 1, 2025 16:56
@topolarity
Copy link
Member

Unsurprisingly, looks like JuliaInterpreter and friends need some adjusting: https://buildkite.com/julialang/julia-master/builds/49600#01986719-957f-4f8a-9636-9470d50ae64b/836-1055

@vtjnash
Copy link
Member Author

vtjnash commented Aug 2, 2025

Yes, it looks like given the choices among possible buggy reimplementations of the ill-defined static_eval previously implemented here, it chose one of the buggiest versions (it just calls eval on the first argument and embeds the result into the new IR)

@vtjnash
Copy link
Member Author

vtjnash commented Aug 2, 2025

@nanosoldier runtests()

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

5 packages crashed on the previous version too.

✖ Packages that failed

988 packages failed only on the current version.

  • Package fails to precompile: 908 packages
  • Illegal method overwrites during precompilation: 13 packages
  • Package has test failures: 5 packages
  • Package tests unexpectedly errored: 21 packages
  • Networking-related issues were detected: 9 packages
  • There were unidentified errors: 2 packages
  • Tests became inactive: 1 packages
  • Test duration exceeded the time limit: 28 packages
  • Test log exceeded the size limit: 1 packages

1192 packages failed on the previous version too.

✔ Packages that passed tests

12 packages passed tests only on the current version.

  • Other: 12 packages

4832 packages passed tests on the previous version too.

~ Packages that at least loaded

3 packages successfully loaded only on the current version.

  • Other: 3 packages

2857 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

6 packages were skipped only on the current version.

  • Package could not be installed: 6 packages

904 packages were skipped on the previous version too.

@vtjnash
Copy link
Member Author

vtjnash commented Aug 2, 2025

It looks like a majority of those were due to a recent breaking change in FFTW (JuliaMath/FFTW.jl#207) which could just be partly reverted it seems (it had been in preparation for LazyLibraries, but then we went a different way with implementation of those, so now this should be reverted to actually support them)

@KristofferC KristofferC mentioned this pull request Aug 6, 2025
38 tasks
@KristofferC KristofferC mentioned this pull request Aug 19, 2025
27 tasks
vtjnash added a commit to JuliaMath/FFTW.jl that referenced this pull request Aug 25, 2025
…rary improvements (#318)

We had initially planned a breaking change design for LazyLibrary, which
this package was updated for, but later we realized a better design
which allows going back to the original version of this code. The
previous form is still very poorly supported in the compiler and was
still rather buggy internally.

Revert "Replace `__init__` with `OncePerProcess` on supported Julia versions (#316)"
This reverts commit dbcb5d2.

Revert "Make FFTW relocatable for PackageCompiler"
This reverts commit e96270a.

Refs: [ccall: make distinction of pointer vs name a syntactic distinction](JuliaLang/julia#59165)
@vtjnash
Copy link
Member Author

vtjnash commented Oct 3, 2025

@nanosoldier runtests(["CUDA_Driver", "ThreadPinning", "SHTns", "Walsh", "CGcoefficient", "Spinnaker", "Pardiso", "ArnoldiMethodTransformations", "BandwidthBenchmark", "GeoIP", "RegressionDynamicCausalModeling", "SmallCombinatorics", "Hamburg", "InMemoryDatasets", "LongDatasetSort", "PositiveIntegrators", "CxxWrap", "CartesianJoin", "MDEStudy", "Pingouin", "PowerNetworkMatrices", "SurveyDataWeighting", "CountingChambers", "ModeCouplingTheory", "ParticleInCell", "TruncatedMVN", "SkeelBerzins", "DLMReader", "AlgebraicMultigrid", "PowerFlows", "DiffEqGPU", "OrdinaryDiffEqExtrapolation", "AsapOptim", "StatisticalGraphics", "GLM", "BytePairEncoding", "AMGCLWrap", "GAP", "OrdinaryDiffEqExponentialRK", "OrdinaryDiffEqNonlinearSolve", "SteadyWaves", "Bcube", "SimpleImplicitDiscreteSolve", "BoundaryValueDiffEqAscher", "BoundaryValueDiffEqCore", "VortexStepMethod", "MusicProcessing", "OrdinaryDiffEqDefault", "OrdinaryDiffEqPDIRK", "DiffEqPhysics", "BcubeVTK", "OrdinaryDiffEqIMEXMultistep", "OrdinaryDiffEqSDIRK", "OrdinaryDiffEqFIRK", "OrdinaryDiffEqStabilizedIRK", "OrdinaryDiffEqBDF", "Omniscape", "PowerSimulationsDynamics", "BoundaryValueDiffEqMIRKN", "BoundaryValueDiffEqShooting", "FerriteMultigrid", "BcubeCGNS", "BoundaryValueDiffEqMIRK", "MAiNGO", "GModelFit", "DecomposingPolynomialSystems", "ProbNumDiffEq", "VoronoiFVM", "GroebnerWalk", "GModelFitViewer", "BcubeGmsh", "PowerSimulations", "PeriodicOrbits", "Einstein", "Circuitscape", "ONSAS", "Fortuna", "LiquidElectrolytes", "MGVI", "GenericCharacterTables", "MatterPower", "Trebuchet", "DiffEqFinancial", "HydroPowerSimulations", "NBodySimulator", "StorageSystemsSimulations", "PowerAnalytics", "Consensus", "Tapestree", "Globtim", "Yunir", "QSFit", "IonSim", "SphericalClusterMass", "QuantumCumulants", "Biofilm", "QuantumDynamics", "MixedComplementarityProblems", "KinematicDriver", "GalacticPotentials", "Octofitter", "OctofitterRadialVelocity", "SpectralDistances"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

✖ Packages that failed

24 packages failed only on the current version.

  • Package fails to precompile: 15 packages
  • Illegal method overwrites during precompilation: 1 packages
  • Package tests unexpectedly errored: 7 packages
  • Test duration exceeded the time limit: 1 packages

74 packages failed on the previous version too.

✔ Packages that passed tests

1 packages passed tests only on the current version.

  • Other: 1 packages

➖ Packages that were skipped altogether

4 packages were skipped on the previous version too.

@vtjnash vtjnash force-pushed the jn/syntax-verified-ccall branch from b251dac to 159a24f Compare October 7, 2025 15:48
@oscardssmith
Copy link
Member

Is there a reason this needs backporting?

@vtjnash
Copy link
Member Author

vtjnash commented Oct 7, 2025

@nanosoldier runtests(["CGcoefficient", "Walsh", "Spinnaker", "Pardiso", "ArnoldiMethodTransformations", "GeoIP", "SmallCombinatorics", "Hamburg", "LongDatasetSort", "CxxWrap", "CartesianJoin", "Pingouin", "DLMReader", "StatisticalGraphics", "BytePairEncoding", "GLM", "AMGCLWrap", "GAP", "SteadyWaves", "DecomposingPolynomialSystems", "MAiNGO", "GenericCharacterTables", "ONSAS", "Globtim"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

✖ Packages that failed

15 packages failed only on the current version.

  • Package tests unexpectedly errored: 14 packages
  • There were unidentified errors: 1 packages

9 packages failed on the previous version too.

@vtjnash
Copy link
Member Author

vtjnash commented Oct 8, 2025

Good. Now only the desired (ambiguous or unexpected method call) failures. And about half are just from having a dependency on a failing package (e.g. because they pinned to an old version of LinearSolve), so fixing those half will fix the rest.

Package tests unexpectedly errored: 14 packages
Package	Version	Primary	Against	History (9-7 to 10-6)
CxxWrap	v0.17.3	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
AMGCLWrap	v2.1.0	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
Pardiso	v1.1.0	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
GAP	v0.15.3	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
Spinnaker	v1.2.0	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
ArnoldiMethodTransformations	v0.1.3	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
CGcoefficient	v0.3.3	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
Walsh	v0.1.0	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
SmallCombinatorics	v0.1.1	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
DecomposingPolynomialSystems	v1.0.1	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
SteadyWaves	v1.1.1	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
Globtim	v0.1.3	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
MAiNGO	v0.2.2	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▇
GenericCharacterTables	v0.6.0	fail	test	▇▇▇▇▇▇▇▇▇▇▇▇▅

We have long expected users to be explicit about the library name for
`ccall`, and the `@ccall` macro has even always enforced that. That
means users should have already been using explicit syntax. And indeed,
other syntax forms weren't handled reliably (since doing so would
require linearizing IR if and only if the runtime values required it,
which is not something that is computable, and thus was often done
wrong). This now aligns the runtime and compiler to expect only syntax
forms that we could reliably handle before without errors, and adds
explicit errors for cases that we previously knew would be unreliable
because they relied upon inference making particular decisions for the
semantics. The `ccall` function is already very special since it is more
like a actual macro (it does not exist as a binding or value), so we can
make unusual syntax decisions like this, mirroring `@ccall`.

This drops support for #37123,
since we were going to use that for LazyLibraries, be we decided that
approach was quite buggy and that PR would make static compilation quite
impossible to support, so we instead actually implemented LazyLibraries
with a different approach. It could be re-enabled, but we never had
correct lowering or inference support for it, so it is presumably
still unused.

The goal is to cause breakage only where the package authors really
failed to express intent with syntax, and otherwise to explicitly
maintain support by adding cases to normalize the given syntax into one
of the supported cases. All existing functionality (and more) can be
accessed by explicit management of a pointer or by a LazyLibrary-like
type, so this shouldn't cause any reduction in possible functionality,
just possibly altered syntax.
@vtjnash vtjnash force-pushed the jn/syntax-verified-ccall branch from 159a24f to 8441383 Compare October 8, 2025 21:30
@vtjnash vtjnash merged commit 829666d into master Oct 9, 2025
4 of 7 checks passed
@vtjnash vtjnash deleted the jn/syntax-verified-ccall branch October 9, 2025 20:00
@lgoettgens
Copy link
Contributor

lgoettgens commented Oct 9, 2025

@vtjnash Could you give any pointer how to fix the nanosoldier failures? I am currently looking into GAP.jl, which, according to the log, fails around https://github.com/oscar-system/GAP.jl/blob/f50b8d0473cf13f196a9a90507614215692dc772/src/ccalls.jl#L21
However, there is no use of ccall in that function. All places use @ccall which (if I understand the initial description correctly) shouldn't be affected by this change at all.

cc @fingolfin

@vtjnash
Copy link
Member Author

vtjnash commented Oct 10, 2025

Function calls (JuliaInterface_path()) was an experimental feature we introduced a couple years ago, but it never really worked correctly, and then we replaced it with LazyLibraries (overloads of Libdl.dlopen). So switch to LazyLibraries (or roll your own with dlopen overloads on your own lazy library like type). I can send along a PR if you remind me next week.

@lgoettgens
Copy link
Contributor

Since this is marked as a "minor change" and it seems to break some packages in practice (like CxxWrap and a few other ones mentioned in #59165 (comment)), I don't think this should be backported to 1.12.
Backporting this would require the above mentioned packages to be adapted and released before julia 1.12.1 gets released.

cc @KristofferC

@oscardssmith oscardssmith removed the backport 1.12 Change should be backported to release-1.12 label Oct 10, 2025
@fingolfin
Copy link
Member

So, wait... are you saying this PR now broke CxxWrap, and the 50+ packages depending on it, without any warning there (e.g. at least an issue that something has to be changed, with some hint as to what)?

That seems... suboptimal?

@vtjnash
Copy link
Member Author

vtjnash commented Oct 10, 2025

PkgEval only identified a bug in the CxxWrap tests (https://github.com/JuliaInterop/CxxWrap.jl/blob/ece5014a34f2c9434ad714ebf7e7fa2238c523ed/test/functions.jl#L246) and not in the package itself (so no dependencies were noted in PkgEval runs)

@fingolfin
Copy link
Member

Function calls (JuliaInterface_path()) was an experimental feature we introduced a couple years ago, but it never really worked correctly, and then we replaced it with LazyLibraries (overloads of Libdl.dlopen). So switch to LazyLibraries (or roll your own with dlopen overloads on your own lazy library like type). I can send along a PR if you remind me next week.

Huh, what? So looking at the Julia manual, it says:

In some cases, the exact name or path of the needed library is not known in advance and must be computed at run time. To handle such cases, the library component specification can be a function call, e.g. find_blas().dgemm. The call expression will be executed when the ccall itself is executed. However, it is assumed that the library location does not change once it is determined, so the result of the call can be cached and reused. Therefore, the number of times the expression executes is unspecified, and returning different values for multiple calls results in unspecified behavior.

Is this the same feature we are talking about? Because the one I am quoting describes pretty much what we use in GAP.jl as far as I can tell, and I don't see any references about it being "experimental". What am I missing?

@vtjnash
Copy link
Member Author

vtjnash commented Oct 10, 2025

Yes, it wasn't intended to be experimental, but it never worked correctly, and only a couple packages started attempting to use it (GAP being the main one). We replaced it almost immediately after attempting to introduce it with LazyLibraries, but hadn't fixed up any of the old code to error reliably, so it would just crash sometimes instead. This just makes the error path reliable.

@vtjnash
Copy link
Member Author

vtjnash commented Oct 10, 2025

You didn't need this whole PR, and shouldn't have ever really needed it, you just needed to remove the const from that line to be both forward and backwards compatible: oscar-system/GAP.jl#743

@lgoettgens
Copy link
Contributor

So switch to LazyLibraries (or roll your own with dlopen overloads on your own lazy library like type).

Libdl.LazyLibrary seems to be very sparingly documented, it does have a short docstring, but it is not mentioned anywhere else in the manual. I would have expected it to be mentioned at https://docs.julialang.org/en/v1.13-dev/stdlib/Libdl/

I can send along a PR if you remind me next week.

For GAP.jl, I already have a draft PR at oscar-system/GAP.jl#1267 where it would be great to hear your feedback.
But I am certain that @barche would be very happy if you send them a fix for CxxWrap.

@fingolfin
Copy link
Member

Yes, it wasn't intended to be experimental, but it never worked correctly, and only a couple packages started attempting to use it (GAP being the main one). We replaced it almost immediately after attempting to introduce it with LazyLibraries, but hadn't fixed up any of the old code to error reliably, so it would just crash sometimes instead. This just makes the error path reliable.

Well, it did work well for GAP, at least. And while you are right, working code could have been had easier, this was based on measuring code performance.

Anyway, we'll deal with it somehow.

BTW at the top you wrote

@ccall macro has even always enforced that

which I don't understand, because we have been and are using @ccall and got no warning from that either? Am I misunderstanding this somehow?


But what riles me up about this is the way it was handled: a feature was added years ago to Julia, and documented, with no hint of it being experimental. Then apparently issues were found and it now is being yanked. This is a breaking change. While I understand it may be necessary, I also think such a thing should then always be communicated very clearly, in particular:

  • a prominent note in the NEWS that the feature was removed due to unresolvable issues (and because hardly anyone uses it)
  • a hint what to do instead, and possibly an offer to help people migrate away from it
  • if things like LazyLibraries are mentioned as a possible solution, some more info as to what that even is / where to learn more about it should be included. As it is, I just searched the Julia docs, grepped the Julia git repo, and did a web search for "julia LazyLibraries"; the only hit I found was to https://docs.julialang.org/en/v1/NEWS/ which did not mention it, but I then figured out I had to manually switch to the 1.11 version; that then made me realize I need to search for LazyLibrary. That then produced a single hit in the Julia documentation, to https://docs.julialang.org/en/v1.12/devdocs/locks/ (??). But at least it made me realize it is something in Libdl, so I went to https://docs.julialang.org/en/v1.12/stdlib/Libdl/ but it is not mentioned there either. So back to looking at the Julia git repo, and ahh, there is a docstring Libdl.LazyLibrary. But since it is not in the Julia manual, I have to assume this now is an experimental feature...

So, again: I can live with this change, but I really think this kind of communication breakdown is a good example for why Julia sometimes gets a bad rep on HackerNews :-(.

Foremost, I'd suggest to avoid any hints or implications that it is somehow the user's fault for using this supposed "experimental" feature, which was actually documented as a regular feature, or blame for using a barely documented supposed "replacement" feature.

@vtjnash
Copy link
Member Author

vtjnash commented Oct 15, 2025

I hesitated to mention it too much in NEWS, since it was just a couple packages that had started to use it, so it wasn't broadly applicable to all audiences. I apologize for not being more proactive to let you guys know what changes to expect.

I didn't realize LazyLibraries hadn't gotten more than a passing mention in the documentation. Hopefully this should help address that deficiency: #59841

barche added a commit to JuliaInterop/CxxWrap.jl that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ffi foreign function interfaces, ccall, etc. minor change Marginal behavior change acceptable for a minor release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify ccall evaluation semantics

8 participants