Skip to content

Conversation

sylwia-budzynska
Copy link
Contributor

👋 Several query examples in the CodeQL docs for analyzing global dataflow seem to be wrong, and don't work as expected. E.g. in docs for Python the query:

import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking
import semmle.python.dataflow.new.RemoteFlowSources
import semmle.python.Concepts

module RemoteToFileConfiguration implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    source instanceof RemoteFlowSource
  }

  predicate isSink(DataFlow::Node sink) {
    sink = any(FileSystemAccess fa).getAPathArgument()
  }
}

module RemoteToFileFlow = TaintTracking::Global<RemoteToFileConfiguration>;

from DataFlow::Node input, DataFlow::Node fileAccess
where RemoteToFileFlow::flow(input, fileAccess)
select fileAccess, "This file access uses data from $@.",
  input, "user-controllable input."

Results of running the query:
image
This won’t show a path query, but a #select because:

  • @kind path-problem is missing
  • there is no import RemoteToFileFlow:PathGraph
  • The query should be using RemoteToFileFlow:PathNode types, not DataFlow::Node
from DataFlow::Node input, DataFlow::Node fileAccess
where RemoteToFileFlow::flow(input, fileAccess)
select fileAccess, "This file access uses data from $@.",
  input, "user-controllable input."
  • should be where RemoteToFileFlow::flowPath(input, fileAccess) instead of RemoteToFileFlow::flow(input, fileAccess)
  • the query selects the wrong number of columns for a path query, and should be select fileAccess.getNode(), input, fileAccess, "This file access uses data from $@.", input, "user-controllable input."
    Making the changes fixes the query to display like in screenshot:
image

This PR fixes this specific query. However, I noticed that very similar issues, like the ones I listed above, seem to exist in docs for other languages too, in the Analyzing data flow in [language] pages.

@sylwia-budzynska sylwia-budzynska requested a review from a team as a code owner October 9, 2025 17:42
@Copilot Copilot AI review requested due to automatic review settings October 9, 2025 17:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

michaelnebel
michaelnebel previously approved these changes Oct 10, 2025
Copy link
Contributor

@michaelnebel michaelnebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing!

Comment on lines +327 to 329
import RemoteToFileFlow::PathGraph
module RemoteToFileFlow = TaintTracking::Global<RemoteToFileConfiguration>;
Copy link
Contributor

@owen-mc owen-mc Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally put these two lines the other way round, so you define the module above where you import its submodule.

Suggested change
import RemoteToFileFlow::PathGraph
module RemoteToFileFlow = TaintTracking::Global<RemoteToFileConfiguration>;
module RemoteToFileFlow = TaintTracking::Global<RemoteToFileConfiguration>;
import RemoteToFileFlow::PathGraph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick search for Go and python and the import comes after the module in all the cases I looked at (where they're both in the same file).

@owen-mc
Copy link
Contributor

owen-mc commented Oct 10, 2025

Thanks for fixing this @sylwia-budzynska . I will look into the docs for the other languages.

@owen-mc
Copy link
Contributor

owen-mc commented Oct 10, 2025

@sylwia-budzynska Actually, now I read the whole file, I see that it doesn't make sense to just fix this example. It's all been explained earlier on in terms of data flow nodes rather than path nodes. I think it would be better to keep most of it as it is, then at the end have an extra section on path problems which explains how to convert what you've already written into a path problem query, which shows you the whole path. What do you think? If you agree, I'm happy to take this over and do that consistently for all the languages.

Copy link
Contributor

@owen-mc owen-mc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I just found that we have good docs on making path queries, and that page links to all the language-specific "analyzing data flow" pages. I think the idea is to have the language-specific parts in separate pages and then explain the path query part once. So I don't think this example should be updated. However, I do think it would be good to have a short section at the end of each explaining that you can get information about the whole path and linking to the docs for that. I will do this in a new PR.

@owen-mc
Copy link
Contributor

owen-mc commented Oct 10, 2025

Here is my suggested approach: #20622

@sylwia-budzynska
Copy link
Contributor Author

sylwia-budzynska commented Oct 10, 2025

Actually, now I read the whole file, I see that it doesn't make sense to just fix this example. It's all been explained earlier on in terms of data flow nodes rather than path nodes.

Currently, the results of this specific query show as #select (see screenshot) which doesn't seem to be what's expected. Is there a reason the examples of global data flow use dataflow nodes rather than path nodes and a path query? Happy to be corrected here, I was a bit puzzled by this.

I do think it would be good to have a short section at the end of each explaining that you can get information about the whole path and linking to the docs for that. I will do this in a new PR.

I was a bit worried that we don't have a taint tracking path query example for each language in the docs. That sounds like a good solution 👍

@owen-mc
Copy link
Contributor

owen-mc commented Oct 10, 2025

Well, if you are using global flow as part of something else (like filtering out results if a safe value flows there) then you would just use DataFlow::Node and Flow::flow. Flow::PathNodes and Flow::flowPath are for when you are using that thing as the query and want to see the whole path. Most of the time you are doing that. I guess the main thing is that there is a lot of boilerplate involved (importing the PathGraph module, using PathNodes) that can be explained in one place for all languages, and the language-specific pages doesn't need to go into that.

@sylwia-budzynska
Copy link
Contributor Author

Well, if you are using global flow as part of something else (like filtering out results if a safe value flows there) then you would just use DataFlow::Node and Flow::flow.

Makes sense.

I'll close this PR then. I think adding path query examples to the global taint tracking in [language] docs would be very beneficial, especially for the times when teaching CodeQL to new users who might not know the differences between using a path query or global data flow without path. Thank you for taking the time to fix it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants