Skip to content

Queries

Andrey Starodubtsev edited this page May 5, 2022 · 8 revisions

This page provides a list of domain-specific queries, which should be supported by swh API.

The queries serve practical use, should be expressed in Gremlin, and assessed in terms of performance.

Gremlin-Java language variant is used in the queries.

Performance measurements

The python3k dataset is used to run the queries and assess the performance. The graph contains 46 million vertices, and 1.2 billion edges. The queries are run on Granet. N random starting points are generated, and an average taken over the execution time. To account for cold/hot starts an average is taken over M iterations.


The current API supports a number of queries, implemented natively.

Leaves

Performs a graph traversal and returns the leaves of the subgraph rooted at the specified source node.

g.V().not(__.in())         // find roots
 .repeat(__.out().dedup()) // recursively branch out, filtering out visited vertices
 .until(__.not(__.out()))  // stop at leaf

Performance

Type Time Notes
Native with boolean array 1m 21s --
Native with Set 1m 57s Needs 4-6gb heap space
Gremlin 7m 41s Needs >6gb heap space

Earliest containing revision / origin

  1. Given a dir or a file find the oldest revision which contains it.
g.V(v)                                               // start from CNT/DIR vertex
 .repeat(__.in().dedup())                            // recursively go to parent, ignoring visited vertices
 .emit(__.hasLabel("REV"))                           // collect all encountered REV vertices
 .dedup()                                            // remove duplicates
 .order().by("author_timestamp", Order.asc).limit(1) // sort by 'author_timestamp' and take first
Samples Iters Average Max
1000 3 814ms 9s

ecr-plot

  1. Extension of (1) returning one (random) origin containing the revision returned by (1).
 ...                        // same as (1) + additional visited set
 .repeat(__.in().dedup())   // recursively go to parent, ignoring visited vertices
 .until(__.hasLabel("ORI")) // stop when origin is found

Variants:

  1. same as (1) but not considering revisions older than a given threshold
g.V(v)
 .repeat(__.in().dedup())
 .emit(__.hasLabel("REV")
         .has("author_timestamp", P.lt(max))) // emit if 'author_timestamp' is before threshold
 .dedup()
  1. same as (1) but returning the N oldest revisions
 ...       // same as (1), but without limit(1) step
 .limit(n) // return first n results instead of 1

Transitive closure (vault)

  1. Given a revision node, return the entire directory tree rooted at it, in a format isomorphic to ls -lR (i.e., list all complete paths together with all known associated properties: file names and permissions)
g.V(root)                                              // start with revision or directory
 .choose(__.hasLabel("REV"), __.out().hasLabel("DIR")) // if the start point is a revision, make one step to associated dir
 .repeat(__.outE()                                     // recursively branch out (include edges)
           .inV().choose(__.hasLabel("REV"), __.out().hasLabel("DIR"))) // for submodules make one step to associated dir
 .emit()                                               // emit every walked vertex
 .path()                                               // get all entities walked until reaching this vertex
 .map(__.unfold()
        .<SwhProperties.DirEntryString[]>values("dir_entry_str") // map path elements to 'dir_entry_str' property (with filename converted from id to string)
        .fold()
 .flatMap(path -> {
     List<SwhProperties.DirEntryString[]> pathDirs = path.get();
     StringBuilder dir = new StringBuilder();
     for (int i = 0; i < pathDirs.size() - 1; i++) {
         dir.append(pathDirs.get(i)[0].filename) // parent path should not have duplicate edges
            .append("/");
     }
     SwhProperties.DirEntryString[] last = pathDirs.get(pathDirs.size() - 1);
     if (last.length == 1) {
         var entry = last[0];
         return List.of(String.format("%s%s [perms: %s]", dir, entry.filename, entry.permission))
                    .iterator();
     }
     List<String> res = new ArrayList<>();
     for (SwhProperties.DirEntryString entry : last) {
         res.add(String.format("%s%s [perms: %s]", dir, entry.filename, entry.permission));
     }
     return res.iterator();
 })
Samples Iters Average Max
1000 3 851ms (0.05 ms/el) 12s (0.07 ms/el)

rec-paths-plot

  1. Same as (1), but starting from a snapshot node and stopping listing at revision nodes. Format: src -> dst; for snp -> * edges also include the branch name. Note: this is the moral equivalent of a git log.
g.withSideEffect("e", new HashSet<>())
 .V(snapshot)
 .repeat(__.outE().where(P.without("e"))                // recursively branch out on new edges, skipping visited ones
           .where(__.inV().hasLabel("REV", "REL"))      // leave only revision and release vertices
           .aggregate("e")                              // save matching edges
           .inV().dedup()                               // branch out, avoiding duplicates
 .<Edge>cap("e")                                        // return the set of matching edges
 .unfold()                                              // convert from collection to individual per edge traversers
 .elementMap("filenames")                               // map edges to filenames array
 .flatMap(edgeElementMapTraverser -> {                  // convert edge to all listed files
     Map<Object, Object> edgeElementMap = edgeElementMapTraverser.get();
     long outId = (long) ((Map<Object, Object>) edgeElementMap.get(Direction.OUT)).get(T.id);
     long inId = (long) ((Map<Object, Object>) edgeElementMap.get(Direction.IN)).get(T.id);
     String outLabel = (String) ((Map<Object, Object>) edgeElementMap.get(Direction.OUT)).get(T.label);

     String edgeStr = String.format("(%s -> %s)", outId, inId);
     if (outLabel.equals("SNP")) {
         String[] branches = (String[]) edgeElementMap.get("filenames");
         List<String> res = new ArrayList<>(branches.length);
         for (String branch : branches) {
             res.add(edgeStr + " " + branch);
         }
         return res.iterator();
     }
     return List.of(edgeStr).iterator();
 })
Samples Iters Average Max
1000 3 29ms (0.02 ms/el) 2s (0.01 ms/el)

snp-plot


Takedown notice support (unique contributions by a given origin)

Given an origin, list all nodes that are not found in any other origin.

slow

g.withSideEffect("candidates", new HashSet<Vertex>())                             // will store vertices, reachable from starting origin
 .withSideEffect("others", new HashSet<Vertex>())                                 // will store vertices, reachable from other origins
 .withSideEffect("v", new HashSet<Vertex>())
 .V(origin)                                                                       // start from the origin
 .repeat(__.out().dedup().where(P.without("candidates")).aggregate("candidates")) // recursively branch out and collect the subtree in 'candidates'
 .until(__.not(__.out())).dedup()                                                 // emit all leafs
 .repeat(__.in().dedup().where(P.without("v")).aggregate("v"))                    // recursively branch up from leafs
 .until(__.hasLabel("ORI")).dedup()                                               // emit all origins reachable from leafs
 .filter(__.not(__.id().is(origin)))                                              // remove the starting origin
 .repeat(__.out().dedup().where(P.without("others")).aggregate("others"))         // collect all vertices, reachable from other origins
 .cap("candidates").<Vertex>unfold().where(P.without("others"))                   // return original subtree without vertices reachable from other origins

git blame

Given an origin and a path like a/b/c list all pairs (commit_id, blob_id) that existed at that path, sorted by reverse chronological order. (Ignore other object types if they ever incarnated that path.)


git bisect (WIP)

to be specified...

iteratively narrow down the commit where a given property holds

Find forks of a given repository

Based on "shared-commit" definition of https://dl.acm.org/doi/abs/10.1145/3379597.3387450

  1. Get all the "root commits" of a given origin
  2. Get all the origins reachable from these root commits in the backward graph