Skip to content

[Relay] Dead code elimination pass blows up call stack #4534

@SWu

Description

@SWu

This was not obvious to me because code that I was previously working with successfully suddenly started segfaulting when I upgraded my TVM build to the 0.6.0 release, but:

  • for even relatively simple feed forward graphs (a few convolutional layers + fully connected + output), the gradient function from relay becomes fairly complex especially if using higher order mode
  • for higher order mode, when running a PartialEvaluate() + DeadCodeElimination() pass, the dead code elimination segfaults under Ubuntu default ulimit, but passes when increasing ulimit to something very large
  • looking at the coredump in gdb, the stack when the segfault happens is several thousand frames deep inside the recursive node traversal here: https://github.com/apache/incubator-tvm/blob/master/src/relay/pass/dead_code.cc#L131
  • for first order mode gradients, the VM compilation step also runs several DeadCodeElimination passes and for gradients of larger models (especially tensorflow models which insert a lot of transpose operations to make conv2d layers NCHW), the same stack overflow happens

I think this is a regression from earlier versions, but:

  • it would be preferable to remove the recursion inside DCE, or fix it so it's tail recursive and the compiler can optimize away previous stack frames, or transform to an explicit stack, so users don't get this cryptic segfault
  • if that's not possible, the documentation should include a note that users should increase their ulimit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions