For each tool, benchmark scripts should report
- full execution time, i.e. including tool startup time
- execution time of just the transaction simulation (not sure if this is possible for each tool)
- gas used
- some simple state verification, e.g. check logs or traces, or perhaps we just check this one manually for now?