[Performance]: Diagnose speed difference using spec decode branch

### Report of performance regression

@litone01 reported a performance regression on the branch of spec decode (https://github.com/vllm-project/vllm/pull/24322), where models run slower than on main even without using spec decode.

His branch: https://github.com/litone01/vllm/tree/origin/feature/spec-decode-draft-model-debug

This issue is for tracking progress.