Skip to content

Slow tokenizer decode #26335

@peregilk

Description

@peregilk

System Info

transformers 4.34.0.dev0. Running this on tpu v4-8. Might happen on other plattforms as well.

Who can help?

@ArthurZucker

Reproduction

Decoding is extremely slow using Transformers 4.34.0.dev0.

A small script to reproduce:

import argparse, time
from transformers import AutoTokenizer

def measure_tokenization_speed(tokenizer, sentences):
    start_time = time.time()
    outputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    end_time = time.time()
    print(f"Time taken for encoding: {end_time - start_time} seconds")
    return outputs["input_ids"]

def measure_detokenization_speed(tokenizer, input_ids):
    start_time = time.time()
    decoded_sentences = tokenizer.batch_decode(input_ids)
    end_time = time.time()
    print(f"Time taken for decoding: {end_time - start_time} seconds")

def main(args):
    tokenizer = AutoTokenizer.from_pretrained("openai/whisper-medium", use_fast=True)

    # Create an array of 1000 sentences
    sentences = ["This is a sample sentence."] * 1000

    input_ids = measure_tokenization_speed(tokenizer, sentences)
    measure_detokenization_speed(tokenizer, input_ids)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Measure the speed of HuggingFace tokenizer.")
    args = parser.parse_args()
    main(args)

tpu v4-8 (transformers 4.34.0.dev0)
Time taken for encoding: 1.1659502983093262 seconds
Time taken for decoding: 39.807389974594116 seconds

tpu v4-8 (transformers 4.30.1)
Time taken for encoding: 1.2527313232421875 seconds
Time taken for decoding: 1.8215229511260986 seconds

Expected behavior

Decoder should take approximately as long as encoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions