Skip to content

[BUG] SpeechToTextTool returns incoherent output and prints unexpected informational messages #1478

@Martin-Labenne

Description

@Martin-Labenne

Describe the bug
The default SpeechToTextTool does not behave as expected.
Since it is provided as a default tool, I assumed it would work out of the box with little to no configuration.

While it doesn’t raise any exceptions or crash, it consistently returns incoherent or meaningless transcriptions.
Moreover, three setup-related messages are printed during execution, which suggest that the tool may not be properly configured by default.

Code to reproduce the error

from smolagents.default_tools import SpeechToTextTool
stt_tool = SpeechToTextTool(model='openai/whisper-small')
print(stt_tool('tmp/audio.wav'))

Error logs (if any)
There is no exception or stack trace, but three messages are printed during execution, and the returned transcription is unusable.

  • It is strongly recommended to pass the sampling_rate argument to WhisperFeatureExtractor(). Failing to do so can result in silent errors that might be hard to debug.
  • Due to a bug fix in [Whisper] Refactor forced_decoder_ids & prompt ids transformers#28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass language='en'.
  • The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
  • Returns unusable string:
 A fwiyayu dhe, a yoyo t'ho arhantu, fyt b'n lab'ch'h'n z'nab'ch'h'n wakhe. Asa r'h'f t'ho h'nj, a shyt b'h'h'h'n fwiyayu waphe t'ho fwant'h'n fyt b'h'h'n shy'h'n shy'h'n shy'h'n b'h'n t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu ... [TRUNCATED BY THE AUTHOR]

Expected behavior
The tool returns the correct transcription of the audio:

 Before you all go, I want to remind you that the midterm is next week. Here's a little hint. You should be familiar with the differential equations on page 245. Problems that are very similar to problems 32, 33 and 44 from that page might be on the test. And also some of you might want to brush up on the last page in the integration section, page 197. I know some of you struggled on last week's quiz. I foresee problem 22 from page 197 being on your midterm. Oh and don't Don't forget to brush up on the section on related rates on pages 132, 133 and 134.

Packages version:

smolagents==1.19.0

Additional context

  • Im currently working on the Agents certification and I discovered this while trying to answer a question attached with an audio file. The id of the question is 1f975693-876d-457b-a649-393859e79bf3, you can retrieve the question and the file directly via this api: https://agents-course-unit4-scoring.hf.space/docs.
  • I did no processing on the file, I got it with a requests.get call then stored it with f.write(file_content.read())
  • Im using openai/whisper-small instead of the current default checkpoint because my laptop cannot handle openai/whisper-large-v3-turbo but I believe the result does not change

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions