[BUG] SpeechToTextTool returns incoherent output and prints unexpected informational messages

**Describe the bug**
The default SpeechToTextTool does not behave as expected.
Since it is provided as a default tool, I assumed it would work out of the box with little to no configuration.

While it doesn’t raise any exceptions or crash, it consistently returns incoherent or meaningless transcriptions.
Moreover, three setup-related messages are printed during execution, which suggest that the tool may not be properly configured by default.

**Code to reproduce the error**
```py
from smolagents.default_tools import SpeechToTextTool
stt_tool = SpeechToTextTool(model='openai/whisper-small')
print(stt_tool('tmp/audio.wav'))
```

**Error logs (if any)**
There is no exception or stack trace, but three messages are printed during execution, and the returned transcription is unusable.
- It is strongly recommended to pass the `sampling_rate` argument to `WhisperFeatureExtractor()`. Failing to do so can result in silent errors that might be hard to debug.
- Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
- The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
- Returns unusable string:
```sh 
 A fwiyayu dhe, a yoyo t'ho arhantu, fyt b'n lab'ch'h'n z'nab'ch'h'n wakhe. Asa r'h'f t'ho h'nj, a shyt b'h'h'h'n fwiyayu waphe t'ho fwant'h'n fyt b'h'h'n shy'h'n shy'h'n shy'h'n b'h'n t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu t'ho fwiyayu ... [TRUNCATED BY THE AUTHOR]
```

**Expected behavior**
The tool returns the correct transcription of the audio:
```text
 Before you all go, I want to remind you that the midterm is next week. Here's a little hint. You should be familiar with the differential equations on page 245. Problems that are very similar to problems 32, 33 and 44 from that page might be on the test. And also some of you might want to brush up on the last page in the integration section, page 197. I know some of you struggled on last week's quiz. I foresee problem 22 from page 197 being on your midterm. Oh and don't Don't forget to brush up on the section on related rates on pages 132, 133 and 134.
```

**Packages version:**
```sh
smolagents==1.19.0
```
**Additional context**
- Im currently working on the Agents certification and I discovered this while trying to answer a question attached with an audio file. The id of the question is `1f975693-876d-457b-a649-393859e79bf3`, you can retrieve the question and the file directly via this api: https://agents-course-unit4-scoring.hf.space/docs.
- I did no processing on the file, I got it with a `requests.get` call then stored it with `f.write(file_content.read())` 
- Im using `openai/whisper-small` instead of the current default checkpoint because my laptop cannot handle `openai/whisper-large-v3-turbo` but I believe the result does not change

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] SpeechToTextTool returns incoherent output and prints unexpected informational messages #1478

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] SpeechToTextTool returns incoherent output and prints unexpected informational messages #1478

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions