How Whitespaces Affect LLM Performance in Arithmetic Calculations?

We conducted an experiment to determine whether whitespace modifications in arithmetic expressions influence how LLMs generate answers.

Dataset

We used the arithmetic_1dc subset from EleutherAI/arithmetic, which contains pairs like:

context: Question: What is (9 + 8) * 2? Answer:
completion: 34

The complexity level of these problems is appropriate for modern ~8B parameter LLMs. Short prompts also avoid context window limitations, allowing us to isolate the effect of minor prompt modifications. This dataset is also included in the lm-evaluation-harness benchmark.

The dataset is loaded via datasets library from huggingface hub.

Code

To run the analysis:

Open arithmetic_robustness.py
Set the model_name variable to your target model
Execute the script

Methodology

Whitespace Modifications

We added or removed spaces (ASCII 0x20) in positions that don't alter the mathematical meaning:

Never inside words or numbers
Up to 3 consecutive spaces could be added
Spaces before brackets and operators could be removed

Example modification:

Original: What is (9 + 8) * 2?
Modified: What is ( 9+8) * 2 ?

Experimental Setup

Prompt Format: Fixed across all tested models (no model-specific optimization)
Generation Paradigm: Greedy decoding with answer extraction via <start>number</start> pattern
Chat Template: Standard template applied via tokenizer's apply_chat_template with enable_thinking=False
Statistical Test: McNemar's test (p-value<0.05 threshold) for significance testing

Note: McNemar's test becomes unreliable when failure rates are very low (e.g., for Qwen3-32B's near-perfect accuracy).

Results

Model	Instruction-following Acc. (↑)	Population	Original Acc. (↑)	Distorted Acc. (↑)	Significant Difference
Qwen/Qwen3-14B	1.0	2000	0.946	0.964	True
Qwen/Qwen3-8B	1.0	2000	0.897	0.939	True
t-tech/T-lite-it-1.0	1.0	2000	0.762	0.676	True
ai-sage/GigaChat-20B-A3B-instruct	0.99625	1985	0.498	0.507	False
mistralai/Mistral-7B-Instruct-v0.3	0.93275	1749	0.207	0.189	True
yandex/YandexGPT-5-Lite-8B-instruct	0.981	1926	0.201	0.088	True
HuggingFaceTB/SmolLM2-1.7B-Instruct	0.972	1890	0.0	0.0	True

Column Descriptions:

Instruction-following acc.: Ratio of responses containing the <start>number</start> pattern; higher is better
Population: Valid response pairs (both original/modified prompts contained extractable answers)
Original Acc.: Accuracy with unmodified prompts; higher is better
Distorted Acc.: Accuracy with whitespace-modified prompts; higher is better
Significant Difference: McNemar's test result (p<0.05)

Key Findings

Whitespaces significantly affect LLM performance in arithmetic calculations
Qwen3 models show improved accuracy with added whitespaces, while other models degrade
GigaChat-20B-A3B-instruct performance is not affected by whitespace noise (difference of mean accuracies is not significant)
Instruction-following capability:
- Qwen3 achieves perfect compliance
- Other models occasionally deviate from response formatting requirements
Russian-language LLMs underperform likely due to English prompt mismatch (we maintained consistency by not using Russian prompts)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arithmetic_robustness.py		arithmetic_robustness.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How Whitespaces Affect LLM Performance in Arithmetic Calculations?

Dataset

Code

Methodology

Whitespace Modifications

Experimental Setup

Results

Key Findings

About

Uh oh!

Releases

Packages

Languages

License

Koziev/arithmetic_robustness

Folders and files

Latest commit

History

Repository files navigation

How Whitespaces Affect LLM Performance in Arithmetic Calculations?

Dataset

Code

Methodology

Whitespace Modifications

Experimental Setup

Results

Key Findings

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages