Mistake in word tokenization for text containing digit related time and finance

## Description
I've been contacted via email that AttaCut (possibly other tokenizers as well) cannot cope well when encountering texts like below

```
- 'เจอกันตอน 17.00น.' 
   - actual: ['เจอ', 'กัน', 'ตอน', ' ', '17', '.', '00น', '.']
   - expected: ['เจอ', 'กัน', 'ตอน', ' ', '17.00น', '.']
- 'เจอกันตอน 17:00'
   - actual:  ['เจอ', 'กัน', 'ตอน', ' ', '17', ':', '00']
   - expected: ['เจอ', 'กัน', 'ตอน', ' ', '17:00']
- 'ของชิ้นนี้ราคา 3.50 บาท' => 
   - actual:  ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3', '.', '50', ' ', 'บาท']
   - expected: ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3.50', ' ', 'บาท']
```

IMHO, this problem seems quite general. I wonder what could be a good strategy to solve the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mistake in word tokenization for text containing digit related time and finance #652

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mistake in word tokenization for text containing digit related time and finance #652

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions