Skip to content

Conversation

DamonFool
Copy link
Contributor

While I was debugging our Hunyun models, I found that non-ascii characters had been displayed incorrectly in the logging.

For example,

tokenize the prompt
prompt: ""
tokens: [ '<?hy_begin?of?sentence?>':120000 ]     <--- the bos of Hunyuan model, which used non-ascii chars, had been displayed incorrectly

...

我eval: [ '?':564 ]
n_past = 29
n_remain: -30
应该eval: [ '?该':3165 ]          <---- The Chinese characters had been also dumped incorrectly
n_past = 30
n_remain: -31
保持eval: [ '??':3674 ]
n_past = 31
n_remain: -32
友好eval: [ '?好':28753 ]
n_past = 32
n_remain: -33
。eval: [ '?':292 ]
n_past = 33
n_remain: -34
接下来eval: [ '???':7764 ]

The reason is that erasing the chars which are !std::isprint in the detokenized string leads to the problem for non-ascii characters.

After this patch, the non-ascii characters can be print correctly.

tokenize the prompt
prompt: ""
tokens: [ '<|hy_begin▁of▁sentence|>':120000 ]

...

我eval: [ '我':564 ]
n_past = 34
n_remain: -35
应该eval: [ '应该':3165 ]
n_past = 35
n_remain: -36
保持eval: [ '保持':3674 ]
n_past = 36
n_remain: -37
友好eval: [ '友好':28753 ]
n_past = 37
n_remain: -38
,eval: [ ',':270 ]
n_past = 38
n_remain: -39
回应eval: [ '回应':25417 ]

@ggerganov
Copy link
Member

These logs are used just for debugging. I have forgotten what was the reason to filter out non-ascii characters.

@DamonFool
Copy link
Contributor Author

I have forgotten what was the reason to filter out non-ascii characters.

Thanks @ggerganov for taking a look at this.

Not all non-ascii chars would be filter out by !std::isprint(c).
A few Chinese chars still can be printed OK.
However, quite a lot of Chinese chars would be dumped incorrectly.

@ggerganov ggerganov merged commit 2f3dbff into ggml-org:master Aug 21, 2025
47 checks passed
@DamonFool
Copy link
Contributor Author

Thanks @ggerganov for your help.

@DamonFool DamonFool deleted the non-ascii-print branch August 21, 2025 08:58
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants