Using LLamaSharp with Korean langauge

If I try to use LLamaSharp with Korean language, the output gets garbled. This does not happen in llama.cpp, so I suspect this is a problem with LLamaSharp.

- LLamaSharp 0.5.1
- Model: `llama-2-7b-guanaco-qlora.Q4_K_M.gguf`
- Prompt (Translated version of chat-with-bob): `사용자가 철수라는 어시스턴트와 상호작용하는 대화 내용입니다. 철수는 도움이 되고 친절하며 정직하고 글쓰기에 능숙하며 사용자의 요청에 즉각적이고 정확하게 답변하는 데 실패한 적이 없습니다.\r\n\r\n사용자: 안녕하세요, 철수.\r\n철수: 안녕하세요. 오늘은 무엇을 도와드릴까요?\r\n사용자: 유럽에서 가장 큰 도시를 알려주세요.\r\n철수: 네. 유럽에서 가장 큰 도시는 러시아의 수도인 모스크바입니다.\r\n사용자:`
- Anti-prompt is not used, so text is all generated.

llama.cpp:
![20231022-172856-WindowsTerminal](https://github.com/SciSharp/LLamaSharp/assets/1446561/9fe04878-fcbb-4c64-9157-8bc2b380e190)

LLamaSharp:
![20231022-172909-WindowsTerminal](https://github.com/SciSharp/LLamaSharp/assets/1446561/baff687a-ec1d-4827-8ddc-3424e04c7898)

My gut feeling is that llama.cpp generates tokens in UTF-8, and Korean characters' surrogate pairs are getting sliced into separate tokens, and LLamaSharp does `Encoding.UTF8.GetString(byte[])` with sliced surrogate pairs in somewhere, converting them into U+FFFD. A korean character takes 3 bytes in UTF-8, and the lengths of all U+FFFD blocks are divisible by 3...

It's just my guess, I have no evidence, I looked at the code and couldn't understand how it works enough to tell if my theory is correct or not.

Can this be fixed? Am I doing something wrong?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using LLamaSharp with Korean langauge #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using LLamaSharp with Korean langauge #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions