-
Notifications
You must be signed in to change notification settings - Fork 470
Closed
Description
If I try to use LLamaSharp with Korean language, the output gets garbled. This does not happen in llama.cpp, so I suspect this is a problem with LLamaSharp.
- LLamaSharp 0.5.1
- Model:
llama-2-7b-guanaco-qlora.Q4_K_M.gguf
- Prompt (Translated version of chat-with-bob):
사용자가 철수라는 어시스턴트와 상호작용하는 대화 내용입니다. 철수는 도움이 되고 친절하며 정직하고 글쓰기에 능숙하며 사용자의 요청에 즉각적이고 정확하게 답변하는 데 실패한 적이 없습니다.\r\n\r\n사용자: 안녕하세요, 철수.\r\n철수: 안녕하세요. 오늘은 무엇을 도와드릴까요?\r\n사용자: 유럽에서 가장 큰 도시를 알려주세요.\r\n철수: 네. 유럽에서 가장 큰 도시는 러시아의 수도인 모스크바입니다.\r\n사용자:
- Anti-prompt is not used, so text is all generated.
My gut feeling is that llama.cpp generates tokens in UTF-8, and Korean characters' surrogate pairs are getting sliced into separate tokens, and LLamaSharp does Encoding.UTF8.GetString(byte[])
with sliced surrogate pairs in somewhere, converting them into U+FFFD. A korean character takes 3 bytes in UTF-8, and the lengths of all U+FFFD blocks are divisible by 3...
It's just my guess, I have no evidence, I looked at the code and couldn't understand how it works enough to tell if my theory is correct or not.
Can this be fixed? Am I doing something wrong?
Metadata
Metadata
Assignees
Labels
No labels