Skip to content

Using LLamaSharp with Korean langauge #203

@sinusinu

Description

@sinusinu

If I try to use LLamaSharp with Korean language, the output gets garbled. This does not happen in llama.cpp, so I suspect this is a problem with LLamaSharp.

  • LLamaSharp 0.5.1
  • Model: llama-2-7b-guanaco-qlora.Q4_K_M.gguf
  • Prompt (Translated version of chat-with-bob): 사용자가 철수라는 어시스턴트와 상호작용하는 대화 내용입니다. 철수는 도움이 되고 친절하며 정직하고 글쓰기에 능숙하며 사용자의 요청에 즉각적이고 정확하게 답변하는 데 실패한 적이 없습니다.\r\n\r\n사용자: 안녕하세요, 철수.\r\n철수: 안녕하세요. 오늘은 무엇을 도와드릴까요?\r\n사용자: 유럽에서 가장 큰 도시를 알려주세요.\r\n철수: 네. 유럽에서 가장 큰 도시는 러시아의 수도인 모스크바입니다.\r\n사용자:
  • Anti-prompt is not used, so text is all generated.

llama.cpp:
20231022-172856-WindowsTerminal

LLamaSharp:
20231022-172909-WindowsTerminal

My gut feeling is that llama.cpp generates tokens in UTF-8, and Korean characters' surrogate pairs are getting sliced into separate tokens, and LLamaSharp does Encoding.UTF8.GetString(byte[]) with sliced surrogate pairs in somewhere, converting them into U+FFFD. A korean character takes 3 bytes in UTF-8, and the lengths of all U+FFFD blocks are divisible by 3...

It's just my guess, I have no evidence, I looked at the code and couldn't understand how it works enough to tell if my theory is correct or not.

Can this be fixed? Am I doing something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions