-
Notifications
You must be signed in to change notification settings - Fork 470
Description
So I'm interested in using the LLamaSharp ChatCompletion interface for Semantic Kernel for an application I'm working on and I've been looking over the implementation and have some concerns regarding performance and chat state. I've already had this conversation with @martindevans a month or so ago on discord, you can view the convo here.
I haven't had any time to work on it until now. I'm going to have a look at fixing it, but the primary issue is the use of the StatelessExectutor for ChatCompletion. This basically leads to extremely long inference times as the chat history builds up since the model has to process the entire context on each inference call. I remember earlier that the implementation used to just wrap the existing llamasharp ChatSession class but that made it difficult to manage conversation state, i.e. deleting the last message and changing it to generate a different response.
I'm considering writing a dedicated ChatCompletion for semantic-kernel that doesn't use LLamaSharp's internal chat history but instead interfaces directly with the ChatHistory defined by Semantic Kernel. Are there any thoughts from @AsakusaRinne and @xbotter? Tagging you two since Rinne made the initial implementation and xbotter rewrote it and made the newer one.