Skip to content

Conversation

Kotomi-Du
Copy link

@Kotomi-Du Kotomi-Du commented Oct 3, 2025

Description

Recognize other LLM models specifically for phi-silica models to trigger making stateful model.

Motivation and Context

Combined with #830 and #831 , these changes improved the memory footprint and performance for Phi-Silica workload. Without these changes, it consumed 16GB memory and got 1 fps when running the workload on OVEP GPU backend. After the change, the memory usage reduced to 3.7GB and the performance achieves 16fps.

Open

Should OVEP add an API in provider option to let user decide if they want to make the model stateful? If so, we don't need to hardcode the input name for specific models.

@Kotomi-Du Kotomi-Du changed the base branch from master to ovep-develop October 3, 2025 00:24
@Kotomi-Du Kotomi-Du marked this pull request as draft October 3, 2025 00:26
if (gpu_or_npu) {
prefill_use_full_chat_history = true;
}
// bool gpu_or_npu = ((device.find("NPU") != std::string::npos) || (device.find("GPU") != std::string::npos));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to discuss with ORT-GenAI team how to handle this logic

@Kotomi-Du Kotomi-Du force-pushed the make_stateful_phisilica branch from 0fe0302 to 1e132f3 Compare October 11, 2025 00:29
@Kotomi-Du Kotomi-Du marked this pull request as ready for review October 11, 2025 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant