Skip to content

Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

Mohamed-Ashraf273
Copy link
Contributor

@Mohamed-Ashraf273 Mohamed-Ashraf273 commented Jul 18, 2025

Device used for inference

CPU

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

GPT-2

Mentions

@rkazants

Performance issue description

During my GSoC project, I've faced this issue:
Running the generate step using OpenVINO backend gives a very high memory usage for some reason, based on these PRs:
Keras_hub: keras-team/keras-hub#2310

OpenVINO performance

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()
Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)

1) without quantization (not serialized):
Memory used by compile_model: 2932.03 MB
Generated text: Keras is  a great library for Python.  Its features and ease of
Keras is using the backend: openvino
Latency: 6.41 seconds
Throughput: 1.87 tokens/sec
Peak CPU Memory Used: 3181.58 MB

2) without quantization (serialized):
Memory used by compile_model: 4228.38 MB
Generated text: Keras is  a great framework for creating scalable, highly concurrent, fault tolerant,
Keras is using the backend: openvino
Latency: 7.65 seconds
Throughput: 1.57 tokens/sec
Peak CPU Memory Used: 4541.11 MB

3) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM)) 
(serialized)
Memory used by compile_model: 3048.21 MB
Generated text: Keras is  a Python framework for creating high performance, scalable and fault tolerant web
Keras is using the backend: openvino
Latency: 17.79 seconds
Throughput: 0.79 tokens/sec
Peak CPU Memory Used: 4512.11 MB

4) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_ASYM)) 
(serialized)
Memory used by compile_model: 3262.74 MB
Generated text: Keras is 《Keras》, a powerful machine-learning framework
Keras is using the backend: openvino
Latency: 17.79 seconds
Throughput: 0.39 tokens/sec
Peak CPU Memory Used: 4707.22 MB

5) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM))  and dynamic model inputs:
Memory used by compile_model: 2800.11 MB
Generated text: Keras is  a Python package that allows you to write simple, fast and efficient
Keras is using the backend: openvino
Latency: 17.02 seconds
Throughput: 0.82 tokens/sec
Peak CPU Memory Used: 4327.54 MB

for

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()
Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)

1) without quantization (not serialized):
Memory used by compile_model: 6324.46 MB
Generated text: Keras is  a Python library for creating high quality models and performing computations using
Keras is using the backend: openvino
Latency: 9.19 seconds
Throughput: 1.52 tokens/sec
Peak CPU Memory Used: 6645.12 MB

2) without quantization (serialized):
Memory used by compile_model: 6810.26 MB
Generated text: Keras is  a Python framework for machine learning and artificial intelligence.  It
Keras is using the backend: openvino
Latency: 10.04 seconds
Throughput: 1.19 tokens/sec
Peak CPU Memory Used: 7217.23 MB

3) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM))
(serialized)
Memory used by compile_model: 4664.95 MB
Generated text: Keras is  a great framework for building complex applications. It allows you to write
Keras is using the backend: openvino
Latency: 22.68 seconds
Throughput: 0.62 tokens/sec
Peak CPU Memory Used: 5844.22 MB

4) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_ASYM)) 
(serialized)
Memory used by compile_model: 4890.85 MB
Generated text: Keras is  a powerful machine learning library for Python. 
Keras
Keras is using the backend: openvino
Latency: 20.75 seconds
Throughput: 0.48 tokens/sec
Peak CPU Memory Used: 5976.95 MB

5) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM)) and dynamic model inputs:
Memory used by compile_model: 4532.57 MB
Generated text: Keras is  the software that allows you to write software.  It
Keras is using the backend: openvino
Latency: 21.50 seconds
Throughput: 0.51 tokens/sec
Peak CPU Memory Used: 5709.98 MB

While TensorFlow shows more memory-efficient usage:

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()
Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)

1) Generate with max length = 20:
Keras is using the backend: tensorflow
Latency: 9.17 seconds
CPU Memory Used (end - start): 266.49 MB
Peak CPU Memory Used: 266.14 MB

2) Appling  only one layer out of 24:
Generated text: Keras is                  
Keras is using the backend: tensorflow
Latency: 2.05 seconds
CPU Memory Used (end - start): 69.00 MB
Peak CPU Memory Used: 68.65 MB

for

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()
Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)

1) Generate with max length = 20:
Generated text: Keras is ____ and _____ is the function that makes it all work together.
Keras is using the backend: tensorflow
Latency: 11.93 seconds
CPU Memory Used (end - start): 475.20 MB
Peak CPU Memory Used: 475.03 MB

2) Appling  only one layer out of 24:
Generated text: Keras is                  
Keras is using the backend: tensorflow
Latency: 2.55 seconds
CPU Memory Used (end - start): 248.85 MB
Peak CPU Memory Used: 248.85 MB

Step-by-step reproduction

using these PRs:
Keras_hub: keras-team/keras-hub#2310

run that code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

os.environ["KERAS_BACKEND"] = "openvino"


import time
import psutil
import keras
import keras_hub
import threading


causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / (1024 ** 2)

peak_memory = mem_before
done = [False]
def monitor_memory():
    global peak_memory
    while not done[0]:
        mem_now = process.memory_info().rss / (1024 ** 2)
        if mem_now > peak_memory:
            peak_memory = mem_now
        time.sleep(0.05)

monitor_thread = threading.Thread(target=monitor_memory)
monitor_thread.start()

start_time = time.perf_counter()
output = causal_lm.generate("Keras is ", max_length=20)

end_time = time.perf_counter()
done[0] = True
monitor_thread.join()

mem_after = process.memory_info().rss / (1024 ** 2)
memory_used = mem_after - mem_before
latency = end_time - start_time
tokens_generated = len(output.split())
throughput = tokens_generated / latency

print("Generated text:", output)
print(f"Keras is using the backend: {keras.backend.backend()}")
print(f"Latency: {latency:.2f} seconds")
print(f"Throughput: {throughput:.2f} tokens/sec")
print(f"CPU Memory Used (end - start): {memory_used:.2f} MB")
print(f"Peak CPU Memory Used: {peak_memory - mem_before:.2f} MB")

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Mohamed-Ashraf273, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is a draft aimed at advancing the OpenVINO backend for Keras by implementing critical operations previously marked as unsupported. The core objective is to enable comprehensive inference capabilities for large language models, specifically demonstrating the functionality of the generate method for models like GPT2, Gemma, and Mistral. The changes involve complex OpenVINO graph constructions for various array manipulations, slicing, and random number generation, alongside refinements to tensor conversion and general data handling within the backend.

Highlights

  • Expanded OpenVINO Backend Capabilities: Implemented several previously unsupported NumPy and core Keras operations, including slice_update, repeat, take_along_axis, tri, tril, triu, and categorical for the OpenVINO backend. This significantly broadens the range of Keras functionalities that can be executed with OpenVINO.
  • Improved Tensor Handling and Interoperability: Enhanced get_ov_output to support list and tuple inputs, enabling more flexible data processing. Additionally, convert_to_tensor now directly handles ov.Output objects, and OpenVINOKerasTensor gained a __array__ method, streamlining conversions and improving interoperability with NumPy.
  • Enabled Large Language Model (LLM) Inference: The changes enable the generate method for GPT2CausalLM, GemmaCausalLM, and MistralCausalLM models on the OpenVINO backend, as confirmed by new test results. This is a crucial step towards full LLM support.
  • Test Suite Adjustments: Updated excluded_concrete_tests.txt to reflect newly supported operations by removing them from the exclusion list. Conversely, certain random operation tests were added to the exclusion list, indicating areas still under development or requiring further integration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds significant new functionality to the OpenVINO backend, including implementations for slice_update, repeat, take_along_axis, tri, tril, triu, and categorical. This is a great step towards feature parity with other backends. The code is generally well-structured, but some of the new implementations are very complex. I've found a critical bug in the categorical implementation and have a few suggestions to improve code clarity and maintainability in slice_update.

@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2025

Codecov Report

Attention: Patch coverage is 88.48369% with 60 lines in your changes missing coverage. Please review.

Project coverage is 82.86%. Comparing base (d55a767) to head (8bc66a1).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/backend/openvino/core.py 71.42% 15 Missing and 5 partials ⚠️
keras/src/export/openvino.py 84.00% 4 Missing and 8 partials ⚠️
keras/src/layers/core/einsum_dense.py 90.90% 3 Missing and 9 partials ⚠️
keras/src/backend/openvino/numpy.py 93.84% 4 Missing and 4 partials ⚠️
keras/src/backend/openvino/random.py 93.54% 1 Missing and 1 partial ⚠️
keras/src/layers/core/dense.py 90.47% 1 Missing and 1 partial ⚠️
keras/src/models/model.py 50.00% 2 Missing ⚠️
keras/api/_tf_keras/keras/ops/__init__.py 0.00% 1 Missing ⚠️
keras/api/_tf_keras/keras/ops/numpy/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21491      +/-   ##
==========================================
+ Coverage   82.81%   82.86%   +0.05%     
==========================================
  Files         565      566       +1     
  Lines       55520    55963     +443     
  Branches     8664     8733      +69     
==========================================
+ Hits        45977    46376     +399     
- Misses       7428     7455      +27     
- Partials     2115     2132      +17     
Flag Coverage Δ
keras 82.67% <87.90%> (+0.05%) ⬆️
keras-jax 63.80% <37.04%> (+0.42%) ⬆️
keras-numpy 58.29% <24.37%> (-0.30%) ⬇️
keras-openvino 34.73% <51.24%> (+0.74%) ⬆️
keras-tensorflow 64.27% <39.15%> (+0.43%) ⬆️
keras-torch 63.14% <33.58%> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Mohamed-Ashraf273 Mohamed-Ashraf273 marked this pull request as ready for review July 20, 2025 19:21
@Mohamed-Ashraf273 Mohamed-Ashraf273 changed the title Draft PR for OpenVINO backend to simulate the unmerged PRs functionality Draft PR for OpenVINO backend to simulate the unmerged PRs functionality with the high memory usage issue Jul 20, 2025
@Mohamed-Ashraf273 Mohamed-Ashraf273 changed the title Draft PR for OpenVINO backend to simulate the unmerged PRs functionality with the high memory usage issue Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling Jul 20, 2025
@Mohamed-Ashraf273 Mohamed-Ashraf273 force-pushed the gsoc2025 branch 2 times, most recently from 5cd93f9 to e26fc10 Compare July 21, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants