Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21491

Mohamed-Ashraf273 · 2025-07-18T21:24:02Z

Device used for inference

CPU

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

GPT-2

Mentions

@rkazants

Performance issue description

During my GSoC project, I've faced this issue:
Running the generate step using OpenVINO backend gives a very high memory usage for some reason, based on these PRs:
Keras_hub: keras-team/keras-hub#2310

OpenVINO performance

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()

Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)

1) without quantization (not serialized):
Memory used by compile_model: 2932.03 MB
Generated text: Keras is  a great library for Python.  Its features and ease of
Keras is using the backend: openvino
Latency: 6.41 seconds
Throughput: 1.87 tokens/sec
Peak CPU Memory Used: 3181.58 MB

2) without quantization (serialized):
Memory used by compile_model: 4228.38 MB
Generated text: Keras is  a great framework for creating scalable, highly concurrent, fault tolerant,
Keras is using the backend: openvino
Latency: 7.65 seconds
Throughput: 1.57 tokens/sec
Peak CPU Memory Used: 4541.11 MB

3) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM)) 
(serialized)
Memory used by compile_model: 3048.21 MB
Generated text: Keras is  a Python framework for creating high performance, scalable and fault tolerant web
Keras is using the backend: openvino
Latency: 17.79 seconds
Throughput: 0.79 tokens/sec
Peak CPU Memory Used: 4512.11 MB

4) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_ASYM)) 
(serialized)
Memory used by compile_model: 3262.74 MB
Generated text: Keras is 《Keras》, a powerful machine-learning framework
Keras is using the backend: openvino
Latency: 17.79 seconds
Throughput: 0.39 tokens/sec
Peak CPU Memory Used: 4707.22 MB

5) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM))  and dynamic model inputs:
Memory used by compile_model: 2800.11 MB
Generated text: Keras is  a Python package that allows you to write simple, fast and efficient
Keras is using the backend: openvino
Latency: 17.02 seconds
Throughput: 0.82 tokens/sec
Peak CPU Memory Used: 4327.54 MB

for

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()

Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)

1) without quantization (not serialized):
Memory used by compile_model: 6324.46 MB
Generated text: Keras is  a Python library for creating high quality models and performing computations using
Keras is using the backend: openvino
Latency: 9.19 seconds
Throughput: 1.52 tokens/sec
Peak CPU Memory Used: 6645.12 MB

2) without quantization (serialized):
Memory used by compile_model: 6810.26 MB
Generated text: Keras is  a Python framework for machine learning and artificial intelligence.  It
Keras is using the backend: openvino
Latency: 10.04 seconds
Throughput: 1.19 tokens/sec
Peak CPU Memory Used: 7217.23 MB

3) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM))
(serialized)
Memory used by compile_model: 4664.95 MB
Generated text: Keras is  a great framework for building complex applications. It allows you to write
Keras is using the backend: openvino
Latency: 22.68 seconds
Throughput: 0.62 tokens/sec
Peak CPU Memory Used: 5844.22 MB

4) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_ASYM)) 
(serialized)
Memory used by compile_model: 4890.85 MB
Generated text: Keras is  a powerful machine learning library for Python. 
Keras
Keras is using the backend: openvino
Latency: 20.75 seconds
Throughput: 0.48 tokens/sec
Peak CPU Memory Used: 5976.95 MB

5) with quantization (compress_weights(ov_model, mode=CompressWeightsMode.INT4_SYM)) and dynamic model inputs:
Memory used by compile_model: 4532.57 MB
Generated text: Keras is  the software that allows you to write software.  It
Keras is using the backend: openvino
Latency: 21.50 seconds
Throughput: 0.51 tokens/sec
Peak CPU Memory Used: 5709.98 MB

While TensorFlow shows more memory-efficient usage:

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()

Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)

1) Generate with max length = 20:
Keras is using the backend: tensorflow
Latency: 9.17 seconds
CPU Memory Used (end - start): 266.49 MB
Peak CPU Memory Used: 266.14 MB

2) Appling  only one layer out of 24:
Generated text: Keras is                  
Keras is using the backend: tensorflow
Latency: 2.05 seconds
CPU Memory Used (end - start): 69.00 MB
Peak CPU Memory Used: 68.65 MB

for

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()

Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)

1) Generate with max length = 20:
Generated text: Keras is ____ and _____ is the function that makes it all work together.
Keras is using the backend: tensorflow
Latency: 11.93 seconds
CPU Memory Used (end - start): 475.20 MB
Peak CPU Memory Used: 475.03 MB

2) Appling  only one layer out of 24:
Generated text: Keras is                  
Keras is using the backend: tensorflow
Latency: 2.55 seconds
CPU Memory Used (end - start): 248.85 MB
Peak CPU Memory Used: 248.85 MB

Step-by-step reproduction

using these PRs:
Keras_hub: keras-team/keras-hub#2310

run that code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

os.environ["KERAS_BACKEND"] = "openvino"


import time
import psutil
import keras
import keras_hub
import threading


causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / (1024 ** 2)

peak_memory = mem_before
done = [False]
def monitor_memory():
    global peak_memory
    while not done[0]:
        mem_now = process.memory_info().rss / (1024 ** 2)
        if mem_now > peak_memory:
            peak_memory = mem_now
        time.sleep(0.05)

monitor_thread = threading.Thread(target=monitor_memory)
monitor_thread.start()

start_time = time.perf_counter()
output = causal_lm.generate("Keras is ", max_length=20)

end_time = time.perf_counter()
done[0] = True
monitor_thread.join()

mem_after = process.memory_info().rss / (1024 ** 2)
memory_used = mem_after - mem_before
latency = end_time - start_time
tokens_generated = len(output.split())
throughput = tokens_generated / latency

print("Generated text:", output)
print(f"Keras is using the backend: {keras.backend.backend()}")
print(f"Latency: {latency:.2f} seconds")
print(f"Throughput: {throughput:.2f} tokens/sec")
print(f"CPU Memory Used (end - start): {memory_used:.2f} MB")
print(f"Peak CPU Memory Used: {peak_memory - mem_before:.2f} MB")

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist

Summary of Changes

Hello @Mohamed-Ashraf273, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is a draft aimed at advancing the OpenVINO backend for Keras by implementing critical operations previously marked as unsupported. The core objective is to enable comprehensive inference capabilities for large language models, specifically demonstrating the functionality of the generate method for models like GPT2, Gemma, and Mistral. The changes involve complex OpenVINO graph constructions for various array manipulations, slicing, and random number generation, alongside refinements to tensor conversion and general data handling within the backend.

Highlights

Expanded OpenVINO Backend Capabilities: Implemented several previously unsupported NumPy and core Keras operations, including slice_update, repeat, take_along_axis, tri, tril, triu, and categorical for the OpenVINO backend. This significantly broadens the range of Keras functionalities that can be executed with OpenVINO.
Improved Tensor Handling and Interoperability: Enhanced get_ov_output to support list and tuple inputs, enabling more flexible data processing. Additionally, convert_to_tensor now directly handles ov.Output objects, and OpenVINOKerasTensor gained a __array__ method, streamlining conversions and improving interoperability with NumPy.
Enabled Large Language Model (LLM) Inference: The changes enable the generate method for GPT2CausalLM, GemmaCausalLM, and MistralCausalLM models on the OpenVINO backend, as confirmed by new test results. This is a crucial step towards full LLM support.
Test Suite Adjustments: Updated excluded_concrete_tests.txt to reflect newly supported operations by removing them from the exclusion list. Conversely, certain random operation tests were added to the exclusion list, indicating areas still under development or requiring further integration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds significant new functionality to the OpenVINO backend, including implementations for slice_update, repeat, take_along_axis, tri, tril, triu, and categorical. This is a great step towards feature parity with other backends. The code is generally well-structured, but some of the new implementations are very complex. I've found a critical bug in the categorical implementation and have a few suggestions to improve code clarity and maintainability in slice_update.

keras/src/backend/openvino/random.py

keras/src/backend/openvino/core.py

codecov-commenter · 2025-07-18T21:31:03Z

Codecov Report

Attention: Patch coverage is 88.48369% with 60 lines in your changes missing coverage. Please review.

Project coverage is 82.86%. Comparing base (d55a767) to head (8bc66a1).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
keras/src/backend/openvino/core.py	71.42%	15 Missing and 5 partials ⚠️
keras/src/export/openvino.py	84.00%	4 Missing and 8 partials ⚠️
keras/src/layers/core/einsum_dense.py	90.90%	3 Missing and 9 partials ⚠️
keras/src/backend/openvino/numpy.py	93.84%	4 Missing and 4 partials ⚠️
keras/src/backend/openvino/random.py	93.54%	1 Missing and 1 partial ⚠️
keras/src/layers/core/dense.py	90.47%	1 Missing and 1 partial ⚠️
keras/src/models/model.py	50.00%	2 Missing ⚠️
keras/api/_tf_keras/keras/ops/__init__.py	0.00%	1 Missing ⚠️
keras/api/_tf_keras/keras/ops/numpy/__init__.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21491      +/-   ##
==========================================
+ Coverage   82.81%   82.86%   +0.05%     
==========================================
  Files         565      566       +1     
  Lines       55520    55963     +443     
  Branches     8664     8733      +69     
==========================================
+ Hits        45977    46376     +399     
- Misses       7428     7455      +27     
- Partials     2115     2132      +17

Flag	Coverage Δ
keras	`82.67% <87.90%> (+0.05%)`	⬆️
keras-jax	`63.80% <37.04%> (+0.42%)`	⬆️
keras-numpy	`58.29% <24.37%> (-0.30%)`	⬇️
keras-openvino	`34.73% <51.24%> (+0.74%)`	⬆️
keras-tensorflow	`64.27% <39.15%> (+0.43%)`	⬆️
keras-torch	`63.14% <33.58%> (-0.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…vino format

Mohamed-Ashraf273 and others added 9 commits July 1, 2025 16:46

[OpenVINO backend] support repeat

aefa32f

[OpenVINO backend] support tri, triu, and tril

4b66e60

[OpenVINO backend] support slice_update

8093d51

[OpenVINO backend] add __array__ method

75647be

[OpenVINO backend] support categorical

950bdeb

[OpenVINO backend] suppor export model using openvino format

850d003

add export for openvino backend

a7e07f3

adding tests for openvino export format

4e93a74

fix dynamic shape handling

cb7812a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

google-ml-butler bot added the size:L label Jul 18, 2025

google-ml-butler bot assigned gbaned Jul 18, 2025

gemini-code-assist bot reviewed Jul 18, 2025

View reviewed changes

keras/src/backend/openvino/random.py Show resolved Hide resolved

keras/src/backend/openvino/core.py Show resolved Hide resolved

keras/src/backend/openvino/core.py Outdated Show resolved Hide resolved

Mohamed-Ashraf273 mentioned this pull request Jul 18, 2025

[Performance] High Memory Usage During GPT-2 Generation Using OpenVINO Backend on Keras 3 Compared to other backends openvinotoolkit/openvino#31390

Open

3 tasks

Mohamed-Ashraf273 marked this pull request as ready for review July 20, 2025 19:21

Mohamed-Ashraf273 changed the title ~~Draft PR for OpenVINO backend to simulate the unmerged PRs functionality~~ Draft PR for OpenVINO backend to simulate the unmerged PRs functionality with the high memory usage issue Jul 20, 2025

Mohamed-Ashraf273 changed the title ~~Draft PR for OpenVINO backend to simulate the unmerged PRs functionality with the high memory usage issue~~ Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling Jul 20, 2025

Mohamed-Ashraf273 force-pushed the gsoc2025 branch from ebba7d6 to b4c38f1 Compare July 21, 2025 12:26

Mohamed-Ashraf273 added 2 commits July 21, 2025 16:24

[OpenVINO backend] add supporting for lists and tuples

51ca2cd

[OpenVINO backend] fix_transpose

f3a2468

Mohamed-Ashraf273 marked this pull request as draft July 21, 2025 13:52

Mohamed-Ashraf273 force-pushed the gsoc2025 branch from 5482bae to 208c3d5 Compare July 21, 2025 14:43

add more detailed comments

1d713ba

Mohamed-Ashraf273 force-pushed the gsoc2025 branch 2 times, most recently from 5cd93f9 to e26fc10 Compare July 21, 2025 15:09

Mohamed-Ashraf273 added 4 commits July 21, 2025 18:37

Merge branch 'support_slice_update' into gsoc2025

f93d188

Merge branch 'support_triu' into gsoc2025

87002b6

Merge branch 'support_export' into gsoc2025

94c27ad

Merge branch 'support_categorical' into gsoc2025

b034524

Mohamed-Ashraf273 added 4 commits July 21, 2025 18:38

Merge branch 'support_repeat' into gsoc2025

cc087e7

Merge branch 'update_get_ov_output' into gsoc2025

b96aaef

Merge branch 'add_array_method' into gsoc2025

6139ec9

Merge branch 'fix_transpose' into gsoc2025

f0ca994

Mohamed-Ashraf273 force-pushed the gsoc2025 branch from e26fc10 to f0ca994 Compare July 21, 2025 15:40

[OpenVINO backend] fix numpy conversions

ee5f734

Mohamed-Ashraf273 force-pushed the gsoc2025 branch from 9c72a94 to ee5f734 Compare July 21, 2025 17:43

Mohamed-Ashraf273 added 5 commits July 21, 2025 21:09

fix typo

d44de26

add testing files

980c09d

add more detailed comments

1f8f2ea

Merge branch 'support_slice_update' into gsoc2025

cc86259

add support for jax backend and avoid to load models on disc for open…

8bc66a1

…vino format

Mohamed-Ashraf273 closed this Jul 22, 2025

Mohamed-Ashraf273 deleted the gsoc2025 branch July 22, 2025 13:10

Mohamed-Ashraf273 mentioned this pull request Jul 22, 2025

Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21500

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21491

Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21491

Uh oh!

Mohamed-Ashraf273 commented Jul 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21491

Simulated OpenVINO Backend for Testing Unmerged PR Features with Memory Profiling #21491

Uh oh!

Conversation

Mohamed-Ashraf273 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Device used for inference

Programming Language

Hardware Architecture

Model used

Mentions

Performance issue description

OpenVINO performance

Step-by-step reproduction

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Mohamed-Ashraf273 commented Jul 18, 2025 •

edited

Loading

codecov-commenter commented Jul 18, 2025 •

edited

Loading