-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Labels
bugSomething isn't workingSomething isn't workingcategory: transformationsOpenVINO Runtime library - TransformationsOpenVINO Runtime library - TransformationsperformancePerformance related topicsPerformance related topics
Description
OpenVINO Version
No response
Operating System
Ubuntu 22.04 (LTS)
Device used for inference
CPU
OpenVINO installation
PyPi
Programming Language
Python
Hardware Architecture
x86 (64 bits)
Model used
GPT-2
Model quantization
No
Mentions
Performance issue description
During my GSoC project, I've faced this issue:
Running the generate step using OpenVINO backend gives a very high memory usage for some reason, based on these PRs:
Keras: keras-team/keras#21500
Keras_hub: keras-team/keras-hub#2310
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()
for OpenVINO the model is being serialized with size:

Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)
Generate with max_length=20:
OpenVINO:
Generated text: Keras is an open-source machine learning framework for Python, written by
Keras is using the backend: openvino
Latency: 7.02 seconds
Throughput: 1.57 tokens/sec
CPU Memory Used (end - start): 2708.63 MB
Peak CPU Memory Used: 2832.45 MB
Tensorflow:
Generated text: Keras is a powerful Python programming language that allows you to create powerful interactive models
Keras is using the backend: tensorflow
Latency: 8.97 seconds
Throughput: 1.67 tokens/sec
CPU Memory Used (end - start): 264.24 MB
Peak CPU Memory Used: 264.07 MB
JAX:
Generated text: Keras is an object-oriented framework for building complex models with Python. It
Keras is using the backend: jax
Latency: 11.65 seconds
Throughput: 1.03 tokens/sec
CPU Memory Used (end - start): 260.25 MB
Peak CPU Memory Used: 260.07 MB
Torch:
Generated text: Keras is _____ and you want to use Keras?
This tutorial explains
Keras is using the backend: torch
Latency: 4.13 seconds
Throughput: 2.91 tokens/sec
CPU Memory Used (end - start): 56.97 MB
Peak CPU Memory Used: 56.61 MB
for
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()
for OpenVINO the model is being serialized with size:

Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)
Generate with max_length=20:
OpenVINO:
Generated text: Keras is an open source framework for rapid prototyping and automation that is designed
Keras is using the backend: openvino
Latency: 12.31 seconds
Throughput: 1.14 tokens/sec
CPU Memory Used (end - start): 5564.21 MB
Peak CPU Memory Used: 6352.55 MB
Tensorflow:
Generated text: Keras is a great language library for the JavaScript programming language. It provides you
Keras is using the backend: tensorflow
Latency: 11.43 seconds
Throughput: 1.22 tokens/sec
CPU Memory Used (end - start): 441.70 MB
Peak CPU Memory Used: 441.34 MB
JAX:
Generated text: Keras is _____. It's a language that is written in, or at least
Keras is using the backend: jax
Latency: 11.12 seconds
Throughput: 1.17 tokens/sec
CPU Memory Used (end - start): 364.71 MB
Peak CPU Memory Used: 1909.30 MB
Torch:
Generated text: Keras is a powerful machine learning library written by Martin Odersky
Keras is using the backend: torch
Latency: 19.95 seconds
Throughput: 0.55 tokens/sec
CPU Memory Used (end - start): 62.81 MB
Peak CPU Memory Used: 258.79 MB
Step-by-step reproduction
using these PRs:
Keras: keras-team/keras#21500
Keras_hub: keras-team/keras-hub#2310
run that code:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
os.environ["KERAS_BACKEND"] = "openvino"
import time
import psutil
import keras
import keras_hub
import threading
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / (1024 ** 2)
peak_memory = mem_before
done = [False]
def monitor_memory():
global peak_memory
while not done[0]:
mem_now = process.memory_info().rss / (1024 ** 2)
if mem_now > peak_memory:
peak_memory = mem_now
time.sleep(0.05)
monitor_thread = threading.Thread(target=monitor_memory)
monitor_thread.start()
start_time = time.perf_counter()
output = causal_lm.generate("Keras is ", max_length=20)
end_time = time.perf_counter()
done[0] = True
monitor_thread.join()
mem_after = process.memory_info().rss / (1024 ** 2)
memory_used = mem_after - mem_before
latency = end_time - start_time
tokens_generated = len(output.split())
throughput = tokens_generated / latency
print("Generated text:", output)
print(f"Keras is using the backend: {keras.backend.backend()}")
print(f"Latency: {latency:.2f} seconds")
print(f"Throughput: {throughput:.2f} tokens/sec")
print(f"CPU Memory Used (end - start): {memory_used:.2f} MB")
print(f"Peak CPU Memory Used: {peak_memory - mem_before:.2f} MB")
Issue submission checklist
- I'm reporting a performance issue. It's not a question.
- I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- There is reproducer code and related data files such as images, videos, models, etc.
Metadata
Metadata
Labels
bugSomething isn't workingSomething isn't workingcategory: transformationsOpenVINO Runtime library - TransformationsOpenVINO Runtime library - TransformationsperformancePerformance related topicsPerformance related topics