Skip to content

[Performance] High Memory Usage During GPT-2 Generation Using OpenVINO Backend on Keras 3 Compared to other backends #31390

@Mohamed-Ashraf273

Description

@Mohamed-Ashraf273

OpenVINO Version

No response

Operating System

Ubuntu 22.04 (LTS)

Device used for inference

CPU

OpenVINO installation

PyPi

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

GPT-2

Model quantization

No

Mentions

@rkazants
@mvafin
@mlukasze

Performance issue description

During my GSoC project, I've faced this issue:
Running the generate step using OpenVINO backend gives a very high memory usage for some reason, based on these PRs:
Keras: keras-team/keras#21500
Keras_hub: keras-team/keras-hub#2310

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()

for OpenVINO the model is being serialized with size:

Image
Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)

Generate with max_length=20:

OpenVINO:
Generated text: Keras is  an open-source machine learning framework for Python, written by 
Keras is using the backend: openvino
Latency: 7.02 seconds
Throughput: 1.57 tokens/sec
CPU Memory Used (end - start): 2708.63 MB
Peak CPU Memory Used: 2832.45 MB

Tensorflow:
Generated text: Keras is  a powerful Python programming language that allows you to create powerful interactive models
Keras is using the backend: tensorflow
Latency: 8.97 seconds
Throughput: 1.67 tokens/sec
CPU Memory Used (end - start): 264.24 MB
Peak CPU Memory Used: 264.07 MB

JAX:
Generated text: Keras is  an object-oriented framework for building complex models with Python. It
Keras is using the backend: jax
Latency: 11.65 seconds
Throughput: 1.03 tokens/sec
CPU Memory Used (end - start): 260.25 MB
Peak CPU Memory Used: 260.07 MB


Torch:
Generated text: Keras is _____ and you want to use Keras?
This tutorial explains
Keras is using the backend: torch
Latency: 4.13 seconds
Throughput: 2.91 tokens/sec
CPU Memory Used (end - start): 56.97 MB
Peak CPU Memory Used: 56.61 MB

for

causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()

for OpenVINO the model is being serialized with size:

Image
Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)

Generate with max_length=20:

OpenVINO:
Generated text: Keras is  an open source framework for rapid prototyping and automation that is designed
Keras is using the backend: openvino
Latency: 12.31 seconds
Throughput: 1.14 tokens/sec
CPU Memory Used (end - start): 5564.21 MB
Peak CPU Memory Used: 6352.55 MB

Tensorflow:
Generated text: Keras is  a great language library for the JavaScript programming language. It provides you
Keras is using the backend: tensorflow
Latency: 11.43 seconds
Throughput: 1.22 tokens/sec
CPU Memory Used (end - start): 441.70 MB
Peak CPU Memory Used: 441.34 MB

JAX:
Generated text: Keras is _____. It's a language that is written in, or at least
Keras is using the backend: jax
Latency: 11.12 seconds
Throughput: 1.17 tokens/sec
CPU Memory Used (end - start): 364.71 MB
Peak CPU Memory Used: 1909.30 MB

Torch:
Generated text: Keras is  a powerful machine learning library written by  Martin Odersky 
Keras is using the backend: torch
Latency: 19.95 seconds
Throughput: 0.55 tokens/sec
CPU Memory Used (end - start): 62.81 MB
Peak CPU Memory Used: 258.79 MB

Step-by-step reproduction

using these PRs:
Keras: keras-team/keras#21500
Keras_hub: keras-team/keras-hub#2310

run that code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

os.environ["KERAS_BACKEND"] = "openvino"


import time
import psutil
import keras
import keras_hub
import threading


causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / (1024 ** 2)

peak_memory = mem_before
done = [False]
def monitor_memory():
    global peak_memory
    while not done[0]:
        mem_now = process.memory_info().rss / (1024 ** 2)
        if mem_now > peak_memory:
            peak_memory = mem_now
        time.sleep(0.05)

monitor_thread = threading.Thread(target=monitor_memory)
monitor_thread.start()

start_time = time.perf_counter()
output = causal_lm.generate("Keras is ", max_length=20)

end_time = time.perf_counter()
done[0] = True
monitor_thread.join()

mem_after = process.memory_info().rss / (1024 ** 2)
memory_used = mem_after - mem_before
latency = end_time - start_time
tokens_generated = len(output.split())
throughput = tokens_generated / latency

print("Generated text:", output)
print(f"Keras is using the backend: {keras.backend.backend()}")
print(f"Latency: {latency:.2f} seconds")
print(f"Throughput: {throughput:.2f} tokens/sec")
print(f"CPU Memory Used (end - start): {memory_used:.2f} MB")
print(f"Peak CPU Memory Used: {peak_memory - mem_before:.2f} MB")

Issue submission checklist

  • I'm reporting a performance issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.

Metadata

Metadata

Labels

bugSomething isn't workingcategory: transformationsOpenVINO Runtime library - TransformationsperformancePerformance related topics

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions