[Performance] High Memory Usage During GPT-2 Generation Using OpenVINO Backend on Keras 3 Compared to other backends

### OpenVINO Version

_No response_

### Operating System

Ubuntu 22.04 (LTS)

### Device used for inference

CPU

### OpenVINO installation

PyPi

### Programming Language

Python

### Hardware Architecture

x86 (64 bits)

### Model used

GPT-2

### Model quantization

No

### Mentions

@rkazants 
@mvafin 
@mlukasze 

### Performance issue description
During my GSoC project, I've faced this issue:
Running the generate step using OpenVINO backend gives a very high memory usage for some reason, based on these PRs:
Keras: https://github.com/keras-team/keras/pull/21500
Keras_hub: https://github.com/keras-team/keras-hub/pull/2310


```python
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float32")
causal_lm.summary()
```
for OpenVINO the model is being serialized with size:

<img width="406" height="74" alt="Image" src="https://github.com/user-attachments/assets/cb48e881-6e95-4e18-a0d6-642c017422fa" />

```
Total params: 354,823,168 (1.32 GB)
Trainable params: 354,823,168 (1.32 GB)
Non-trainable params: 0 (0.00 B)

Generate with max_length=20:

OpenVINO:
Generated text: Keras is  an open-source machine learning framework for Python, written by 
Keras is using the backend: openvino
Latency: 7.02 seconds
Throughput: 1.57 tokens/sec
CPU Memory Used (end - start): 2708.63 MB
Peak CPU Memory Used: 2832.45 MB

Tensorflow:
Generated text: Keras is  a powerful Python programming language that allows you to create powerful interactive models
Keras is using the backend: tensorflow
Latency: 8.97 seconds
Throughput: 1.67 tokens/sec
CPU Memory Used (end - start): 264.24 MB
Peak CPU Memory Used: 264.07 MB

JAX:
Generated text: Keras is  an object-oriented framework for building complex models with Python. It
Keras is using the backend: jax
Latency: 11.65 seconds
Throughput: 1.03 tokens/sec
CPU Memory Used (end - start): 260.25 MB
Peak CPU Memory Used: 260.07 MB


Torch:
Generated text: Keras is _____ and you want to use Keras?
This tutorial explains
Keras is using the backend: torch
Latency: 4.13 seconds
Throughput: 2.91 tokens/sec
CPU Memory Used (end - start): 56.97 MB
Peak CPU Memory Used: 56.61 MB
```

for 
```python
causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")
causal_lm.summary()
```

for OpenVINO the model is being serialized with size:

<img width="406" height="74" alt="Image" src="https://github.com/user-attachments/assets/d24bf78c-6031-4986-832f-79abae02bab6" />

```
Total params: 354,823,168 (676.77 MB)
Trainable params: 354,823,168 (676.77 MB)
Non-trainable params: 0 (0.00 B)

Generate with max_length=20:

OpenVINO:
Generated text: Keras is  an open source framework for rapid prototyping and automation that is designed
Keras is using the backend: openvino
Latency: 12.31 seconds
Throughput: 1.14 tokens/sec
CPU Memory Used (end - start): 5564.21 MB
Peak CPU Memory Used: 6352.55 MB

Tensorflow:
Generated text: Keras is  a great language library for the JavaScript programming language. It provides you
Keras is using the backend: tensorflow
Latency: 11.43 seconds
Throughput: 1.22 tokens/sec
CPU Memory Used (end - start): 441.70 MB
Peak CPU Memory Used: 441.34 MB

JAX:
Generated text: Keras is _____. It's a language that is written in, or at least
Keras is using the backend: jax
Latency: 11.12 seconds
Throughput: 1.17 tokens/sec
CPU Memory Used (end - start): 364.71 MB
Peak CPU Memory Used: 1909.30 MB

Torch:
Generated text: Keras is  a powerful machine learning library written by  Martin Odersky 
Keras is using the backend: torch
Latency: 19.95 seconds
Throughput: 0.55 tokens/sec
CPU Memory Used (end - start): 62.81 MB
Peak CPU Memory Used: 258.79 MB
```

### Step-by-step reproduction

using these PRs:
Keras: https://github.com/keras-team/keras/pull/21500
Keras_hub: https://github.com/keras-team/keras-hub/pull/2310

run that code:
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

os.environ["KERAS_BACKEND"] = "openvino"


import time
import psutil
import keras
import keras_hub
import threading


causal_lm = keras_hub.models.GPT2CausalLM.from_preset("gpt2_medium_en", dtype="float16")

process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / (1024 ** 2)

peak_memory = mem_before
done = [False]
def monitor_memory():
    global peak_memory
    while not done[0]:
        mem_now = process.memory_info().rss / (1024 ** 2)
        if mem_now > peak_memory:
            peak_memory = mem_now
        time.sleep(0.05)

monitor_thread = threading.Thread(target=monitor_memory)
monitor_thread.start()

start_time = time.perf_counter()
output = causal_lm.generate("Keras is ", max_length=20)

end_time = time.perf_counter()
done[0] = True
monitor_thread.join()

mem_after = process.memory_info().rss / (1024 ** 2)
memory_used = mem_after - mem_before
latency = end_time - start_time
tokens_generated = len(output.split())
throughput = tokens_generated / latency

print("Generated text:", output)
print(f"Keras is using the backend: {keras.backend.backend()}")
print(f"Latency: {latency:.2f} seconds")
print(f"Throughput: {throughput:.2f} tokens/sec")
print(f"CPU Memory Used (end - start): {memory_used:.2f} MB")
print(f"Peak CPU Memory Used: {peak_memory - mem_before:.2f} MB")
```

### Issue submission checklist

- [x] I'm reporting a performance issue. It's not a question.
- [x] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- [x] There is reproducer code and related data files such as images, videos, models, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] High Memory Usage During GPT-2 Generation Using OpenVINO Backend on Keras 3 Compared to other backends #31390

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Mentions

Performance issue description

Step-by-step reproduction

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] High Memory Usage During GPT-2 Generation Using OpenVINO Backend on Keras 3 Compared to other backends #31390

Description

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Mentions

Performance issue description

Step-by-step reproduction

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions