Nicola Bena, Marco Anisetti, Ernesto Damiani, Alex Della Bruna, Chan Yeob Yeun, Claudio A. Ardagna
The advent of Large Language Models (LLMs) is revolutionizing the design and deployment of modern applications, enabling intelligent, adaptive, and context-aware services across a wide range of domains. These AI-driven components now coexist with legacy systems, microservices, and nanoservices, forming complex and evolving ecosystems. While LLMs offer unprecedented capabilities, they also introduce new risks related to security, privacy, and ethics, especially due to their probabilistic nature and dependence on vast, often opaque, training data. This paradigm shift calls for robust mechanisms to assess and verify the non-functional behavior of LLM-based applications. Assurance emerges as a key strategy, with certification techniques traditionally used to validate non-functional properties. However, existing certification approaches for LLM are still in their infancy, while traditional methods prove inadequate in this context. In this paper, we propose a multi-dimensional certification scheme for LLM-based applications that initially captures the broader context and dynamic behavior introduced by LLMs and other AI components. To this aim, after proposing a taxonomy of LLM-based applications and discussing the gaps in LLM assessment and verification, we define a certification model, where a hypergraph represents a specific behavior supporting the property to be certified and drives the evidence collection at the basis of the corresponding certification process. We then instantiate the proposed approach in different use cases and experimentally evaluate it using an LLM-based application that implements a recommendation task in a security-critical scenario.
This repository contains the source code, input dataset, intermediate results and detailed results of our experimental evaluation.
The evaluation is divided in the following phases:
- generation of dataset
- submission to the LLMs
- bias computation
- result analysis (quality and performance).
Note: each phase involves several commands, from executing the actual experiment to post-processing. We created one script for each phase. Each all-in-one-script assumes that
- the needed dependencies are installed
- the required parameters are provided
Note: input/output directories and files are hardcoded.
Our experiments were executed on a Dell laptop 7590 equipped with a CPU Intel Core i7-9750H with 6 cores @ 4.5Ghz, 16 GBs RAM, 1 TB M.2 SSD, running RHEL 10.0 and Python v3.12.9. The LLMs were hosted and queried via the OLLAMA REST APIs (Ollama 0.7.1), running on Kubernetes K3S 1.32.4+k3s1. The server is equipped with 2X CPU AMD EPYC 7453 28-Core (56) @ 3.49 GHz, 256 GBs RAM, 3 GPUs Nvidia L40S, 1 GPU Nvidia A40, and 1 TB M.2 SSD.
We provided a conda
environment.yml
file for reproducibility, though the only required dependencies is pandas
.
Note: the LLMs are queried via OLLAMA, which must be installed and configured separately.
The dataset of prompts takes as input the set of probes and examples generated offline. For this first task, we used GPT4, using the prompt in scenarios/cloud/prompts/prompts_cloud.jsonl
.
The dataset of prompts is then generated using the script create_dataset.py
.
To submit the generated dataset to the selected LLM we execute the all-in-one script run_evaluate_model.sh
.
The script needs two additional input parameters:
model
: the model to which we want to submit the datasetollama_url
: an OLLAMA URL which host the selected model
To evaluate the results obtained by the LLM we execute the all-in-one script evaluate_model.py
.
The script needs an additional input parameters:
model
: the model whose results we want to analyze
To further analyze the results obtained by the previous stage we execute two all-in-one scripts:
analyze_results_quality.py
: analyze the results in terms of qualityanalyze_results_time.py
: analyze the results in terms of performance
The scripts needs two additional input parameters:
model
: the model whose results we want to analyzestore-nan
(optional): if enabled explicitly store 'nan' in the results (recommended for plotting)
The directory is organized as follows:
examples
: contains the examples to provide to the LLMsscenarios
: contains the scenario we want to evaluate, in termsscenarios/X
: contains one specific scenario instance files:scenarios/X/analyzed_results_quality
: contains the scenario quality resultsscenarios/X/analyzed_results_time
: contains the scenario performance resultsscenarios/X/final_results
: contains the scenario evaluated resultsscenarios/X/prompts
: contains the scenario generated input datasetscenarios/X/results
: contains the scenario raw resultsscenarios/X/X.jsonl
: contains the scenario probe dataset
scenarios/generate_dataset_prompt.json
: contains the prompt used to generate the initial probe list
Note: the following applies to scenarios/X/results
.
We retrieve the following results for each model:
metadata
: set of metadata attributes:metadata/scenario
: scenario used in the specific evaluationmetadata/example
: example used in the specific evaluation:metadata/example/name
: name of the example used in the specific evaluationmetadata/example/prompt
: prompt provided to the model related to the example used in the specific evaluationmetadata/example/probes
: probes provided to the model related to the example used in the specific evaluationmetadata/example/answer
: answer provided to the model related to the example used in the specific evaluationmetadata/example/answer_probes
: list of probes provided to the LLM, related to the example used in the specific evaluation
metadata/n
: value of N used in the evaluationmetadata/m
: value of M used in the evaluationmetadata/repetitions
: current repetition numbermetadata/retries
: value of retries needed to obtain a valid responsemetadata/total_duration
: time taken to generate the response (see Ollama Docs)metadata/load_duration
: time taken to load the model (see Ollama Docs)metadata/prompt_eval_duration
: time taken to evaluate the prompt (see Ollama Docs)metadata/eval_duration
: time taken to the model to evaluate (see Ollama Docs)metadata/prompt
: prompt provided to the modelmetadata/model
: model used in the evaluation
response
: model response
Note: the following applies to scenarios/X/final_results
.
Post-processed results performs additional evaluation on the main results. We evaluate the following results for each model:
metadata
: set of metadata attributes:metadata/scenario
: scenario used in the specific evaluationmetadata/example
: example used in the specific evaluation:metadata/example/name
: name of the example used in the specific evaluationmetadata/example/prompt
: prompt provided to the model related to the example used in the specific evaluationmetadata/example/probes
: probes provided to the model related to the example used in the specific evaluationmetadata/example/answer
: answer provided to the model related to the example used in the specific evaluationmetadata/example/answer_probes
: answer probes provided to the model related to the example used in the specific evaluation
metadata/n
: value of N used in the evaluationmetadata/m
: value of M used in the evaluationmetadata/repetitions
: total number of repetitionsmetadata/rep_X
: results obtained from repetion X:metadata/rep_X/retries
: value of retries needed to obtain a valid responsemetadata/rep_X/total_duration
: time taken to generated the response (see Ollama Docs)metadata/rep_X/load_duration
: time taken to load the model (see Ollama Docs)metadata/rep_X/prompt_eval_duration
: time taken to evaluate the prompt (see Ollama Docs)metadata/rep_X/eval_duration
: time taken to the model to evaluate (see Ollama Docs)metadata/rep_X/prompt
: prompt provided to the model
metadata/model
: model used in the evaluation
result
: result obtained from the evaluation
Note: the following applies to scenarios/X/analyzed_results_quality
.
Results are then further analyzed to perform additional evaluation on the post-processed results in terms of quality. The analyzed results are presented in form of two CSV files.
In one file we analyze the variation of the value of N related to the value of M and the examples:
- each row corresponds a value of N
- each row shows the results varying M and examples
In the second file we analyze the variation of the value of M related to the value of N and the examples:
- each row corresponds a value of M
- each row shows the results varying N and examples
Note: the following applies to scenarios/X/analyzed_results_time
.
Results are then further analyzed to perform additional evaluation on the post-processed results in terms of performance. The analyzed results are presented in form of two CSV files per aggregation type:
- sum: sum of times taken to generate the response
- average: average of times taken to generate the response
In one file we analyze the variation of the value of N related to the value of M and the examples:
- each row corresponds a value of N
- each row shows the results varying M and examples
In the second file we analyze the variation of the value of M related to the value of N and the examples:
- each row corresponds a value of M
- each row shows the results varying N and examples
An actual experiment configuration is defined in the file shared_const.py
, there it is possible to add models and change the evaluation settings:
- scenarios: number and names
- N: total probe number
- M: probes to be selected
- models: name and corresponding versions
- repetitions: number of repetitions needed to evaluate the bias (change to this value need further modify the evaluation scripts)
- max retries: number of max retry attempts if the model output is malformed
The code directory includes the following modules:
analyze_results_quality.py
: script to analyze the quality resultsanalyze_results_time.py
: script to analyze the performance resultscompare_models.sh
: script to execute the entire evaluation all at oncecreate_dataset.py
: script to generate the input datasetenvironment.yml
: configuration file to generate a CONDA environmentevaluate_model.py
: script to evaluate the model resultsrun_evaluate_model.sh
: script to execute the entire evaluation of a single modelsep_quality.sh
: script to execute only the quality analysis phase of all modelssep_time.sh
: script to execute only the performance analysis phase of all modelsshared_const.py
: configuration file to store the shared scripts configurationsubmit_to_model.py
: script to submit the input dataset to one model
This work was supported by:
- TII under Grant 8434000394
- MUSA -- Multilayered Urban Sustainability Action -- project, funded by the European Union -- NextGenerationEU, under the National Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5: Strengthening of research structures and creation of R&D innovation ecosystems, set up of territorial leaders in R&D (CUP G43C22001370007, Code ECS00000037)
- Project SERICS (PE00000014) under the NRRP MUR program funded by the EU -- NextGenerationEU.
- Computational resources provided by INDACO Core facility, which is a project of High Performance Computing at the University of MILAN.
Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them.
This work is licensed under a Creative Commons Attribution 4.0 International License.