Cross-Lingual performance degradation #197

njelicic · 2025-02-18T10:16:07Z

njelicic
Feb 18, 2025

Hi,

After converting BAAI/bge-m3 the cross-lingual performance of the model drops significantly. I converted the model with:

from model2vec.distill import distill
m2v_model = distill(model_name="BAAI/bge-m3", pca_dims=256)
m2v_model.save_pretrained("m2v_model")

I translated a few sentences I compared the cosine similarity between them. The average cosine similarity between the sentences drops from 0.92 to 0.751 between the two models.

from sentence_transformers import SentenceTransformer
from model2vec import StaticModel
from scipy.spatial.distance import cosine
import numpy as np

models = {
    "SentenceTransformer":SentenceTransformer("BAAI/bge-m3",device=0),
    "model2vec":StaticModel.from_pretrained("m2v_model")}

test_data = [
    {
        'sentence1': "Ik ben Nick" ,
        'sentence2': "Je suis Nick"
    },
    {
        'sentence1': "De hoofdstad van Nederland is Amsterdam" ,
        'sentence2': "The capital of the Netherlands is Amsterdam"
    },
    {
        'sentence1': "The DPIA provides you with a mechanism to accompany the entire life cycle of the project from the original concept to the actual implementation and first use." ,
        'sentence2': "De DPIA biedt u een mechanisme om de gehele levenscyclus van het project te begeleiden, van het oorspronkelijke concept tot de daadwerkelijke implementatie en het eerste gebruik."
    },

]

results = {
    "SentenceTransformer": {'res': []},
    "model2vec": {'res': []},
}
for name, model in models.items():
    for idx, d in enumerate(test_data):
        sent1 = model.encode(d['sentence1'])
        sent2 = model.encode(d['sentence2']) 

        dist = (1- cosine(sent1,sent2))
        results[name]['res'].append(dist)

for model, res in results.items():
    print(f"MODEL TYPE:'{model}', MEAN COSINE SIMILARITY: {np.mean(res['res']):.3f}")

output

~$ python3 benchmark.py
MODEL TYPE:'SentenceTransformer', MEAN COSINE SIMILARITY: 0.920
MODEL TYPE:'model2vec', MEAN COSINE SIMILARITY: 0.751

I think that contextualization of the embeddings is necessary for this tasks. I can imagine that this would also impact other tasks such as code retrieval? Perhaps you could include some more benchmarks to the repo so it is clear when model2vec should be used with caution?

Pringled · 2025-02-18T11:12:44Z

Pringled
Feb 18, 2025
Maintainer

Hi @njelicic, I don't think you can draw conclusions based on the differences in cosine similarities without knowing the distribution of cosine similarities. It could be that a cosine similarity of 0.75 for your distilled model is the highest cosine similarity for cross-lingual sentence similarity. What matters more is that more similar sentences get higher cosine similarities, and vice versa.

We have run extensive benchmarks which are documented in our results. However, if you have access to real data that reflects the task you want to solve, I would always run benchmarks yourself to see if the performance is good enough. While we have benchmarked on a large number of tasks and datasets, there is no way to know if the model is going to work for your task without testing and benchmarking it.

0 replies

njelicic · 2025-02-18T13:24:02Z

njelicic
Feb 18, 2025
Author

I discovered this with testing model2vec in my RAG application (not retrieving any relevant cross lingual documents) and tried to isolate a few examples for discussion. I did some more testing to demonstrate the effect. I created 4 small datasets:

Cross-Lingual: Translations
Inter-Lingual: Within language examples of the same meaning
Cross-Lingual Negative: wrong translations
Inter-Lingual Negative: within language examples of the different meaning

Next, I did a t-test to compare the means for the different datasets within models. For the model2vec approach, all distributions have a statistically significant different mean (p<0.05). However, for the original model does not have statistically significant different means for the Cross-Lingual vs Inter-Lingual (p=0.1251) and Cross-Lingual Negative vs Inter-Lingual Negative (p=0.7561). The plots below show essentially the same:

T-test Results Within Model2Vec:
Cross-Lingual vs Cross-Lingual Negative | T-statistic: 7.5284, p-value: 0.0000
Cross-Lingual vs Inter-Lingual | T-statistic: -2.0561, p-value: 0.0447
Cross-Lingual vs Inter-Lingual Negative | T-statistic: 5.6681, p-value: 0.0000
Cross-Lingual Negative vs Inter-Lingual | T-statistic: -8.0483, p-value: 0.0000
Cross-Lingual Negative vs Inter-Lingual Negative | T-statistic: -4.5660, p-value: 0.0000
Inter-Lingual vs Inter-Lingual Negative | T-statistic: 7.7424, p-value: 0.0000

T-test Results Within SentenceTransformer:
Cross-Lingual vs Cross-Lingual Negative | T-statistic: 12.6289, p-value: 0.0000
Cross-Lingual vs Inter-Lingual | T-statistic: 1.5584, p-value: 0.1251
Cross-Lingual vs Inter-Lingual Negative | T-statistic: 15.1293, p-value: 0.0000
Cross-Lingual Negative vs Inter-Lingual | T-statistic: -11.7087, p-value: 0.0000
Cross-Lingual Negative vs Inter-Lingual Negative | T-statistic: -0.3126, p-value: 0.7561
Inter-Lingual vs Inter-Lingual Negative | T-statistic: 14.8045, p-value: 0.0000

Here's the code to reproduce the results:

from sentence_transformers import SentenceTransformer
from model2vec import StaticModel
from scipy.spatial.distance import cosine
from scipy.stats import f_oneway, ttest_ind
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

models = {
    "SentenceTransformer": SentenceTransformer("BAAI/bge-m3", device=0),
    "model2vec": StaticModel.from_pretrained("m2v_model")
}
cross_lingual = [
    {'sentence1': "I am going to the store", 'sentence2': "Voy a la tienda"},  # Spanish
    {'sentence1': "She is studying hard", 'sentence2': "Elle étudie dur",},  # French
    {'sentence1': "I love programming", 'sentence2': "Ich liebe Programmieren"},  # German
    {'sentence1': "Good morning", 'sentence2': "Bom dia"},  # Portuguese
    {'sentence1': "How are you?", 'sentence2': "Comment ça va?"},  # French
    {'sentence1': "Where is the library?", 'sentence2': "Wo ist die Bibliothek?"},  # German
    {'sentence1': "I need help", 'sentence2': "Necesito ayuda"},  # Spanish
    {'sentence1': "Thank you very much", 'sentence2': "Muchas gracias"},  # Spanish
    {'sentence1': "This is my cat", 'sentence2': "Ceci est mon chat"},  # French
    {'sentence1': "My favorite color is blue", 'sentence2': "Mi color favorito es azul"},  # Spanish
    {'sentence1': "Let's go out for lunch", 'sentence2': "Vamos a salir a almorzar"},  # Spanish
    {'sentence1': "I am happy", 'sentence2': "Sono felice"},  # Italian
    {'sentence1': "Good evening", 'sentence2': "Buenas noches"},  # Spanish
    {'sentence1': "I am from the USA", 'sentence2': "Je viens des États-Unis"},  # French
    {'sentence1': "What time is it?", 'sentence2': "Que hora es?"},  # Spanish
    {'sentence1': "I have a car", 'sentence2': "Ich habe ein Auto"},  # German
    {'sentence1': "I want to learn", 'sentence2': "Je veux apprendre"},  # French
    {'sentence1': "My name is John", 'sentence2': "Meu nome é João"},  # Portuguese
    {'sentence1': "I have a brother", 'sentence2': "J'ai un frère"},  # French
    {'sentence1': "I am sleepy", 'sentence2': "Tengo sueño"},  # Spanish
    {'sentence1': "Please help me", 'sentence2': "Por favor ayúdame"},  # Spanish
    {'sentence1': "How old are you?", 'sentence2': "Quanti anni hai?"},  # Italian
    {'sentence1': "I like music", 'sentence2': "Ik hou van muziek"},  # Dutch
]

inter_lingual = [
    {'sentence1': "I am very tired", 'sentence2': "I feel exhausted"},  # English, same meaning
    {'sentence1': "She enjoys reading", 'sentence2': "She likes to read books"},  # English, same meaning
    {'sentence1': "I am learning Python", 'sentence2': "I am studying Python programming"},  # English, same meaning
    {'sentence1': "The sky is blue", 'sentence2': "The clouds are white"},  # English, same meaning
    {'sentence1': "I like coffee", 'sentence2': "I prefer coffee over tea"},  # English, same meaning
    {'sentence1': "She is my friend", 'sentence2': "She is one of my best friends"},  # English, same meaning
    {'sentence1': "It is very hot today", 'sentence2': "Today is extremely warm outside"},  # English, same meaning
    {'sentence1': "I am tired", 'sentence2': "I need rest"},  # English, same meaning
    {'sentence1': "The food was delicious", 'sentence2': "The meal was amazing"},  # English, same meaning
    {'sentence1': "She sings beautifully", 'sentence2': "She has a lovely voice"},  # English, same meaning
    {'sentence1': "Mi chiamo Luca", 'sentence2': "Il mio nome è Luca"},  # Italian, same meaning
    {'sentence1': "Jag älskar att läsa", 'sentence2': "Jag tycker om att läsa böcker"},  # Swedish, same meaning
    {'sentence1': "C'est un beau jour", 'sentence2': "Il fait beau aujourd'hui"},  # French, same meaning
    {'sentence1': "Das Wetter ist schön", 'sentence2': "Es ist sonnig heute"},  # German, same meaning
    {'sentence1': "Oggi è una giornata calda", 'sentence2': "Fa caldo oggi"},  # Italian, same meaning
    {'sentence1': "今日は暑いです", 'sentence2': "今日はとても暑いです"},  # Japanese, same meaning
    {'sentence1': "Быстро бегать полезно", 'sentence2': "Занятия спортом полезны"},  # Russian, same meaning
    {'sentence1': "Estoy cansado", 'sentence2': "Tengo sueño"},  # Spanish, same meaning
    {'sentence1': "I love pizza", 'sentence2': "I enjoy eating pizza"},  # English, same meaning
    {'sentence1': "I like traveling", 'sentence2': "I love to visit new places"},  # English, same meaning
    {'sentence1': "She is tired", 'sentence2': "She feels exhausted"},  # English, same meaning
    {'sentence1': "I am learning Spanish", 'sentence2': "I am studying Spanish language"},  # English, same meaning
    {'sentence1': "It is very cold today", 'sentence2': "The weather is freezing today"},  # English, same meaning
    {'sentence1': "He plays football", 'sentence2': "He enjoys playing soccer"},  # English, same meaning
    {'sentence1': "I am hungry", 'sentence2': "I want to eat something"},  # English, same meaning
    {'sentence1': "I love nature", 'sentence2': "I enjoy the outdoors"},  # English, same meaning
    {'sentence1': "Il pleut aujourd'hui", 'sentence2': "Il fait mauvais aujourd'hui"},  # French, same meaning
    {'sentence1': "Eu gosto de ler", 'sentence2': "Eu amo livros"},  # Portuguese, same meaning
    {'sentence1': "Schöne Blumen", 'sentence2': "Ich mag Blumen"},  # German, same meaning
    {'sentence1': "Me gusta nadar", 'sentence2': "Me encanta nadar en el mar"},  # Spanish, same meaning
    {'sentence1': "今日は暑い", 'sentence2': "今日は非常に暑い"},  # Japanese, same meaning
    {'sentence1': "J'aime le chocolat", 'sentence2': "Le chocolat est délicieux"},
]


inter_lingual_negative = [
    {'sentence1': "I am very tired", 'sentence2': "The car is parked outside"},  # English, different meaning
    {'sentence1': "I love programming", 'sentence2': "The dog is barking loudly"},  # English, different meaning
    {'sentence1': "I enjoy reading books", 'sentence2': "The sky is cloudy today"},  # English, different meaning
    {'sentence1': "She is my friend", 'sentence2': "My cat is sleeping peacefully"},  # English, different meaning
    {'sentence1': "I am learning Python", 'sentence2': "My favorite sport is basketball"},  # English, different meaning
    {'sentence1': "I feel sad", 'sentence2': "She is making dinner for everyone"},  # English, different meaning
    {'sentence1': "I want to go for a walk", 'sentence2': "The movie starts at 8pm"},  # English, different meaning
    {'sentence1': "I am happy", 'sentence2': "My phone is on the table"},  # English, different meaning
    {'sentence1': "It is raining", 'sentence2': "The sun is shining brightly"},  # English, different meaning
    {'sentence1': "She sings beautifully", 'sentence2': "The train is leaving soon"},  # English, different meaning
    {'sentence1': "Mi chiamo Luca", 'sentence2': "La pizza è deliziosa"},  # Italian, different meaning
    {'sentence1': "Jag älskar att läsa", 'sentence2': "Fiskarna simmar i sjön"},  # Swedish, different meaning
    {'sentence1': "C'est un beau jour", 'sentence2': "J'ai acheté une nouvelle voiture"},  # French, different meaning
    {'sentence1': "Das Wetter ist schön", 'sentence2': "Ich fahre nach Berlin"},  # German, different meaning
    {'sentence1': "Oggi è una giornata calda", 'sentence2': "Sto mangiando una mela"},  # Italian, different meaning
    {'sentence1': "今日は暑いです", 'sentence2': "私は昨日本を読んだ"},  # Japanese, different meaning
    {'sentence1': "Быстро бегать полезно", 'sentence2': "Мы поехали на дачу"},  # Russian, different meaning
    {'sentence1': "Estoy cansado", 'sentence2': "Mi casa está cerca del parque"},  # Spanish, different meaning
    {'sentence1': "I want some water", 'sentence2': "I like to swim in the ocean"},  # English, different meaning
    {'sentence1': "It is raining", 'sentence2': "The sun is shining"},  # English, different meaning
    {'sentence1': "I am learning French", 'sentence2': "She is cooking dinner"},  # English, different meaning
    {'sentence1': "She loves ice cream", 'sentence2': "He loves to play basketball"},  # English, different meaning
    {'sentence1': "I am so happy today", 'sentence2': "It is snowing outside"},  # English, different meaning
    {'sentence1': "She is reading a book", 'sentence2': "He is running in the park"},  # English, different meaning
    {'sentence1': "I want to watch a movie", 'sentence2': "My friend is visiting me"},  # English, different meaning
    {'sentence1': "I am traveling to Paris", 'sentence2': "I am going to the supermarket"},  # English, different meaning
    {'sentence1': "Je suis fatigué", 'sentence2': "Je mange une pomme"},  # French, different meaning
    {'sentence1': "Ich spiele Gitarre", 'sentence2': "Ich koche Abendessen"},  # German, different meaning
    {'sentence1': "Estoy cansado", 'sentence2': "Estoy comiendo pizza"},  # Spanish, different meaning
    {'sentence1': "今日は暑い", 'sentence2': "私は旅行に行きます"},  # Japanese, different meaning
    {'sentence1': "J'aime les chats", 'sentence2': "Je travaille demain"}, 
]

cross_lingual_negative = [
    {'sentence1': "I am going to work", 'sentence2': "Ik hou van aardbeien"},  # English to Dutch (work vs. strawberries)
    {'sentence1': "She is studying hard", 'sentence2': "Ik ben aan het zwemmen"},  # English to Dutch (studying vs. swimming)
    {'sentence1': "I love programming", 'sentence2': "Me gusta bailar salsa"},  # English to Spanish (programming vs. salsa dancing)
    {'sentence1': "Good morning", 'sentence2': "Je suis fatigué"},  # English to French (morning vs. tired)
    {'sentence1': "How are you?", 'sentence2': "C'est mon anniversaire"},  # English to French (how are you? vs. birthday)
    {'sentence1': "Where is the library?", 'sentence2': "Ich liebe Schokolade"},  # English to German (library vs. chocolate)
    {'sentence1': "I need help", 'sentence2': "Du bist mein bester Freund"},  # English to German (help vs. best friend)
    {'sentence1': "Thank you very much", 'sentence2': "Estoy enojado"},  # English to Spanish (thank you vs. angry)
    {'sentence1': "This is my cat", 'sentence2': "Mon chien est très gentil"},  # English to French (cat vs. dog)
    {'sentence1': "My favorite color is blue", 'sentence2': "Voy a la playa"},  # English to Spanish (blue vs. going to the beach)
    {'sentence1': "Let's go out for lunch", 'sentence2': "Je vais courir au parc"},  # English to French (lunch vs. running in the park)
    {'sentence1': "I am happy", 'sentence2': "Estoy triste y solo"}  # English to Spanish (happy vs. sad and alone)
]


results = {model: {'res': {'Cross-Lingual': [], 'Cross-Lingual Negative': [], 'Inter-Lingual': [], 'Inter-Lingual Negative': []}} for model in models}

def evaluate_similarity(model_name, model, dataset, dataset_name):
    for d in dataset:
        sent1 = model.encode(d['sentence1'])
        sent2 = model.encode(d['sentence2'])
        dist = (1 - cosine(sent1, sent2))
        results[model_name]['res'][dataset_name].append(dist)

def plot_distributions(results):
    plt.figure(figsize=(12, 8))
    for model, data in results.items():
        plt.figure(figsize=(12, 8))
        for category, scores in data['res'].items():
            sns.histplot(scores, kde=True, label=f"{model} - {category}", bins=20, alpha=0.5)
        plt.legend()
        plt.xlabel("Cosine Similarity")
        plt.ylabel("Frequency")
        plt.title(f"Distribution of Cosine Similarities for {model}")
        plt.savefig(f'{model}.png') 

for name, model in models.items():
    evaluate_similarity(name, model, cross_lingual, "Cross-Lingual")
    evaluate_similarity(name, model, cross_lingual_negative, "Cross-Lingual Negative")
    evaluate_similarity(name, model, inter_lingual, "Inter-Lingual")
    evaluate_similarity(name, model, inter_lingual_negative, "Inter-Lingual Negative")

plot_distributions(results)

def f_test_between_models(results):
    # Create a matrix to store F-test results (between models)
    f_test_matrix_models = {}

    # Create a matrix to store t-test results (between models)
    t_test_matrix_models = {}

    # Create a matrix to store standard deviations comparison
    std_dev_comparison = {}

    # Iterate through each category (cross-lingual, inter-lingual, etc.)
    categories = list(results['SentenceTransformer']['res'].keys())
    for category in categories:
        scores_model1 = results['SentenceTransformer']['res'][category]
        scores_model2 = results['model2vec']['res'][category]
        
        # Calculate standard deviations for each model
        std_dev_model1 = np.std(scores_model1)
        std_dev_model2 = np.std(scores_model2)
        
        # Compare standard deviations
        if std_dev_model1 > std_dev_model2:
            std_dev_comparison[category] = f"SentenceTransformer has a larger standard deviation (SD = {std_dev_model1:.4f})"
        else:
            std_dev_comparison[category] = f"model2vec has a larger standard deviation (SD = {std_dev_model2:.4f})"
        
        # Perform F-test between models for each category
        f_stat, p_val_f = f_oneway(scores_model1, scores_model2)
        f_test_matrix_models[category] = (f_stat, p_val_f)
        
        # Perform t-test between models for each category
        t_stat, p_val_t = ttest_ind(scores_model1, scores_model2)
        t_test_matrix_models[category] = (t_stat, p_val_t)
    
    return f_test_matrix_models, t_test_matrix_models, std_dev_comparison

def f_test_within_models(results):
    # Create a matrix to store F-test results (within models)
    f_test_matrix_within = {}

    # Create a matrix to store t-test results (within models)
    t_test_matrix_within = {}

    # Iterate through each model
    for model, data in results.items():
        f_test_matrix_within[model] = {}
        t_test_matrix_within[model] = {}

        # Extract the different categories for each model
        categories = list(data['res'].keys())
        scores = [data['res'][category] for category in categories]
        
        # Perform pairwise F-tests and t-tests within the same model
        for i, cat1 in enumerate(categories):
            for j, cat2 in enumerate(categories):
                if i < j:
                    # Perform F-test
                    f_stat, p_val_f = f_oneway(scores[i], scores[j])
                    f_test_matrix_within[model][(cat1, cat2)] = (f_stat, p_val_f)
                    
                    # Perform t-test
                    t_stat, p_val_t = ttest_ind(scores[i], scores[j])
                    t_test_matrix_within[model][(cat1, cat2)] = (t_stat, p_val_t)

    return f_test_matrix_within, t_test_matrix_within

f_test_results_models, t_test_results_models, std_dev_comparison = f_test_between_models(results)
f_test_results_within, t_test_results_within = f_test_within_models(results)

print("\nF-test Results Between Models:")
for category, (f_stat, p_val) in f_test_results_models.items():
    print(f"  {category} | F-statistic: {f_stat:.4f}, p-value: {p_val:.4f}")
    
print("\nT-test Results Between Models:")
for category, (t_stat, p_val) in t_test_results_models.items():
    print(f"  {category} | T-statistic: {t_stat:.4f}, p-value: {p_val:.4f}")
    
print("\nStandard Deviation Comparison Between Models:")
for category, comparison in std_dev_comparison.items():
    print(f"  {category} | {comparison}")

for model, test_results in f_test_results_within.items():
    print(f"\nF-test Results Within {model}:")
    for (cat1, cat2), (f_stat, p_val) in test_results.items():
        print(f"  {cat1} vs {cat2} | F-statistic: {f_stat:.4f}, p-value: {p_val:.4f}")

for model, test_results in t_test_results_within.items():
    print(f"\nT-test Results Within {model}:")
    for (cat1, cat2), (t_stat, p_val) in test_results.items():
        print(f"  {cat1} vs {cat2} | T-statistic: {t_stat:.4f}, p-value: {p_val:.4f}")

0 replies

Pringled · 2025-02-18T15:58:19Z

Pringled
Feb 18, 2025
Maintainer

Interesting, thanks for the thorough analysis! I think for this example, you might benefit from using a task specific model. Could you give the following a try:

"model2vec": StaticModel.from_sentence_transformers("sentence-transformers/static-similarity-mrl-multilingual-v1")

This is a static model that's specifically trained for multilingual similarity. Similarly, if you're doing (English) retrieval/RAG, you might want to try out minishlab/potion-retrieval-32M, which is specifically trained for retrieval tasks. I just ran your code with this model and it gives the following output, which (at least at first glance) looks to produce more desirable results:

1 reply

njelicic Feb 19, 2025
Author

Thanks, will try it out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cross-Lingual performance degradation #197

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cross-Lingual performance degradation #197

Uh oh!

Uh oh!

njelicic Feb 18, 2025

Replies: 3 comments · 1 reply

Uh oh!

Pringled Feb 18, 2025 Maintainer

Uh oh!

njelicic Feb 18, 2025 Author

Uh oh!

Pringled Feb 18, 2025 Maintainer

Uh oh!

njelicic Feb 19, 2025 Author

njelicic
Feb 18, 2025

Replies: 3 comments 1 reply

Pringled
Feb 18, 2025
Maintainer

njelicic
Feb 18, 2025
Author

Pringled
Feb 18, 2025
Maintainer

njelicic Feb 19, 2025
Author