Skip to content

feat: Implement synonym index management and API #2425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: v30
Choose a base branch
from

Conversation

ozanarmagan
Copy link
Contributor

Change Summary

PR Checklist

@tharropoulos
Copy link
Contributor

It seems like some synonyms aren't being triggered when migrating to the new version.

Reproduction Steps

  1. Use the latest 29.0 build and run the following script:
#!/bin/bash

export TYPESENSE_HOST="http://localhost:8108"
export TYPESENSE_API_KEY="xyz"
export COLLECTION_NAME="books"

wait_for_typesense() {
    echo "Waiting for Typesense to be ready..."
    local max_attempts=30
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        if curl -s -o /dev/null -w "%{http_code}" "${TYPESENSE_HOST}/health" \
           -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | grep -q "200"; then
            echo "Typesense is ready"
            return 0
        fi
        echo "   Attempt ${attempt}/${max_attempts}..."
        sleep 2
        ((attempt++))
    done
    
    echo "Typesense failed to start after ${max_attempts} attempts"
    exit 1
}

echo "Creating collection schema..."
curl -s -X POST "${TYPESENSE_HOST}/collections" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "name": "'${COLLECTION_NAME}'",
        "fields": [
            {"name": "id", "type": "string"},
            {"name": "title", "type": "string"},
            {"name": "author", "type": "string", "facet": true},
            {"name": "genre", "type": "string", "facet": true},
            {"name": "description", "type": "string"},
            {"name": "publication_year", "type": "int32", "facet": true},
            {"name": "rating", "type": "float"},
            {"name": "pages", "type": "int32"}
        ],
        "default_sorting_field": "rating"
    }'


curl -s -X POST "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "id": "1",
        "title": "The Great Gatsby",
        "author": "F. Scott Fitzgerald",
        "genre": "Classic Fiction",
        "description": "A classic American novel about the Jazz Age and the American Dream.",
        "publication_year": 1925,
        "rating": 4.2,
        "pages": 180
    }'

curl -s -X POST "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "id": "2",
        "title": "To Kill a Mockingbird",
        "author": "Harper Lee",
        "genre": "Literary Fiction",
        "description": "A powerful story of racial injustice and childhood innocence in the American South.",
        "publication_year": 1960,
        "rating": 4.5,
        "pages": 376
    }'

curl -s -X POST "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "id": "3",
        "title": "1984",
        "author": "George Orwell",
        "genre": "Dystopian Fiction",
        "description": "A dystopian novel about totalitarianism and surveillance.",
        "publication_year": 1949,
        "rating": 4.6,
        "pages": 328
    }'

curl -s -X POST "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "id": "4",
        "title": "Pride and Prejudice",
        "author": "Jane Austen",
        "genre": "Romance",
        "description": "A romantic novel about manners, upbringing, morality, and marriage.",
        "publication_year": 1813,
        "rating": 4.3,
        "pages": 432
    }'

curl -s -X POST "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "id": "5",
        "title": "The Catcher in the Rye",
        "author": "J.D. Salinger",
        "genre": "Coming of Age",
        "description": "A novel about teenage rebellion and alienation.",
        "publication_year": 1951,
        "rating": 3.8,
        "pages": 277
    }'

echo "Documents indexed successfully"

echo "Creating synonyms..."

curl -s -X PUT "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/synonyms/classic-synonyms" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "synonyms": ["classic", "literature", "literary", "masterpiece"]
    }'

curl -s -X PUT "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/synonyms/scifi-synonyms" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "synonyms": ["dystopian", "sci-fi", "science fiction", "futuristic"]
    }'

curl -s -X PUT "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/synonyms/romance-synonyms" \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    -d '{
        "synonyms": ["romance", "romantic", "love story", "love"]
    }'

echo "Synonyms created successfully"

sleep 2

echo "Test 1: Searching for 'classic' (should find literary fiction via synonyms)"
curl -s -G "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents/search" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    --data-urlencode "q=classic" \
    --data-urlencode "query_by=title,description,genre" | jq -r '.hits[] | "- \(.document.title) by \(.document.author) [\(.document.genre)]"'

echo ""

echo "Test 2: Searching for 'sci-fi' (should find dystopian via synonyms)"
curl -s -G "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents/search" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    --data-urlencode "q=sci-fi" \
    --data-urlencode "query_by=title,description,genre" | jq -r '.hits[] | "- \(.document.title) by \(.document.author) [\(.document.genre)]"'

echo ""

echo "Test 3: Searching for 'love story' (should find romance via synonyms)"
curl -s -G "${TYPESENSE_HOST}/collections/${COLLECTION_NAME}/documents/search" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    --data-urlencode "q=love story" \
    --data-urlencode "query_by=title,description,genre" | jq -r '.hits[] | "- \(.document.title) by \(.document.author) [\(.document.genre)]"'

This just creates a collection, indexes 5 documents and creates a couple of synonyms to test the results.

Results

Test 1: Searching for 'classic' (should find literary fiction via synonyms)
- The Great Gatsby by F. Scott Fitzgerald [Classic Fiction]
- To Kill a Mockingbird by Harper Lee [Literary Fiction]

Test 2: Searching for 'sci-fi' (should find dystopian via synonyms)
- 1983 by George Orwell [Dystopian Fiction]

Test 3: Searching for 'love story' (should find romance via synonyms)
- Pride and Prejudice by Jane Austen [Romance]

Current Synonym Configuration
=================================
ID: classic-synonyms
Synonyms: classic, literature, literary, masterpiece

ID: scifi-synonyms
Synonyms: dystopian, sci-fi, science fiction, futuristic

ID: romance-synonyms
Synonyms: romance, romantic, love story, love
  1. Run the same script with the new version, using a backup snapshot of the data directory

Results

Test 1: Searching for 'classic' (should find literary fiction via synonyms)
- The Great Gatsby by F. Scott Fitzgerald [Classic Fiction]

Test 2: Searching for 'sci-fi' (should find dystopian via synonyms)

Test 3: Searching for 'love story' (should find romance via synonyms)
- To Kill a Mockingbird by Harper Lee [Literary Fiction]

There are some discrepancies (the dystopian and love story one isn't triggered) and "To Kill a mocking bird" is returned, even though its category is of literary fiction, and there's no synonym there.

Another valuable addition to this would be to migrate the synonyms to new synonym sets associated with that collection, so users are aware of their older synonyms as well.

On the new synonyms build, querying out to synonym sets returns an empty array, even though synonyms are being triggered:

curl "http://localhost:8108/synonym_sets/" -X GET \
    -H "Content-Type: application/json" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \

[]%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants