New index building APIs for simpler and flexible usage

In SVS, we separate the concept of dataset and index. This allows an index to accept different datasets, such as FP32, Scalar Quantization (`SQDataset`), LVQ, and LeanVec.
Currently, using `SQDataset` as an example, users must call `compress` to get a `SQDataset`, then pass it to the index. However, most use cases don't care about the dataset itself - this two-step process is often unnecessary for most users. Additionally, this approach makes runtime fallback impossible to implement in SVS, as each dataset (i.e., type) is determined at compile time.
```cpp
auto loaded =
        svs::VectorDataLoader<float>(std::filesystem::path(SVS_DATA_DIR) / "data_f32.svs").load();
auto data =
        svs::scalar::SQDataset<std::int8_t>::compress(loaded, threadpool); // SQDataset is determined at compile time
auto parameters = svs::index::vamana::VamanaBuildParameters{};
svs::Vamana index = svs::Vamana::build<float>(
        parameters, data, svs::distance::DistanceL2(), num_threads
);
```

I propose adding index building APIs that directly accept uncompressed data and take the dataset type as a parameter to determine which dataset format to use internally.
```cpp
auto loaded =
        svs::VectorDataLoader<float>(std::filesystem::path(SVS_DATA_DIR) / "data_f32.svs").load();
auto parameters = svs::index::vamana::VamanaBuildParameters{};
svs::Vamana index = svs::Vamana::build<float>(
        parameters, data, svs::distance::DistanceL2(), num_threads, svs::SQ8
); // internally, SVS fallbacks to uncompressed data if scalar quantization failed
```
Another advantage of this is that we could optionally utilize uncompressed data to build the graph rather than compressed data, which typically gives worse quality graphs as compression introduces approximation errors that can degrade the graph structure during construction.
Note that when calling `build`, we can compress the data and build the graph in parallel, as the two things are independent.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New index building APIs for simpler and flexible usage #189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New index building APIs for simpler and flexible usage #189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions