Evaluating Methods for Efficient Entity Count Estimation

This repository contains the code and data for the paper "Evaluating Methods for Efficient Entity Count Estimation", which has been accepted at the 51st International Conference on Very Large Data Bases (VLDB 2025).

Abstract

The problem of estimating the size of a query result has a long history in data management. When the query performs entity resolution (aka record linkage or deduplication), the problem is that of estimating the number of distinct entities, referred to as the entity count. This problem has received much attention from the statistics community but it has been largely overlooked in the data management literature. In this work, we formally define the entity count problem from a data management perspective and decompose it into a framework of fundamental steps. We explore approaches from both statistics and data management, systematically identifying a design space for different pipelines that address this problem. Finally, we provide extensive experiments to highlight the strengths and weaknesses of these approaches on real-world benchmarks.

Prerequisites

First, please download the code and data from Zenodo. Please note the .zip file is 33GB in size. The unzipped folder is 123GB in size.

The experiments were conducted using Python 3.11.5 and R 3.6.1. The experiments make use of the Networkit library, which can be installed following the instructions on the Networkit installation page, based on your operating system. The other Python dependencies are listed in the requirements.txt, which can be found in this repository (not the .zip file). To install them, we recommend using a virtual environment. Then, run the following command:

pip install -r requirements.txt

To install Blink, and the R dependencies, run the following command:

cd src/R/Blink
R -e "install.packages('stringdist', repos='https://cloud.r-project.org/')"
R -e "install.packages('plyr', repos='https://cloud.r-project.org/')"
R -e "install.packages('reticulate', repos='https://cloud.r-project.org/')"
R -e "install.packages('knitr', repos='https://cloud.r-project.org/')"
R -e "install.packages('glue', repos='https://cloud.r-project.org/')"
R -e "install.packages('stringr', repos='https://cloud.r-project.org/')"
R CMD INSTALL blink_1.1.0.tar.gz

The experiments also make use of the text-embedding-3-large model from OpenAI, which requires an API key. To set up the API key, edit the .env file in the root directory and replace the <API_KEY> placeholder with your API key.

Getting Started

Code, data and experimental results can be found in the vldb-data-and-code folder. To unzip the file, run the following commands:

unzip vldb-data-and-code.zip
cd vldb-data-and-code

The code is located in the src directory, the data is located in the data directory, and the experimental results are located in the experiments-final directory within the vldb-data-and-code folder.

Generating the plots

The code for generating the plots is located in the experiments-final/parse_results directory. To generate the plots, run the following command:

cd experiments-final/parse_results
python generate_plots.py

The plots will be saved in the parse_results directory. The script will generate the following plots, named after the figure numbers in the paper:

Figure_1.pdf: The results for the base ML ER pipeline on a selection of small and large datasets in our experiments in terms of F1-score, approximation error and runtime.
Figure_3.pdf: The results for the base ML ER, Simulation, Statistical, and LLM Embedding pipelines on the small datasets in our experiments.
Figure_4.pdf: The results for the sampling variant of the ML ER, Simulation, Statistical, and LLM Embedding pipelines on a selection of small datasets in our experiments.
Figure_5.pdf: The results for the sampling variant of the ML ER, Simulation, Statistical, and LLM Embedding pipelines on the large datasets in our experiments.

Experimental results

Results for the experiments can be found in the experiments-final folder. For each dataset, the results are stored in a separate folder. The datasets are as follows: Music-Brainz-20k, Music-Brainz-200k, Music-Brainz-2M, North-Carolina-Voters-5M, Cars, Alaska-monitor, DBLP-Scholar, WDC-xlarge-computers, Cars-1M and WDC-xlarge-computers-1M. Each dataset folder contains the following files:

ER_baseline: The results for the base ML ER pipeline with RoBERTa and DistilBERT. The folder contains two subfolders, one for each model.
ER_sampling: The results for the sampling variant of the ML ER pipeline with RoBERTa and DistilBERT. The folder contains two subfolders, one for each model.
Simulation_baseline: The results for the base Simulation pipeline with a calibration process based on RoBERTa and DistilBERT. The folder contains two subfolders, one for each model.
Simulation_sampling: The results for the sampling variant of the Simulation pipeline with a calibration process based on RoBERTa and DistilBERT. The folder contains two subfolders, one for each model.
Statistical_baseline: The results for the base Statistical pipeline with Blink. The folder contains two subfolders, one for each model (edit distance and mpnet-based sentence transformer model).
Statistical_sampling: The results for the sampling variant of the Statistical pipeline with Blink. The folder contains two subfolders, one for each model (edit distance and mpnet-based sentence transformer model).
LLM_embedding_baseline: The results for the base LLM Embedding pipeline with using the text-embedding-3-large model from OpenAI and edit distance. The folder contains two subfolders, one for each model.
LLM_embedding_sampling: The results for the sampling variant of the LLM Embedding pipeline with using the text-embedding-3-large model from OpenAI and edit distance. The folder contains two subfolders, one for each model.

For each pipeline the results are stored in a .json file with the prefix results_.

Running the pipelines

The code for to run the pipelines is located in the experiments-final directory, and are listed below:

run_er_baseline.py: This script runs the base ML ER pipeline on all datasets for both RoBERTa and DistilBERT.
run_er_baseline_control_queries.py: This script runs the base ML ER pipeline on all datasets for both RoBERTa and DistilBERT with control queries.
run_er_sampling.py: This script runs the sampling variant of the ML ER pipeline on all datasets for both RoBERTa and DistilBERT.
run_er_sampling_srs.py: This script runs the sampling variant of the ML ER pipeline on all datasets for both RoBERTa and DistilBERT with simple random sampling instead of Bernoulli sampling.
run_simulation_baselines.py: This script runs the base Simulation pipeline on all datasets for both RoBERTa and DistilBERT based calibration.
run_simulation_sampling.py: This script runs the sampling variant of the Simulation pipeline on all datasets for both RoBERTa and DistilBERT based calibration.
run_simulation_sampling_srs.py: This script runs the sampling variant of the Simulation pipeline on all datasets for both RoBERTa and DistilBERT based calibration with simple random sampling instead of Bernoulli sampling.
run_statistical_baselines.py: This script runs the base Statistical pipeline on all datasets for both edit distance and mpnet-based sentence transformer model.
run_statistical_sampling.py: This script runs the sampling variant of the Statistical pipeline on all datasets for both edit distance and mpnet-based sentence transformer model.
run_clustering_baselines.py: This script runs the base LLM Embedding pipeline on all datasets for both the text-embedding-3-large model from OpenAI and edit distance.
run_clustering_sampling.py: This script runs the sampling variant of the LLM Embedding pipeline on all datasets for both the text-embedding-3-large model from OpenAI and edit distance.
run_clustering_sampling_srs.py: This script runs the sampling variant of the LLM Embedding pipeline on all datasets for both the text-embedding-3-large model from OpenAI and edit distance with simple random sampling instead of Bernoulli sampling.

To run a pipeline, use the following command:

python <pipeline_script>

Where <pipeline_script> is the name of the pipeline script you want to run. The results will be stored in the experiments-final directory in a folder with the name of the dataset and the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Methods for Efficient Entity Count Estimation

Abstract

Prerequisites

Getting Started

Generating the plots

Experimental results

Running the pipelines

About

Uh oh!

Releases

Packages

Uh oh!

License

jermathew/entity-count-estimation

Folders and files

Latest commit

History

Repository files navigation

Evaluating Methods for Efficient Entity Count Estimation

Abstract

Prerequisites

Getting Started

Generating the plots

Experimental results

Running the pipelines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages