Text-MinHash-Priority

Installation

Only tested with Python 3.10 so far.

pip install git+https://github.com/zmzhang2000/text-minhash-priority

Features

This repository implements the MinHash Near Deduplication with Priority algorithm. Specifically, this algorithm differs from the original MinHash Near Deduplication algorithm in that

it support to select the item to keep from duplicated items according to the priority

Usage

Process your dataset into huggingface dataset format. I will provide a sample dataset with jsonl format.

{"source": "ABC", "text": "What's your name?"}
{"source": "ABC", "text": "My name is John."}

Add __keep__ or __minhash_priority__ key to your dataset.

{"source": "ABC", "text": "What's your name?", "__keep__": true}
{"source": "ABC", "text": "My name is John.", "__keep__": false}

or

{"source": "ABC", "text": "What's your name?", "__minhash_priority__": 20}
{"source": "ABC", "text": "My name is John.", "__minhash_priority__": 1}

Run minhash deduplication script. Use --column to specify the column to deduplicate.

python -m text_dedup.minhash \
  --path "json" \
  --data_files "dataset.jsonl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "dataset_deduplicated" \
  --column "text" \
  --ngram 4 \
  --threshold 0.8 \
  --batch_size 10000 \
  --use_auth_token true

The results will be saved with the huggingface dataset format. You can load the results with datasets.load_from_disk().

Acknowledgements

This repository is developed based on ChenghaoMou/text-dedup. More details can be found in the original repository.

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
.github/workflows		.github/workflows
deduplicate-text-datasets @ fcf7432		deduplicate-text-datasets @ fcf7432
docs/source		docs/source
notebooks		notebooks
tests		tests
text_dedup		text_dedup
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
banner.png		banner.png
cobertura.xml		cobertura.xml
compose.yaml		compose.yaml
log4j.properties		log4j.properties
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text-MinHash-Priority

Installation

Features

Usage

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

zmzhang2000/text-minhash-priority

Folders and files

Latest commit

History

Repository files navigation

Text-MinHash-Priority

Installation

Features

Usage

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages