Hadoop Map-Reduce job for running distributed cardinality estimation.
Assuming you have Apache Maven installed and configured:
mvn package
The Maven assembly plugin will output the jar, Cardinality.MapReduce-1.0-SNAPSHOT-job.jar
The Hadoop job can be run using
bin/hadoop jar <jar-location>/Cardinality.MapReduce-1.0-SNAPSHOT-job.jar <input-dir> <output-dir>
The input is expected to consist of files containing files containing string identifiers, one per line. The job will compute the estimated cardinality of these strings.
The job will output a single file containing the estimated cardinality.