Clarify when chunking files

Hello! I ran `sourmash compute` on ~50k single-cell (SmartSeq2 library prep) RNA-seq samples (they are here if you would like to see them: `aws s3 ls s3://olgabot-maca/facs/sourmash/`) and wanted to index/compare them all vs our [cell-cell distances/clusters/annotations using gene count tables](https://github.com/czbiohub/tabula-muris/blob/master/00_data_ingest/18_global_annotation_csv/annotations_facs.csv)

At first, I thought `sourmash compare` was broken because it said it was only loading 3444 signatures out of the 50k:

```
 Wed 13 Jun - 18:34  /mnt/data/maca/facs/sourmash 
 ubuntu@olgabot-reflow  ls | xargs sourmash compare --ksize 21 --csv ../ksize21.csv
loaded 3444 signatures total.
```

But there's 51,446 files here!!
```
 Wed 13 Jun - 18:54  /mnt/data/maca/facs/sourmash 
 ubuntu@olgabot-reflow  ls | wc -l
51446
```

But then `sourmash index` was more explicit in showing that it was chunking the data:

<details>

```
 Wed 13 Jun - 18:05  /mnt/data/maca/facs/sourmash 
 ubuntu@olgabot-reflow  ls | xargs sourmash index --ksize 21 ../ksize21db
loading 3444 files into SBT
loaded 3444 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3441 files into SBT
loaded 3441 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3441 files into SBT
loaded 3441 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3439 files into SBT
loaded 3439 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3699 files into SBT
loaded 3699 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3437 files into SBT
loaded 3437 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3434 files into SBT
loaded 3434 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3449 files into SBT
loaded 3449 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3447 files into SBT
loaded 3447 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3436 files into SBT
loaded 3436 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3435 files into SBT
loaded 3435 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3435 files into SBT
loaded 3435 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3439 files into SBT
loaded 3439 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3442 files into SBT
loaded 3442 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3028 files into SBT
```

</details>

So it seems that `sourmash compute` was *not* broken after all, but just taking its time through all the samples.

Here are my questions:

1. Can there be more explicit descriptions of how the data gets chunked by sourmash, either as a flag or a description? 
2. Is the chunking on a per-cpu basis? If so, that would be helpful to know when starting up an EC instance to know how much to allocate. 
3. Could this output, e.g. `on chunk 1/20` be output to the stdout?

Thank you!

EDIT: This was run on an [AWS EC2 m4.large](https://ec2instances.info/?filter=m4.large) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify when chunking files #495

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify when chunking files #495

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions