-
Notifications
You must be signed in to change notification settings - Fork 88
Closed
Description
Hello! I ran sourmash compute on ~50k single-cell (SmartSeq2 library prep) RNA-seq samples (they are here if you would like to see them: aws s3 ls s3://olgabot-maca/facs/sourmash/) and wanted to index/compare them all vs our cell-cell distances/clusters/annotations using gene count tables
At first, I thought sourmash compare was broken because it said it was only loading 3444 signatures out of the 50k:
Wed 13 Jun - 18:34 /mnt/data/maca/facs/sourmash
ubuntu@olgabot-reflow ls | xargs sourmash compare --ksize 21 --csv ../ksize21.csv
loaded 3444 signatures total.
But there's 51,446 files here!!
Wed 13 Jun - 18:54 /mnt/data/maca/facs/sourmash
ubuntu@olgabot-reflow ls | wc -l
51446
But then sourmash index was more explicit in showing that it was chunking the data:
Wed 13 Jun - 18:05 /mnt/data/maca/facs/sourmash
ubuntu@olgabot-reflow ls | xargs sourmash index --ksize 21 ../ksize21db
loading 3444 files into SBT
loaded 3444 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3441 files into SBT
loaded 3441 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3441 files into SBT
loaded 3441 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3439 files into SBT
loaded 3439 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3699 files into SBT
loaded 3699 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3437 files into SBT
loaded 3437 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3434 files into SBT
loaded 3434 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3449 files into SBT
loaded 3449 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3447 files into SBT
loaded 3447 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3436 files into SBT
loaded 3436 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3435 files into SBT
loaded 3435 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3435 files into SBT
loaded 3435 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3439 files into SBT
loaded 3439 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3442 files into SBT
loaded 3442 sigs; saving SBT under "../ksize21db"
Finished saving nodes, now saving SBT json file.
loading 3028 files into SBT
So it seems that sourmash compute was not broken after all, but just taking its time through all the samples.
Here are my questions:
- Can there be more explicit descriptions of how the data gets chunked by sourmash, either as a flag or a description?
- Is the chunking on a per-cpu basis? If so, that would be helpful to know when starting up an EC instance to know how much to allocate.
- Could this output, e.g.
on chunk 1/20be output to the stdout?
Thank you!
EDIT: This was run on an AWS EC2 m4.large
Metadata
Metadata
Assignees
Labels
No labels