Skip to content
christiam edited this page Aug 12, 2014 · 3 revisions

There are several routines whose job is to collect seed hits from the BLAST lookup table. They are:

  • BlastAaScanSubject (BLAST_aalookup.h)

  • BlastRPSScanSubject (BLAST_aalookup.h)

  • A large collection of routines in BLAST_nalookup.h

  • PHIBlastScanSubject (phi_lookup.h)

For most database searches these routines will touch every letter of the database being searched. All of the routines fill an array of type BlastOffsetPair (lookup_wrap.h) with the offsets of hits found. For every BLAST program except phiblast, each BlastOffsetPair represents one high-scoring alignment between a concatenated set of query sequences and a single subject sequence. In the case of phiblast, the query sequence is scanned once and PHIBlastScanSubject is called once for each subject sequence; every BlastOffsetPair then contains the start and stop point of each pattern hit to the subject sequence.

In general, a ScanSubject routine works from the beginning to the end of a subject sequence. The ScanSubject can be called several times on the same sequence, if there happen to be many hits to that sequence. There is a limit to the number of hits that a single ScanSubject call can retrieve; this limit is calculated by GetOffsetArraySize (lookup_wrap.c) and is BLAST-program- and BLAST-lookup-table-specific. The limit is chosen so that every time the lookup table is accessed, there is guaranteed to be room for all of the query offsets at that lookup table entry. An upper bound on the number of lookup entries is computed when the lookup table is constructed. BLAST lookup table structures all contain a longest_chain field that gives this upper bound (note that to reduce the construction time for the megablast lookup table, longest_chain in this case is given a conservative value that avoids having to examine every lookup table cell). The maximum number of hits returned by a ScanSubject call is longest_chain plus a fixed amount; choosing the limit this way has several advantages:

  • Since the lookup table is accessed once per subject sequence word, if room in the array of BlastOffsetPairs is guaranteed for at least one lookup table access then a single ScanSubject call is guaranteed to make forward progress through the subject sequence

  • Being able to finish a complete lookup table access means that no extra bookkeeping information is needed to remember how many more query offsets need to be extracted from a single cell on the next ScanSubject call. This in turn allows the lookup table structure to be abstracted away to a certain extent, and to allow different lookup tables to do the same job at higher efficiency

Optimizations to ScanSubject

For protein sequences, the number of letters in a subject word matches the 'width' of the lookup table; this means the protein-based ScanSubject routines can walk one letter at a time through the subject sequence. However, for most nucleotide searches the number of letters in a lookup table word is less than the actual word size for the BLAST search (otherwise, for example, a word size of 28 in a megablast search would require a lookup table with 428 entries, far larger than is practical). Apple and Genentech, in their version of the BLAST source, use this difference in word width to accelerate nucleotide ScanSubject. Their changes to the old BLAST engine needed to achieve this speedup have been implemented in the current BLAST engine and hence can benefit searches running on any platform.

In the old BLAST engine, since the subject sequence is always assumed to be in ncbi2na format for the nucleotide wordfinders, a subject word is formed and used to access the lookup table for every fourth base (i.e. every byte of the compressed subject sequence). This 'stride' of 4 is independent of the number of letters in a lookup table word and independent of the word size used in the BLAST search. It becomes increasingly wasteful when the disparity between these two quantities is large.

Words used in the lookup table for blastn are usually 8 letters wide; if the word size of the BLAST search is 20 then any hits between query and subject involve at least 20 contiguous exact matches. You don't need to test the subject sequence every four letters to find runs of 8 matches that may be part of a 20-letter run of exact matches; you can get away with less testing than that. Since every test of the subject sequence involves acessing the lookup table, as the disparity between the BLAST word size and the lookup table word size increases then in principle the number of lookup table accesses (and hence the runtime for the ScanSubject routine) can decrease linearly. Especially for megablast, where large word sizes are common and up to 70% of the runtime can be tied up in ScanSubject, there is a potential for a significant speedup.

What is the largest possible stride that will not miss any hits between query and subject? Let the BLAST word size be W and the lookup table word size be w. A hit of size W will be missed if in the course of a ScanSubject call we do not form a subject word that lies completely inside the hit. The worst case picture to consider is thus:

If the two words of size w are any farther apart than the stride above, then a word of size W can potentially fit between them so that neither of the two smaller words lie completely inside. The size of this maximum stride is W - w + 1. With W = 20 and w = 8 this means a stride of 13, more than triple the ordinary stride of 4. Hence for these settings a strided ScanSubject can be expected to run more than three times faster than the old engine. Note also that any stride from 1 to W - w + 1 will work equally well, though obviously the larger the stride the faster the scanning process will go. Because a range of strides is possible, it may sometimes be advantageous to use a stride slightly smaller than the maximum possible; for example, if the stride were a multiple of 4 then the bit-twiddling needed to form subject words is slightly reduced. With the current state of the nucleotide subject scanning code this turns out not to be needed, and the largest possible stride is always chosen.

Another advantage to this striding method occurs when W < 11. Without striding we are forced to use lookup tables of size 4 and ScanSubject takes huge amounts of time (due to the enormous number of 4-letter matches that can occur by chance in nucleotide sequences). Use of AG striding allows a lookup table width that is much larger. For example, word size 7 can have a lookup table of size 7 with a stride of 1, or a table of size 6 with a stride of 2. The wider lookup table drastically reduces the number of seeds to process, and thus also reduces the number of extensions of those seeds. This can make nucleotide searches with small word sizes up to several times faster. Whether a slightly larger stride turns out faster depends on whether the improved cache performance of a smaller lookup table outweighs the extra time needed by an increased number of ungapped extensions.

Discontiguous words

Discontiguous megablast works slightly differently from conventional megablast. It does not require complete runs of matches in query and subject words, but instead only x nucleotide matches in y positions (for x = 11 or 12 and y = 16, 18 or 21). The exact positions in a word where a match is required are described by a bitfield called a 'template'. Use of discontiguous words is justified on theoretical grounds; one reason is that every third nucleotide is not so important in a nucleotide sequence when translation to a protein sequence occurs. Hence discontiguous megablast can be expected to find hits that are too dissimilar for megablast to find.

Lookup tables are built up in the same way as conventional megablast, except that the template is applied to generate offsets into the lookup table. ScanSubject routines work the same way as in the contiguous case, except that the stride is always fixed and the template is applied to generate subject words. Another difference with discontiguous megablast is that one can optionally use two simultaneous lookup tables, in the hope of recovering more alignments that would be missed if just a single discontiguous template was used. Both tables are accessed for every subject word, which means the two templates must evaluate to the same lookup table size. Also, longest_chain is a worst case value for both lookup tables combined, to accomodate the possibility of twice as many hits in a single ScanSubject call.

The only allowed stride for discontiguous megablast is 1. In principle any stride will work, although any stride larger than 1 will potentially miss some hits because here the word size matches the lookup table width (W = w). BLAST uses two types of templates, 'coding' and 'optimal'; the former are designed to selectively ignore every third base, while the latter are constructed to optimize the probability that a pair of random words of size template_length containing W matches will be found.

Designing templates for nucleotide scanning is something of a cottage industry; several researchers have used a variety of techniques to develop templates that optimize some kind of probability. The search space is very large when the template length is long. In principle it is straightforward to incorporate new templates into discontiguous megablast, but to date there has never been a comparative study of the effectiveness of even the existing templates.

Nucleotide ScanSubject Architecture

The nucleotide ScanSubject operation is usually critical for performance, but packing all of the needed flexibility for any ScanSubject operation into a single routine is necessarily wasteful. For example, if a ScanSubject routine has to handle word size 8 with arbitrary stride, then it must assume a whole new 8-mer must be constructed each time the lookup table is accessed. Given that that 8-mer may not be aligned on a 4-base boundary, in the general case the subject sequence must have three bytes loaded to form each 8-mer. However, if the stride was 1, then loading one byte of the subject sequence is enough for 4 lookup table accesses. As another example, if the stride is a multiple of 4 then several mask and shift operations become unnecessary. Even when the masks and shifts are required, knowing the stride beforehand allows the compiler to use compile-time constants instead of variables, and this can improve performance.

In order to achieve maximum ScanSubject performance, for nucleotide searches the BLAST engine will choose a ScanSubject routine from a pool of about two dozen candidates; that routine is used for the entirety of the search. The choice depends on the lookup table type, word size, and scanning stride. A pointer to the appropriate ScanSubject routine is determined by BlastChooseNucleotideScanSubject, and the pointer is stored as a void pointer within the lookup table structure (to avoid artificial dependencies between lookup table construction and scanning). This routine in turn calls lookup-table-type-specific routines that know the details of which specialized ScanSubject routine is available for a particular search, while the scanning routines themselves are static functions in BLAST_nascan.c.If a specialized routine is not available for a particular combination of word size and scanning stride, generic routines that can handle any combination (at reduced efficiency) are chosen.

The specialized routines each use as many of the following optimizations as are applicable:

  • If the stride is 1 or 3 plus a multiple of four, the scanning loop is unrolled by 4. If it is 2 plus a multiple of 4 the scanning loop is unrolled by 2. The unrolling eliminates several shift and mask operations, and turns the shift and mask quantities into compile-time constants. The unrolled routines use a switch statement to jump into the middle of the unrolled loop when scanning starts

  • If the stride is less than 4, the scanning uses an accumulator (to store subject words across scanning iterations instead of forming the subject word from scratch). This reduces the number of memory accesses to the subject sequence, and the number of pointer increments

  • If the word size is a multiple of 4, mask operations are removed when the offset into the subject sequence is known to be a multiple of 4. Because the loops are unrolled, all strides can take advantage of this optimization once in a while

  • In the specific case of lookup table width 8 and stride 4, all the unrolled loop iterations look the same, so that scanning can be made to finish only at the bottom of the loop. This eliminates several bounds checks, and makes this case the fastest of all the scanning routines. Happily, this loop is also the most common for ordinary blastn

  • Discontiguous megablast always uses an accumulator since it scans at stride 1. For the most common discontiguous templates, the mask and shift operations to form a single subject word are permuted so that the accumulator does not have to be shifted first. The reduction in arithmetic for these templates makes them more than 20% faster

Finally, note that when the stride is exactly a multiple of 4, the generic scanning routines have a special case that handles this essentially as fast as possible, so the generic routine is used. This cuts down greatly on what is already a large number of static functions.

The speedup obtained from this drastic increase in complexity will vary with the search. Megablast, and discontiguous megablast with the most common templates and a short-to-moderate size query, runs twice as fast now. For blastn the more common speedup is 10-15%.

RPSBLAST Optimizations

rpsblast is a special case of protein BLAST. We concatenate all the protein sequences in a small database. Each sequence has a position-specific score matrix, and both the concatenation of the score matrices and the lookup table derived from it are precomputed and stored on disk. Given the query to an rpsblast search, we scan the query and collect the hits against the database. Compare this to ordinary blastp, which would scan the sequences in the database and collect hits against the query. In rpsblast the lookup table is derived from the database, while in blastp the lookup table is derived from the query.

There are several optimizations in the current engine that significantly improve the performance of rpsblast. These optimizations depend on the fact that a single entry in the lookup table contains thousands of sequence offsets, scattered throughout all the protein sequences in the RPS database. These sequence offsets have no locality at all in the database, and so while it is possible to collect seeds in the same manner as ordinary blastp, the amount of 'hopping around' in the wordfinder will cause huge numbers of cache misses. We can reduce the number of cache misses by

  • collecting a very large number of seeds in a single ScanSubject call

  • organizing those seeds so that the protein wordfinder still works but the amount of locality is increased

Here 'locality' refers to the amount of clustering in the values of sequence offsets that appear in the list of offset pairs returned by BlastRPSScanSubject. The more locality in the list, the more often all of the relevant quantities (sequence data, score matrix columns, wordfinder data structures) will fit in cache and the higher the performance achieved. The number of seeds returned at once must be large so that when everything is loaded into cache the first time, it gets reused thoroughly before going on to the next batch of data.

Correct operation of the wordfinder requires that offsets into the sequence being scanned appear in increasing order. There is no restriction on the order in which the lookup table offsets appear. Optimal locality can be achieved by pulling a large number of seeds out of the lookup table and then sorting, first by query offset and then by subject offset. This is much too much work, since a call to qsort would involve dozens of passes through the list (when there are millions of seeds).

A better approach is to use one pass of a radix sort: we divide the RPS database into chunks of fixed power-of-two size, and for every seed that comes out of the lookup table we locate the chunk that corresponds to the database offset of the seed, and append the seed to a list (or 'bucket') of seeds located within the chunk. Later, once a large number of seeds have been collected, the wordfinder then operates on each chunk in turn, iterating linearly through the bucket for that chunk. This is implemented as an array of RPSBucket structures.

Changing the order of seed access in this way technically violates the requirement that scan offsets always are handled in increasing order. However, the real reason for the ordering restriction is that it guarantees that seeds falling on the same diagonal are processed in order of increasing distance down that diagonal. Proceeding a bucket at a time still does that: seeds on the same diagonal and inside one bucket are handled in the same order as before, while two seeds in different buckets but on the same diagonal still occur in order (since buckets are handled in order of increasing DB offset).

In real RPSBLAST searches, the concatenated DB is much larger than the sequence being scanned, so the graph above would be very 'tall' and 'thin'. This means the number of diagonals (DB offset minus query offset) that are simultaneously active is limited. We want a large number of hits so that we accumulate a large number of scan offsets, and reuse the same range of DB offsets for as long as possible.

The performance difference is pretty dramatic: rpsblast is 25-50% faster than it used to be. Approximately half of the benefit comes from reusing data structures in the wordfinder (since the number of simultaneously active diagonals is limited), and the other half comes from needing only a small portion of the huge score matrix at a time, when calculating ungapped extensions.

Currently a bucket is of size RPS_BUCKET_SIZE (=2048), and this produces enough locality to be fast without needing a huge number of buckets. As caches get smaller and RPS databases get larger, this granularity will become completely inappropriate. For a sense of scale, the largest RPS database has about 23000 sequences, and this is more than 200 times smaller than our largest protein database. Maintaining optimal performance while avoiding huge data structures will take further restructuring of the scanning code than is present now.

Clone this wiki locally