• A MULTITHREADED PARALLEL IMPLEMENTATION OF A DYNAMIC ...


  •   
  • FileName: martins.pdf [preview-online]
    • Abstract: programming algorithm for biological sequence comparison on a general ... Parallel hardware for sequence comparison and align- ment. Computer Applications in the Biosciences, 12:473 ...

Download the ebook

Pacific Symposium on Biocomputing 6:311-322 (2001)
A MULTITHREADED PARALLEL IMPLEMENTATION OF A
DYNAMIC PROGRAMMING ALGORITHM FOR SEQUENCE
COMPARISON
W.S. MARTINS, J.B. DEL CUVILLO, F.J. USECHE,
K.B. THEOBALD, G.R. GAO
Department of Electrical and Computer Engineering
University of Delaware, Newark, DE 19716, USA
Abstract
This paper discusses the issues involved in implementing a dynamic
programming algorithm for biological sequence comparison on a general-
purpose parallel computing platform based on a ne-grain event-driven
multithreaded program execution model. Fine-grain multithreading per-
mits e cient parallelism exploitation in this application both by taking
advantage of asynchronous point-to-point synchronizations and commu-
nication with low overheads and by e ectively tolerating latency through
the overlapping of computation and communication. We have imple-
mented our scheme on EARTH, a ne-grain event-driven multithreaded
execution and architecture model which has been ported to a number of
parallel machines with o -the-shelf processors. Our experimental results
show that the dynamic programming algorithm can be e ciently imple-
mented on EARTH systems with high performance e.g., speedup of 90
on 120 nodes, good programmability and reasonable cost.
1 Introduction
Today, one of the most powerful methods for inferring the biological function
of a gene or the protein that it encodes is by sequence similarity searching on
protein and DNA sequence databases. With the development of rapid methods
for sequence comparison, discoveries based solely on sequence homology have
become routine. A good introduction can be found in a book by Waterman.1
Although sequence comparison algorithms based on the dynamic program-
ming method | such as Needleman-Wunsch 2 and Smith-Waterman 3 | pro-
vide optimal solutions, they are computationally expensive. Therefore, most
current sequence comparison methods used in practice, e.g., BLAST 4 and
FASTA,5 are based on heuristics which are much faster, but do not produce
optimal results. Speed is important given the sizes of the sequence databases
currently available, but it comes at the price of getting incomplete results.
In this paper, we are interested in studying how to apply the computational
power of parallel computers to speed up the process of comparing sequences,
Pacific Symposium on Biocomputing 6:311-322 (2001)
but without having to compromise by missing some optimal results. We look
at dynamic programming algorithms for sequence comparison. The rst al-
gorithm introduced for nding the optimal alignment between sequences, the
Needleman-Wunsch algorithm,2 used this technique. This algorithm had a
great impact on later sequence alignment algorithms, such as the well-known
Smith-Waterman method 3 and others.6 7 Therefore, speeding up dynamic pro-
;
gramming algorithms for nding optimal solutions to sequence comparisons is
an important problem in computational biology and bioinformatics.
The dynamic programming algorithm works by computing the so-called
similarity matrix. As will be discussed in detail in Section 2, the computation
at each element in this matrix depends on the results of three other elements:
its nearest west, northwest and north neighbors in the matrix. Such ne-grain
data dependences present serious challenges for e cient parallel execution on
current parallel computers. To meet such challenges, we exploit the power of
a multithreaded execution and architecture model, such as the EARTH E -
cient Architecture for Running THreads model,8 where ne-grain parallelism
can be e ciently exploited on top of a parallel machine based on o -the-shelf
microprocessors. Under the EARTH model, the computation of an element
or a block of elements of the similarity matrix can be assigned to one thread.
The thread scheduling under EARTH is event-driven; a thread will become
enabled if and only if the events on which it depends have arrived. There-
fore, we can map the ne-grain data dependences into such events, and the
enabling and execution of the threads in di erent points are performed in an
asynchronous fashion. With similarity matrices above a reasonable threshold,
this mapping provides ample thread parallelism to keep the processors usefully
busy. Maintaining multiple enabled threads in the same processor also pro-
vides the ability to tolerate interprocessor communication and communication
latencies, and to sustain high and smooth scalability.
2 Sequence Comparison Using Dynamic Programming
The rst algorithm for comparing biological sequences using the dynamic pro-
gramming technique was proposed by Needleman and Wunsch in 1970.2 The
algorithm consists of two parts: the calculation of the total score indicating
the similarity between the two given sequences, and the identi cation of the
alignments that lead to the score. In this paper we will concentrate on the
calculation of the score, since this is the most computationally expensive part.
The idea behind using dynamic programming is to build up the solution
by using previous solutions for smaller subsequences. The comparison of the
two sequences X and Y, using the dynamic programming algorithm, is illus-
Pacific Symposium on Biocomputing 6:311-322 (2001)
trated in Figure 1. This algorithm nds global alignments by comparing entire
sequences. The sequences are placed along the left margin X and on the top
Y. A similarity matrix is initialized with decreasing values 0; ,1; ,2; ,3; : : :
along the rst row and rst column to penalize for consecutive gaps insertions
or deletions.
Sequence
A T G C A G T Y
0 -1 -2 -3 -4 -5 -6 -7
A -1 1 0 -1 -2 -3 -4 -5
T -2 0 2 1 0 -1 -2 -3
sequence X A T - A A G T
Sequence A -3 -1 1 2 1 1 0 -1 sequence Y A T G C A G T
X score 1 1 -1 0 1 1 1
-4 -2 0 1 2 2 1 0 TOTAL= 4
A
sequence X A T A - A G T
G -5 -3 -1 1 1 2 3 2 sequence Y A T G C A G T
score 1 1 0 -1 1 1 1
T -6 -4 -2 0 1 1 2 4 TOTAL= 4
Figure 1: Similarity matrix and global alignments
The other elements of the matrix are calculated by nding the maximum
value among the following three values: the left element plus gap penalty, the
upper-left element plus the score of substituting the horizontal symbol for the
vertical symbol, and the upper element plus the gap penalty. For the general
case where X = x1 ; : : : ; x and Y = y1 ; : : : ; y , for i = 1; : : : ; n and j = 1; : : : ; m,
i j
the similarity matrix SM n; m is built by applying the following recurrence
equation, where gp is the gap penalty and ss is the substitution score:
8 SM i; j , 1 + gp
SM i; j = max : SM i , 1; j , 1 + ss
SM i , 1; j + gp
In our example, gp is -1, and ss is 1 if the elements match and 0 otherwise.
However, other general values can be used instead.
Following this recurrence equation, the matrix is lled from top left to
bottom right with entry i; j requiring the entries i; j , 1 , i , 1; j , 1 , and
i , 1; j . Notice that SM i; j corresponds to the best score of the subsequences
x1 ; : : : ; x and y1 ; : : : ; y . Since global alignment takes into account the entire
i j
sequences, the nal score will always be found in the bottom right hand corner
Pacific Symposium on Biocomputing 6:311-322 (2001)
of the matrix. In our example, the nal score 4 gives us a measure of how
a
similar the two sequences are. Figure 1 shows the similarity matrix and the
two possible alignments arrows going up and left.
3 Parallel Computation | Challenges and Problem Formulation
A parallel version of the sequence comparison algorithm using dynamic pro-
gramming must handle the data dependences presented by this method, yet
it should perform as many operations as possible independently. This may
present a serious challenge for e cient parallel execution on current general
purpose parallel computers, i.e., MIMD Multiple Instruction stream, Multiple
Data stream computers.
Given the data dependences presented by the algorithm, the similarity
matrix can be lled row by row, column by column, or anti-diagonal by anti-
diagonal i.e., all elements i; j  for which i + j is a xed value. The problem
with the rst two approaches is that most of the elements in a row or col-
umn depend on other elements in the same row column. This means the
row or column cannot be computed in parallel. On the other hand, the ele-
ments in an anti-diagonal depend only on previously calculated anti-diagonals.
This means that parallel computation can proceed as a wave front across the
similarity matrix, i.e., by computing successive anti-diagonals of the matrix
simultaneously, during successive time steps.
Although it exposes parallelism, the anti-diagonal approach faces a few
challenges when it comes to an e cient parallel implementation. First, the
sizes of the anti-diagonals vary during the computation, which leads to unbal-
anced work among processors. For example, assume six processors, one per
symbol row of sequence X, are available to compute the matrix in Figure 1.
The computation would start with processor 1 calculating the element 1,1
assuming rows and columns are numbered 0,1,2 . . . . Then, in the next time
step, processors 1 and 2 would calculate the elements 1,2 and 2,1 respec-
tively, and so on. This way, processor 6, for instance, would have to wait 6
time steps before starting to work, and by the time we get to time step 8,
processor 1 would already be idle. In the worst case, where X and Y are the
same length, each processor would only be used half of the time on average.
Another challenge has to do with the number of elements to be computed
by each processor in each time step. In the previous example we assumed that
each processor would calculate one element of the matrix at a time. However,
a When this information is stored during the calculation of the matrix, the second part
of the algorithm, identi cation of the alignment, can be easily computed by following the
pointers from the lowest right corner to the upper left corner.
Pacific Symposium on Biocomputing 6:311-322 (2001)
this ne-grain computation would require a very large number of processors
if real biological data is to be considered. An additional problem is the high
communication overheads for such an implementation, which would require
data exchange among all active processors at every time step.
In this paper, we are interested in the following challenging question: Can
the dynamic programming algorithm be e ciently implemented on general-
purpose parallel computers with high performance and e ciency, good pro-
grammability, and reasonable cost? Here we are particularly interested in
parallel machines made mainly of commodity o -the-shelf microprocessors
and stock hardware. The requirement of high performance" implies good
speedup and scalability, and good programmability" and reasonable cost" fa-
vor general-purpose parallel computer solutions as opposed to special-purpose
hardware or exotic processor technology.
4 Parallel Implementation of the Dynamic Programming Algorithm
The previous section mentioned that computing the anti-diagonal element by
element would lead to expensive communication overheads. For each element,
the program would compute a single maximum, yet would have to send the
result to three processors though one of these may be the same processor.
One solution to this problem is to divide the similarity matrix into rect-
angular blocks, as shown in Figure 2a. In this example, the program would
compute block 1 rst, followed by 2 and 5, etc. If each block has q rows and r
columns, then the computation of a given block requires only the row segment
immediately above the block, the column segment to its immediate left, and
the element above and to the left | a total of q + r +1 elements. For instance,
if each block has 4 rows and 4 columns, then each block has to compute 16
maxima after receiving 9 input values. The communication-to-computation
ratio drops from 3:1 to 9:16 | an 81 reduction!
Note that this blocking will decrease the maximum achievable parallelism
somewhat, by introducing some sequential dependences in the code. However,
given the sizes of the current problems and the parallel machines currently
used, this potential loss will not be a limiting factor.
The load-balancing problem can be addressed by putting several rows of
blocks or strips" on the same processor. Figure 2b illustrates this ap-
proach when four processors are used. The rst and fth strips are assigned
to processor 1, the second and sixth strips are assigned to processor 2 and so
on. This helps to keep all processors busy through most of the computation.
For example, processor 1 initially works with the rst strip, then simultane-
ously with the rst and fth strip, then nally only with the fth strip. The
Pacific Symposium on Biocomputing 6:311-322 (2001)
P1 1 2 3 4
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P1
111111111111111111111
000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
P2 5 6 7 8 P2 111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
111111111111111111111
000000000000000000000
P3 9 10 11 12 000000000000000000000
111111111111111111111
P3
000000000000000000000
111111111111111111111
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P4 13 14 15 16 111111111111111111111
000000000000000000000
P4
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P1 17 18 19 20 111111111111111111111
000000000000000000000
P1
111111111111111111111
000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
P2 111111111111111111111111111111111111111111
P2 21 22 23 24 000000000000000000000000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
111111111111111111111
000000000000000000000
P3 25 26 27 28 111111111111111111111
000000000000000000000
P3
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P4 29 30 31 32 111111111111111111111
000000000000000000000
P4
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
a Block Division b Processor Distribution
Figure 2: Partition of the similarity matrix
processor utilization rises to 75.
4.1 The EARTH Multithreaded Architecture | Our Platform
EARTH 8 9 supports a multithreaded program execution model in which a
;
program is viewed as a collection of threads whose execution ordering is deter-
mined by data and control dependences explicitly identi ed in the program.
Threads, in turn, are further divided into bers which are non-preemptive and
scheduled according to data ow-like ring rules, i.e., all needed data must be
available before it becomes ready for execution. Programs structured using
this two-level hierarchy can take advantage of both local synchronization and
communication between bers within the same thread, exploiting data local-
ity. In addition, an e ective overlapping of communication and computation is
made possible by providing a pool of ready-to-run bers from which the pro-
cessor can fetch new work as soon as the current ber ends and the necessary
communication is initiated.
The EARTH model de nes a common set of primitive operations required
for the management, synchronization and data communication of threads.
Each node in an EARTH system consists of an execution unit EU, a syn-
chronization unit SU, queues linking the EU and SU, local memory, and an
interface to interconnection network. While the EU merely executes bers, i.e.,
does the computation, the SU is responsible for scheduling and synchronizing
threads, handling remote accesses and performing dynamic load balancing.
Although designed to deal with multiple threads per node, the EARTH
model does not require any support for rapid context switching since bers
are non-preemptive and is well-suited to running on o -the-shelf processors.
EARTH systems have been implemented on a number of platforms: MANNA
and PowerMANNA, IBM SP2, Sun SMP cluster and Beowulf. EARTH pro-
Pacific Symposium on Biocomputing 6:311-322 (2001)
Inactive fiber P1
111
000
111
000 Active fiber
111
000
111
000 Thread A Thread B
data E Fibers 1111
0000 E
11111
00000
Fibers
sync 1111
0000 11111
00000
P2 1111
0000
1111
0000 11111
00000
11111
00000
acknowledgement O O
P3
* 4fibers per processor
O 111111111
000000000
* Event-driven 111111111
000000000
P1 111111111
000000000 E
* Asynchronous 11111
00000 111111111
000000000
E 11111
00000
* Non-preemptive fibers P2 11111
00000
11111
00000 O
11111
00000
11111
00000
* Local communication/ O 11111
00000
11111
00000
11111
00000
11111
00000
synchronization between P3 11111
00000
11111
00000
11111
00000
11111
00000 E
fibers
E 11111
00000
P4 11111
00000
11111
00000 O
111111111
000000000 11111
00000
O 111111111
000000000
P1 111111111
000000000
111111111
000000000 E
1111
0000 111111111
000000000
E 1111
0000
P2 1111
0000
1111
0000 O
1111
0000
O 1111
0000
1111
0000
1111
0000
1111
0000
P3 1111
0000
1111
0000
1111
0000
1111
0000 E
1111
0000
P4 1111
0000
1111
0000 O
1111
0000
Figure 3: Computation of the similarity matrix on EARTH
grams are written using the programming language Threaded-C.8 9 This is ;
an extension of the ANSI-C programming language which, by incorporating
EARTH operations, allows the user to indicate parallelism explicitly.
4.2 Our Multithreaded Implementation
Our multithreaded implementation follows the description given at the begin-
ning of this section. Generally speaking, it assigns the computation of each
strip to a thread, having 2 independent threads per node. However, in order
to better overlap computation and communication, blocks on a strip are ac-
tually calculated by two bers within a thread. These bers are repeatedly
instantiated to compute one block at a time, and only one of the two bers of
each thread can be active at a particular time.
The decision of having two alternating bers within each thread was based
on the following reasoning. It would be a waste of resources if we had one sep-
arate ber for each block, in each strip, since only one block can be calculated
at a time. Having just one ber for all blocks is also not a good idea because
Pacific Symposium on Biocomputing 6:311-322 (2001)
this ber would get delayed due to the synchronization signal coming from
the ber immediately below. This signal acknowledges the receipt of data |
without it the ber, re-instantiated, would be allowed to overwrite the previous
data. Thus, with just one ber, computation would not be allowed to proceed
until this acknowledgment signal is received. With the addition of an extra
ber we can further overlap computation and communication since one of the
bers can wait for the acknowledgment while the other starts working on the
following block. This double-bu ering and acknowledgment scheme is used
with other parallel applications on EARTH.8 10 
;
A snapshot of the computation of the similarity matrix using our multi-
threaded implementation is illustrated in Figure 3. A thread is assigned to each
horizontal strip and the actual computation is done by bers labeled Even
and Odd. The gure shows the computation of the main anti-diagonal of
the matrix. The arrows indicate data and synchronization signals. For exam-
ple, processor 2 sends data downward arrows to processor 3 and receives data
from processor 1 | i.e., bers E of strips 2 and 6 send data to bers E of strips
3 and 7, and bers O of strips 1 and 5 send data to bers O of strips 2 and 6.
Fibers within a same thread, that is, associated with the same strip, send only
a synchronization signal horizontal arrows since they share data local to the
thread to which they belong. Finally, dotted upward arrows acknowledge the
receipt of data so that the ber receiving this signal can be re-instantiated to
calculate another block of the same strip.
During the initialization phase, each thread grabs a piece of the input
sequence X. This piece is all a thread needs from sequence X so the whole
sequence need not be stored. Moreover, after computing a block, each ber
sends to the ber beneath a piece of the sequence Y being compared. By
doing so, we minimize the initialization delay that occurs when the nodes are
reading sequence X from the server. Furthermore, since subsequent pieces of
sequence Y can be stored in the same memory area, the demands for space are
considerably reduced.
5 Results
The experiments in this study are based on the EARTH implementation for
the MANNA parallel machine. Our experiments were run using both a 20-node
MANNA and SEMi, an accurate simulator of the MANNA.8 9 The di erence
;
in the clock cycle counts between the simulator and the real MANNA have
been measured and were less than 3 for the same Threaded-C code. The
simulator allows one to test di erent con gurations for the EARTH system.
In this paper we consider only the most conservative of them, one in which each
Pacific Symposium on Biocomputing 6:311-322 (2001)
node contains only a single o -the-shelf microprocessor, and the operations of
the EU and SU must be performed therefore on the same processor.
The proposed multithreaded dynamic-programming algorithm for sequence
comparison has been implemented under the EARTH experimental platform.
The results are reported for sequences ranging from 512 to 10K elements. Fig-
ure 4 shows absolute speedups for various problem sizes sizes list the length
of the X and Y sequences. Absolute speedup compares the running time of
the parallel code to the best sequential program, so that the results account
for multithreading overheads.
The main experimental results include: b
Speedup: The multithreaded implementation has achieved impressive
speedup: up to 90 on a 120-node platform.
Scalability: A good scalability is obtained from 4 up to 120 nodes.
Processor E ciency: Very good e ciency has been achieved; on 8
nodes, processors are utilized 99 as much as on one node running se-
quential code. Even on 120 nodes, processors achieve 75 utilization on
the largest problem size.
Programmability: The good performance is achieved using a straight-
forward partitioning method, without requiring expensive compiler sup-
port for optimal partitioning algorithms.
Reasonable cost: This is achieved on an EARTH platform based on
o -the-shelf general-purpose microprocessor technology.
6 Related Work
Di erent parallel implementations of pair-wise sequence comparison algorithms
using dynamic programming techniques range from the exploitation of instruc-
tion level parallelism in uniprocessor machines 11 12 to SIMD 13 14 Single In-
; ;
struction stream, Multiple Data stream and MIMD implementations.15 16 Re- ;
lated problems such as sequence alignment and database search have also made
extensive use of parallel hardware, from special-purpose VLSI to recon gurable
hardware and programmable co-processor designs.14 17 Some work has been re-
;
ported in the eld of hardware accelerators 18 and some basic software platforms
b The authors are not aware of any published work reporting results, for a parallel MIMD
implementation of the dynamic programming algorithm, for the range of input data and
number of processors reported in this paper.
Pacific Symposium on Biocomputing 6:311-322 (2001)
96
512x1k
1kx2k
2kx4k
4kx4k


Use: 0.5754