docs/long_mult_walkthrough/long_mult_walkthrough.html
<html>
<head>
<title>Long Multiplication pipeline walkthrough</title>
</head>
<body>
<center><h1>Long Multiplication pipeline walkthrough</h1></center>
<ol start=0>
<br/><hr width=50% /><br/>
<li><h3>This is a walkthrough of a simple 3-analysis example pipeline.</h3>
<p>
The goal of the pipeline is to multiply two long numbers.
We pretend that it cannot be done in one operation on a single machine.
So we decide to split the task into subtasks of multiplying the first long number
by individual digits of the second long number for the sake of an example.
At the last step the partial products are shifted and added together to yield the final product.
</p>
<br/>
<br/>
<center>
We demonstrate what happens in the pipeline with the help of two types of diagrams:
job-level dependency (J-)diagrams and analysis-rule (A-)diagrams:
</center>
<table border=0 cellpadding=10>
<tr>
<td>
<p>
A <b>J-diagram</b> is a directed acyclic graph where nodes represent Jobs, Semaphores or Accumulators
with edges representing relationships and dependencies. Most of these objects are created dynamically
during the pipeline execution, so here you'll see a lot of action - the J-diagram will be growing.
</p>
<p><i>
J-diagrams can be generated at any moment during a pipeline's execution by running Hive's <b>visualize_jobs.pl</b> script (new in version/2.5) .
</i></p>
</td>
<td>
<p>
An <b>A-diagram</b> is a directed graph where most of the nodes represent Analyses and edges represent Rules.
As a whole it represents the structure of the pipeline which is normally static.
The only changing elements will be job counts and analysis colours.
</p>
<p><i>
A-diagrams can be generated at any moment during a pipeline's execution by running Hive's <b>generate_graph.pl</b> script.
</i></p>
</td>
</tr>
</table>
<br/>
<br/>
<br/>
<p>
The main bulk of this document is a commented series of snapshots of both types of diagrams during the execution of the pipeline.
<br/>
They can be approximately reproduced by running a sequence of commands, similar to these, in a terminal:
</p>
<table border=1><tr><td>
<pre>
<b>export PIPELINE_URL=sqlite:///lg4_long_mult.sqlite</b> <i># An SQLite file is enough to handle this pipeline</i>
<b>init_pipeline.pl Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf -pipeline_url $PIPELINE_URL</b> <i># Initialize the pipeline database from a PipeConfig file</i>
<b>runWorker.pl -url $PIPELINE_URL -job_id $JOB_ID</b> <i># Run a specific job - this allows you to force your own order of execution. Run a few of these</i>
<b>beekeeper.pl -url $PIPELINE_URL -analyses_pattern $ANALYSIS_NAME -sync</b> <i># Force the system to recalculate job counts and determine states of analyses</i>
<b>visualize_jobs.pl -url $PIPELINE_URL -out long_mult_jobs_${STEP_NUMBER}.png</b> <i># To make a J-diagram snapshot (it is convenient to have synchronized numbering)</i>
<b>generate_graph.pl -url $PIPELINE_URL -out long_mult_analyses_${STEP_NUMBER}.png</b> <i># To make an A-diagram snapshot (it is convenient to have synchronized numbering)</i>
</pre>
</td></tr></table>
</li>
<br/><br/>
<li><h3>This is our pipeline just after the initialization:</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_01.png><img src=long_mult_jobs_01.png /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_01.png><img src=long_mult_analyses_01.png /></a>
</td>
</tr>
<tr>
<td>
<p>
The J-diagram shows a couple of 3d-boxes, they represent specific Jobs.
Each Job is an individual task that can be run on an individual machine.
We need at least one initial Job to run a pipeline. However that one Job
may generate many more as it gets executed.
</p>
<p>
In this example we have two initial Jobs. They were created automatically during the pipeline's initialization process.
These two initial Jobs will generate two independent "streams" of execution which will yield their own independent results.
Since in this particular pipeline we are simply multiplying same two numbers in different orders, we expect the final results to be identical.
</p>
</td>
<td>
<p>
The A-diagram shows how execution of the pipeline is guided by Rules.
Since the Rules are mostly static, the diagram will also be changing very little.
</p>
<p>
The main objects on A-diagram are rectangles with rounded corners, they represent Analyses.
Analyses are "types" of Jobs (Analyses broadly define which code to run, where and how,
but miss specific parameters which become defined in Jobs).
In this pipeline we have three types: 'take_b_apart', 'part_multiply' and 'add_together'.
</p>
<p>
The "take_b_apart" Analysis contains two Jobs, which are in 'READY' state (can be checked-out for execution).
Our colour for 'READY' is green, so both the Analysis and the specific Jobs are shown in green.
</p>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>After running the first Job we see a lot of changes on the J-diagram:</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_02.png><img src=long_mult_jobs_02.png /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_02.png><img src=long_mult_analyses_02.png /></a>
</td>
</tr>
<tr>
<td>
<p>
Job_1 has finished running and is now in 'DONE' state (colour-coded blue).
It has generated 6 more Jobs: five in Analysis 'part_multiply' (splitting its own task into parts)
and one in Analysis 'add_together' (which will recombine the results of the former into the final result).
</p>
<p>
The newly created 'part_multiply' Jobs also control a Semaphore which blocks the 'add_together' Job
which is in 'SEMAPHORED' state and cannot be executed yet (grey).
The Semaphore is essentially a counter that gets decremented each time one of the controlling Jobs becomes 'DONE'.
It is our primary mechanism for synchronization of control- and dataflow.
</p>
</td>
<td>
<p>
The topology of A-diagram doesn't normally change, so pay attention at more subtle changes of colours and labels:
<ul>
<li>'take_b_apart' analysis is now yellow (in progress);<br/>
<i>"1r+1d" stands for "1 READY and 1 DONE"</i><br/><br/>
</li>
<li>'part_multiply' analysis is now green (ready);<br/>
<i>"5r" means "5 READY"</i><br/><br/>
</li>
<li>'add_together' analysis is now grey (all jobs are waiting);<br/>
<i>"1s" means "1 SEMAPHORED" (or blocked).</i>
</li>
</ul>
</p>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>After running the second Job more jobs have been added to Analyses 'part_multiply' and 'add_together'.</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_03.png><img src=long_mult_jobs_03.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_03.png><img src=long_mult_analyses_03.png /></a>
</td>
</tr>
<tr>
<td>
<p>
There is a new Semaphore, a new group of 'part_multiply' Jobs to control it, and a new 'add_together' Job blocked by it.
</p>
<p>
Note that the child jobs sometimes inherit some of their parameters from their parent Job ("params from: 1", "params from: 2").
This is done to save some space.
</p>
</td>
<td>
<p>
<ul>
<li>'take_b_apart' Analysis is completed (no more Jobs to run) and turns blue ('DONE')</li>
<li>more 'part_multiply' jobs have been generated, all are 'READY'</li>
<li>one more 'add_together' job has been generated, and it is also 'SEMAPHORED'</li>
</ul>
</p>
</td>
</tr>
</table>
<center><i>
Note that the job counts of A-diagram do not provide enough resolution to tell which Jobs are semaphored by which.
Not even the distribution of the Jobs that control Semaphores.
This is where J-diagram becomes useful.
</i></center>
</li>
<br/><hr width=50% /><br/>
<li><h3>We finally get to run a Job from the second Analysis.</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_04.png><img src=long_mult_jobs_04.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_04.png><img src=long_mult_analyses_04.png /></a>
</td>
</tr>
<tr>
<td>
<p>
Once it's done, two things happen:
<ul>
<li>one of the links to the Semaphore turns green and its counter gets decremented by 1 (control flow)</li>
<li>some data intended for the Job_3 is sent from Job_4 and arrives at an Accumulator (data flow).</li>
</ul>
</p>
</td>
<td>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>A couple more Jobs get executed with a similar effect</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_05.png><img src=long_mult_jobs_05.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_05.png><img src=long_mult_analyses_05.png /></a>
</td>
</tr>
<tr>
<td>
<p>
After executing these two jobs:
<ul>
<li>the Semaphore counter gets decremented by 2 (by the number of completed jobs)</li>
<li>the data that they generated gets sent to the corresponding Accumulator.</li>
</ul>
</p>
</td>
<td>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>And another couple more Jobs...</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_06.png><img src=long_mult_jobs_06.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_06.png><img src=long_mult_analyses_06.png /></a>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>Finally, one of the Semaphores gets completely unblocked, which turns Job_9 into 'READY' state.</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_07.png><img src=long_mult_jobs_07.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_07.png><img src=long_mult_analyses_07.png /></a>
</td>
</tr>
<tr>
<td>
<p>
To recap:
<ul>
<li>Semaphores help us to funnel multiple control sub-threads into one thread of execution.</li>
<li>Accumulators help to assemble multiple data sub-structures into one data structure.</li>
</ul>
Their operation is synchronized, so that when a Semaphore opens its Accumulators are ready for consumption.
</p>
</td>
<td>
<ul>
<li>'add_together' analysis has turned green, which means it finally contains something 'READY' to run</li>
<li>the label changed to '1s+1r', which stands for "1 SEMAPHORED and 1 READY"</li>
</ul>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>Job_9 gets executed.</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_08.png><img src=long_mult_jobs_08.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_08.png><img src=long_mult_analyses_08.png /></a>
</td>
</tr>
<tr>
<td>
<p>
We can see that the stream of execution starting at Job_2 finished first.
In general, there is no guarantee for the order of execution of jobs that are in 'READY' state.
</p>
</td>
<td>
<ul>
<li>The results of Job_9 are deposited into the 'final_result' table.</li>
<li>Unlike Accumulators, 'final_result' is a pipeline-specific non-Hive table,
so no link is retained between the job that generated the data and the data in the table.</li>
<li>There are no more runnable jobs in 'add_together' analysis, so it turns grey again, with '1s+1d' label for "1 SEMAPHORED and 1 DONE"</li>
</ul>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>The last 'part_multiply' job gets run...</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_09.png><img src=long_mult_jobs_09.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_09.png><img src=long_mult_analyses_09.png /></a>
</td>
</tr>
<tr>
<td>
<ul>
<li>Once Job_7 has run the second Semaphore gets unblocked.</li>
<li>This makes the second Accumulator ready for consumption and Job_3 becomes 'READY'.</li>
</ul>
</td>
<td>
</td>
</tr>
</table>
</li>
<br/><hr width=50% /><br/>
<li><h3>Job_3 gets executed.</h3>
<table border=0 cellpadding=10>
<tr>
<td style="vertical-align: top;">
<a href=long_mult_jobs_10.png><img src=long_mult_jobs_10.png width=1000pt /></a>
</td>
<td style="vertical-align: top;">
<a href=long_mult_analyses_10.png><img src=long_mult_analyses_10.png /></a>
</td>
</tr>
<tr>
<td>
<ul>
<li>Finally, all the jobs are 'DONE' (displayed in blue)</li>
<li>The stream of execution starting at Job_1 finished second (it could easily be the other way around).</li>
</ul>
</td>
<td>
<p>
The result also goes into 'final_result' table. We can verify that the two results are identical.
</p>
</td>
</tr>
</table>
</li>
</ol>
</body></html>