Ensembl/ensembl-hive

View on GitHub
docs/running_eHive_pipelines.html

Summary

Maintainability
Test Coverage
<html>
    <head>
        <title>Running eHive pipelines</title>
        <link rel="stylesheet" type="text/css" media="all" href="ehive_doc.css" />
    </head>
<body>

    <center><h1>Running eHive pipelines</h1></center>

    <hr width=50% />

    <h2>A quick overview</h2>

    <p>Each eHive pipeline is a potentially complex computational process.</p>

    <p>Whether it runs locally, on the farm, or on multiple compute resources, this process is centred around a "blackboard"
    (a MySQL, SQLite or PostgreSQL database) where individual jobs of the pipeline are created, claimed by independent Workers
    and later recorded as done or failed.</p>

    <p>Running the pipeline involves the following steps:</p>
    <ul>
    <li>
        Using <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> script to create an instance pipeline database from a "PipeConfig" file
    </li>
    <li>
        (optionally) Using <a href=scripts/seed_pipeline.html><b>seed_pipeline.pl</b></a> script to add jobs to the "blackboard"
    </li>
    <li>
        Running the <a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> script that will look after the pipeline and maintain a population of Worker processes
            on the compute resource that will take and perform all the jobs of the pipeline
    </li>
    <li>
        (optionally) Monitoring the state of the running pipeline
        <ol>
            <li>
                by periodically running <a href=scripts/generate_graph.html><b>generate_graph.pl</b></a> that will produce a fresh snapshot of the pipeline diagram,
            </li>
            <li>
                by using <b>guiHive</b> web interface.
            </li>
            <li>
                by connecting to the database using <a href=scripts/db_cmd.html><b>db_cmd.pl</b></a> script and issuing SQL commands,
            </li>
        </ol>
    </li>
    </ul>

    <hr width=50% />

    <h2>Initialization of the pipeline database</h2>

    <p>Every eHive pipeline is centered around a "blackboard", which is usually a MySQL/SQLite/PostgreSQL database.
    This database contains both static information
    (general definition of analyses, associated runnables, parameters and resources, dependency rules, etc)
    and runtime information about states of single jobs running on the farm or locally.</p>

    <p>By initialization we mean an act of moulding one such new pipeline database from a PipeConfig file.
    This is done by feeding the PipeConfig file to ensembl-hive/scripts/<a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> script.<br/>
    A typical example:
    <pre>
        <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf -pipeline_url mysql://user:password@host:port/long_mult
    </pre>

    <p>It will create a MySQL pipeline database called 'long_mult' with the given connection parameters.
    In case of newer PipeConfig files these could be the only parameters needed, as the rest could be set at a later stage via "seeding" (see below).</p>

    <p>If heavy concurrent traffic to the database is not expected, we may choose to keep the blackboard in a local SQLite file:
    <pre>
        <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf -pipeline_url sqlite:///long_mult
    </pre>
    <p>
    In the latter case no other connection parameters except for the filename are necessary, so they are skipped.
    </p>

    <p>A couple of more complicated examples:
    <pre>
        <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> Bio::EnsEMBL::Compara::PipeConfig::Example::EnsemblProteinTrees_conf -user "my_db_username" -password "my_db_password" -mlss_id 12345
    </pre>
    <p>It sets 'user', 'password' and 'mlss_id' parameters via command line options.
    At this stage you can also override any of the other options mentioned in the default_options section of the PipeConfig file.</p>

    <p>If you need to modify second-level values of a "hash option" (such as the '-host' or '-port' of the 'pipeline_db' option),
    the syntax is the following (follows the extended syntax of Getopt::Long) :
    <pre>
        <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> Bio::EnsEMBL::Compara::PipeConfig::Example::EnsemblProteinTrees_conf -pipeline_db -host=myhost -pipeline_db -port=5306
    </pre>

    <p><font color=red>PLEASE NOTE</font>:
    Although many older PipeConfig files make extensive use of command-line options such as -password and -mlss_id above (so-called o() syntax),
    it is no longer the only or recommended way of pre-configuring pipelines. There are better ways to configure pipelines,
    so if you find yourself struggling to make sense of an existing PipeConfig's o() syntax, please talk to eHive developers or power-users
    who are usually happy to help.<br/>
    Some pipelines may have other dependencies beyond eHive (e.g. the Ensembl Core API, BioPerl, etc). Make sure you have installed them and configured your environment (PATH and PERL5LIB). <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> will try to compile all the analysis modules, which ensures that most of the dependencies are installed, but some others can only be found at runtime.
   </p>

    <p>Normally, one run of <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> should create you a pipeline database.<br/>

    If anything goes wrong and the process does not complete successfully, you will need to drop the partially created database in order to try again.
    You can either drop the database manually, or use "-hive_force_init 1" option, which will automatically drop the database before trying to create it.
    </p>

    <p>If <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> completes successfully, it will print a legend of commands that could be run next:
    <ul>
        <li>how to "seed" jobs into the pipeline database</li>
        <li>how to run the pipeline</li>
        <li>how to connect to the pipeline database and monitor the progress</li>
        <li>how to visualize the pipeline's diagram or resource usage statistics</li>
        <li>etc...</li>
    </ul>
    <p>
    Please remember that these command lines are for use only with a particular pipeline database,
    and are likely to be different next time you run the pipeline. Moreover, they will contain a sensitive password!
    So don't write them down.
    </p>

    <hr width=50% />

    <h2>Generating a pipeline's flow diagram</h2>

    <p>As soon as the pipeline database is ready you can store its visual flow diagram in an image file.
        This diagram is a much better tool for understanding what is going on in the pipeline.
        Run the following command to produce it:
    <pre>
        <a href=scripts/generate_graph.html><b>generate_graph.pl</b></a> -url sqlite:///my_pipeline_database -out my_diagram.png
    </pre>
    <p>
        You only have to choose the format (gif, jpg, png, svg, etc) by setting the output file extension.
    </p>

    <table><tr>
    <td><img src=LongMult_diagram.png height=450 alt="example_diagram"></td>

    <td width=50></td>

    <td>
    <h3>LEGEND:</h3>
    <ul>
        <li>The rounded nodes on the flow diagram represent Analyses (classes of jobs).</li>
        <li>The white rectangular nodes represent Tables that hold user data.</li>
        <li>The blue solid arrows are called "dataflow rules". They either generate new jobs (if they point to an Analysis node) or store data (if they point at a Table node).</li>
        <li>The red solid arrows with T-heads are "analysis control rules". They block the pointed-at Analysis until all the jobs of the pointing Analysis are done.</li>
        <li>Light-blue shadows behind some analyses stand for "semaphore rules". Together with red and green dashed lines they represent our main job control mechanism that will be described elsewhere.</li>
    </ul>

    <p>Each flow diagram thus generated is a momentary snapshot of the pipeline state, and these snapshots will be changing as the pipeline runs.
        One of the things changing will be the colour of the Analysis nodes. The default colour legend is as follows:</p>
        <ul>
            <li><span style="background-color:white">&nbsp;[&nbsp;EMPTY&nbsp;]&nbsp;</span> : the Analysis never had any jobs to do. Since pipelines are dynamic it may be ok for some Analyses to stay EMPTY until the very end.</li>
            <li><span style="background-color:DeepSkyBlue">&nbsp;[&nbsp;DONE&nbsp;]&nbsp;</span> : all jobs of the Analysis are DONE. Since pipelines are dynamic, it may be a temporary state, until new jobs are added.</li>
            <li><span style="background-color:green">&nbsp;[&nbsp;READY&nbsp;]&nbsp;</span> : some jobs are READY to be run, but nothing is running at the moment.</li>
            <li><span style="background-color:yellow">&nbsp;[&nbsp;IN PROGRESS&nbsp;]&nbsp;</span> : some jobs of the Analysis are being processed at the moment of the snapshot.</li>
            <li><span style="background-color:grey">&nbsp;[&nbsp;BLOCKED&nbsp;]&nbsp;</span> : none of the jobs of this Analysis can be run at the moment because of job dependency rules.</li>
            <li><span style="background-color:red">&nbsp;[&nbsp;FAILED&nbsp;]&nbsp;</span> : the number of FAILED jobs in this Analysis has gone over a threshold (which is 0 by default). By default <b>beekeeper.pl</b> will exit if it encounters a FAILED analysis.</li>
        </ul>

    <p>
        Another thing that will be changing from snapshot to snapshot is the job "breakout" formula displayed under the name of the Analysis.
        It shows how many jobs are in which state and the total number of jobs. Separate parts of this formula are similarly colour-coded:</p>
        <ul>
            <li>grey : <span style="background-color:grey">&nbsp;s&nbsp;</span> (SEMAPHORED) - individually blocked jobs</li>
            <li>green : <span style="background-color:green">&nbsp;r&nbsp;</span> (READY) - jobs that are ready to be claimed by Workers</li>
            <li>yellow : <span style="background-color:yellow">&nbsp;i&nbsp;</span> (IN PROGRESS) - jobs that are currently being processed by Workers</li>
            <li>skyblue : <span style="background-color:DeepSkyBlue">&nbsp;d&nbsp;</span> (DONE) - successfully completed jobs</li>
            <li>red : <span style="background-color:red">&nbsp;f&nbsp;</span> (FAILED) - unsuccessfully completed jobs</li>
        </ul>
    </td>
    </tr></table>

    <p>Actually, you don't even need to generate a pipeline database to see its diagram,
        as the diagram can be generated directly from the PipeConfig file:
    <pre>
        <a href=scripts/generate_graph.html><b>generate_graph.pl</b></a> -pipeconfig Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf -out my_diagram2.png
    </pre>
    <p>
        Such a "standalone" diagram may look slightly different (analysis_ids will be missing).
    </p>

    <p><font color=red>PLEASE NOTE</font>:
    A very friendly <b>guiHive</b> web interface can periodically regenerate the pipeline flow diagram for you,
    so you can now monitor (and to a certain extent control) your pipeline from a web browser.
    </p>

    <hr width=50% />

    <h2>Seeding jobs into the pipeline database</h2>

    <p>Pipeline database contains a dynamic collection of jobs (tasks) to be done.
        The jobs can be added to the "blackboard" either by the user (we call this process "seeding") or dynamically, by already running jobs.
        When a database is created using <a href=scripts/init_pipeline.html><b>init_pipeline.pl</b></a> it may or may not be already seeded, depending on the PipeConfig file
        (you can always make sure whether it has been automatically seeded by looking at the flow diagram).
        If the pipeline needs seeding, this is done by running <a href=scripts/seed_pipeline.html><b>seed_pipeline.pl</b></a> script,
        by providing both the Analysis to be seeded and the parameters of the job being created:
    <pre>
        <a href=scripts/seed_pipeline.html><b>seed_pipeline.pl</b></a> -url sqlite:///my_pipeline_database -logic_name "analysis_name" -input_id '{ "paramX" =&gt; "valueX", "paramY" =&gt; "valueY" }'
    </pre>
    <p>
    It only makes sense to seed certain analyses, typically they are the ones that do not have any incoming dataflow on the flow diagram.
    </p>

    <hr width=50% />

    <h2>Synchronizing ("sync"-ing) the pipeline database</h2>

    <p>In order to function properly (to monitor the progress, block and unblock analyses and send correct number of workers to the farm)
    the eHive system needs to maintain certain number of job counters. These counters and associated analysis states are updated
    in the process of "synchronization" (or "sync"). This has to be done once before running the pipeline, and normally the pipeline
    will take care of synchronization by itself and will trigger the 'sync' process automatically.
    However sometimes things go out of sync. Especially when people try to outsmart the scheduler by manually stopping and running jobs :)
    This is when you might want to re-sync the database. It is done by running the ensembl-hive/scripts/<a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> in "sync" mode:

    <pre>
        <a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> -url sqlite:///my_pipeline_database -sync
    </pre>

    <hr width=50% />

    <h2>Running the pipeline in automatic mode</h2>

    <p>As mentioned previously, the usual life-cycle of an eHive pipeline is revolving around the pipeline database.
    There are several "Worker" processes that run on the farm.
    The Workers pick suitable tasks from the database, run them, and report back to the database.
    There is also one "Beekeeper" process that normally loops on a head node of the farm (it is very light-weight).
    It monitors the progress of Workers and whenever needed submits more Workers to the farm
    (since Workers die from time to time for natural and not-so-natural reasons, Beekeeper maintains the correct load).<br/>
Beekeeper reports back the completion
percentage of the pipeline and how many jobs are to do, failed or done for each individual
analysis. As analyses progress they usually create new jobs for analyses further down the
pipeline. Once the pipeline is complete (or breaks) Beekeeper exits.
</p>

    <p>So to "run the pipeline" all you have to do is to run the Beekeeper:
    <pre>
        <a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> -url sqlite:///my_pipeline_database -loop
    </pre>

    <p>You can also restrict running to a subset of Analyses (either by analysis_id or by name pattern) :</p>
    <pre>
        <a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> -url sqlite:///my_pipeline_database -analyses_pattern 'alignment_%' -loop       <i># all analyses whose name starts with 'alignment_'</i>
    </pre>
    or
    <pre>
        <a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> -url sqlite:///my_pipeline_database -analyses_pattern '1..5,fasta_check' -loop  <i># only analyses with analysis_id between 1 and 5 and 'fasta_check'</i>
    </pre>

    <p>In order to make sure the <a href=scripts/beekeeper.html><b>beekeeper.pl</b></a> process doesn't die when you disconnect your ssh session from the farm,
    it is normally run in a "screen session".
    If your Beekeeper process gets killed for some reason, don't worry - you can re-sync and start another Beekeeper process.
    It will pick up from where the previous Beekeeper left it.
    </p>

<p>
At each loop, Beekeeper will print out
the status of each analysis and the completion level of the pipeline, and then go to sleep for 1 minute.
Here is an example
of the output from the <a href="scripts/beekeeper.html"><b>beekeeper.pl</b></a> script running the <a href="http://www.ensembl.org/info/genome/compara/index.html">Ensembl Compara</a> <a href="https://github.com/Ensembl/ensembl-compara/blob/release/77/modules/Bio/EnsEMBL/Compara/PipeConfig/Lastz_conf.pm">LastZ pipeline</a>:
</p>

<pre>
======= beekeeper loop ** 19 **==========
GarbageCollector: Checking for lost Workers...
GarbageCollector: [Queen:] out of 32 Workers that haven't checked in during the last 5 seconds...
GarbageCollector: [LSF Meadow:]   RUN:32

get_species_list           ( 1)        <span style="color:black; background-color:DeepSkyBlue">DONE</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:1, <span style="color:black; background-color:red">Fail</span>:0)=1 Ave_msec:23, workers(Running:0, Reqired:0)   h.cap:-  a.cap:-  (sync'd 552 sec ago)
populate_new_database      ( 2)        <span style="color:black; background-color:DeepSkyBlue">DONE</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:1, <span style="color:black; background-color:red">Fail</span>:0)=1 Ave_msec:50810, workers(Running:0, Reqired:0)   h.cap:-  a.cap:-  (sync'd 552sec ago)
parse_pair_aligner_conf    ( 3)        <span style="color:black; background-color:DeepSkyBlue">DONE</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:1, <span style="color:black; background-color:red">Fail</span>:0)=1 Ave_msec:563, workers(Running:0, Reqired:0)   h.cap:-  a.cap:-  (sync'd 552 sec ago)
chunk_and_group_dna        ( 4)        <span style="color:black; background-color:DeepSkyBlue">DONE</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:2, <span style="color:black; background-color:red">Fail</span>:0)=2 Ave_msec:9044, workers(Running:0, Reqired:0)   h.cap:-  a.cap:-  (sync'd 552 sec ago)
store_sequence             ( 5)        <span style="color:black; background-color:DeepSkyBlue">DONE</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:853, <span style="color:black; background-color:red">Fail</span>:0)=853 Ave_msec:20692, workers(Running:0, Reqired:0)   h.cap:50  a.cap:-  (sync'd 518 sec ago)
store_sequence_again       ( 6)       EMPTY jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:0, <span style="color:black; background-color:red">Fail</span>:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:100  a.cap:-  (sync'd 552 sec ago)
dump_dna_factory           ( 7)       EMPTY jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:0, <span style="color:black; background-color:red">Fail</span>:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:-  a.cap:-  (sync'd 492 sec ago)
dump_dna                   ( 8)       EMPTY jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:0, <span style="color:black; background-color:red">Fail</span>:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:10  a.cap:-  (sync'd 552 sec ago)
create_pair_aligner_jobs   ( 9)        <span style="color:black; background-color:DeepSkyBlue">DONE</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:1, <span style="color:black; background-color:red">Fail</span>:0)=1 Ave_msec:394188, workers(Running:0, Reqired:0)   h.cap:10  a.cap:-  (sync'd 93 sec ago)
LastZ                      (10)     <span style="color:black; background-color:yellow">WORKING</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:162524, <span style="color:black; background-color:yellow">InProg</span>:192, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:3573, <span style="color:black; background-color:red">Fail</span>:0)=166289 Ave_msec:9572, workers(Running:100, Reqired:54437)   h.cap:100  a.cap:-  (sync'd 93 sec ago)
.
.
.
.
.
pairaligner_stats          (37)     <span style="color:black; background-color:grey">BLOCKED</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:1, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:0, <span style="color:black; background-color:red">Fail</span>:0)=1 Ave_msec:0, workers(Running:0, Reqired:1)   h.cap:-  a.cap:-  (sync'd 61 sec ago)
coding_exon_stats          (38)       EMPTY jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:0, <span style="color:black; background-color:red">Fail</span>:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:10  a.cap:-  (sync'd 552 sec ago)
coding_exon_stats_summary  (39)     <span style="color:black; background-color:grey">BLOCKED</span> jobs(<span style="color:black; background-color:grey">Sem</span>:0, <span style="color:black; background-color:green">Rdy</span>:0, <span style="color:black; background-color:yellow">InProg</span>:0, <span style="color:black; background-color:DeepSkyBlue">Done+Pass</span>:0, <span style="color:black; background-color:red">Fail</span>:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:-  a.cap:-  (sync'd 61 sec ago)

===== Stats of active Roles as recorded in the pipeline database: ======
                         LastZ : 100 active Roles
         ======= TOTAL ======= : 100 active Roles

Not submitting any workers this iteration
                          hive 2.651% complete (&lt; 432.133 CPU_hrs) (162726 todo + 4432 done + 0 failed = 167158 total)
sleep 1.00 minutes. Next loop at Mon Sep  8 15:27:48 2014
</pre>
<p>NB: colours are normally not used in beekeeper's output. They have been inserted here to refer to the above-mentioned legend</p>

<p>
Last note about Beekeeper: you can see it as a <em>pump</em>.
Its task is to add new workers to maintain the job flow. If you kill Beekeeper, you stop the pump, but the water is still flowing, i.e. the workers are not killed but still running.
To actually kill the workers, you have to use the specific commands of your grid engine (e.g. <code>bkill</code> for Platform LSF).
</p>

    <hr width=50% />

    <h2>Monitoring the progress via a direct database session</h2>

    <p>In addition to monitoring the visual flow diagram (that could be generated manually using <a href=scripts/generate_graph.html><b>generate_graph.pl</b></a> or via the <b>guiHive</b> web interface)
    you can also connect to the pipeline database directly and issue SQL commands. To avoid typing in all the connection details (syntax is different
    depending on the particular database engine used) you can use a bespoke <a href=scripts/db_cmd.html><b>db_cmd.pl</b></a> script that takes the eHive database URL and performs the connection for you:

    <pre>
        <a href=scripts/db_cmd.html><b>db_cmd.pl</b></a> -url sqlite:///my_pipeline_database
    </pre>
    or
    <pre>
        <a href=scripts/db_cmd.html><b>db_cmd.pl</b></a> -url mysql://user:password@host:port/long_mult
    </pre>
    or
    <pre>
        <a href=scripts/db_cmd.html><b>db_cmd.pl</b></a> -url pgsql://user:password@host:port/long_mult
    </pre>

    <p>
      Once connected, you can list the tables and views with <code>SHOW TABLES;</code>.
      The default set of tables should look something like:
    </p>

<pre>
+----------------------------+
| Tables_in_hive_pipeline_db |
+----------------------------+
| accu                       |
| analysis_base              |
| analysis_ctrl_rule         |
| analysis_data              |
| analysis_stats             |
| analysis_stats_monitor     |
| dataflow_rule              |
| hive_meta                  |
| job                        |
| job_file                   |
| log_message                |
| msg                        |
| pipeline_wide_parameters   |
| progress                   |
| resource_class             |
| resource_description       |
| resource_usage_stats       |
| role                       |
| worker                     |
| worker_resource_usage      |
+----------------------------+
 </pre>

<p>
Some of these tables, such as <code>analysis_base</code>, <code>job</code> and <code>resource_class</code> may be populated with
entries depending on what is in you configuration file. At the very least you should expect to
have your analyses in <code>analysis_base</code>. Some tables such as <code>log_message</code> will only get populated
while the pipeline is running (for example <code>log_message</code> will get an entry when a job exceeds
the memory limit and dies).
</p>
<p>
Please refer to the eHive schema (see <a href=hive_schema.png>eHive schema diagram</a> and <a href=hive_schema.html>eHive schema description</a>) to find out how those tables are related.
</p>

    <p>In addition to the tables, there is a "progress" view from which you can select and see how your jobs are doing:</p>
    <pre>
        SELECT * from progress;
    </pre>

    <p>If you see jobs in 'FAILED' state or jobs with retry_count&gt;0 (which means they have failed at least once and had to be retried),
    you may need to look at the "msg" view in order to find out the reason for the failures:</p>
    <pre>
        SELECT * FROM msg WHERE job_id=1234;    # a specific job
    </pre>
    or
    <pre>
        SELECT * FROM msg WHERE analysis_id=15;    # jobs of a specific analysis
    </pre>
    or
    <pre>
        SELECT * FROM msg;    # show me all messages
    </pre>

    <p>Some of the messages indicate temporary errors (such as temporary lack of connectivity with a database or file),
    but some others may be critical (wrong path to a binary) that will eventually make all jobs of an analysis fail.
    If the "is_error" flag of a message is false, it may be just a diagnostic message which is not critical.</p>

    <hr width=50% />

    <h2>Monitoring the progress via guiHive</h2>

    <p>guiHive is a web-interface to a eHive database that allows to monitor the state of the pipeline.
It displays flow diagrams of all the steps in the pipeline and their relationship to one another. In addition it colours
analyses based on completion and each analysis has a progress circle which indicates the
number of complete, running and failed jobs. guiHive also offers the ability to directly
modify analyses, for example you can change the resource class used by the analysis directly
through guiHive.
</p>
    <p>guiHive is already installed at the <a href="http://guihive.internal.sanger.ac.uk:8080/">Sanger</a> and at the <a href="http://guihive.ebi.ac.uk:8080/">EBI</a> (both for internal use only), but can also be installed locally. Instructions for this are on <a href="https://github.com/Ensembl/guiHive">GitHub</a>
    </p>

    <hr width=50% />

    <h2>Testing the pipeline and debugging failed jobs</h2>

<p>
During the course of running a pipeline issues may arise from things like incorrect
resource allocation, issues with input data, problems in the configuration itself or in the
modules that are run for each analysis. Alternatively you might be building a pipeline
from scratch and instead of wanting to run the entire pipeline from scratch you may
want to test a single job at a time.
eHive provides a script that is useful in both cases: <a href="scripts/runWorker.html"><b>runWorker.pl</b></a>.
</p>

<p>
Before debugging the first thing you should do is look in the database or in guiHive, and query the
<code>job</code> table for jobs that have failed (which can be found from the <samp>status</samp> column). Then
you can search for these job ids in the <code>log_message</code> table and see if there are any log
messages there that would explain why the jobs are failing.
guiHive also directly reports the last error message in the job table.
If they are failing because
of insufficient resources you should be able to find that from the message itself.
If this turns out to be the case, the best thing to do is to either make a new entry
in the <code>resource_class</code> table with sufficient resources and point the analysis in the
<code>analysis_base</code> to that <code>resource_class_id</code>, or you could just directly modify the existing
resource class entry. Again, the same can be achieved entirely in guiHive by editing the "Resources" tab, and then
the analysis itself (by clicking on it).
</p>

<p>
If there is no obvious reason why the jobs are failing then it is probably time to use
<a href="scripts/runWorker.html"><b>runWorker.pl</b></a>. Here is an example of how to run it:</p>

<pre>
        <a href="scripts/runWorker.html"><b>runWorker.pl</b></a> -url mysql://user:pass@host:port/my_pipeline_db -job_id 1 -debug 1
</pre>

<p>
This would run job id 1 directly. There are a few things to note about this. Firstly if
the job is considered <samp>'DONE'</samp> but you know it has failed, you will need to force <a href="scripts/runWorker.html"><b>runWorker.pl</b></a>
to run it using the <code>-force</code> flag. Alternatively you can reset the job state to <samp>READY</samp>
through the <a href="scripts/beekeeper.html"><b>beekeeper.pl</b></a> using <code>-reset_job_id &lt;num&gt;</code>, and then just run <a href="scripts/runWorker.html"><b>runWorker.pl</b></a> with the
job id. The second thing to note is that by default <a href="scripts/runWorker.html"><b>runWorker.pl</b></a> will try to run the job entirely, which may write things into files or the database itself.
If you are afraid that it may happen, and if the module is nicely implemented, you can use the <code>-no_write</code> flag to disable the <code>write_output()</code>
method in whatever module the analysis uses. Note that some modules may
misbehave and still write their output outside of <code>write_output()</code> !
Using <code>-no_write</code> will also stop any resulting jobs that
are created from being written to the pipeline db. This is useful when both testing and
debugging.
</p>

<p>
When trying to track down a problem via <a href="scripts/runWorker.html"><b>runWorker.pl</b></a> the difficulty varies from analysis
to analysis and problem to problem. If the code throws an exception then it is usually
easy to track down where to start debugging. On the other hand sometimes the output
from <a href="scripts/runWorker.html"><b>runWorker.pl</b></a> will simply be to say the job died. In these cases it is useful to consider
the general structure of analysis modules. There should be the following subroutines:
<code>fetch_input()</code>, <code>run()</code>, and <code>write_output()</code>.
A failing job should state in which state the error occurred. Then you can go for more fine grain debugging.</p>

<p>As Perl objects
are essentially just hashes, using the <code>Data::Dumper</code> is a good idea for checking the
contents of variables. You can do this by putting <code>use Data::Dumper;</code>
at the top of the module and then when you want to check a variable use:</p>
<pre>
    $self-&gt;warning("Dumped var ".Dumper($myvar));
</pre>
<p>
With enough warnings statements any debugging problem can be overcome. Of course, you can also use the
Perl debugger.
</p>

<p>
A final note on testing/debugging is that if you want to restart an entire analysis
you can run the following command:
</p>
<pre>
        <a href="scripts/beekeeper.html"><b>beekeeper.pl</b></a> -url mysql://user:pass@host:port/my_pipeline_db -analyses_pattern MyAnalysisName -reset_all_jobs
</pre>
<p>
If you do this, it is important to note that it is your responsibility to then clean up
all the output from any jobs that had completed. So if you were writing objects to a
database, you would have to delete them out yourself, or delete any flatfiles that were
generated as the pipeline is not designed to do this.
</p>

</body>
</html>