Ensembl/ensembl-hive

View on GitHub
docs/long_mult_example_pipeline.txt

Summary

Maintainability
Test Coverage

4   Long multiplication example pipeline.

4.1 Long multiplication pipeline solves a problem of multiplying two very long integer numbers by pretending the computations have to be done in parallel on the farm.
    While performing the task it demonstates the use of the following features:

    A) A pipeline can have multiple analyses (this one has three: 'take_b_apart', 'part_multiply' and 'add_together').

    B) A job of one analysis can create jobs of other analyses by 'flowing the data' down numbered channels or branches.
       These branches are then assigned specific analysis names in the pipeline configuration file
       (one 'take_b_apart' job flows partial multiplication subtasks down branch #2 and a task of adding them together down branch #1).

    C) Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed
    ('add_together' is blocked both by 'part_multiply').

    D) As filesystems are frequently a bottleneck for big pipelines, it is advised that eHive processes store intermediate
    and final results in a database (in this pipeline, 'accu' and 'final_result' tables are used).

4.2 The pipeline is defined in 4 files:

        * ensembl-hive/modules/Bio/EnsEMBL/Hive/Examples/LongMult/RunnableDB/DigitFactory.pm     splits a multiplication job into sub-tasks and creates corresponding jobs

        * ensembl-hive/modules/Bio/EnsEMBL/Hive/Examples/LongMult/RunnableDB/PartMultiply.pm     performs a partial multiplication and stores the intermediate result in a table

        * ensembl-hive/modules/Bio/EnsEMBL/Hive/Examples/LongMult/RunnableDB/AddTogether.pm      waits for partial multiplication results to compute and adds them together into final result

        * ensembl-hive/modules/Bio/EnsEMBL/Hive/Examples/LongMult/PipeConfig/LongMult_conf.pm    the pipeline configuration module that links the previous Runnables into one pipeline

4.3 The main part of any PipeConfig file, pipeline_analyses() method, defines the pipeline graph whose nodes are analyses and whose arcs are control and dataflow rules.
    Each analysis hash must have:
        -logic_name     string name by which this analysis is referred to,
        -module         a name of the Runnable module that contains the code to be run (several analyses can use the same Runnable)
    Optionally, it can also have:
        -input_ids      an array of hashes, each hash defining job-specific parameters (if empty it means jobs are created dynamically using dataflow mechanism)
        -parameters     usually a hash of analysis-wide parameters (each such parameter can be overriden by the same name parameter contained in an input_id hash)
        -wait_for       an array of other analyses, *controlling* this one (jobs of this analysis cannot take_b_apart before all jobs of controlling analyses have completed)
        -flow_into      usually a hash that defines dataflow rules (rules of dynamic job creation during pipeline execution) from this particular analysis.

    The meaning of these parameters should become clearer after some experimentation with the pipeline.


5   Initialization and running the long multiplication pipeline.

5.1 Before running the pipeline you will have to initialize it using init_pipeline.pl script supplying PipeConfig module and the necessary parameters.
    Have another look at LongMult_conf.pm file. The default_options() method returns a hash that pretty much defines what parameters you can/should supply to init_pipeline.pl .
    You will probably need to specify the following:

        $ init_pipeline.pl Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf \
            -host=<your_mysql_host> \
            -port=<your_mysql_port> \
            -user=<your_mysql_username> \
            -password=<your_mysql_password> \

    This should create a fresh eHive database and initalize it with long multiplication pipeline data (the two numbers to be multiplied are taken from defaults).

    Upon successful completion init_pipeline.pl will print several beekeeper commands and
    a mysql command for connecting to the newly created database.
    Copy and run the mysql command in a separate shell session to follow the progress of the pipeline.

5.2 Run the first beekeeper command that contains '-sync' option. This will initialize database's internal stats and determine which jobs can be run.

5.3 Now you have two options: either to run the beekeeper.pl in automatic mode using '-loop' option and wait until it completes,
    or run it in step-by-step mode, initiating every step by separate executions of 'beekeeper.pl ... -run' command.
    We will use the step-by-step mode in order to see what is going on.

5.4 Go to mysql window and check the contents of job table:

        MySQL> SELECT * FROM job;

    It will only contain jobs that set up the multiplication tasks in 'READY' mode - meaning 'ready to be taken by workers and executed'.

    Go to the beekeeper window and run the 'beekeeper.pl ... -run' once.
    It will submit a worker to the farm that will at some point get the 'take_b_apart' job(s).

5.5 Go to mysql window again and check the contents of job table. Keep checking as the worker may spend some time in 'pending' state.

    After the first worker is done you will see that 'take_b_apart' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created.
    Also check the contents of 'accu' table, it should be empty at that moment:

        MySQL> SELECT * from accu;

    Go back to the beekeeper window and run the 'beekeeper.pl ... -run' for the second time.
    It will submit another worker to the farm that will at some point get the 'part_multiply' jobs.

5.6 Now check both 'job' and 'accu' tables again.
    At some moment 'part_multiply' jobs will have been completed and the results will go into 'accu' table;
    'add_together' jobs are still to be done.
    
    Check the contents of 'final_result' table (should be empty) and run the third and the last round of 'beekeeper.pl ... -run'

5.7 Eventually you will see that all jobs have completed and the 'final_result' table contains final result(s) of multiplication.