How to wrap new tools¶
The library bistro.bioinfo
offers a handful of functions to call
various tools in computational biology, but of course many are
missing. The purpose of this chapter is to demonstrate the few steps
required to make a new tool available in bistro
(a.k.a. wrapping).
A (very) basic example¶
As a starting example, let’s see how we’d proceed with a very silly
example, wrapping the touch
command. To do so, we will use the
Bistro.Shell_dsl
module which provides many convenient functions to
create new workflow
values. Here’s what it looks like:
open Bistro.Shell_dsl
let touch =
Workflow.shell ~descr:"touch" [
cmd "touch" [ dest ] ;
]
- Let’s describe what we wrote:
- the first line (open statement) makes all the many handy functions
from
Bistro.Shell_dsl
visible in the current scope; many functions we describe below come from this module - we define
touch
by calling a function fromBistro.Workflow
namedshell
. As the name suggests, workflow steps it defines are built calling a command line on a shell. - this function takes an argument
descr
which can be used to give a name to the workflow. This argument is optional and is only used for display purpose, but it helpsbistro
to display readable information when logging - the second and last argument of
Workflow.shell
is a list of commands that will be executed when the workflow is run - a command can be built with the
cmd
function fromBistro.Shell_dsl
, which takes a string providing the name of the executable to run and a list of arguments - arguments are of type
Bistro.Shell_dsl.template
, which can be seen as a representation of text with some special tokens inside, that can be replaced by some value when we try to execute the command - the single argument to our command (
dest
) is an example of these special tokens, and represents a path wherebistro
expects to find the result file or directory of the workflow
- the first line (open statement) makes all the many handy functions
from
Basically defining a workflow amounts to providing a list of commands
that are expected to produce a result at the location represented by
the token dest
. Note that a workflow that doesn’t use ``dest``
is necessarily incorrect since it has no means to produce its output
at the expected location. The value touch
we have defined has type
'a path workflow
, and represents a recipe (right, a very simple
one) to produce a result file. This type is too general and we’d have
to restrict it to prevent run-time error, but we’ll see that
later. Let’s now see how we make make a pipeline on some parameter.
Parameterizing workflows¶
Our touch
workflow is a very normal OCaml value. It’s a
datastructure that describes a recipe to produce a file. Let’s write
another one which is very similar:
let echo_hello =
workflow ~descr:"echo_hello" [
cmd "echo" ~stdout:dest [ string "hello" ] ;
]
- There are a few newcomers here:
- there is an argument
stdout
to thecmd
function, which adds to the command what’s necessary to redirect its standard output to a file. Here we redirect todest
- we see that we can form arguments from simple strings with the
string
function. There are other such argument constructors, likeint
,float
and other more sophisticated ones
- there is an argument
With this wrapper, we’ve encoded the following command line:
$ echo "hello" > $DEST
So far so good. But do we really have to write a new wrapper each time we want to change a small detail in the workflow? Of course not, instead we can simply write a function that produces our workflow:
let echo msg =
workflow ~descr:"echo" [
cmd "echo" ~stdout:dest [ string msg ] ;
]
Our workflow is now a lot more generic, since it can be used to
produce files with any content. Well saying workflow here is slightly
incorrect, because the value echo
has type string -> 'a path
workflow
. It’s a function that produces workflows, but since it will
be so common, I’ll just call them workflows. To put it another way,
instead of writing a single script, we now have a function that can
produce a particular kind of script given a string.
Depending on others¶
Most of the time, a computational step in a workflow will take as an
input the results obtained from some other. This can be expressed
thanks to the function dep
. Let’s see right away how it can be
used to wrap the program sort
:
let sort text_file =
workflow ~descr:"sort" [
cmd "sort" ~stdout:dest [ dep text_file ] ;
]
The value sort
thus defined is again a function, but this time its
argument is a workflow. If you ask OCaml, it will say that sort
has type 'a path workflow -> 'b path workflow
. That is, given a
first workflow, this function is able to build a new one. This new
workflow will call sort
redirecting the standard output to the
expected destination and giving it text_file
as an argument. More
precisely, bistro
will inject the location it decided for the
output of workflow text_file
in the command invocating
sort
. By combining the use of dep
and dest
, you can write
entire collections of interdependent scripts without ever caring about
where the generated files are stored.
Utility functions to describe a command’s arguments¶
The functions string
and dep
are enough to describe virtually
any command-line argument to a program. In addition, the module
Bistro.Shell_dsl
provides a few more utility functions that help
writing concise and readable wrappers. The following code illustrates
the use of a few of them on a simplified wrapper for the bowtie
command:
let bowtie ?v index fq1 fq2 =
workflow ~descr:"bowtie" [
cmd "bowtie" [
string "-S" ;
opt "-1" dep fq1 ;
opt "-2" dep fq2 ;
option (opt "-v" int) v ;
seq ~sep:"" [ dep index ; string "/index" ] ;
dest ;
]
]
- Let us examine each parameter to this command from top to bottom:
- the first argument is a simple
-S
switch, we encode it directly with thestring
function - the second and third arguments are paths to input files introduces
with a switch; here writing
[ ... ; opt "-1" dep fq1 ; ... ]
is equivalent to writing[ ... ; string "-1" ; dep fq1 ; ... ]
but is shorter and more readable - the fourth argument is optional; notice that the variable
v
is an optional argument to thebowtie
function, so it is of type'a option
; theoption
function fromBistro.Shell_dsl
will add nothing to the command line ifv
isNone
or else apply its first argument to the value if holds. In that case, the applied function adds an integer argument introduced by a-v
switch - the fifth argument features a constructor called
seq
that can be used to concatenate a list of other chunks interspersed with a string (here the empty string); here we use it to describe a subdirectory of a workflow result - the last argument is simply the destination where to build the result.
- the first argument is a simple
Typing workflows¶
We have seen that the Workflow.shell
function from
Bistro.Shell_dsl
can be used to make new workflows that call
external programs. This function has of course no means to know what
the format of the result file or directory will be. For this reason,
it outputs a value of type 'a path workflow
, which means a result
whose format is compatible with any other. This is obviously wrong in
the general case, and could lead to run-time errors by feeding a tool
with inputs of an unsupported format. In order to prevent such
run-time errors, we can provide more precise types to our functions
producing workflows, when we have more information. Let’s see that on
an example. FASTA files have the property that when you concatenate
several of them, the result is still a FASTA file (this is false in
general case of course). We are now going to write a workflow that
concatenates several FASTA files, and make sure its typing reflects
this property.
Both Bistro
and Bistro_bioinfo
define a few type definitions
for annotating workflows. In particular we’ll use
Bistro_bioinfo.fasta
for our example. Here’s how it looks:
open Bistro
open Bistro.Shell_dsl
open Bistro_bioinfo
let fasta_concat (x : fasta pworkflow) (y : fasta pworkflow) : fasta pworkflow =
workflow ~descr:"fasta-concat" [
cmd "cat" ~stdout:dest [ dep x ; dep y ] ;
]
Note the 'a pworkflow
type which is used here, and which is
synonym for 'a path workflow
. Alternatively, you can define your
workflow in a .ml
file:
open Bistro.Shell_dsl
let fasta_concat x y =
workflow ~descr:"fasta-concat" [
cmd "cat" ~stdout:dest [ dep x ; dep y ] ;
]
and constraint its type in the corresponding .mli
file:
open Bistro
open Bistro_bioinfo
val fasta_concat : fasta pworkflow -> fasta pworkflow -> fasta pworkflow