.. _subpipelines:

# Reusing pipeline code

.. tip::
    This page covers an advanced pipeline concept aimed at experienced RiskScape users.
    If you are fairly new to RiskScape, you should make sure you are comfortable with :ref:`advanced_pipelines` first.

You may find yourself duplicating pipeline code if you need to add several models that do similar, but slightly different, things.
For example, you may want to process the input data in exactly the same way, but report the results in different ways.
Or the input data may be organized in different ways, but you always want to generate the same result files.

It can seem simple to just copy-paste pipeline files, but this can lead to a maintenance headache.
For example, if you later want to change a common part of the pipelines, there are now several places that require changing.

There are a few approaches to avoid duplicating pipeline code:

- :ref:`Pipeline model parameters <parameterized_pipelines>`.
  If the difference between two models is simple, then often you can make the pipeline more reusable simply by making something a model parameter.
  For example, varying the input data file or making an `if()`/else condition a model parameter.

- :ref:`expression_functions`. These let you easily reuse a complicated expression, such as an `if()` or `switch()` statement,
  in multiple places. This is useful when the duplicated pipeline code is contained within a single pipeline step.
  
- Sub-pipelines. This lets you pull out a series of pipeline steps into a separate file,
  and then use that sub-pipeline code from multiple model pipelines. This page will cover sub-pipelines in more detail.

.. tip::
    Sub-pipelines are also useful for managing complexity in your pipeline.
    For example, if you have a lot of fairly trivial pipeline code that just customizes the output results,
    it can be useful to split that code out into a separate sub-pipeline, so that it doesn't distract from the core
    risk-modelling section of the pipeline.

### Setup

This page will refer to some example sub-pipeline files.
Click [here](../subpipelines.zip) to download the example project we will use in this guide.
Unzip the file into the :ref:`top_level_dir` where you normally keep your RiskScape projects.

.. note::
    The ``subpipeline`` step is currently a beta feature.
    In order to use the ``subpipeline`` step, you need to have the :ref:`beta-plugin` enabled.

## A simple sub-pipeline

The RiskScape `subpipeline()` pipeline step simply executes pipeline code located in another pipeline `.txt` file.
This essentially lets you run a pipeline within a pipeline, which means you can easily reuse the same sub-pipeline code in multiple models.

For example, say we have a `hello_world.txt` file containing the following pipeline code:

```
input(value: { range(1, 4) as count })
-> unnest(count)
-> select({ 'Hello, world x' + str(count) as greeting })
```

This is just a _really_ basic pipeline that will generate a few rows of output data for demonstrative purposes.
We can then execute this pipeline code from within _another_ pipeline, e.g.

```none
riskscape pipeline eval "subpipeline('hello_world.txt')"
```

This produces a `subpipeline.csv` output file:

```none
greeting
"Hello, world x1"
"Hello, world x2"
"Hello, world x3"
```

Note that this is not a particularly useful sub-pipeline (we'll start introducing more complexity,
and build up to a more realistic example), but it demonstrates that the `subpipeline()` step can
execute a pipeline file from within another pipeline.

## Sub-pipeline parameters

Typically, a sub-pipeline will have parameters that let it be reused in more useful ways.
Parameters are denoted with a `$` and work the same way as model parameters.
The difference here is that the parameters need to be passed through in the `subpipeline()` step.

For example, say we have a `hello_world_params.txt` file containing the following pipeline code:

```
input(value: { range(1, $number + 1) as count })
-> unnest(count)
-> select({ 'Hello, ' + $who + ' x' + str(count) as greeting })
```

To use this sub-pipeline file, we now need to pass parameters through in the `subpipeline()` step, e.g.

```none
riskscape pipeline eval "subpipeline('hello_world_params.txt', parameters: { who: 'Bob', number: 2})"
```

This would then produce a `subpipeline.csv` output file like this:
```none
greeting
"Hello, Bob x1"
"Hello, Bob x2"
```

## Connecting sub-pipeline steps

The `subpipeline()` step becomes more versatile once you start connecting it to other pipeline steps.
So far our example has just been executing a standalone pipeline file, but the `subpipeline()` step
can be treated like any other pipeline step. You can either :ref:`connect <connecting_steps>` together steps before the
`subpipeline()` step, after it, or both.

When connecting the `subpipeline()` step onto pipeline steps before it, the sub-pipeline `.txt` file must use
the `in` step reference as the starting point of the sub-pipeline.
This refers to the pipeline data coming into the sub-pipeline.

For example, say we have a `hello_world_input.txt` file containing the following pipeline code:

```
in -> select({ *, range(1, number + 1) as count })
-> unnest(count)
-> select({ 'Hello, ' + who + ' x' + str(count) as greeting })
```

To use this sub-pipeline file, we now need to pass pipeline input data through to the `subpipeline()` step, e.g.

```none
riskscape pipeline eval "
  input(value: { who: 'Bob', number: 2})
  -> subpipeline('hello_world_input.txt')
"
```

This would then produce a `subpipeline.csv` output file similar to the previous parameter example:

```none
greeting
"Hello, Bob x1"
"Hello, Bob x2"
```

The difference here is the values are being passed through as pipeline data instead of parameters.
However, this is still just a contrived example to demonstrate the pipeline syntax.
Let's look at a more practical example now.

## Exposed population example

Let's go back to the :ref:`getting-started example data <Upolu_data>` for the 2009 South Pacific Tsunami (SPT) event.
Let's say we want to see the household population that was exposed to the tsunami inundation.
The only problem is that we don't have any household population data.

However, we do have total population counts for each region, and we have residential building data.
We could combine the two data sources and simply average the regional population across the
households in each region to produce an rudimentary household population.
We can also apply a weighting to the household population based on building size, so that a
larger residential dwelling has a larger household population.

.. note::
    This example is demonstrating pipeline concepts rather than robust modelling techniques.
    Linking a building to regional census data and distributing a weighted average based on building characteristics
    may be things you want to repeat in your own models. However, the Samoa data here
    has too many limitations to produce a reliable household population, e.g. Vaimauga West only has a subset of the total buildings.

The `household_population.txt` file is an example of a pipeline that distributes a regional population
among the residential buildings, weighted by building size.
You can look at the pipeline code in more detail if you want, but this exercise will simply use it as a sub-pipeline file.

Assigning a household population is not a simple operation that can be done in a RiskScape bookmark,
because it involves loading the entire dataset to calculate totals, before we can start assigning a
household population to individual buildings.

.. tip::
    Alternatively, we could build the household population as a one-off operation, and then simply
    use it as a new input layer for models.
    However, using a sub-pipeline here allows us to easily use a different building- or population-layer
    in the model, as new data becomes available.

### Running the model

We can split the full pipeline up into three sub-pipeline operations:
- Build a new household population exposure-layer
- Determine the features in this exposure-layer that are exposed to the hazard
- Summarize the results with a regional breakdown

The whole overall pipeline looks like this:

```
# generate a household population for the given exposure-layer
subpipeline('household_population.txt', parameters: { exposure: $exposure,
                                                      regional_population: $regional_population,
                                                      region_buffer_m: $region_buffer_m
                                                    })
 ->
# work out the household population exposed to the hazard
subpipeline('is_exposed.txt', parameters: { hazard: $hazard,
                                            min_depth: $min_depth,
                                            exposed_value: Est_HH_Population
                                          })
 ->
# report the exposed results by region
subpipeline('regional_report.txt', parameters: { report_depths: $report_depths,
                                                 aggregate: { 
                                                        count(exposed) as Exposed_buildings,
                                                        round(sum(exposed_value)) as Exposed_population
                                                 },
                                               })
```    

The sub-pipelines are designed to be reusable components that can be used in other model pipelines.
The model parameters can customize the behaviour of the sub-pipelines, for example, changing the attribute(s) being aggregated. 

You can run the model using the following command:

```none
riskscape model run Exposed-Population
```

This model will produce two output files that contain a regional breakdown of the estimated exposed population,
a geospatial GeoJSON file and a tabular CSV summary. 

### The sub-pipelines in more detail

Sub-pipelines can be used in one of three ways, each demonstrated in the Exposed-Population model:

| Behaviour                | Data in | Data out | Example sub-pipeline         |
| ------------------------ | ------- | -------- | ---------------------------- |
| Produces pipeline data   |    ✘   |    ✔    |  `household_population.txt`   |
| Transforms pipeline data |    ✔   |    ✔    |  `is_exposed.txt`             |
| Consumes pipeline data   |    ✔   |    ✘    |  `regional_report.txt`        |


For the `household_population.txt` sub-pipeline file, the `subpipeline()` step acts a lot like an `input()` step.
However, instead of loading static input data from file, the data is generated dynamically by executing the sub-pipeline steps
(which combine the building and population input data and turn it into a new layer).

Whereas for the `is_exposed.txt` sub-pipeline file, the `subpipeline()` step acts more like a `select()` or `group()` step
in that it both consumes and produces pipeline data. The `is_exposed.txt` sub-pipeline is fairly simple:

```
in
 ->
select({ *, sample_one(geometry: exposure, coverage: bookmark($hazard)) as hazard})
 ->
select({ *, if_null(hazard > $min_depth, false) as exposed })
 ->
select({ *, if(exposed, float($exposed_value), 0.0) as exposed_value }) as out
```

In this case, the `in` step reference is used to denote the start of the pipeline.
The `in` step reference connects the steps within the sub-pipeline `.txt` file to
the pipeline data that is going into the the ``subpipeline()`` step in the parent pipeline.

The last step in the sub-pipeline is named `out`.
This is the pipeline data that the ``subpipeline()`` step produces or emits.

.. tip::
    Naming the last step ``out`` is optional, unless the sub-pipeline ends in multiple pipeline branches.
    An alternative approach is to make sure any other pipeline branches have explicit ``save()`` steps.

Finally, for `regional_report.txt`, the `subpipeline()` step acts more like a `save()` step and
consumes pipeline data, but does not emit any data.

### Parameter expressions

You may have noticed that the parameters for the `regional_report.txt` sub-pipeline accept an `aggregate`
expression, which lets you customize the the information reported in the regional results.

In this case, the `aggregate` parameter's expression gets evaluated _within_ the sub-pipeline code,
i.e. it gets evaluated separately for each region.
As opposed to a _function call argument_, where the expression would be evaluated _before_ it gets passed to the function.
This difference is worth noting, because although they may look similar, pipeline steps and function calls are
two different things, and so behave differently.

You can test out this behaviour for yourself using the hello-world sub-pipeline.
Try running the following command:

```none
riskscape pipeline eval "subpipeline('hello_world_params.txt',
                                     parameters: { who: random_choice(['Alice', 'Bob']), number: 10 })"
```

The output should give you 10 rows of data, randomly switching between Alice and Bob.
This is because the `random_choice()` expression gets evaluated 10 times _within_ the sub-pipeline.

Now try running the following command, which is a _function_ equivalent of the hello-world sub-pipeline:

```none
riskscape expression eval "hello_world(who: random_choice(['Alice', 'Bob']), number: 10)"
```

This will evaluate the `who` function argument _beforehand_, and so use the same random choice repeatedly 10 times.