Reusing pipeline code
Tip
This page covers an advanced pipeline concept aimed at experienced RiskScape users. If you are fairly new to RiskScape, you should make sure you are comfortable with How to write advanced pipelines first.
You may find yourself duplicating pipeline code if you need to add several models that do similar, but slightly different, things. For example, you may want to process the input data in exactly the same way, but report the results in different ways. Or the input data may be organized in different ways, but you always want to generate the same result files.
It can seem simple to just copy-paste pipeline files, but this can lead to a maintenance headache. For example, if you later want to change a common part of the pipelines, there are now several places that require changing.
There are a few approaches to avoid duplicating pipeline code:
Pipeline model parameters. If the difference between two models is simple, then often you can make the pipeline more reusable simply by making something a model parameter. For example, varying the input data file or making an
if()/else condition a model parameter.Expression language functions. These let you easily reuse a complicated expression, such as an
if()orswitch()statement, in multiple places. This is useful when the duplicated pipeline code is contained within a single pipeline step.Sub-pipelines. This lets you pull out a series of pipeline steps into a separate file, and then use that sub-pipeline code from multiple model pipelines. This page will cover sub-pipelines in more detail.
Tip
Sub-pipelines are also useful for managing complexity in your pipeline. For example, if you have a lot of fairly trivial pipeline code that just customizes the output results, it can be useful to split that code out into a separate sub-pipeline, so that it doesn’t distract from the core risk-modelling section of the pipeline.
Setup
This page will refer to some example sub-pipeline files. Click here to download the example project we will use in this guide. Unzip the file into the Top-level Windows project directory where you normally keep your RiskScape projects.
Note
The subpipeline step is currently a beta feature.
In order to use the subpipeline step, you need to have the Beta plugin enabled.
A simple sub-pipeline
The RiskScape subpipeline() pipeline step simply executes pipeline code located in another pipeline .txt file.
This essentially lets you run a pipeline within a pipeline, which means you can easily reuse the same sub-pipeline code in multiple models.
For example, say we have a hello_world.txt file containing the following pipeline code:
input(value: { range(1, 4) as count })
-> unnest(count)
-> select({ 'Hello, world x' + str(count) as greeting })
This is just a really basic pipeline that will generate a few rows of output data for demonstrative purposes. We can then execute this pipeline code from within another pipeline, e.g.
riskscape pipeline eval "subpipeline('hello_world.txt')"
This produces a subpipeline.csv output file:
greeting
"Hello, world x1"
"Hello, world x2"
"Hello, world x3"
Note that this is not a particularly useful sub-pipeline (we’ll start introducing more complexity,
and build up to a more realistic example), but it demonstrates that the subpipeline() step can
execute a pipeline file from within another pipeline.
Sub-pipeline parameters
Typically, a sub-pipeline will have parameters that let it be reused in more useful ways.
Parameters are denoted with a $ and work the same way as model parameters.
The difference here is that the parameters need to be passed through in the subpipeline() step.
For example, say we have a hello_world_params.txt file containing the following pipeline code:
input(value: { range(1, $number + 1) as count })
-> unnest(count)
-> select({ 'Hello, ' + $who + ' x' + str(count) as greeting })
To use this sub-pipeline file, we now need to pass parameters through in the subpipeline() step, e.g.
riskscape pipeline eval "subpipeline('hello_world_params.txt', parameters: { who: 'Bob', number: 2})"
This would then produce a subpipeline.csv output file like this:
greeting
"Hello, Bob x1"
"Hello, Bob x2"
Connecting sub-pipeline steps
The subpipeline() step becomes more versatile once you start connecting it to other pipeline steps.
So far our example has just been executing a standalone pipeline file, but the subpipeline() step
can be treated like any other pipeline step. You can either connect together steps before the
subpipeline() step, after it, or both.
When connecting the subpipeline() step onto pipeline steps before it, the sub-pipeline .txt file must use
the in step reference as the starting point of the sub-pipeline.
This refers to the pipeline data coming into the sub-pipeline.
For example, say we have a hello_world_input.txt file containing the following pipeline code:
in -> select({ *, range(1, number + 1) as count })
-> unnest(count)
-> select({ 'Hello, ' + who + ' x' + str(count) as greeting })
To use this sub-pipeline file, we now need to pass pipeline input data through to the subpipeline() step, e.g.
riskscape pipeline eval "
input(value: { who: 'Bob', number: 2})
-> subpipeline('hello_world_input.txt')
"
This would then produce a subpipeline.csv output file similar to the previous parameter example:
greeting
"Hello, Bob x1"
"Hello, Bob x2"
The difference here is the values are being passed through as pipeline data instead of parameters. However, this is still just a contrived example to demonstrate the pipeline syntax. Let’s look at a more practical example now.
Exposed population example
Let’s go back to the getting-started example data for the 2009 South Pacific Tsunami (SPT) event. Let’s say we want to see the household population that was exposed to the tsunami inundation. The only problem is that we don’t have any household population data.
However, we do have total population counts for each region, and we have residential building data. We could combine the two data sources and simply average the regional population across the households in each region to produce an rudimentary household population. We can also apply a weighting to the household population based on building size, so that a larger residential dwelling has a larger household population.
Note
This example is demonstrating pipeline concepts rather than robust modelling techniques. Linking a building to regional census data and distributing a weighted average based on building characteristics may be things you want to repeat in your own models. However, the Samoa data here has too many limitations to produce a reliable household population, e.g. Vaimauga West only has a subset of the total buildings.
The household_population.txt file is an example of a pipeline that distributes a regional population
among the residential buildings, weighted by building size.
You can look at the pipeline code in more detail if you want, but this exercise will simply use it as a sub-pipeline file.
Assigning a household population is not a simple operation that can be done in a RiskScape bookmark, because it involves loading the entire dataset to calculate totals, before we can start assigning a household population to individual buildings.
Tip
Alternatively, we could build the household population as a one-off operation, and then simply use it as a new input layer for models. However, using a sub-pipeline here allows us to easily use a different building- or population-layer in the model, as new data becomes available.
Running the model
We can split the full pipeline up into three sub-pipeline operations:
Build a new household population exposure-layer
Determine the features in this exposure-layer that are exposed to the hazard
Summarize the results with a regional breakdown
The whole overall pipeline looks like this:
# generate a household population for the given exposure-layer
subpipeline('household_population.txt', parameters: { exposure: $exposure,
regional_population: $regional_population,
region_buffer_m: $region_buffer_m
})
->
# work out the household population exposed to the hazard
subpipeline('is_exposed.txt', parameters: { hazard: $hazard,
min_depth: $min_depth,
exposed_value: Est_HH_Population
})
->
# report the exposed results by region
subpipeline('regional_report.txt', parameters: { report_depths: $report_depths,
aggregate: {
count(exposed) as Exposed_buildings,
round(sum(exposed_value)) as Exposed_population
},
})
The sub-pipelines are designed to be reusable components that can be used in other model pipelines. The model parameters can customize the behaviour of the sub-pipelines, for example, changing the attribute(s) being aggregated.
You can run the model using the following command:
riskscape model run Exposed-Population
This model will produce two output files that contain a regional breakdown of the estimated exposed population, a geospatial GeoJSON file and a tabular CSV summary.
The sub-pipelines in more detail
Sub-pipelines can be used in one of three ways, each demonstrated in the Exposed-Population model:
Behaviour |
Data in |
Data out |
Example sub-pipeline |
|---|---|---|---|
Produces pipeline data |
✘ |
✔ |
|
Transforms pipeline data |
✔ |
✔ |
|
Consumes pipeline data |
✔ |
✘ |
|
For the household_population.txt sub-pipeline file, the subpipeline() step acts a lot like an input() step.
However, instead of loading static input data from file, the data is generated dynamically by executing the sub-pipeline steps
(which combine the building and population input data and turn it into a new layer).
Whereas for the is_exposed.txt sub-pipeline file, the subpipeline() step acts more like a select() or group() step
in that it both consumes and produces pipeline data. The is_exposed.txt sub-pipeline is fairly simple:
in
->
select({ *, sample_one(geometry: exposure, coverage: bookmark($hazard)) as hazard})
->
select({ *, if_null(hazard > $min_depth, false) as exposed })
->
select({ *, if(exposed, float($exposed_value), 0.0) as exposed_value }) as out
In this case, the in step reference is used to denote the start of the pipeline.
The in step reference connects the steps within the sub-pipeline .txt file to
the pipeline data that is going into the the subpipeline() step in the parent pipeline.
The last step in the sub-pipeline is named out.
This is the pipeline data that the subpipeline() step produces or emits.
Tip
Naming the last step out is optional, unless the sub-pipeline ends in multiple pipeline branches.
An alternative approach is to make sure any other pipeline branches have explicit save() steps.
Finally, for regional_report.txt, the subpipeline() step acts more like a save() step and
consumes pipeline data, but does not emit any data.
Parameter expressions
You may have noticed that the parameters for the regional_report.txt sub-pipeline accept an aggregate
expression, which lets you customize the the information reported in the regional results.
In this case, the aggregate parameter’s expression gets evaluated within the sub-pipeline code,
i.e. it gets evaluated separately for each region.
As opposed to a function call argument, where the expression would be evaluated before it gets passed to the function.
This difference is worth noting, because although they may look similar, pipeline steps and function calls are
two different things, and so behave differently.
You can test out this behaviour for yourself using the hello-world sub-pipeline. Try running the following command:
riskscape pipeline eval "subpipeline('hello_world_params.txt',
parameters: { who: random_choice(['Alice', 'Bob']), number: 10 })"
The output should give you 10 rows of data, randomly switching between Alice and Bob.
This is because the random_choice() expression gets evaluated 10 times within the sub-pipeline.
Now try running the following command, which is a function equivalent of the hello-world sub-pipeline:
riskscape expression eval "hello_world(who: random_choice(['Alice', 'Bob']), number: 10)"
This will evaluate the who function argument beforehand, and so use the same random choice repeatedly 10 times.