Pipelines

A pipeline is a series of data-processing steps in RiskScape. RiskScape uses pipelines to transform rows of data, or tuples, as they flow through the system.

All RiskScape models are implemented using an underlying pipeline. However, advanced users can define their own models directly in pipeline code. This lets you interact with RiskScape’s data-processing at a much lower level.

Tip

For a how-to guide to writing pipelines see How to write basic pipelines.

Steps

A pipeline is made up of one or more steps. A step is a component that will process tuples. For example, the input step will output tuples from a data source, whilst the filter step will remove tuples that do not match the filter expression.

Steps get chained together, so that the output from one step feeds into the input for another step.

Most steps have parameters that are used to alter how the step processes tuples.

RiskScape has many built-in pipeline steps. The steps available, and their associated parameters, can be seen here

This help can also be accessed within RiskScape, using the following commands:

riskscape pipeline step list
riskscape pipeline step info STEP_NAME

Defining pipelines

Pipelines can be defined in the project file as a pipeline model.

The following sections use a simple pipeline example that:

  • takes some assets

  • filters them to remove assets that are not constructed from timber

  • joins the assets to their region

We will explain how the definition works in the following sections.

Pipeline file

The pipeline can be defined in a separate text file, e.g. my_pipeline.txt:

input('assets.csv', name: 'asset')
-> filter(filter: asset.construction = 'timber')
-> join(
     on: region.name = asset.region,
     rhs: <- input('regions.shp', name: 'region')
   ) as with_region

Then a pipeline model entry is added to the project INI file with a location pointing to that file.

[model my_pipeline]
framework = pipeline
description = demonstrates how to write a pipeline

location = my_pipeline.txt

In project INI

Pipeline models can also be defined in the project INI file itself with a pipeline entry. For example:

[model my_pipeline]
framework = pipeline
description = demonstrates how to write a pipeline

pipeline = """
  input('assets.csv', name: 'asset')
  -> filter(filter: asset.construction = 'timber')
  -> join(
       on: region.name = asset.region,
       rhs: <- input('regions.shp', name: 'region')
     ) as with_region
"""

Note

Pipelines defined in an INI file need to be surrounded in triple quotes.

Defining steps

Steps are defined in a similar syntax to functions:

step_id([optional parameter, ...])[as optional_name]

The step_id specifies the type of step to use. It must correspond to one of the IDs listed in riskscape pipeline step list. The optional_name is used to uniquely identify the step in the pipeline.

Some examples of defining the same pipeline step:

input('my_bookmark')

With a step name:

input('my_bookmark') as input_assets

With parameter keywords:

input(relation: 'my_bookmark')

If you do not specify a step name, RiskScape will assign a default identifier to each step. The default name is simply the step_id, e.g. input. To ensure the name is unique, RiskScape may also append a number that reflects the order the step appeared in the pipeline, e.g. input_2, input_3, etc.

Tip

You only need to assign a step name if you want to reference your step from elsewhere in the pipeline. For example, a join step requires two inputs, so at least one of them will be a reference to another step.

Connecting steps

Steps are connected together so that the output of one step is passed to the input of the next. When multiple steps are connected together, they are called a pipeline chain or branch.

Connecting steps is done with the -> operator. For example:

source -> destination

The source and destination in this example must both be unique pipeline steps, but they can be either:

  • a step definition e.g. input() as my_input

  • the name of a previously defined step e.g. my_input

Tip

When you reference a previously defined step, make sure you have explicitly named the step. RiskScape will implicitly assign a unique name to each unnamed step, e.g. select({*}) might be named select_2. However, these implicit names can change as you edit your pipeline, so it is not recommended to reference an implicit step name.

Connections within a step

The -> operator represents the pipeline data that’s flowing between the different steps in your pipeline. This is the operator that you will end up using most often.

There is a second, less common <- operator which is used to connect pipeline data within a pipeline step. You may see this <- operator used within the join pipeline step, for example:

input('assets.csv', name: 'asset')
-> join(
     on: region.name = asset.region,
     rhs: <- input('regions.shp', name: 'region')
   )

Join steps are somewhat of a special case, as they involve connecting two pipeline chains together (i.e. connecting three pipeline steps altogether). In this example, the input data from assets.csv and the input data from regions.shp are both connected to the join step. The -> operator is used to connect the pipeline between the assets input and the join step. The <- operator is used within the join step to also connect the regions input data.

Refer to Joins for more details about how this works.

The <- is used for convenience, to simplify join steps. The pipeline chain specified by the <- operator is called an anonymous pipeline, and cannot be referenced from elsewhere in your pipeline.

Saving results

When data reaches the end of a chain of pipeline steps, it gets saved to an output file. By default, the results will be written as either CSV or shapefile (depending on whether or not geometry data is present), and the name of the last step will be used as the filename, e.g. select_2.csv.

The save() pipeline step can be used to customize how the output data is saved, such as the filename and file format. For example, if you wanted to convert some shapefile data to GeoPackage, you could do so with the following simple RiskScape pipeline:

input('foo.shp')
-> save('foo', format: 'geopackage')

Some output formats also accept format-specific options. For example, the following pipeline would rasterize the shapefile data and save it to a 100m-grid GeoTIFF.

input('foo.shp')
-> save('foo', format: 'geotiff',
        options: {
          bounds: bounds(bookmark('foo.shp')),
          grid-resolution: 100
       })

RiskScape supports the output formats listed below. For more details, see also Results output.

Note

Geospatial output formats can only be used if geometries are present in the data being saved.

Advanced pipeline features

This has covered the basics of defining pipelines. You may be interested in more advanced topics, such as: