Pipeline data manipulation

At its core, RiskScape is a general-purpose data processing engine. It uses the RiskScape expression language to manipulate data, and it uses a data-type called structs to organize the data. This page will cover how the two concepts can interact to produce complex data-processing pipelines.

This page is aimed at advanced RiskScape users. We expect that you have read and are familiar with:

Note

This page specifically covers manipulating data from within a RiskScape model pipeline. You can alternatively manipulate input data before/as it gets loaded into the model, via bookmarks. Refer to Bookmarks for data sources for more details.

Introduction

Data in RiskScape

Input data to a RiskScape model is often relational, meaning it has rows and columns, like a CSV file. With geospatial data formats, the columns are often called attributes and the rows called features. For example, features (rows) and attributes (columns) in a simple geospatial file might look like this:

ID

Cons_Frame

the_geom

Use_Cat

708

Masonry

-14.034, -171.611

Residential

709

Timber

-14.042, -171.501

Tourist Fale

713

Timber

-14.040, -171.661

Hotel

Structs are what RiskScape uses to keep related clumps of data together, for example, keeping the exposure-layer attributes separate from the regional area-layer attributes. A struct is used to represent a single row of data, or feature, and can consist of several different attributes, which are sometimes referred to as struct members.

Why it gets tricky

RiskScape expressions are a strongly typed language that uses type inference. This means that RiskScape typically needs to know the exact data-types (i.e. integer, text-string) and attribute names before it can execute a pipeline. However, usually RiskScape can work out the types itself, without you having to specify them explicitly.

One of the benefits of RiskScape models is the ability to change the input data on the fly. However, this can completely change the attribute names and types in the model.

For example, the geometry in one exposure-layer might be called the_geom, whereas it’s called Shape in another dataset. If the pipeline tries to access the geometry by using a exposure.the_geom expression, then this model will fail to run with the second exposure-layer.

Accessing attributes

To start with, let’s look at the flexible ways you can access attributes from a struct.

To recap, the . operator lets you access a single struct member by name. For example, the expression exposure.the_geom would access the the_geom attribute from the exposure struct.

To access all attributes in a struct, you can use * (aka the ‘splat’ operator), for example exposure.* would access all the attributes in the exposure struct.

Determining the geometry attribute dynamically

When dealing with geometry attributes, there’s a simple way to access the geometry without used a fixed attribute name, which might only work for some layers. Typically a struct only has a single geometry attribute, and so RiskScape can usually figure out the correct attribute to use on the fly (called coercion).

For example, to spatially sample the hazard-layer you might use an expression like this:

sample_closest(exposure.the_geom, bookmark($hazard)) as hazard

However, you do not need to be so explicit about the geometry attribute here. Instead, you can use the following expression, which would work with any geospatial input layer:

sample_closest(exposure, bookmark($hazard)) as hazard

Note

This approach only works when you have a single geometry attribute in the struct. Some RiskScape operations can add a second geometry attribute, for example if a bookmark contained set-attribute.geom = the_geom you would end up with two geometry attributes in the struct: geom and the_geom. You could avoid this problem by either omitting the geom attribute, or by using map-attribute instead of set-attribute.

Optional attributes

One common situation is where an attribute may or may not be present in the input data. However, if it’s not present, then you may be able to use a suitable default value instead, so it’s not the end of the world. For example, you might want to do something like this:

# use a default seismic Z-factor of 0.3 if unknown
if_null(exposure.zfactor, 0.3) as seismic_zfactor

However, that expression will only work if the exposure-layer input data contains a zfactor attribute, and its value happens to be null. If no zfactor attribute is present in the input data, then the pipeline will fail to run.

RiskScape provides a way around this via the get_attr() function. For example, the following expression will safely use the exposure-layer zfactor attribute if it exists, and the default value 0.3 if it doesn’t exist:

get_attr(exposure, 'zfactor', default: 0.3)

Tip

This behaviour is similar to Python dictionaries, i.e. using dictionary.get('attribute') instead of dictionary.attribute.

Avoiding silent get_attr() failures

When using get_attr() you still have to be careful that your attribute name matches the actual data. For example, if get_attr() is looking for zfactor, but the attribute in the input data is actually called Z_FACTOR, then the value in the input data will be silently ignored.

The warning() function can help you avoid unintended effects from silently occurring in your model. For example, you could produce a warning making it clear that no zfactor attribute was found in the input data by adding the following to your pipeline.

input($exposures, name: 'exposure', limit: 1)
-> select({
            warning(is_null(get_attr(exposure, 'zfactor')),
                    'No `zfactor` attribute present in exposure-layer')
          }) as warnings

Tip

Warnings can also be helpful to highlight unexpected values that might be present in the input data, e.g. warning(Loss < 0, 'Bad Loss: ' + str(Loss)). The assert() function serves a similar purpose, except your model will immediately exit if the assert condition is ever hit.

Dynamic data manipulations

Conditional data types

Say you have a model that needs to run over either building input data or roads. In the buildings case, you want the model to report the total building value exposed. Whereas for roads, you need to report the total length in kilometres of road exposed.

One approach is to create two separate models, that are almost identical apart from a few lines. Another approach is to output attributes in the results that are vaguely named or don’t really make sense, e.g. the building results might contain a Road_Length_km column.

RiskScape provides an if() function, which can give you the flexibility to change the data-types (i.e. results) that your model produces on the fly. For example, you could have an asset_type model parameter that specifies whether it is building or road data, and then change the results reported dynamically, like this:

group({
        sum(
            if($asset_type = 'building',
               then: () -> { get_attr(exposure, 'Rep_Cost', 0) as Building_Value },
               else: () -> { measure(exposure) / 1000 as Road_Length_km })
            ) as Exposed
      })

Note that there are two important prerequisites for changing the type dynamically in an if/else like this:

  • the true/false condition in the if() statement must be a constant expression. For example, checking the value of a model parameter is typically constant, whereas checking an attribute in the exposure-layer isn’t.

  • the then/else arguments in the if() expression must be lambdas, i.e. the expressions need to start with () ->. This means that RiskScape will lazily evaluate the expression, i.e. it only does so when that path through the conditional is taken.

Anonymous attribute manipulation

RiskScape allows you to manipulate the attributes in a struct without knowing their name. This can be handy, when the data type may change conditionally.

For example, in our previous example we could end up with Exposed.Building_Value for buildings and Exposed.Road_Length_km for roads. Say we want to round the value in the Exposed struct - that becomes awkward to do, because we can no longer be certain what the attribute is called.

Fortunately, the map_struct() function provides a generic way to manipulate struct members. To round all the values in a struct, you don’t need to know the specific attribute names - you just need to be certain that the struct contains numeric values. For example:

map_struct(Exposed, value -> round(value)) as Exposed

You can also combine several different RiskScape functions to perform more complicated operations. For example, say you wanted to round AAL results to 4 decimal places, but leave the other results as is, You could do so with the following pipeline step:

select({
          *,
          merge(consequence,
                map_struct(
                           select_attr(consequence, 'AAL'),
                           AAL -> round(AAL, 4)
                          )
               ) as consequence
       })

There are several RiskScape functions involved here:

  • select_attr() picks a subset of attributes to manipulate (any attributes with ‘AAL’ in their name)

  • map_struct() rounds this subset of the results

  • merge() adds the updated attribute subset back into the original consequence struct, i.e. replacing the old, unrounded AAL values.

Saving data

Finally, a couple of RiskScape functions can be helpful when saving your model results to file.

Removing attributes

Sometimes the model pipeline can contain a lot of information and you do not need to save it all to file. A common annoyance is trying to save results to CSV and having geometry WKT clutter the output file. If you know the name of an attribute, you can easily remove it from the results using the following pipeline code:

# remove the geometry attribute from the exposure struct before saving
select({ remove_attr(exposure, 'geom').* })

To remove a top-level attribute, for example the exposure struct itself, you would use * as the input struct:

# remove the exposure struct completely
select({ remove_attr(*, 'exposure').* })

Flattening structs

How attributes get saved to file varies somewhat depending on the file format. For example, an exposure struct containing foo and bar attributes would be saved to a CSV or GeoJSON output file with column/attribute names like exposure.foo and exposure.bar (i.e. the exposure struct is preserved in the name). Whereas, the same struct would be saved to GeoPackage file with attributes foo and bar (i.e. the exposure struct gets dropped from the name).

In either case, using the flatten_struct() function before saving can be useful, e.g.

select({ flatten_struct(*).* })
-> save()

Using flatten_struct() would save our example exposure struct as exposure_foo and exposure_bar for most file formats.

Tip

This is particularly helpful when trying to save bucket() or bucket_range() results to GeoPackage, as the bucket names get dropped from the output otherwise. It is also useful if you need to use a pipeline output file as input data to another RiskScape pipeline, as the . in the attribute names would need to be double-quoted otherwise.

Note that Shapefile attributes are limited to 10 characters, so in this example attribute names would be truncated to exposure_f and exposure_b. Using the GeoPackage format instead of Shapefile would result in more meaningful attribute names. Alternatively, shortening the struct name from exposure to e would result in e_foo and e_bar.

# shapefile case
select({ flatten_struct(exposure) as e })
-> save()