Pipeline data manipulation
At its core, RiskScape is a general-purpose data processing engine. It uses the RiskScape expression language to manipulate data, and it uses a data-type called structs to organize the data. This page will cover how the two concepts can interact to produce complex data-processing pipelines.
This page is aimed at advanced RiskScape users. We expect that you have read and are familiar with:
Note
This page specifically covers manipulating data from within a RiskScape model pipeline. You can alternatively manipulate input data before/as it gets loaded into the model, via bookmarks. Refer to Bookmarks for data sources for more details.
Introduction
Data in RiskScape
Input data to a RiskScape model is often relational, meaning it has rows and columns, like a CSV file. With geospatial data formats, the columns are often called attributes and the rows called features. For example, features (rows) and attributes (columns) in a simple geospatial file might look like this:
ID |
Cons_Frame |
the_geom |
Use_Cat |
|---|---|---|---|
708 |
Masonry |
-14.034, -171.611 |
Residential |
709 |
Timber |
-14.042, -171.501 |
Tourist Fale |
713 |
Timber |
-14.040, -171.661 |
Hotel |
Structs are what RiskScape uses to keep related clumps of data together, for example, keeping the exposure-layer attributes separate from the regional area-layer attributes. A struct is used to represent a single row of data, or feature, and can consist of several different attributes, which are sometimes referred to as struct members.
Why it gets tricky
RiskScape expressions are a strongly typed language that uses type inference. This means that RiskScape typically needs to know the exact data-types (i.e. integer, text-string) and attribute names before it can execute a pipeline. However, usually RiskScape can work out the types itself, without you having to specify them explicitly.
One of the benefits of RiskScape models is the ability to change the input data on the fly. However, this can completely change the attribute names and types in the model.
For example, the geometry in one exposure-layer might be called the_geom, whereas it’s called
Shape in another dataset.
If the pipeline tries to access the geometry by using a exposure.the_geom expression,
then this model will fail to run with the second exposure-layer.
Accessing attributes
To start with, let’s look at the flexible ways you can access attributes from a struct.
To recap, the . operator lets you access a single struct member by name.
For example, the expression exposure.the_geom would access the the_geom attribute from the exposure struct.
To access all attributes in a struct, you can use * (aka the ‘splat’ operator),
for example exposure.* would access all the attributes in the exposure struct.
Determining the geometry attribute dynamically
When dealing with geometry attributes, there’s a simple way to access the geometry without used a fixed attribute name, which might only work for some layers. Typically a struct only has a single geometry attribute, and so RiskScape can usually figure out the correct attribute to use on the fly (called coercion).
For example, to spatially sample the hazard-layer you might use an expression like this:
sample_closest(exposure.the_geom, bookmark($hazard)) as hazard
However, you do not need to be so explicit about the geometry attribute here. Instead, you can use the following expression, which would work with any geospatial input layer:
sample_closest(exposure, bookmark($hazard)) as hazard
Note
This approach only works when you have a single geometry attribute in the struct.
Some RiskScape operations can add a second geometry attribute, for example
if a bookmark contained set-attribute.geom = the_geom you would end up with two
geometry attributes in the struct: geom and the_geom. You could avoid this
problem by either omitting the geom attribute, or by using map-attribute
instead of set-attribute.
Optional attributes
One common situation is where an attribute may or may not be present in the input data. However, if it’s not present, then you may be able to use a suitable default value instead, so it’s not the end of the world. For example, you might want to do something like this:
# use a default seismic Z-factor of 0.3 if unknown
if_null(exposure.zfactor, 0.3) as seismic_zfactor
However, that expression will only work if the exposure-layer input data contains a zfactor attribute,
and its value happens to be null. If no zfactor attribute is present in the input data, then the
pipeline will fail to run.
RiskScape provides a way around this via the get_attr() function.
For example, the following expression will safely use the exposure-layer zfactor attribute if it exists,
and the default value 0.3 if it doesn’t exist:
get_attr(exposure, 'zfactor', default: 0.3)
Tip
This behaviour is similar to Python dictionaries, i.e. using dictionary.get('attribute')
instead of dictionary.attribute.
Avoiding silent get_attr() failures
When using get_attr() you still have to be careful that your attribute name matches the
actual data. For example, if get_attr() is looking for zfactor, but the attribute in the input
data is actually called Z_FACTOR, then the value in the input data will be silently ignored.
The warning() function can help you avoid unintended effects from silently occurring in your model.
For example, you could produce a warning making it clear that no zfactor attribute was found in
the input data by adding the following to your pipeline.
input($exposures, name: 'exposure', limit: 1)
-> select({
warning(is_null(get_attr(exposure, 'zfactor')),
'No `zfactor` attribute present in exposure-layer')
}) as warnings
Tip
Warnings can also be helpful to highlight unexpected values that might be present in the input data,
e.g. warning(Loss < 0, 'Bad Loss: ' + str(Loss)). The assert() function serves a similar purpose,
except your model will immediately exit if the assert condition is ever hit.
Attribute search
RiskScape also allows you to access attributes based on a sub-string match.
For example, to extract any attributes related to the cost of the asset,
like Replacement_Value or Dwelling_Value, you could use the following
select_attr() expression:
select_attr(exposure, 'Value')
The second function argument here is a regular expression, which means that you can craft complex attribute-matching expressions. For example, if you were not sure of the geometry attribute’s name, but knew there were several common possibilities, you could use expressions like this:
# look for an exact match on one of these attribute names
select_attr(exposure, '^(?:the_geom|geometry|geom|Shape|SHAPE|GEOMETRY)$')
# or, look for any case-insenitive sub-string match
select_attr(exposure, '(?i)geom|shape')
Note
select_attr() will always return a struct, whereas
get_attr() will always return a single attribute.
Tip
select_attr() will not recursively search nested structs,
which is when a struct contains another struct as a member.
For example, the return values from bucket() and bucket_range() expressions
are nested structs. However, you can use flatten_struct() to remove
the nesting, and then use select_attr().
Dynamic data manipulations
Conditional data types
Say you have a model that needs to run over either building input data or roads. In the buildings case, you want the model to report the total building value exposed. Whereas for roads, you need to report the total length in kilometres of road exposed.
One approach is to create two separate models, that are almost identical apart from a few lines.
Another approach is to output attributes in the results that are vaguely named or don’t really make sense, e.g.
the building results might contain a Road_Length_km column.
RiskScape provides an if() function, which can give you the flexibility to change the data-types
(i.e. results) that your model produces on the fly.
For example, you could have an asset_type model parameter that specifies whether it is building or
road data, and then change the results reported dynamically, like this:
group({
sum(
if($asset_type = 'building',
then: () -> { get_attr(exposure, 'Rep_Cost', 0) as Building_Value },
else: () -> { measure(exposure) / 1000 as Road_Length_km })
) as Exposed
})
Note that there are two important prerequisites for changing the type dynamically in an if/else like this:
the true/false condition in the
if()statement must be a constant expression. For example, checking the value of a model parameter is typically constant, whereas checking an attribute in the exposure-layer isn’t.the
then/elsearguments in theif()expression must be lambdas, i.e. the expressions need to start with() ->. This means that RiskScape will lazily evaluate the expression, i.e. it only does so when that path through the conditional is taken.
Anonymous attribute manipulation
RiskScape allows you to manipulate the attributes in a struct without knowing their name. This can be handy, when the data type may change conditionally.
For example, in our previous example we could end up with Exposed.Building_Value for buildings
and Exposed.Road_Length_km for roads. Say we want to round the value in the Exposed
struct - that becomes awkward to do, because we can no longer be certain what the attribute is called.
Fortunately, the map_struct() function provides a generic way to manipulate struct members.
To round all the values in a struct, you don’t need to know the specific attribute
names - you just need to be certain that the struct contains numeric values.
For example:
map_struct(Exposed, value -> round(value)) as Exposed
You can also combine several different RiskScape functions to perform more complicated operations. For example, say you wanted to round AAL results to 4 decimal places, but leave the other results as is, You could do so with the following pipeline step:
select({
*,
merge(consequence,
map_struct(
select_attr(consequence, 'AAL'),
AAL -> round(AAL, 4)
)
) as consequence
})
There are several RiskScape functions involved here:
select_attr()picks a subset of attributes to manipulate (any attributes with ‘AAL’ in their name)map_struct()rounds this subset of the resultsmerge()adds the updated attribute subset back into the originalconsequencestruct, i.e. replacing the old, unrounded AAL values.
Saving data
Finally, a couple of RiskScape functions can be helpful when saving your model results to file.
Removing attributes
Sometimes the model pipeline can contain a lot of information and you do not need to save it all to file. A common annoyance is trying to save results to CSV and having geometry WKT clutter the output file. If you know the name of an attribute, you can easily remove it from the results using the following pipeline code:
# remove the geometry attribute from the exposure struct before saving
select({ remove_attr(exposure, 'geom').* })
To remove a top-level attribute, for example the exposure struct itself, you would use * as the input struct:
# remove the exposure struct completely
select({ remove_attr(*, 'exposure').* })
Flattening structs
How attributes get saved to file varies somewhat depending on the file format.
For example, an exposure struct containing foo and bar attributes would be saved to
a CSV or GeoJSON output file with column/attribute names like exposure.foo and exposure.bar
(i.e. the exposure struct is preserved in the name).
Whereas, the same struct would be saved to GeoPackage file with attributes foo and bar (i.e. the exposure
struct gets dropped from the name).
In either case, using the flatten_struct() function before saving can be useful, e.g.
select({ flatten_struct(*).* })
-> save()
Using flatten_struct() would save our example exposure struct as exposure_foo and exposure_bar
for most file formats.
Tip
This is particularly helpful when trying to save bucket() or bucket_range()
results to GeoPackage, as the bucket names get dropped from the output otherwise.
It is also useful if you need to use a pipeline output file as input data to another
RiskScape pipeline, as the . in the attribute names would need to be double-quoted otherwise.
Note that Shapefile attributes are limited to 10 characters, so in this example attribute
names would be truncated to exposure_f and exposure_b. Using the GeoPackage format
instead of Shapefile would result in more meaningful attribute names. Alternatively,
shortening the struct name from exposure to e would result in e_foo and e_bar.
# shapefile case
select({ flatten_struct(exposure) as e })
-> save()