.. _manipulating_pipeline_data:

# Pipeline data manipulation

At its core, RiskScape is a general-purpose data processing engine.
It uses the :ref:`expressions` to _manipulate_ data, and it uses a data-type called **structs** to _organize_ the data.
This page will cover how the two concepts can interact to produce complex data-processing pipelines.

This page is aimed at _advanced_ RiskScape users. We expect that you have read and are familiar with:
- :ref:`types`
- :ref:`expression_tutorial`
- :ref:`advanced_pipelines`

.. note::
    This page specifically covers manipulating data from *within* a RiskScape model pipeline.
    You can alternatively manipulate input data *before*/as it gets loaded into the model, via bookmarks.
    Refer to :ref:`bookmarks` for more details.

## Introduction

### Data in RiskScape

Input data to a RiskScape model is often _relational_, meaning it has rows and columns, like a CSV file.
With geospatial data formats, the columns are often called _attributes_ and the rows called _features_.
For example, features (rows) and attributes (columns) in a simple geospatial file might look like this:

| ID   | Cons_Frame |    the_geom       | Use_Cat      |
|----- |------------|-------------------|--------------|
| 708  | Masonry    | -14.034, -171.611 | Residential  |
| 709  | Timber     | -14.042, -171.501 | Tourist Fale |
| 713  | Timber     | -14.040, -171.661 | Hotel        |

Structs are what RiskScape uses to keep related clumps of data together,
for example, keeping the exposure-layer attributes separate from the regional area-layer attributes.
A struct is used to represent a single row of data, or feature, 
and can consist of several different attributes, which are sometimes referred to as _struct members_.

### Why it gets tricky

RiskScape expressions are a strongly _typed_ language that uses type inference.
This means that RiskScape typically needs to
know the _exact_ data-types (i.e. integer, text-string) and attribute names before it can execute a pipeline.
However, usually RiskScape can work out the types itself, without you having to specify them explicitly.

One of the benefits of RiskScape models is the ability to change the input data on the fly.
However, this can completely change the attribute names and types in the model.

For example, the geometry in one exposure-layer might be called `the_geom`, whereas it's called
`Shape` in another dataset.
If the pipeline tries to access the geometry by using a `exposure.the_geom` expression,
then this model will fail to run with the second exposure-layer.

## Accessing attributes

To start with, let's look at the flexible ways you can access attributes from a struct.

To recap, the `.` operator lets you access a _single_ struct member by name.
For example, the expression `exposure.the_geom` would access the `the_geom` attribute from the `exposure` struct.

To access _all_ attributes in a struct, you can use `*` (aka the 'splat' operator),
for example `exposure.*` would access all the attributes in the `exposure` struct.

### Determining the geometry attribute dynamically

When dealing with geometry attributes, there's a simple way to access the geometry without
used a fixed attribute name, which might only work for some layers.
Typically a struct only has a single geometry attribute, and so RiskScape can usually figure out
the correct attribute to use on the fly (called _coercion_).

For example, to spatially sample the hazard-layer you might use an expression like this:

```
sample_closest(exposure.the_geom, bookmark($hazard)) as hazard
```

However, you do not need to be so explicit about the geometry attribute here.
Instead, you can use the following expression, which would work with any geospatial input layer:

```
sample_closest(exposure, bookmark($hazard)) as hazard
```

.. note::
   This approach only works when you have a *single* geometry attribute in the struct.
   Some RiskScape operations can add a *second* geometry attribute, for example
   if a bookmark contained ``set-attribute.geom = the_geom`` you would end up with *two*
   geometry attributes in the struct: ``geom`` *and* ``the_geom``. You could avoid this
   problem by either omitting the ``geom`` attribute, or by using ``map-attribute``
   instead of ``set-attribute``.

### Optional attributes

One common situation is where an attribute may or may not be present in the input data.
However, if it's not present, then you may be able to use a suitable default value instead, so it's not the end of the world.
For example, you might want to do something like this:

```
# use a default seismic Z-factor of 0.3 if unknown
if_null(exposure.zfactor, 0.3) as seismic_zfactor
```

However, that expression will only work if the exposure-layer input data contains a `zfactor` attribute,
and its value happens to be null. If no `zfactor` attribute is present in the input data, then the
pipeline will fail to run.

RiskScape provides a way around this via the `get_attr()` function.
For example, the following expression will safely use the exposure-layer `zfactor` attribute if it exists,
and the default value 0.3 if it doesn't exist:

```
get_attr(exposure, 'zfactor', default: 0.3)
```

.. tip::
    This behaviour is similar to Python dictionaries, i.e. using ``dictionary.get('attribute')``
    instead of ``dictionary.attribute``.

#### Avoiding silent `get_attr()` failures

When using `get_attr()` you still have to be careful that your attribute name matches the
actual data. For example, if `get_attr()` is looking for `zfactor`, but the attribute in the input
data is actually called `Z_FACTOR`, then the value in the input data will be silently ignored.

The `warning()` function can help you avoid unintended effects from silently occurring in your model.
For example, you could produce a warning making it clear that no `zfactor` attribute was found in
the input data by adding the following to your pipeline.

```
input($exposures, name: 'exposure', limit: 1)
-> select({
            warning(is_null(get_attr(exposure, 'zfactor')),
                    'No `zfactor` attribute present in exposure-layer')
          }) as warnings
```

.. tip::
    Warnings can also be helpful to highlight unexpected values that might be present in the input data,
    e.g. ``warning(Loss < 0, 'Bad Loss: ' + str(Loss))``. The ``assert()`` function serves a similar purpose,
    except your model will immediately exit if the assert condition is ever hit.

### Attribute search

RiskScape also allows you to access attributes based on a sub-string match.
For example, to extract any attributes related to the cost of the asset,
like `Replacement_Value` or `Dwelling_Value`, you could use the following
`select_attr()` expression:
 
```
select_attr(exposure, 'Value')
```

The second function argument here is a regular expression, which means that
you can craft complex attribute-matching expressions.
For example, if you were not sure of the geometry attribute's name,
but knew there were several common possibilities, you could use expressions like this:

```
# look for an exact match on one of these attribute names
select_attr(exposure, '^(?:the_geom|geometry|geom|Shape|SHAPE|GEOMETRY)$')
# or, look for any case-insenitive sub-string match
select_attr(exposure, '(?i)geom|shape')
```

.. note::
    ``select_attr()`` will always return a **struct**, whereas
    ``get_attr()`` will always return a single attribute.

.. tip::
    ``select_attr()`` will not recursively search *nested* structs,
    which is when a struct contains another struct as a member.
    For example, the return values from ``bucket()`` and ``bucket_range()`` expressions
    are nested structs. However, you can use ``flatten_struct()`` to remove
    the nesting, and *then* use ``select_attr()``.

## Dynamic data manipulations

### Conditional data types

Say you have a model that needs to run over _either_ building input data or roads.
In the buildings case, you want the model to report the total building value exposed.
Whereas for roads, you need to report the total length in kilometres of road exposed.

One approach is to create two separate models, that are almost identical apart from a few lines.
Another approach is to output attributes in the results that are vaguely named or don't really make sense, e.g.
the building results might contain a `Road_Length_km` column.

RiskScape provides an `if()` function, which can give you the flexibility to change the data-types
(i.e. results) that your model produces on the fly.
For example, you could have an `asset_type` model parameter that specifies whether it is building or
road data, and then change the results reported dynamically, like this:

```
group({
        sum(
            if($asset_type = 'building',
               then: () -> { get_attr(exposure, 'Rep_Cost', 0) as Building_Value },
               else: () -> { measure(exposure) / 1000 as Road_Length_km })
            ) as Exposed
      }) 
```

Note that there are two important prerequisites for changing the type dynamically in an if/else like this:
- the true/false condition in the `if()` statement must be a _constant_ expression.
For example, checking the value of a model parameter is typically constant, whereas checking an attribute in the exposure-layer isn't.
- the `then`/`else` arguments in the `if()` expression must be _lambdas_, i.e. the expressions need to start with `() ->`. This means that RiskScape will _lazily_ evaluate the expression, i.e. it only does so when that path through the conditional is taken.

### Anonymous attribute manipulation

RiskScape allows you to manipulate the attributes in a struct without knowing their name.
This can be handy, when the data type may change conditionally.

For example, in our previous example we could end up with `Exposed.Building_Value` for buildings
and `Exposed.Road_Length_km` for roads. Say we want to round the value in the `Exposed`
struct - that becomes awkward to do, because we can no longer be certain what the attribute is called.

Fortunately, the `map_struct()` function provides a generic way to manipulate struct members.
To round _all_ the values in a struct, you don't need to know the specific attribute
_names_ - you just need to be certain that the struct contains numeric values.
For example:

```
map_struct(Exposed, value -> round(value)) as Exposed
```

You can also combine several different RiskScape functions to perform more complicated operations.
For example, say you wanted to round AAL results to 4 decimal places, but leave the other results as is,
You could do so with the following pipeline step:

```
select({
          *,
          merge(consequence,
                map_struct(
                           select_attr(consequence, 'AAL'),
                           AAL -> round(AAL, 4)
                          )
               ) as consequence
       })
```

There are several RiskScape functions involved here:
- `select_attr()` picks a subset of attributes to manipulate (any attributes with 'AAL' in their name)
- `map_struct()` rounds this subset of the results
- `merge()` adds the updated attribute subset back into the original `consequence` struct,
i.e. replacing the old, unrounded AAL values.

## Saving data

Finally, a couple of RiskScape functions can be helpful when saving your model results to file.

### Removing attributes

Sometimes the model pipeline can contain a lot of information and you do not need to save it all to file.
A common annoyance is trying to save results to CSV and having geometry WKT clutter the output file.
If you know the name of an attribute, you can easily remove it from the results using the following pipeline code:

```
# remove the geometry attribute from the exposure struct before saving
select({ remove_attr(exposure, 'geom').* })
```

To remove a top-level attribute, for example the `exposure` struct itself, you would use `*` as the input struct:

```
# remove the exposure struct completely
select({ remove_attr(*, 'exposure').* })
```

### Flattening structs

How attributes get saved to file varies somewhat depending on the file format.
For example, an `exposure` struct containing `foo` and `bar` attributes would be saved to
a CSV or GeoJSON output file with column/attribute names like `exposure.foo` and `exposure.bar`
(i.e. the `exposure` struct is preserved in the name).
Whereas, the same struct would be saved to GeoPackage file with attributes `foo` and `bar` (i.e. the `exposure` 
struct gets dropped from the name).

In either case, using the `flatten_struct()` function before saving can be useful, e.g.

```
select({ flatten_struct(*).* })
-> save()
```

Using `flatten_struct()` would save our example `exposure` struct as `exposure_foo` and `exposure_bar`
for most file formats.

.. tip::
    This is particularly helpful when trying to save ``bucket()`` or ``bucket_range()``
    results to GeoPackage, as the bucket names get dropped from the output otherwise.
    It is also useful if you need to use a pipeline output file as input data to another
    RiskScape pipeline, as the ``.`` in the attribute names would need to be double-quoted otherwise.

Note that Shapefile attributes are limited to 10 characters, so in this example attribute
names would be truncated to ``exposure_f`` and ``exposure_b``. Using the GeoPackage format
instead of Shapefile would result in more meaningful attribute names. Alternatively,
shortening the struct name from ``exposure`` to ``e`` would result in ``e_foo`` and ``e_bar``.

```
# shapefile case
select({ flatten_struct(exposure) as e })
-> save()
```