.. _multi_file_hazard:

# Multi-file hazard data

There are a few different cases where you might have multiple hazard input files that you want to run through your model:
- Probabilistic modelling, where each hazard file represents a specific ARI (Average Recurrence Interval).
- Time-series modelling, where each hazard file represents a particular snapshot in a time sequence.
For example, a volcanic eruption sequence may be represented by a series of hazard layers that span several months.
- Cases where you are 'stitching together' hazard data that spans multiple files.
For example, a national-scale flood model may use GeoTIFFs produced for one river catchment at a time.
- Letting the user pick the hazard scenario at run-time, such as picking out a particular sea level rise increment for a climate-change model.

Specifying each hazard-layer as a separate model parameter quickly becomes impractical,
as you might have a dozen or more hazard-layers, or the hazard-layers may vary in number depending on the situation.
Fortunately, the RiskScape Engine's data processing abilities are flexible enough to support all these models.
The approach we recommend using is loading the hazard-layers via a CSV file.

.. note::
    This page covers using multiple hazard-layers at the *same* time.
    If you have multiple hazard-layers that you want to run through your model *one* at a time,
    then refer to :ref:`model_batch`.

.. tip::
    This page is aimed at advanced users who are comfortable writing pipeline code.
    If pipeline code is new to you, try looking at :ref:`pipelines_tutorial` and
    :ref:`advanced_pipelines` first.

## Loading data via CSV

To start off with, you will probably have a directory full of hazard files.
The directory listing may look like this, for example:

```
Hazards/FloodDepth_upolu_5_0.tif
Hazards/FloodDepth_upolu_10_0.tif
Hazards/FloodDepth_upolu_25_0.tif
Hazards/FloodDepth_upolu_50_0.tif
Hazards/FloodDepth_upolu_100_0.tif
Hazards/FloodDepth_upolu_250_0.tif
```

The first step is to turn this directory listing into a CSV file.
In addition to the filepath, you should also include any metadata you will need in your model as separate CSV columns.
For a probabilistic model, this will be the return period or ARI.
For a time-series model, this will be the time step or event-ID.

From the example directory listing earlier, we could create a `hazards.csv` file that looks like this:

```
filepath,island,return_period,SLR
Hazards/FloodDepth_upolu_5_0.tif,upolu,5,0
Hazards/FloodDepth_upolu_10_0.tif,upolu,10,0
Hazards/FloodDepth_upolu_25_0.tif,upolu,25,0
Hazards/FloodDepth_upolu_50_0.tif,upolu,50,0
Hazards/FloodDepth_upolu_100_0.tif,upolu,100,0
Hazards/FloodDepth_upolu_250_0.tif,upolu,250,0
```

RiskScape can now load this CSV as an input to your model.
Each row of data in the CSV file will be processed one at a time in your model.
However, the `filepath` attribute is simply loaded into the model as a text string,
whereas in order to spatially query it, we need to turn it into :ref:`coverage_ref`.

.. note::
    The filepath in the CSV file should always be a path relative to the directory containing your ``project.ini`` file.
    This is how the RiskScape Engine will try to resolve the files.
    Alternatively, you can use full filepaths, but this makes it more difficult to share your model with others
    or upload the model into the RiskScape Platform.

## Building a coverage dynamically

The `bookmark()` function will turn a GeoTIFF filepath from a text-string into a coverage that can be spatially sampled.
For example, `bookmark('Hazards/FloodDepth_upolu_10_0.tif')` would return a coverage that could then be passed to `sample()` functions.
In this case, we simply need to call `bookmark(filepath)` for each row of data in our input `hazards.csv` file.

However, the dynamic nature of the `hazards.csv` file presents a problem.
RiskScape pipeline code is a *typed* language, which means that the RiskScape Engine needs to figure
out the shape of your input data (e.g. raster or vector data) *before* the model is run.
However, RiskScape does not know what sort of file `filepath` is until the model is actually run.

We can solve this by defining a placeholder bookmark in our `project.ini` file.
RiskScape can use this placeholder bookmark to work out the shape of the input data before the model is run.
For our example data, the placeholder bookmark might look like this:

```ini
[bookmark floodmap]
location = Hazards/FloodDepth_upolu_10_0.tif
```

In the model pipeline, we can then use the placeholder bookmark ID in our `bookmark()` expression,
e.g. `bookmark('floodmap')`.  Here, the bookmark ID `'floodmap'` is a *constant* expression,
which means it always has the same value for every row of data processed by the model.
In this case, it means that the `bookmark()` expression will always return raster data (i.e. a coverage),
because that matches the type of the `floodmap` bookmark.

The second part is swapping the bookmark's location on the fly for the `filepath` from  the CSV file.
As long as the *type* doesn't change, the `bookmark()` expression can return a different GeoTIFF coverage
for each row of data loaded from the CSV.
In our example, the pipeline might look like this:

```
input('hazards.csv')
 ->
select({ *, bookmark('floodmap', { location: filepath }) as coverage })
```

.. tip::
    Make sure the placeholder bookmark you use is valid.
    You can check it using the command ``riskscape bookmark info BOOKMARK_ID``.
    You will get an error if the bookmark ID does not exist, or the ``location`` filepath in the placeholder bookmark
    is not valid.

### Vector data

This example uses raster data (i.e. GeoTIFF) for simplicity, but you can use the same approach
if your hazard-layer is vector data (i.e. shapefile, GeoPackage, etc).
When loading vector data dynamically via CSV, the input data in the files must match *exactly*.
This means that:
- all files in the CSV must have the exact same attributes.
- all files in the CSV must be in the exact same CRS.

Typically when using this approach, the hazard data would have all been generated in the same manner,
so *should* have the same attributes.
However, if some vector files contain spurious extra attributes, it can be helpful to define
a `type` in your bookmark to exclude the attributes that you don't need.
For example, the following placeholder bookmark would only include the attributes `the_geom` and
`Depth`.

```ini
[bookmark floodmap]
location = Hazards/FloodDepth_upolu_10_0.shp
type = struct(the_geom: geometry, Depth: floating)
```

.. tip::
    If the vector data files span several different CRS, then you could try
    grouping the data by CRS, e.g. if you are dealing with two different CRS then split the data into
    two different CSV files. Alternatively, try working in a common CRS, such as WGS84.

### Alternative raster approach

Instead of using a placeholder bookmark, a slightly more complicated alternative is to specify the type
of the data being loaded as a third `type` argument to the `bookmark()` function.
This approach works best for raster input files. Using this approach, the example pipeline would look like this:

```
input('hazards.csv')
 ->
select({ *, bookmark('N/A', { location: filepath }, type: 'coverage(floating)') as coverage })
```

.. note::
    The ``type`` argument is currently not supported for geospatial vector data.
    However, it could be used to load non-geospatial relations, such as CSV input data or NetCDF files.

## The rest of the model

So far we have only looked at loading the hazard-layers into the model dynamically.
The rest of the model will still need to: 

- Join the exposure-layer to the events
- Calculate the consequence (i.e. loss) for each event
- Aggregate the results by event

Calculating the consequence for each event essentially works the same as a single-event model,
but we will look at joining and aggregating the data in a bit more detail.

### Joining the exposure-layer

Each exposure-layer feature will be potentially exposed to every event.
To do this in pipeline code, we can simply use the `join(on: true)` step to join the exposure data to the hazard data.
Essentially, every row of exposure-data will be duplicated for each event (i.e. each row of data in the `hazards.csv` file).

For example, say we only had two buildings (A and B) and two events (1 and 2), then after joining the datasets and calculating
the loss, the pipeline data might look like this:

| exposure     | hazard  | loss        |
| ------------ | ------- | ----------- | 
| Building A   | event 1 |   $10,000   |
| Building A   | event 2 |   $25,000   |
| Building B   | event 1 |        $0   |
| Building B   | event 2 |    $5,000   |

The pipeline code to join together the exposure and hazard input data might look something like this:

```
input($exposure_layer, name: 'exposure')
 ->
join(on: true) as join_exposures_and_events

input('hazards.csv', name: 'event')
 ->
select({ *, bookmark('floodmap', { location: event.filepath }) as coverage })
 ->
join_exposures_and_events.rhs
```

.. tip::
    It's better performance if the smaller dataset is on the right-hand side on the join (i.e. ``.rhs``).
    Typically the exposure dataset will be larger than the hazard dataset, in terms of rows of data,
    and so the hazard dataset should connect to the ``.rhs`` of the join step.

### Aggregating the results by event

When working with multiple hazard layers, you may often want to view the total loss by event.
Going back to the two-building example from the previous section, the aggregated loss would look like this:

| hazard  | loss        |
| ------- | ----------- | 
| event 1 |   $10,000   |
| event 2 |   $30,000   |

.. tip::
    For probabilistic models, you could pass the total loss to ``aal_trapz()`` to calculate the Average Annual Loss.

The pipeline code for the whole model, including aggregation, might look something like this:

```
input($exposure_layer, name: 'exposure')
 ->
join(on: true) as join_exposures_and_events
 ->
select({ *, sample_centroid(exposure, coverage) as hazard_intensity })
 ->
select({ *, loss_function(exposure, hazard_intensity) as loss })
 ->
group(by: event,
      select: {
                *,
                sum(loss) as Total_Loss,
                mean(hazard_intensity) as Mean_Hazard
      }) as event_loss_table
 ->
save('event-loss')

input('hazards.csv', name: 'event')
 ->
filter(true)
 ->
select({ *, bookmark('floodmap', { location: event.filepath }) as coverage })
 ->
join_exposures_and_events.rhs
```

Note that:
- `sample_centroid()` is used here for simplicity, but you may want to use different :ref:`spatial_sampling` depending on your model.
- `loss_function()` is a placeholder for where your own loss or consequence function would go.
- If you were letting the user pick out the hazard scenario at run-time, then you would simply change the `filter(true)` step to be
more appropriate, e.g. `filter(event.SLR = $SLR)`
- How you aggregate by event may vary slightly depending on your model.
For example, if you were 'stitching together' hazard data that spans multiple files, you would need to be careful to aggregate by
`event.id` or `event.return_period` instead of the whole `event` (which only represents *one* GeoTIFF file out of many for the event).

## Event dependencies

The example pipeline so far has been processing each GeoTIFF as a separate, independent event.
In some cases you may want to keep the event data together. For example:
- For a time-series model, such as cascading hazards, where the damage state of the building needs to be factored into the next event.
- For memory efficiency, when calculating a per-property AAL for a probabilistic model with a large exposure dataset.

In these cases, we can aggregate the coverages into a single list.
That allows each exposure to process *all* events in one go.
The pipeline code to do this would look more like this:

```
input($exposure_layer, name: 'exposure')
 ->
join(on: true) as join_exposures_and_events
 ->
select({ *, map(events, event -> sample_centroid(exposure, event.coverage)) as hazard_intensities })
 ->
select({ *, cascading_loss_function(exposure, hazard_intensities) as loss })

input('hazards.csv', name: 'event')
 ->
filter(true)
 ->
select({ *, bookmark('floodmap', { location: event.filepath }) as coverage })
 ->
group({ to_list({ coverage, event }) as events })
 ->
join_exposures_and_events.rhs
```

.. note::
    For cascading hazards, you would also need to sort the ``events`` list.
    When aggregating data into a list, RiskScape does not guarantee the order of the list items in any way.
.