.. _progress_monitor:

# Progress statistics

When RiskScape runs a model, it displays real-time progress statistics on the CLI.
These numbers give you an indication as to how long the model will take to run.

## Example output

The progress statistics output focuses on the number of tuples (rows of data) that RiskScape processes.
Example output from a wizard model might look something like this.

```none
Progress:
  995917 /  2025135      49.18%  exposures_input.complete
  995727 total, 20122.089/s avg: exposures.in
  995727 total, 20122.091/s avg: exposures.out
  995921 total, 20140.616/s avg: exposures_input.out
  994823 total, 19947.583/s avg: exposures_join_areas.in
  994823 total, 19947.583/s avg: exposures_join_areas.out
  995727 total, 20121.692/s avg: exposures_join_hazards.in
  995727 total, 20121.692/s avg: exposures_join_hazards.out
  995242 total, 20058.763/s avg: sample_hazard_layer.in
  995239 total, 20058.474/s avg: sample_hazard_layer.out
  993367 total, 19928.925/s avg: steps-select_3~>report_event-impact-sink.in
```

The first line shows you how far through the input data RiskScape is (i.e. 49.18%
complete for the exposure-layer). Often you can use this as a rough guide to how far through
the overall model RiskScape is.

.. note::
    The percentage complete does not include any invalid tuples or tuples that get removed by
    a bookmark filter. The reported progress could be inaccurate if your bookmark has a ``filter``
    that removes a large number of features. You could potentially move the ``filter`` from your
    bookmark into your model for more accurate progress.

The remaining lines show you a breakdown of the pipeline steps that are currently in progress.
Each line shows you:
- The `total` tuples processed so far.
- The `avg` tuples processed per second.
- The name of the pipeline step(s) that are doing the work.

The name of the pipeline steps are further broken into `in` and `out`.
This is because some pipeline steps can emit more tuples than they consume (e.g. `unnest` and `join` steps),
and others can emit fewer tuples (e.g. `filter` steps).

.. note::
    By default, these statistics are also saved to a ``stats.txt`` file in the output directory.
    Although viewing the statistics in real-time makes it easier to see what is happening.

## Pipeline processing

In general, RiskScape tries to 'stream' your input data so that it is spread out through the entire pipeline.

Model processing can involve a large number of data-points.
Potentially this is more data than it is practical to hold in memory all at once (especially true of probabilistic models).
So RiskScape tries to get rid of the data as quickly as it can, by getting the data from one end of the pipeline to the other.

.. mermaid::

    graph LR

      EL("Exposure-layer<br/>Input");
      EL --> GS("Geospatial<br/>matching");
      GS --> CA("Analyse<br/>consequences");
      CA --> REP("Save<br/>results");

      %% Styling
      class EL rs-box-navy
      style EL color:#FFF
      class CA,GS rs-box
      class REP rs-box-green

In a 'waterfall' approach, *all* the input data would be read before moving on to the next step (geospatial matching), and so on.
RiskScape does not do this.
Instead, RiskScape will read just enough input data to keep the rest of the pipeline steps busy.
When the 'geospatial matching' step starts to run out of data, then RiskScape will read some more input data.

The goal of this approach is to make maximal use of your CPU cores, by parallelizing work,
while using your available RAM efficiently, by holding minimal data in memory at once.

What this generally means is that the percentage complete metric for the input step (i.e. `exposures_input.complete`)
is often a good indication for the progress through the model itself.

## FAQ

### Why do I not see a percentage complete?

Not all input formats support a percentage complete metric, e.g. WFS input data does not.
Some formats, like Shapefile, may not report a percentage complete when a `filter` is configured
for the bookmark.

Furthermore, some model pipelines are organized such that it is hard to report their progress.
For example, you might have an exposure-layer that contains a few hundred roads, but your model then cuts these
roads into small (e.g. 1-metre) pieces.
In this case, RiskScape may read *all* the input data before it gets to the slow part of the model (the cutting).

Unfortunately, it is not practical to report a percentage complete for *every* step in the pipeline.
Due to the nature of pipelines, most steps can easily change the total number of tuples in the model,
which makes it hard to know exactly how many tuples the next step can expect to process.

.. _slow_model_tips:

### Why is my model slow?

Typically, the slowest processing in a model pipeline involves geometry operations.
If your model is running particularly slow, here are some things to check.

- Are you loading remote input data? E.g. your model uses WFS bookmarks or HTTP links.
  If so, try saving a local copy of the data to your file system and using that instead.
- Are you geoprocessing your input data? E.g. cutting or buffering the input geometry.
  If so, you could try:
  - Filtering the input data *before* your cut or buffer.
    This means you do the expensive geometry processing for less of the input data (i.e. only the data you care about).
  - If cutting by distance, try using a larger distance.
  - Avoid doing the geoprocessing *every* time you run the model.
    Instead, you could cut or buffer the geometry once, save the result, and then use that as the input layer to your models.
- Do you have a large (e.g. national-level) dataset and a localized hazard?
  If so, you could try filtering your input data by the bounds of your hazard.
  Refer to the geoprocessing options in the wizard.
- Do you have large polygons in your exposure-layer and a small grid distance for your hazard-layer?
  And are you using *all-intersections* or *closest* sampling?
  E.g. if you had farmland polygons and a 1-metre resolution hazard grid, then the sampling operation
  may still end up cutting each farm polygon into 1-metre segments.
  You could use centroid sampling instead.
  For better sampling accuracy, you could cut the polygons into more reasonable size first (e.g. 100-metre segments),
  and then use centroid sampling.
- Are you using shapefiles for the hazard-layer or area-layer?
  These can hold complex geometry, which can slow down spatial sampling operations.
  Here are a few things to check:
  - Does your area-layer contain a marine or oceanic feature that encloses the other features?
    If so, we recommend using bookmark :ref:`bookmark_filter` to remove this feature, as it will slow down spatial matching.
  - Try running the `riskscape bookmark info` command for your shapefile bookmarks.
    If this command seems to hang for a minute or more, then it may indicate something is wrong with your geometry input data.
    You can use `Ctrl c` to stop the command. Try setting `validate-geometry = off` in your bookmark and repeat.
    Refer to :ref:`invalid-geometry` on how to fix invalid geometry.
  - If you have large hazard-layer or area-layer shapes, try cutting the geometry into smaller pieces (e.g. 1km by 1km polygon segments)
    using the geoprocessing features in the wizard.
    This should not make any difference to the results, but it can mean that the geospatial matching is quicker
    because it is matching against smaller geometry.
- Are you filtering your *output* results when you could be filtering your *input*?
  It's more efficient to filter out any data as early in the model as you can.
  So if you were filtering by region say, you would want to filter in the *geoprocessing* phase
  of your model, rather than in the *reporting* phase (which is after the bulk of the processing work has been done).
- If you have lots of rows of data and are aggregating it, some aggregation operations are more efficient than others.
  - For example, `count()`, `sum()`, and `mean()` should be fairly efficient, whereas `stddev()`, `percentile()`, and others
  can consume a lot more memory. You could try temporarily replacing aggregation functions with `sum()` or `count()`
  as a sanity-check, and see if performance improves. You could also try filtering out unnecessary data
  before you aggregate it.
  - The `to_list()` and `to_lookup_table()` functions can also consume a large amount of RAM.  Try reducing
  the number of attributes being stored, especially if they are geometry or text.  Large amounts of
  floating point numbers can be stored more efficiently using the :ref:`smallfloat`
- By default, the memory that RiskScape can consume is capped by Java.
  If you have plenty of free RAM available on your system, you could try increasing this Java limit and see if RiskScape runs more efficiently.
  For more details on how to do this, see :ref:`java_memory`.

For advanced users who have written their own pipeline code, you could also check:

- Are you running a probabilistic model with a large number of data-points?
  The total data-points will be your number of elements-at-risk multiplied by the number of events.
  If this seems like a big number, try thinking of ways to process the data more efficiently.
  For example, you could filter (i.e. remove) data you are not interested in,
  or use interpolation (refer to the `create_continuous()` function) so that your Python code gets called less.
- Are you manually joining two different datasets or layers together, e.g. based on a common attribute?
  If so, make sure that the *smaller* dataset is on the `rhs` of the `join` step.
- Are you doing any unnecessary spatial operations?
  E.g. if you are filtering your data, then do any geospatial matching to region *after* the filter, not before.
- Do you have large polygons (i.e. land area) that you are sampling repeatedly against different GeoTIFFs?
  The repeated sampling operations will repeatedly cut your polygons, so it can be quicker to cut your polygons
  *once* up front, to match the GeoTIFF grid resolution (use the `segment_by_grid()` function).