Progress statistics

When RiskScape runs a model, it displays real-time progress statistics on the CLI. These numbers give you an indication as to how long the model will take to run.

Example output

The progress statistics output focuses on the number of tuples (rows of data) that RiskScape processes. Example output from a wizard model might look something like this.

Progress:
  995917 /  2025135      49.18%  exposures_input.complete
  995727 total, 20122.089/s avg: exposures.in
  995727 total, 20122.091/s avg: exposures.out
  995921 total, 20140.616/s avg: exposures_input.out
  994823 total, 19947.583/s avg: exposures_join_areas.in
  994823 total, 19947.583/s avg: exposures_join_areas.out
  995727 total, 20121.692/s avg: exposures_join_hazards.in
  995727 total, 20121.692/s avg: exposures_join_hazards.out
  995242 total, 20058.763/s avg: sample_hazard_layer.in
  995239 total, 20058.474/s avg: sample_hazard_layer.out
  993367 total, 19928.925/s avg: steps-select_3~>report_event-impact-sink.in

The first line shows you how far through the input data RiskScape is (i.e. 49.18% complete for the exposure-layer). Often you can use this as a rough guide to how far through the overall model RiskScape is.

Note

The percentage complete does not include any invalid tuples or tuples that get removed by a bookmark filter. The reported progress could be inaccurate if your bookmark has a filter that removes a large number of features. You could potentially move the filter from your bookmark into your model for more accurate progress.

The remaining lines show you a breakdown of the pipeline steps that are currently in progress. Each line shows you:

  • The total tuples processed so far.

  • The avg tuples processed per second.

  • The name of the pipeline step(s) that are doing the work.

The name of the pipeline steps are further broken into in and out. This is because some pipeline steps can emit more tuples than they consume (e.g. unnest and join steps), and others can emit fewer tuples (e.g. filter steps).

Note

By default, a summary of these statistics are also saved to a model-run-stats.csv file in the output directory. Although viewing the statistics in real-time makes it easier to see what is happening.

Pipeline processing

In general, RiskScape tries to ‘stream’ your input data so that it is spread out through the entire pipeline.

Model processing can involve a large number of data-points. Potentially this is more data than it is practical to hold in memory all at once (especially true of probabilistic models). So RiskScape tries to get rid of the data as quickly as it can, by getting the data from one end of the pipeline to the other.

graph LR EL("Exposure-layer<br/>Input"); EL --> GS("Geospatial<br/>matching"); GS --> CA("Analyse<br/>consequences"); CA --> REP("Save<br/>results"); %% Styling class EL rs-box-navy style EL color:#FFF class CA,GS rs-box class REP rs-box-green

In a ‘waterfall’ approach, all the input data would be read before moving on to the next step (geospatial matching), and so on. RiskScape does not do this. Instead, RiskScape will read just enough input data to keep the rest of the pipeline steps busy. When the ‘geospatial matching’ step starts to run out of data, then RiskScape will read some more input data.

The goal of this approach is to make maximal use of your CPU cores, by parallelizing work, while using your available RAM efficiently, by holding minimal data in memory at once.

What this generally means is that the percentage complete metric for the input step (i.e. exposures_input.complete) is often a good indication for the progress through the model itself.

Troubleshooting performance bottlenecks

The CLI progress will highlight when a model is running slow, but it can still be hard to pinpoint what the specific bottleneck is. This is partly due to the ‘streaming’ approach RiskScape takes through the data - if one step in the pipeline is particularly slow, all the pipeline steps prior to that also slow down, because data cannot move through the pipeline as quickly.

Tip

Before trouble-shooting, check the Why is my model slow? section below to see if anything obvious applies to your model.

The simplest way to troubleshoot performance is to look at the model-run-stats.csv produced once the model completes.

Analysing performance

Once each model run completes, a model-run-stats.csv file is always written in the output directory. This file contains some basic metrics about how much data each pipeline step processed and how long it took.

The following is a example of output from a simple pipeline that had a performance issue (fixed by issue GL1517).

name

runtime-ms

runtime-average-ms

tuples-in

tuples-in-per-sec

tuples-out

tuples-out-per-sec

context-switches

steps-sample_region~>sample_region-sink

416360

27045.07

50487

765.21

16

exposures_input

1342

130.31

50487

1117.37

9

aggregate_by_region

3

3

2

0

1

aggregate_by_region-capped-sink

0

0

2

0

1

The model-run-stats.csv file contains the following columns:

  • name: This is the name of a pipeline step, or a related group of pipeline steps, from the underlying model pipeline code. Where multiple pipeline steps have been grouped together, the name is in the format steps-FIRST_STEP~>LAST_STEP. This lets you map the performance metrics back to a particular step in your pipeline code. Note that explicitly naming each pipeline step may help you to debug the performance bottleneck.

  • runtime-ms: This is the total execution time in milliseconds for the pipeline steps, across all CPU cores. Some pipeline steps will use multi-threaded processing, whereas others (for example, input()) are single-threaded. In this case the regional sampling was multi-threaded across 8 CPU cores, so (416360 / 8) / 1000 is approximately 52 seconds. This particular model run took 60 seconds of elapsed real time to run (this includes typically 5 seconds of initialization and Java start-up time, which is not captured in the model-run-stats.csv).

  • context-switches: This is the number of times that a ‘worker thread’ started processing this particular pipeline task. When a worker thread needs to wait for either upstream or downstream pipeline steps to catch up, it “switches context” and processes data for a different pipeline step.

  • runtime-average-ms: This is the average time a worker thread spent processing data for this particular task in one go, i.e. the average time before a context switch was needed.

  • tuples-in, tuples-out: The total rows of data either consumed (in) or produced (out) by the pipeline steps. Note that it is normal for some pipeline steps to consume no data (e.g. input() steps) or produce no data (e.g. save() steps).

  • tuples-in-per-sec, tuples-out-per-sec: The approximate number of rows of data, on average, produced or consumed by the pipeline steps per second. Note that this value is averaged across time where the task is idle and waiting for upstream or downstream pipeline steps to catch up.

The model-run-stats.csv file is sorted by runtime-ms, which means that the slowest pipeline steps will typically always be at the top of the file.

In this example, the GL1517 performance issue occurred when sampling the region. The model-run stats also highlight that the sample_region pipeline step is the bottleneck, accounting for the vast majority of the CPU runtime (i.e. runtime-ms).

Note

Be careful when examining the average metrics, as slow upstream or downstream pipeline steps can skew these values. The best metric to rely on would be the runtime-ms value, as that is the total CPU runtime consumed by a pipeline step, so it is harder to misinterpret.

Temporary speed-up for debugging

If your model takes several hours to run, or is not completing at all, it is going to be very painful to troubleshoot. You can see if you can reduce the model run-time down to 5-10 minutes by reducing the number of exposure-layer features used in the model. For example, if your model takes two hours to run with 100,000 buildings, then try reducing that down to 5,000 buildings.

One way to reduce your exposure dataset is via a RiskScape bookmark with a filter statement. For example, you could something like the following bookmark specifically for testing performance:

[bookmark Reduced-Exposure-layer]
location = YOUR_EXPOSURE_DATA.gpkg
description = Selects a random 5% of the data to include in the model
filter = random_uniform(0, 100) <= 5

The filter statement only includes rows of data when the filter condition is true. In this case, the filter condition is generating a random number between 0 and 100 and checking if that number is less than or equal to 5, which should happen roughly 5% of the time. You can adjust the 5 in the filter line to include more or less data.

Tip

If your model is still just as slow, it may mean the problem is with the exposure-layer itself. For example, loading all the exposure data via WFS could be very slow if it is a large dataset. Another problem may be that the model is still trying to consume more RAM than is available on your system. For example, cutting a road into 1cm segments is still going to be too much work, even if you only have one road.

FAQ

Why do I not see a percentage complete?

Not all input formats support a percentage complete metric, e.g. WFS input data does not. Some formats, like Shapefile, may not report a percentage complete when a filter is configured for the bookmark.

Furthermore, some model pipelines are organized such that it is hard to report their progress. For example, you might have an exposure-layer that contains a few hundred roads, but your model then cuts these roads into small (e.g. 1-metre) pieces. In this case, RiskScape may read all the input data before it gets to the slow part of the model (the cutting).

Unfortunately, it is not practical to report a percentage complete for every step in the pipeline. Due to the nature of pipelines, most steps can easily change the total number of tuples in the model, which makes it hard to know exactly how many tuples the next step can expect to process.

Why is my model slow?

Typically, the slowest processing in a model pipeline involves geometry operations. If your model is running particularly slow, here are some things to check.

  • Are you loading remote input data? E.g. your model uses WFS bookmarks or HTTP links. If so, try saving a local copy of the data to your file system and using that instead.

  • Are you geoprocessing your input data? E.g. cutting or buffering the input geometry. If so, you could try:

    • Filtering the input data before your cut or buffer. This means you do the expensive geometry processing for less of the input data (i.e. only the data you care about).

    • If cutting by distance, try using a larger distance.

    • Avoid doing the geoprocessing every time you run the model. Instead, you could cut or buffer the geometry once, save the result, and then use that as the input layer to your models.

  • Do you have a large (e.g. national-level) dataset and a localized hazard? If so, you could try filtering your input data by the bounds of your hazard. Refer to the geoprocessing options in the wizard.

  • Do you have large polygons in your exposure-layer and a small grid distance for your hazard-layer? And are you using all-intersections or closest sampling? E.g. if you had farmland polygons and a 1-metre resolution hazard grid, then the sampling operation may still end up cutting each farm polygon into 1-metre segments. You could use centroid sampling instead. For better sampling accuracy, you could cut the polygons into more reasonable size first (e.g. 100-metre segments), and then use centroid sampling.

  • Are you using shapefiles for the hazard-layer or area-layer? These can hold complex geometry, which can slow down spatial sampling operations. Here are a few things to check:

    • Does your area-layer contain a marine or oceanic feature that encloses the other features? If so, we recommend using bookmark Filtering to remove this feature, as it will slow down spatial matching.

    • Try running the riskscape bookmark info command for your shapefile bookmarks. If this command seems to hang for a minute or more, then it may indicate something is wrong with your geometry input data. You can use Ctrl c to stop the command. Try setting validate-geometry = off in your bookmark and repeat. Refer to Invalid geometry on how to fix invalid geometry.

    • If you have large hazard-layer or area-layer shapes, try cutting the geometry into smaller pieces (e.g. 1km by 1km polygon segments) using the geoprocessing features in the wizard. This should not make any difference to the results, but it can mean that the geospatial matching is quicker because it is matching against smaller geometry.

  • Processing vector-layer multi-polygon data can be slow if the multi-polygons span distant locations. For example, if a single multi-polygon contains a polygon sub-component in Christchurch, and another polygon sub-component in Auckland. In this case, segmenting the multi-polygons may improve performance (you should be able to use a large cut-distance). This can also be an issue when using a region-layer that is a single complex multi-polygon for the entire country.

  • Are you filtering your output results when you could be filtering your input? It’s more efficient to filter out any data as early in the model as you can. So if you were filtering by region say, you would want to filter in the geoprocessing phase of your model, rather than in the reporting phase (which is after the bulk of the processing work has been done).

  • If you have lots of rows of data and are aggregating it, some aggregation operations are more efficient than others.

    • For example, count(), sum(), and mean() should be fairly efficient, whereas stddev(), percentile(), and others can consume a lot more memory. You could try temporarily replacing aggregation functions with sum() or count() as a sanity-check, and see if performance improves. You could also try filtering out unnecessary data before you aggregate it.

    • The to_list() and to_lookup_table() functions can also consume a large amount of RAM. Try reducing the number of attributes being stored, especially if they are geometry or text. Large amounts of floating point numbers can be stored more efficiently using the The smallfloat type

  • By default, the memory that RiskScape can consume is capped by Java. If you have plenty of free RAM available on your system, you could try increasing this Java limit and see if RiskScape runs more efficiently. For more details on how to do this, see Java memory utilization.

  • If your exposure-layer and hazard-layer are in different CRS, then it can be more efficient to reproject the exposure-layer to the hazard-layer’s CRS. In particular, if you have large exposure-layer features and a small hazard grid. You should be able to reproject geometry by adding a line similar to this to your bookmark:

    set-attribute.geom = reproject(*, 'EPSG:2197')
    

For advanced users who have written their own pipeline code, you could also check:

  • Are you running a probabilistic model with a large number of data-points? The total data-points will be your number of elements-at-risk multiplied by the number of events. If this seems like a big number, try thinking of ways to process the data more efficiently. For example, you could filter (i.e. remove) data you are not interested in, or use interpolation (refer to the create_continuous() function) so that your Python code gets called less.

  • Are you manually joining two different datasets or layers together, e.g. based on a common attribute? If so, make sure that the smaller dataset is on the rhs of the join step.

  • Are you doing any unnecessary spatial operations? E.g. if you are filtering your data, then do any geospatial matching to region after the filter, not before.

  • Do you have large polygons (i.e. land area) that you are sampling repeatedly against different GeoTIFFs? The repeated sampling operations will repeatedly cut your polygons, so it can be quicker to cut your polygons once up front, to match the GeoTIFF grid resolution (use the segment_by_grid() function).