Python step
As well as supporting CPython within RiskScape expressions, RiskScape also supports
a python() pipeline step that can process the entire dataset using CPython code. This feature supports being able
to use libraries like Pandas and Numpy across the whole dataset, rather than row at a time.
Note
In order to use the python step, you need to have the Beta plugin enabled and have configured RiskScape to use CPython.
The following table highlights some of the differences between a Python function and a Python pipeline step in RiskScape:
Python function |
Python step |
|---|---|
Executed for one row of data at a time |
Executed for the whole dataset at once |
Called from any RiskScape expression |
Called from |
Impractical to use with a Pandas Dataframe |
Integrates nicely with a Pandas Dataframe |
Intended for calculating a single loss/damage |
Intended for aggregated statistics, plots, outputs |
Note
This page is for advanced users experienced with writing pipeline code. If you are just looking for a simple way to process your data with Python, we recommend to start off looking at Python functions.
A loss statistics example
Consider the case where you want to use numpy to compute some loss statistics
(such as an Average Annual Loss, or AAL) from an event
loss table in RiskScape. Say we have a compute-aal.py file with the following skeleton Python code:
# some kind of pandas/numpy concoction
def compute_aal(dataframe):
aal = 0
# figure out aal properly somehow...
return aal
To integrate this code in to your pipeline you could add the following to your pipeline code:
event_loss
->
python(
script: 'compute-aal.py',
result-type: 'struct(aal: floating, peril: text)'
)
This would send all the tuples (rows of pipeline data) from the event_loss step in your pipeline to
the compute-aal.py script.
Then add to your compute-aal.py python script:
def function(rows):
# 1. construct a dataframe from all the rows
df = pd.DataFrame(rows)
# 2. call your aal function (from the first example)
aal_eq = compute_aal(df['eq_loss'])
# 3. return a result to riskscape
yield {'aal': aal_eq, 'peril': 'earthquake'}
Firstly, the script then turns all the rows of pipeline data in to a Pandas Dataframe,
and then secondly passes that Dataframe to your AAL Python function. Lastly, the function
‘yields’ the result as a dictionary - this dictionary will be converted back into a RiskScape pipeline tuple,
that will be output from the python() step.
This feature is not limited to returning a single result. The example can be adapted to return
multiple rows back to RiskScape, simply by adding more yield statements:
# 4. call your aal function (from the first example)
aal_flood = compute_aal(df['eq_flood'])
# 5. return a second result to riskscape
yield {'aal': aal_flood, 'peril': 'fluvial_flooding'}
Only once the final yield is called will the script finish.
Generator functions
RiskScape makes use of a feature of the Python language called
generator functions to support whole-dataset processing. Tuples come in to
the function using a generator function, and rows are sent back to RiskScape in the same way. For the most part, you
don’t need to know much about how these work, as long as you remember to return rows back to RiskScape using the yield
keyword instead of return.
Outputs
In addition to being able to process your data, CPython also has robust tools
for displaying it. For example, matplotlib allows you to easily make
plots and figures.
In order to ‘save’ a Python plot or figure as a model result, RiskScape provides a special
model_output(file_name) function that can be used in Python code that runs from a
python() pipeline step. Calling the model_output() function with a file name
registers that file with RiskScape. RiskScape then knows that it’s an output
file, and will move it to the output directory along with your other outputs
when the model completes.
You do not need to add anything to your pipeline file in order to register
additional outputs. In fact, if your Python file only registers outputs and
does not yield any rows, you can omit the result-type from the python step
definition.
For a worked example of using the python() step to produce a PDF report, see Creating custom model outputs with Python.
Sub-directories
RiskScape will move all registered outputs into a flat output directory. If you
register outputs that have the same name, but are in different directories
(e.g. flood/map.png and landslide/map.png) only the first can be moved to the output
directory (e.g. output/map.png), and the other output will be discarded.
We recommend writing all your Python outputs to a single directory
(e.g. flood-map.png and landslide-map.png)
Parameters
The python() step can optionally accept parameters, which can be passed directly to your Python code.
This lets you pass model parameters through to your Python code easily, and lets you write
reusable Python scripts that can be used in several different model pipelines.
For example, to customize the title and filename that gets used in a PDF report, you might have report.py Python
code that looks something like this:
from markdown_pdf import MarkdownPdf, Section
def function(rows, parameters={}):
filename = parameters.get('filename', 'report.pdf')
title = parameters.get('title', 'Default title')
pdf = MarkdownPdf()
pdf.add_section(Section("# " + title + "\n"))
pdf.save(model_output(filename))
The first argument to the function is the rows of pipeline data (via a generator function),
and the second argument is the parameter values (as a Python dictionary) passed through from the python() pipeline step.
Tip
Defining the parameters function argument as parameters={}, and using parameter.get(KEY_NAME, DEFAULT_VALUE)
to retrieve values from the dictionary help to make your Python script more reusable.
These approaches mean that the Python code will still work even if the python() pipeline step does
not specify any parameters.
In the pipeline python() step, we can specify the actual parameters to change the details of the PDF that
the Python function produces. For example:
event_loss
->
python(
script: 'report.py',
parameters: {
filename: 'science_report.pdf',
title: 'Summary of RiskScape model results'
}
)
For a worked example of using the python() step with parameters, see the bar graph example in Creating custom model outputs with Python.
More examples
Batch-processing
This example shows how computation can be batched up, which can be beneficial when using advanced features like GPU offloading.
import itertools
BATCH_SIZE = 100
def function(rows):
# use python stdlib itertools to batch the rows coming in so we
# can operate on them en masse
for batch in itertools.batched(rows, BATCH_SIZE):
df = pd.DataFrame(batch)
# call the function that benefits from running across many rows at once
df = df.reticulate_splines()
# return each result from the dataframe back to RiskScape
for new_row in df:
yield new_row
Row-at-a-time
This example shows how you can call a function more like a traditional CPython function in RiskScape. Assume you already have a script that has a compute_damage and compute_loss function:
def function(rows):
for row in rows:
dr = compute_damage(row)
loss = compute_loss(row, dr)
# Return a row back to RiskScape for each row we are given
yield {dr: dr, loss: loss}
Note that unlike a standard RiskScape function that appears in a select() step, only those
attributes that are returned from the function are returned.