Elements of a Nowcast System
TODO:
Python package; e.g. SalishSeaNowcast
version control is highly recommended
nowcast system configuration
workers
next_workers module
process management with supervisor
distribute releases via an anaconda.org channel or conda-forge
Logging
TODO:
logging levels and readability of log files
log file rotation and growth limitation
environment variable substitution in logging config
serve log files on web page if possible, or use log aggregation service
exception logging to Sentry
machine readable logging; JSON via Driftwood
Next Workers Module
TODO
Handling Worker Race Conditions
Occasionally when a collection of workers are launched to run concurrently by returning a list of nemo_nowcast.worker.NextWorker
instances from a next_workers.after_*()
function a race condition is created among two or more of the workers.
When that happens it is impossible to know when of the racing workers after_*()
functions to have return NextWorker
instance(s) for the next step of the automation.
A concrete example of that situation is in the Salish Sea Nowcast system where nowcast.next_workers.after_collect_weather()
includes the grib_to_netcdf and download_live_ocean workers in the list of next workers that it returns.
That results in a race conditions between the grib_to_netcdf and make_live_ocean_files workers that can allow the upload_forcing workers to run before the grib_to_netcdf worker finishes,
causing the atmospheric forcing files to be incomplete for some of the NEMO runs.
To mitigate that situation we can return a 2-tuple from after_collect_weather()
.
The first element of the tuple is the ususal list of NextWorker
instances.
The second element is a set
of worker names involved in the race condition,
for example:
next_workers =
[
NextWorker("nowcast.workers.get_NeahBay_ssh", args=["nowcast"]),
NextWorker("nowcast.workers.grib_to_netcdf", args=["nowcast+"]),
NextWorker("nowcast.workers.download_live_ocean"),
]
)
race_condition_workers = {"grib_to_netcdf", "make_live_ocean_files"}
return next_workers, race_condition_workers
When the Manager sees a set of race condition workers returned from an after_*()
function it sets up a data structure to manage the race condition.
As each of the workers in the race condition set finishes the NextWorker
instance(s) they return are accumulated in a list instead of being launched immediately.
Once all of the race condition workers have finished the accumulated list of NextWorker
instances is launched.
debug()
level logging messages that describe the progress of the race conditions management are emitted.
Note
At present only one race condition can be managed at a time.