Elements of a Nowcast System

TODO:

  • Python package; e.g. SalishSeaNowcast

  • version control is highly recommended

  • nowcast system configuration

  • workers

  • next_workers module

  • process management with supervisor

  • distribute releases via an anaconda.org channel or conda-forge

Logging

TODO:

  • logging levels and readability of log files

  • log file rotation and growth limitation

  • environment variable substitution in logging config

  • serve log files on web page if possible, or use log aggregation service

  • exception logging to Sentry

  • machine readable logging; JSON via Driftwood

Next Workers Module

TODO

Handling Worker Race Conditions

Occasionally when a collection of workers are launched to run concurrently by returning a list of nemo_nowcast.worker.NextWorker instances from a next_workers.after_*() function a race condition is created among two or more of the workers. When that happens it is impossible to know when of the racing workers after_*() functions to have return NextWorker instance(s) for the next step of the automation.

A concrete example of that situation is in the Salish Sea Nowcast system where nowcast.next_workers.after_collect_weather() includes the grib_to_netcdf and download_live_ocean workers in the list of next workers that it returns. That results in a race conditions between the grib_to_netcdf and make_live_ocean_files workers that can allow the upload_forcing workers to run before the grib_to_netcdf worker finishes, causing the atmospheric forcing files to be incomplete for some of the NEMO runs.

To mitigate that situation we can return a 2-tuple from after_collect_weather(). The first element of the tuple is the ususal list of NextWorker instances. The second element is a set of worker names involved in the race condition, for example:

next_workers =
    [
        NextWorker("nowcast.workers.get_NeahBay_ssh", args=["nowcast"]),
        NextWorker("nowcast.workers.grib_to_netcdf", args=["nowcast+"]),
        NextWorker("nowcast.workers.download_live_ocean"),
    ]
)
race_condition_workers = {"grib_to_netcdf", "make_live_ocean_files"}
return next_workers, race_condition_workers

When the Manager sees a set of race condition workers returned from an after_*() function it sets up a data structure to manage the race condition. As each of the workers in the race condition set finishes the NextWorker instance(s) they return are accumulated in a list instead of being launched immediately. Once all of the race condition workers have finished the accumulated list of NextWorker instances is launched. debug() level logging messages that describe the progress of the race conditions management are emitted.

Note

At present only one race condition can be managed at a time.