Data preparation ================ Configuration files ------------------- Steps for preparing the various datasets used in this documentation are specified in yaml files stored in :code:`config/prepare_data`. Here is an example config yaml file for preparing annual precipitation and its anomalies for the CAFE-f6 hindcasts/forecasts: .. code-block:: yaml name: "CAFEf6" # <- The name of the dataset. This must match # the name of a corresponding method in # src.prepare_data._open prepare: annual.full.precip: # <- Unique identifier for the output variable # being processed. This will be used to save # the variable as {name}.{identifier}.zarr uses: # <- List of input variables required to compute atmos_isobaric_month: # the output variable. For some datasets this - "precip" # should be futher broken down into subkeys # indicating the realm for each list of # variables (e.g. atmos_isobaric_month). # Users can also provide the identifier of a # previously prepared variable using: # uses: # prepared: # - preprocess: # <- Functions and kwargs from src.utils to be normalise_by_days_in_month: # applied sequentially prior to concatenation convert_time_to_lead: # (for datasets comprised of multiple time_freq: "months" # concatenated files) and/or prior to merging truncate_latitudes: # input variables from multiple realms where coarsen: # more than one are specified window_size: 12 dim: "lead" apply: # <- Functions and kwargs from src.utils to be rename: # applied sequentially to the opened (and ensemble: "member" # concatenated/merge, where appropriate) convert: # dataset precip: multiply_by: 86400 round_to_start_of_month: dim: ["init", "time"] rechunk: init: -1 lead: 1 member: -1 lat: 10 lon: 12 annual.anom_1991-2020.precip: uses: prepared: - "annual.full.precip" apply: anomalise: clim_period: ["1991-01-01", "2020-12-31"] rechunk: init: -1 lead: 1 member: -1 lat: 10 lon: 12 Code for preparing data from a specified yaml file is in :code:`src/prepare_data.py`: .. code-block:: console $ python src/prepare_data.py -h usage: prepare_data.py [-h] [--config_dir CONFIG_DIR] [--save_dir SAVE_DIR] config Process a raw dataset according to a provided config file positional arguments: config Configuration file to process optional arguments: -h, --help show this help message and exit --config_dir CONFIG_DIR Location of directory containing config file(s) to use, defaults to /config/prepare_data/ --save_dir SAVE_DIR Location of directory to save processed data to, defaults to /data/processed/ To prepare a particular dataset, run: .. code-block:: console make data config= This will submit a batch job to prepare all of the diagnositics specified in :code:`config/prepare_data/`. An output file (named :code:`data_.o????????`) for this batch job will be written to the current directory once this job is complete. Alternatively, users can process multiple datasets in multiple jobs with: .. code-block:: console make data config=" " or process all available datasets with: .. code-block:: console make data Adding a new dataset for preparation ------------------------------------ There are a few steps to adding a new dataset. #. Add a step to the 'data' trigger within :code:`Makefile` symlinking the location of the data in :code:`data/raw`. (This is really just to keep things tidy/easily-traceable.) #. Add a new, appropriately-named, method to :code:`src/prepare_data._open`. Choose a name that uniquely identifies the dataset being added, e.g. "JRA55". #. Prepare a config file for the new dataset. This file can be named anything, however, the "name" key must match the name of the new method added in 2. Functions for executing new steps should be added to :code:`src/utils.py`. #. Add the new config file to the list of default configs to process (variable :code:`data_config`) in :code:`Makefile`