Instructions for data provider

Data provider is usually the scientist performing measurements in the field.

System architecture

Data flow model specification.

../_images/DataFlowModel.svg

The data provider must make sure that:

  1. For online sensor data:
  1. The raw data arrive in the landing-zone at the right place,
  2. A conversion script is provided to interpret the raw data, and
  3. Meta-data are provided.
  1. For offline laboratory data:
  1. The raw data is copied to the whdp server
  2. A conversion script is available and running to process the data into the lab_results.csv file
  3. Meta-data are provided.

General workflow

Working on the real landing-zone would be dangerous. Therefore, all development and testing is done on a copy of the landing-zone. The WHDP provides a command to create development landing-zones. A development landing-zone can have any names, but let’s call it dlz for now:

$ whdp start-develop dlz

This creates a folder (a copy of the real landing-zone) called dlz in the home directory. You can see how the created landing zone looks like with ls -l dlz.

The WHDP provides various checks to ensure that the provided conversion scripts and meta-data are consistent. The checks are ran by:

$ whdp check dlz

If everything is fine, modify the develop landing-zone (e.g. add a new sensor) according to the instructions given below. After the modifications run the checks again.

$ whdp check dlz

It is recommended to execute this checks after any small changes. If this succeeds, update the operational landing zone:

$ whdp update-operational dlz

All future raw data should be delivered directly into the operational database.

In the following sections, the different types of modifications/additions are explained.

Add site

In order to add a new measuring site, the information about this site have to be provided as a site.yaml file in a new folder for the site, within the sites folder of the landingzone. The information to be specified are:

  • Name: Name of the site
  • Description: Free text describing the site
  • Unit: A unique name for the room/space (e.g., BU_B09, FC_C24, NE_A017)
  • Area: The section of the room/space floor plan (e.g., GW).
  • Setup: The experimental setup (e.g., GW).
  • Component: A part of the experimental setup (e.g., MBR-Tank1).
  • Status: The current state of operation of the site (e.g., operational).
  • Pictures (optional): Pictures relating to the site can be specified. Pictures are normally stored in the images folder of the specific site.

The structure of the file has to be the same as in the example below:

name: NE_WaterHub_GW_GW_MBR-Tank1
description: MBR Tank 1 - Buffer tank
unit: NE_A17
area: GW
setup: GW
component: MBR-Tank1
status: operational

# pictures are optional:
pictures:
    -
      path: images/installation.png
      description: Installation of Lora Ultrasonic Sensor 00769
      # date is optional:
      date: 2016/08/22 12:00:00
    -
      path: images/impact.jpg
      description: Impact Zone of Sensor 00769

Adding a site if fairly trivial by starting an interactive input program. Typing a dot (.) at any time will abort the entry:

$ whdp add dlz site

Semi-automatic site updates

The WHDP-provider-tool-kit (https://gitlab.com/krisvillez/whdp-provider-tool-kit) includes a script dlz_site.py to automatically generate all sites in the dlz. This script reads an Excel file (whdp_table_site.xlsx) to obtain the list of all sites.

Note on sampling for laboratory data analysis

  • If the added site is a location where laboratory samples can be taken, it is likely necessary to modify the laboratory workflow accordingly (E.g. programmable drop-down menus in Excel sheets).
  • The laboratory data conversion script is programmed so that adding a site does not require modification of this script.

Add or modify variables

The file variable.yaml is stored in the data folder and contains all variables recognized within the WHDP. This file is modified to add new variables. The information to be included are:

  • Name: Shorthand name of the variable (e.g., sensor_bp)
  • Description: Additional description of the variable. In case there is no description required, the field can remain empty.
  • Unit: Specifies the measurement unit of the variable

Within variable.yaml, each variable is described by 4 text lines with the following structure (note the dash in the first line):

-
   name: sensor_bp
   unit: mbar
   description: pressure

To modify variable.yaml with the nano editor, type the following in the command line:

$ nano dlz/data/variables.yaml

Semi-automatic variable updates

The WHDP-provider-tool-kit (https://gitlab.com/krisvillez/whdp-provider-tool-kit) includes a script dlz_variable.py to automatically generate all variables in the dlz. This script reads an Excel file (whdp_table_variable.xlsx) to obtain the list of all variables.

Note on name convention

In conventional WHDP practice, each variable name is composed of the following parts separated by an underscore (_):

  • An actuator/sensor/lab specification (e.g., ‘actuator’ for valve state; ‘sensor’ for pressure measurement; ‘lab’ for laboratory measurement)
  • A variable description, according to the EN_81346 norm (see https://de.wikipedia.org/wiki/EN_81346 ), e.g. sensor_bp for pressure measurement
  • [optional] An additional variable descriptor when the EN_81346 norm is too vague, e.g.: sensor_bq_doconc for dissolved oxygen concentration

Note on laboratory data

  • If the added variable is a new laboratory measurement, it is likely necessary to modify the laboratory workflow accordingly (E.g. programmable drop-down menus in Excel sheets).
  • Adding a variable requires modification of the laboratory data conversion script .

Add or modify persons (offline laboratory data)

The file persons.yaml is stored in the /dlz/lab_data folder and contains all current and previous staff members recognized within the WHDP. The information to be included are:

  • Abbrev: Unique shorthand identifier of the person (e.g., villezkr)
  • Name: Full name of the staff member (e.g., Kris Villez)
  • Department: Affiliation of the staff member (e.g., eawag-eng)

Within persons.yaml, each variable is described by 4 text lines with the following structure (note the dash in the first line):

-
    abbrev: villezkr
    name: Kris Villez
    department: eawag-eng

To modify persons.yaml with the nano editor, type the following in the command line:

$ nano dlz/lab_data/persons.yaml

Semi-automatic person updates

The WHDP-provider-tool-kit (https://gitlab.com/krisvillez/whdp-provider-tool-kit) includes a script dlz_person.py to automatically generate all persons in the dlz. This script reads the sheet person in an Excel file (last known: LabSamples2019.xlsx) to obtain the list of all persons.

Conversion of raw data (offline laboratory data)

Standardized file format

The data arriving in the landing zone are called raw data. The raw data must be converted into a so called standardized file by a conversion script. The format of the standardized files is defined below. In the WHDP current practice, a single Python script is used to collect all laboratory data and process it into a single file ‘lab_results.csv’. This file is overwritten every day. The WHDP takes this file in and also overwrites existing entries when results are changed. In normal operation, data entries in the WHDP can be overwritten but cannot removed.

Standardized file format for lab_results.csv

The standardized file format for the input data is a csv file with 4 columns. It must adhere the following standards:

  • File format: csv file with semicolon as delimiter (;)

  • Columns: The first row contains the column names. The columns are always:

    • lab_identifier
    • sample_identifier: must be unique
    • variable: must exist in variables.yaml and have the exact same name (see above)
    • filter_lab
    • dilution_lab [can be empty]
    • method_lab
    • value_lab: must contain only numerical values. Missing values (NULL, NA, or similar) are not allowed.
    • description_lab [can be empty]
    • person_abbrev_lab
    • timestamp_start_lab [can be empty]: format yyyy-mm-dd hh:mm:ss
    • timestamp_end_lab: format yyyy-mm-dd hh:mm:ss
    • site: must be defined as such and have the exact same name (see above).
    • person_abbrev_sample: must be defined in persons.yaml and have the exact same value for abbrev (see above).
    • filter_sample
    • dilution_sample [can be empty]
    • timestamp_sample: format yyyy-mm-dd hh:mm:ss
    • description_sample [can be empty]
    • method_sample [can be empty]

Example:

lab_identifier sample_identifier variable filter_lab dilution_lab method_lab value_lab description_lab person_abbrev_lab timestamp_start_lab timestamp_end_lab site person_abbrev_sample filter_sample dilution_sample timestamp_sample description_sample manual
NE19063_000_labTOC NE19063 lab_bq_toc None   Shimadzu_TOC-L 1.39 D villezkr 2019-02-06 23:59:59   NE_WaterHub_GW_GW_Batch-Test hessange GF/_PET-45/25 1.0 2019-02-06_17:15:00 D manual
NE19054_000_labNO3N NE19054 lab_bq_no3n None   IC_Anionen 14.9   villezkr 2019-02-07 23:59:59   NE_WaterHub_GW_GW_M2 braunfab None 1.0 2019-02-07_10:00:00   manual
NE19055_000_labNO3N NE19055 lab_bq_no3n None   IC_Anionen 10.1   villezkr 2019-02-07 23:59:59   NE_WaterHub_GW_GW_M3 braunfab None 1.0 2019-02-07_10:00:00   manual
NE19054_000_labPO4P NE19054 lab_bq_po4p None   IC_Anionen 1.5   villezkr 2019-02-07 23:59:59   NE_WaterHub_GW_GW_M2 braunfab None 1.0 2019-02-07_10:00:00   manual

Conversion script

Unlike the case for online sensor data, the conversion script is not executed automatically by the WHDP software. Instead the data provider must execute this conversion script manually or set up a time-based execution schedule (with cron).

Note on automated data transfer and processing

  • To automate this process, the following scripts are currently in use

    • /home/whdp-provider/laboratory/t_lab_central.py (copy *.xlsx files from Q drive)
    • /home/whdp-provider/laboratory/t_lab_project.py (copy *.xlsx files from Q drive)
    • /home/whdp-provider/laboratory/c_lab.py (convert *.xlsx files to *.csv format)
  • The above scripts are executed once per day by means of cron, a time-based job scheduler

Add a new source_type (online sensor data)

In order to add a new signal source_type, the information about this source_type have to be provided as a source_type.yaml file in a new folder for the source_type within the dlz/data folder of the landingzone. The information to be specified are:cd ..

  • Name: unique shorthand name, e.g. obtained by combining the product name and manufacturer (e.g. concube_scan )
  • Description: Detailed description

Adding a source_type if fairly trivial by starting an interactive input program. Typing a dot (.) at any time will abort the entry:

$ whdp add dlz source_type

Note on programmable logic controllers (PLCs)

In conventional WHDP practice, each programmable logic controller (PLC) is considered a source_type of its own for the following reasons:

  • The conversion scripts for devices (source) of the same type (source_type) are expected to be similar. Because every PLC is programmed very differently, one cannot expect the conversion scripts to be very similar, at least not to the same degree that one expects for stand-alone sensors.
  • The use of a source_type to describe a single PLC device allows to provide a conversion script specific to the PLC which can be used to assign the data automatically to their sources (actuators/sensors) associated with the given PLC.

Add a source (online sensor data, instance of source_type)

In order to add a new signal source, the information about this source have to be provided as a source.yaml file in a new folder for the source_type within the dlz/data folder of the landingzone. The information to be specified are:

  • Name: unique shorthand name for the source (e.g., concube_grey1, prgB615)
  • Description: detailed description
  • Serial: [optional] serial number
  • Manufacturer: [optional] the manufacturer of the device

Adding a source_type if fairly trivial by starting an interactive input program. Typing a dot (.) at any time will abort the entry:

$ whdp add dlz source

Semi-automatic source updates

To avoid errors, the WHDP-provider-tool-kit (https://gitlab.com/krisvillez/whdp-provider-tool-kit) includes scripts to automatically generate all sources of a certain source_type in the dlz. This script reads the data from an Excel file which lists the available source_types. Currently, the following scripts and associated Excel sheets are available:

script file
dlz_sources_memographgrey.py memographgrey_config.xlsx
dlz_sources_plcgrey.py plcgrey_config.xlsx

Note on programmable logic controllers (PLCs)

  • In conventional WHDP practice, each actuator and sensor associated with the same programmable logic controller (PLC) is considered a separate source. A single source can be associated with multiple variables however. E.g., continuous-scale sensors (e.g., pH) typically have a status variable (manual on/manual off/auto on/auto off) and an actual measured value.
  • In conventional WHDP practice, the name for a source representing a PLC-connected actuator or sensor is equal to name of the PLC program that registers the state (and value) of the actuator (sensor). E.g.: prgB615

Add raw data to existing source (online sensor data)

Raw data files are written to the respective data/ folders in the operational landing zone as follows:

  1. A new file, for example data.incomplete, is created and data are written to this file.
  2. Once the file content is complete and the corresponding file handle is closed, the file is renamed to data-TIMESTAMP.raw.

Note, the file must end with ``.raw``! The actual format of TIMESTAMP is not fixed but must be unique string, that starts with a dash -, and can be temporarily ordered. Encoding a full date and time string will help the users and developers to inspect and find files, especially if present in the backup zone.

This procedure is called write-rename pattern and avoids conversion of incomplete data files. The risk for such a rare condition depends on the size of the incoming data files and other factors and is probably very low. But running a data pool over a longer time span increases this risk and could result in missing data in the data base.

Note on automated data transfers

  • To automate this process, the following scripts are currently in use

    • t_plcgrey.py (data transfer from plcgrey PLC)
  • The above script are executed once per day by means of cron, a time-based job scheduler

  • The conversion script for data transfered from plcgrey PLC in use is located here:

    • /home/whdp-provider/dlz/data/plcgreyconversion.py

Conversion of raw data (online sensor data)

The files arriving in the landing zone are called raw data. Every raw data file must be converted into a so called standardized file by a conversion script. The format of the standardized files is defined below. Most typically, every source needs an individually adapted conversion script. As an alternative, the WHDP also allows an individually adapted conversion script for a source_type

Standardized file format for association with a source

The standardized file format for the input data is a csv file with 4 columns. It must adhere the following standards:

  • File format: csv file with semicolon as delimiter (;)
  • Timestamp format: yyyy-mm-dd hh:mm:ss
  • Column names: The first row contains the column names. The columns are always: timestamp, variable, value, and site. The variable must exist in variables.yaml and have the exact same name (see above). The site must be defined as such and have the exact same name (see above).
  • value column: Must contain only numerical values. Missing values (NULL, NA, or similar) are not allowed.

Example:

timestamp variable value site
2019-01-29 12:29:29 sensor_bp 18.605044 NE_WaterHub_GW_GW_M1
2019-01-29 12:29:30 sensor_bp 19.225162 NE_WaterHub_GW_GW_M1
2019-01-29 12:29:31 sensor_bp 19.535282 NE_WaterHub_GW_GW_M1

Standardized file format for association with a source_type

The standardized file format for the input data is a csv file with either 5 columns. It must adhere the following standards:

  • File format: csv file with semicolon as delimiter (;)
  • Timestamp format: yyyy-mm-dd hh:mm:ss
  • Column names: The first row contains the column names. The columns are always: timestamp, variable, value, site, and source. The variable must exist in variables.yaml and have the exact same name (see above). The site must be defined as such and have the exact same name (see above). The source must be defined as an instance of the considered source_type and have the exact same name (see above).
  • value column: Must contain only numerical values. Missing values (NULL, NA, or similar) are not allowed.

Example:

timestamp variable value site source
2019-01-29 12:29:29 sensor_bp 18.605044 NE_WaterHub_GW_GW_M1 prgB615
2019-01-29 12:29:30 sensor_bp 19.225162 NE_WaterHub_GW_GW_M1 prgB615
2019-01-29 12:29:31 sensor_bp 19.535282 NE_WaterHub_GW_GW_M1 prgB615

Conversion script

The conversion script must define a function which reads raw data and write an output file (a standardized file). The first argument if this function is the path to the input raw data, the second argument the path to the resulting file.

The following points should be considered when writing an conversion script:

  • Indicate corrupt input data by throwing an exception within a conversion script. An informative error message is helpful and will be logged.
  • If a conversion script writes to stdout (i.e. normal print() commands) this may not appear in the WHDP log file and thus might be overseen.
  • All required third party modules, packages, or libraries must be installed globally. Do not try to install them within a script.

The following code snippets show how a conversion script could look like for different languages.

R

  • The file must be named conversion.r.
  • The function must be named convert.
# Example R conversion script
# September  27, 2016 -- Alex Hunziker

library(reshape2)

convert <- function(raw_file, output_file){

   data.raw <- utils::read.table(raw_file, sep="\t", skip=1, header=F)
   names(data.raw) <- c("Date Time", "Water Level", "Average Flow Velocity", "Flow",
                        "Temperature", "Surface Flow Velocity", "Distance",
                        "Distance Reading Count",  "Surcharge Level", "Peak to Mean Ratio",
                        "Number of Samples", "Battery Voltage")

   if(ncol(data.raw) !=12 )
       stop(paste("Error: Input File has", ncol(data.raw),
                  "columns, instead of the expected 12 columns."))

   if(!all(sapply(data.raw[2:ncol(data.raw)], is.numeric)==TRUE))
       stop("Error: Non-numeric input where numeric values were expected.")

   # define coordinate
   xcoor <- 682558
   ycoor <- 239404
   zcoor <- ""

   ## reformat data

   time <- strptime(data.raw$"Date Time", "%d.%m.%Y %H:%M")
   data.raw$"Date Time" <- format(time, "%Y-%m-%d %H:%M:%S")

   data.form <- reshape2::melt(data.raw, id.vars = c("Date Time"))

   colnames(data.form) <- c("timestamp", "parameter", "value")
   data.form$X <- xcoor
   data.form$Y <- ycoor
   data.form$Z <- zcoor

   # remove NA values
   data.form <- stats::na.omit(data.form)

   utils::write.table(data.form, file=output_file, row.names=FALSE, col.names=TRUE,
                      quote=FALSE, sep=";")

}

Julia

  • The function must be named convert.
  • The name of the julia file and the declared module must be the same (up to the .jl file extension). So the file containing the module conversion_lake_zurich must be saved as conversion_lake_zurich.jl.
  • Further the module and file name must be unique within the landing zone.
# Example Julia conversion script
# September  27, 2016 -- Alex Hunziker

module conversion_FloDar_Fehraltorf_2

# ---> 1.) load required package (optional)

using DataFrames

function convert(raw_file, output_file)

  # ---> 2.) read file

  if(!isfile(raw_file))
    error("Error: raw_file does not exist.")
  end

  # the header line contains non-utf8 encoded characters, so we skip this:
  dataraw = DataFrame(readtable(raw_file, separator = '\t', skipstart=1, header=false))

  names!(dataraw, map(symbol, ["Date Time", "Water Level", "Average Flow Velocity", "Flow",
                               "Temperature", "Surface Flow Velocity", "Distance",
                               "Distance Reading Count", "Surcharge Level",
                               "Peak to Mean Ratio", "Number of Samples", "Battery Voltage"]))

  ## ---> 3.) test properties

  if(size(dataraw, 2) != 12)
    error("Imput File has wrong number of columns.")
  end

  ## ---> 4.) add additional information (optional)

  #Define coordinate
  xcoor = 682558
  ycoor = 239404
  zcoor = ""

  ## ---> 5.) reformate data

  selCol = symbol("Date Time")
  time = Dates.DateTime(dataraw[selCol], "dd.mm.yyyy HH:MM")
  dataraw[selCol] = Dates.format(time, "yyyy-mm-dd HH:MM")

  dataForm = stack(dataraw, [2:12], selCol)
  dataForm = dataForm[:, [selCol, :variable, :value]]
  dataForm[4] = xcoor
  dataForm[5] = ycoor
  dataForm[6] = zcoor
  names!(dataForm, [:timestamp, :parameter, :value, :x, :y, :z])

  deleterows!(dataForm, find(isna(dataForm[:, symbol("value")])))

  ## ---> 6.) write file

  writetable(output_file, dataForm, separator = ';')

end

end

Python

# Example Python conversion script
# September  27, 2016 -- Alex Hunziker

# ---> 1.) load required packages (optional)
import os.path
import pandas

def convert(raw_file, output_file):

    # ---> 2.) read file

    if not os.path.isfile(raw_file):
        raise ValueError('Error: Input File does not exist.')

    raw_data = pandas.read_csv(raw_file, sep='\t', encoding="latin-1")
    colNames = ("Date Time", "Water Level", "Average Flow Velocity", "Flow", "Temperature",
                "Surface Flow Velocity", "Distance", "Distance Reading Count",
                "Surcharge Level", "Peak to Mean Ratio", "Number of Samples",
                "Battery Voltage")
    raw_data.columns = colNames

    # ---> 3.) test properties

    if len(raw_data.columns) != 12:
        raise ValueError('Error: Input File has wrong number of columns.')

    # ---> 4.) add additional information (optional)

    # Define coordinate
    xcoor = 682558
    ycoor = 239404
    zcoor = ""

    # ---> 5.) reformat data

    time = pandas.to_datetime(raw_data['Date Time'], format="%d.%m.%Y %H:%M")
    raw_data['Date Time'] = time.apply(lambda x: x.strftime('%Y-%m-%d %H:%M'))

    data = pandas.melt(raw_data, id_vars=['Date Time'],
                       value_vars=list(raw_data.columns[1:12]))

    data.columns = ['Date Time', 'parameter', 'value']

    data = data.dropna()

    data['x'] = xcoor
    data['y'] = ycoor
    data['z'] = zcoor

    ## ---> 6.) write file

    data.to_csv(output_file, sep=";", index=False)

Matlab

  • The function must be named convert.
  • The file name must be named convert.m.
%
% SWW-DWH: Example MatLab conversion script
%
% 19/12/2016 - Frank Blumensaat
% Example: conversion('raw_data\data-001.raw','out.dat');
% -------------------------------------------------------

function conversion(fNameIn,fNameOut)

% read full content of the file into 'data'
fid = fopen(fullfile(fNameIn), 'r');
dataRaw = textscan(fid, '%s %f %f %f %f %f %f %f %f %f %f %f', Inf, 'Delimiter','\t','TreatAsEmpty',...
    {'NA'},'HeaderLines',1);
fclose(fid);

% possible to include check if 12 columns and numeric val's in col2 - col12

fid = fopen(fullfile(fNameIn), 'r');
names = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s', 1,'Delimiter','\t','HeaderLines',0);
fclose(fid);

% % parse string of TRANSFER time (time stamp) into ML number
datTime = datenum(dataRaw{1,1}(:),'DD.mm.YYYY hh:MM');

% define coordinates
xcoor = ones(length(dataRaw{1}),1).*682558;
ycoor = ones(length(dataRaw{1}),1).*239404;
zcoor = zeros(length(dataRaw{1}),1);

% split data matrix acc. to parameter and remove NaNs
for j = 2:size(dataRaw,2)
    dataSplit(j-1).var = excise([datTime dataRaw{1,j} xcoor ycoor zcoor]);
end

% some parameter names are not conforming to parameters.yaml:
parametersRaw =  {'Level', 'Velocity', 'Surface Velocity', 'PMR', 'NOS', 'Power Supply'};
parametersUniform = {'Water Level', 'Average Flow Velocity', 'Surface Flow Velocity',...
                    'Peak to Mean Ratio', 'Number of Samples', 'Battery Voltage'};

fixNames = containers.Map(parametersRaw,parametersUniform);

% write processed data to a cell array
celldata = {};
clear celldataTemp
for k = 1:length(dataSplit)
    for i = 1:length(dataSplit(k).var)
        celldataTemp{i,1} = datestr(dataSplit(k).var(i,1),'yyyy-mm-dd HH:MM:SS'); % following the ISO 8601 data standard
        name = char(names{k+1});
        % our parameters.yaml does not have the units in (..), so we remove them:
        name = regexprep(name, '\(.*\)', '');
        % correct some names:
        if isKey(fixNames, name)
            name = fixNames(name);
        end
        celldataTemp{i,2} = name;
        celldataTemp{i,3} = dataSplit(k).var(i,2);
        celldataTemp{i,4} = dataSplit(k).var(i,3);
        celldataTemp{i,5} = dataSplit(k).var(i,4);
        celldataTemp{i,6} = '';
    end
    celldata = vertcat(celldata,celldataTemp);
    clear celldataTemp
end

%% write selected data to TXT file
fid = fopen(fullfile(fNameOut),'w');
fprintf(fid, '%s; %s; %s; %s; %s; %s \n', 'timestamp', 'parameter', 'value', 'x', 'y', 'z');
[nrows] = size(celldata);
for row = 1:nrows
    fprintf(fid,'%s; %s; %f; %d; %d; %d \n',celldata{row,:});
end
fclose(fid);
end

%% function to remove NaN values
function X = excise(X)
X(any(isnan(X)'),:) = [];
end