Instructions for data provider

Data provider is usually the scientist performing measurements in the field.

System Architecture

Data flow model specification.

../_images/DataFlowModel.svg

The data provider must make sure that i) the raw data arrive in the landing-zone at the right place, ii) a conversions script exist to standardize the raw data, and iii) meta-data are provided.

General workflow

Working on the real landing-zone would be dangerous. Therefore, all development and testing is done on a copy of the landing-zone. The datapool provides a command to create development landing-zones. A development landing-zone can have any names, but let’s call it dlz for now:

$ pool start-develop dlz

This creates a folder (a copy of the real landing-zone) called dlz in the home directory. You can see how the created landing zone looks like with ls -l dlz.

The datapool provides various checks to ensure that the provided conversion scripts and meta-data are consistent. The checks are ran by:

$ pool check dlz

If everything is fine, modify the develop landing-zone (e.g. add a new sensor) according to the instructions given below. After the modifications run the checks again.

$ pool check dlz

It is recommended to execute this checks after any small changes. If this succeeds, update the operational landing zone:

$ pool update-operational dlz

All future raw data should be delivered directly into the operational database.

In the following sections, the different types of modifications/additions are explained.

Add Raw Data to Existing Source

Raw data files are written to the respective data/ folders in the operational landing zone as follows:

  1. A new file, for example data.incomplete, is created and data are written to this file.
  2. Once the file content is complete and the corresponding file handle is closed, the file is renamed to data-TIMESTAMP.raw.

Note, the file must end with ``.raw``! The actual format of TIMESTAMP is not fixed but must be unique string, that starts with a dash -, and can be temporarily ordered. Encoding a full date and time string will help the users and developers to inspect and find files, especially if present in the backup zone.

This procedure is called write-rename pattern and avoids conversion of incomplete data files. The risk for such a race condition depends on the size of the incoming data files and other factors and is probably very low. But running a data pool over a longer time span increases this risk and could result in missing data in the data base.

Add Site

In order to add a new measuring site, the information about this site have to be provided as a site.yaml file in a new folder for the site, within the sites folder of the landingzone. The information to be specified are:

  • Name: Name of the site
  • Description: Free Text describing the particularities of the site
  • Street, City and Coordinates (CH1903/LV03): Specifying where the site is located
  • Pictures (optional): Pictures relating to the site can be specified. Pictures are normally stored in the images folder of the specific site.

The structure of the file has to be the same as in the example below:

name: industry
description: Site at pump house at industry. Installed nearly on top of the  pump. Detection of pump possible! When a pump must be removed the sensor has to be taken out and installed after again.
street: Undermülistrasse
city: 8320 Fehraltorf, Switzerland
coordinates:
    x: 47.394973
    y: 8.733811
    z:

# pictures are optional:
pictures:
    -
      path: images/installation.png
      description: Installation of Lora Ultrasonic Sensor 00769
      # date is optional:
      date: 2016/08/22 12:00:00
    -
      path: images/impact.jpg
      description: Impact Zone of Sensor 00769

Add or Modify Parameters

The file parameters.yaml is stored in the data folder and contains all the parameters. New parameters can be added here. The information to be included are:

  • Name: Name of the Parameter
  • Unit: Specifies the unit of the Parameter
  • Description: Additional description of the parameter. In case there is no description required, the field can remain empty.

The syntax has to match the following example (note the dash in the first line):

-
   name: Absorbance 202.50_nm
   unit: m-1
   description: absorbance at 202.50 nm wavelength

Add a new Source Type

TODOXS

Add Source Instance

todo

Conversion of raw data

The files arriving in the landing zone are called raw data. Every raw data file must be converted into a so called standardized file by a conversion script. The format of the standardized files is defined below. Typically, every source instance needs an individually adapted conversion script.

Standardized file format

The standardized file format for the input data is a csv file with either six or four columns. It must adhere the following standards:

  • File format: csv file with semicolon delimited (;)
  • Data format: yyyy-mm-dd hh:mm:ss
  • Column names: The first row contains the column names. The first three are always: timestamp, parameter, value. Next either the three columns x, y, z, or a single column site must be given. The parameter must exisit in the parameters.yaml and have the exactly same name (see above).
  • value column: Must contain only numerical values. Missing values (NULL, NA, or similar) are not allowed.
  • The z-coordinate columns may be empty.

Example standardized file format with coordinates

timestamp parameter value x y z
2013-11-13 10:06:00 Water Level 148.02 682558 239404  
2013-11-13 10:08:00 Water Level 146.28 682558 239404  
2013-11-13 10:08:00 Average Flow Velocity 0.64 682558 239404 412
 

Example standardized file format with site

timestamp parameter value site
2013-11-13 10:06:00 Water Level 148.02 zurich
2013-11-13 10:08:00 Water Level 146.28 zurich
2013-11-13 10:08:00 Average Flow Velocity 0.64 zurich

Conversion script

The conversion script must define a function which reads raw data and write an output file (a standardized file). The first argument if this function is the path to the input raw data, the second argument the path to the resulting file.

The follwing points should be considered when writing an conversion script:

  • Indicate corrupt input data by throwing an exception within a conversion script. A informative error message is helpful and will be logged.
  • If a converson script writes to stdout (i.e. normal print() commands) this may not appear in the datapool log file and thus might be overseen.
  • All required third party modules, packages, or libraries must be installed globally. Do not try to install them within a script.

The following code snippets show how a conversion script could look like for different languages.

R

  • The file must be named conversion.r.
  • The function must be named convert.
# Example R conversion script
# September  27, 2016 -- Alex Hunziker

library(reshape2)

convert <- function(raw_file, output_file){

   data.raw <- utils::read.table(raw_file, sep="\t", skip=1, header=F)
   names(data.raw) <- c("Date Time", "Water Level", "Average Flow Velocity", "Flow",
                        "Temperature", "Surface Flow Velocity", "Distance",
                        "Distance Reading Count",  "Surcharge Level", "Peak to Mean Ratio",
                        "Number of Samples", "Battery Voltage")

   if(ncol(data.raw) !=12 )
       stop(paste("Error: Input File has", ncol(data.raw),
                  "columns, instead of the expected 12 columns."))

   if(!all(sapply(data.raw[2:ncol(data.raw)], is.numeric)==TRUE))
       stop("Error: Non-numeric input where numeric values were expected.")

   # define coordinate
   xcoor <- 682558
   ycoor <- 239404
   zcoor <- ""

   ## reformat data

   time <- strptime(data.raw$"Date Time", "%d.%m.%Y %H:%M")
   data.raw$"Date Time" <- format(time, "%Y-%m-%d %H:%M:%S")

   data.form <- reshape2::melt(data.raw, id.vars = c("Date Time"))

   colnames(data.form) <- c("timestamp", "parameter", "value")
   data.form$X <- xcoor
   data.form$Y <- ycoor
   data.form$Z <- zcoor

   # remove NA values
   data.form <- stats::na.omit(data.form)

   utils::write.table(data.form, file=output_file, row.names=FALSE, col.names=TRUE,
                      quote=FALSE, sep=";")

}

Julia

  • The function must be named convert.
  • The name of the julia file and the declared module must be the same (up to the .jl file extension). So the file containing the module conversion_lake_zurich must be saved as conversion_lake_zurich.jl.
  • Further the module and file name must be unique within the landing zone.
# Example Julia conversion script
# September  27, 2016 -- Alex Hunziker

module conversion_FloDar_Fehraltorf_2

# ---> 1.) load required package (optional)

using DataFrames

function convert(raw_file, output_file)

  # ---> 2.) read file

  if(!isfile(raw_file))
    error("Error: raw_file does not exist.")
  end

  # the header line contains non-utf8 encoded characters, so we skip this:
  dataraw = DataFrame(readtable(raw_file, separator = '\t', skipstart=1, header=false))

  names!(dataraw, map(symbol, ["Date Time", "Water Level", "Average Flow Velocity", "Flow",
                               "Temperature", "Surface Flow Velocity", "Distance",
                               "Distance Reading Count", "Surcharge Level",
                               "Peak to Mean Ratio", "Number of Samples", "Battery Voltage"]))

  ## ---> 3.) test properties

  if(size(dataraw, 2) != 12)
    error("Imput File has wrong number of columns.")
  end

  ## ---> 4.) add additional information (optional)

  #Define coordinate
  xcoor = 682558
  ycoor = 239404
  zcoor = ""

  ## ---> 5.) reformate data

  selCol = symbol("Date Time")
  time = Dates.DateTime(dataraw[selCol], "dd.mm.yyyy HH:MM")
  dataraw[selCol] = Dates.format(time, "yyyy-mm-dd HH:MM")

  dataForm = stack(dataraw, [2:12], selCol)
  dataForm = dataForm[:, [selCol, :variable, :value]]
  dataForm[4] = xcoor
  dataForm[5] = ycoor
  dataForm[6] = zcoor
  names!(dataForm, [:timestamp, :parameter, :value, :x, :y, :z])

  deleterows!(dataForm, find(isna(dataForm[:, symbol("value")])))

  ## ---> 6.) write file

  writetable(output_file, dataForm, separator = ';')

end

end

Python

# Example Python conversion script
# September  27, 2016 -- Alex Hunziker

# ---> 1.) load required packages (optional)
import os.path
import pandas

def convert(raw_file, output_file):

    # ---> 2.) read file

    if not os.path.isfile(raw_file):
        raise ValueError('Error: Input File does not exist.')

    raw_data = pandas.read_csv(raw_file, sep='\t', encoding="latin-1")
    colNames = ("Date Time", "Water Level", "Average Flow Velocity", "Flow", "Temperature",
                "Surface Flow Velocity", "Distance", "Distance Reading Count",
                "Surcharge Level", "Peak to Mean Ratio", "Number of Samples",
                "Battery Voltage")
    raw_data.columns = colNames

    # ---> 3.) test properties

    if len(raw_data.columns) != 12:
        raise ValueError('Error: Input File has wrong number of columns.')

    # ---> 4.) add additional information (optional)

    # Define coordinate
    xcoor = 682558
    ycoor = 239404
    zcoor = ""

    # ---> 5.) reformat data

    time = pandas.to_datetime(raw_data['Date Time'], format="%d.%m.%Y %H:%M")
    raw_data['Date Time'] = time.apply(lambda x: x.strftime('%Y-%m-%d %H:%M'))

    data = pandas.melt(raw_data, id_vars=['Date Time'],
                       value_vars=list(raw_data.columns[1:12]))

    data.columns = ['Date Time', 'parameter', 'value']

    data = data.dropna()

    data['x'] = xcoor
    data['y'] = ycoor
    data['z'] = zcoor

    ## ---> 6.) write file

    data.to_csv(output_file, sep=";", index=False)

Matlab

  • The function must be named convert.
  • The file name must be named convert.m.
%
% SWW-DWH: Example MatLab conversion script
%
% 19/12/2016 - Frank Blumensaat
% Example: conversion('raw_data\data-001.raw','out.dat');
% -------------------------------------------------------

function conversion(fNameIn,fNameOut)

% read full content of the file into 'data'
fid = fopen(fullfile(fNameIn), 'r');
dataRaw = textscan(fid, '%s %f %f %f %f %f %f %f %f %f %f %f', Inf, 'Delimiter','\t','TreatAsEmpty',...
    {'NA'},'HeaderLines',1);
fclose(fid);

% possible to include check if 12 columns and numeric val's in col2 - col12

fid = fopen(fullfile(fNameIn), 'r');
names = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s', 1,'Delimiter','\t','HeaderLines',0);
fclose(fid);

% % parse string of TRANSFER time (time stamp) into ML number
datTime = datenum(dataRaw{1,1}(:),'DD.mm.YYYY hh:MM');

% define coordinates
xcoor = ones(length(dataRaw{1}),1).*682558;
ycoor = ones(length(dataRaw{1}),1).*239404;
zcoor = zeros(length(dataRaw{1}),1);

% split data matrix acc. to parameter and remove NaNs
for j = 2:size(dataRaw,2)
    dataSplit(j-1).var = excise([datTime dataRaw{1,j} xcoor ycoor zcoor]);
end

% some parameter names are not conforming to parameters.yaml:
parametersRaw =  {'Level', 'Velocity', 'Surface Velocity', 'PMR', 'NOS', 'Power Supply'};
parametersUniform = {'Water Level', 'Average Flow Velocity', 'Surface Flow Velocity',...
                    'Peak to Mean Ratio', 'Number of Samples', 'Battery Voltage'};

fixNames = containers.Map(parametersRaw,parametersUniform);

% write processed data to a cell array
celldata = {};
clear celldataTemp
for k = 1:length(dataSplit)
    for i = 1:length(dataSplit(k).var)
        celldataTemp{i,1} = datestr(dataSplit(k).var(i,1),'yyyy-mm-dd HH:MM:SS'); % following the ISO 8601 data standard
        name = char(names{k+1});
        % our parameters.yaml does not have the units in (..), so we remove them:
        name = regexprep(name, '\(.*\)', '');
        % correct some names:
        if isKey(fixNames, name)
            name = fixNames(name);
        end
        celldataTemp{i,2} = name;
        celldataTemp{i,3} = dataSplit(k).var(i,2);
        celldataTemp{i,4} = dataSplit(k).var(i,3);
        celldataTemp{i,5} = dataSplit(k).var(i,4);
        celldataTemp{i,6} = '';
    end
    celldata = vertcat(celldata,celldataTemp);
    clear celldataTemp
end

%% write selected data to TXT file
fid = fopen(fullfile(fNameOut),'w');
fprintf(fid, '%s; %s; %s; %s; %s; %s \n', 'timestamp', 'parameter', 'value', 'x', 'y', 'z');
[nrows] = size(celldata);
for row = 1:nrows
    fprintf(fid,'%s; %s; %f; %d; %d; %d \n',celldata{row,:});
end
fclose(fid);
end

%% function to remove NaN values
function X = excise(X)
X(any(isnan(X)'),:) = [];
end