Instructions for data users

Data users are scientists analysing data from the water hub data pool.

All data is store in a PostgrSQL database so that arbitrary queries can be performed. The figure below shows the database layout.

_images/DataModel.svg

Connect to water hubgdata pool

To connect with the database you need the following information from your admin:

  • The host, e.g. “a.server.com”
  • The port of the database, e.g. 5432
  • The name of the database, e.g. “whdp”
  • A database user name, e.g “data_user”
  • The database password

You can connect directly to the database via psql. However, it is more convenient to load the data required directly in the environment used for further analysis:

R

With the RPostgreSQL package data can be loaded directly into R.

However, for Eawag internal usage the DatapoolR package should be preferred. It connects automatically to the database and provides convenient helper functions.

Python

Different options exist, psycopg2 is widly used.

Example SQL queries

The SQL language may look cumbersome at first. However, it gives a lot of flexibility and allows to express even very complex queries. This SQL tutorial is a helpful reference.

Also note that the DatapoolR package provides functions to simplify common queries.

List all data sources:

SELECT srctype.name, src.name, src.serial, srctype.description, src.description
FROM source_type AS srctype, source AS src
WHERE src.source_type_id = srctype.source_type_id;

List all sites and check how many images a site has:

SELECT name, coord_x, coord_y, coord_z, street,
  postcode, COUNT(picture.filename), site.description
FROM site
  LEFT JOIN picture ON site.site_id=picture.site_id
GROUP BY site.site_id;

Get all signals measured at site site_A within a given time interval:

SELECT signal.timestamp, value, unit, parameter.name, source_type.name, source.name
FROM signal
  INNER JOIN site ON signal.site_id = site.site_id
  INNER JOIN parameter ON signal.parameter_id = parameter.parameter_id
  INNER JOIN source ON signal.source_id = source.source_id
  INNER JOIN source_type ON source.source_type_id = source_type.source_type_id
WHERE site.name = site_A AND
  ?tmin::timestamp <= "2017-01-01 00:12:00"::timestamp AND
  signal.timestamp <= "2017-01-01 00:18:00"::timestamp;