this Repo is copied to new location /Podnebnik/website/data and has been archieved.
Note: this readme is copied to new location /Podnebnik/website/data. This repo will be archieved soon
This is a collection of open data sources related to the climate change. We use the Frictionless Data Framework to organize and describe the data.
Currently we provide the following data packages:
For details about the provided data, please consult the datapackage.yaml files in the individual data package folders.
A datapackage is a combination of data resources (.csv data files) and a datapackage descriptor file (.yaml) containing the metadata.
Create a fork of the https://github.com/podnebnik/data repository and follow the instructions below to create your datapackage. You can also look at the existing datapackages for reference.
Use the following template for your datapackage folder structure that you should place in the root of the https://github.com/podnebnik/data repository:
datasets/
- emissions/
- datasets/
- sources/
emissions.xlsx
pipeline.py
emissions.csv
emissions.energy.csv
emissions.aviation.csv
emissions.agriculture.csv
datapackage.yaml
- give your datapackage a unique and descriptive name (e.g. "emissions") and name the folder in root.
- within the folder create a
datasets/sources/subfolder for the original data files (if they exist) and any for code used to transform the data into.csvfiles. - place the
.csvfiles into thedatasets/subfolder (see below for more details on these files) - prepare the metadata
.yamlfile for your datapackage (see below for detailed instructions on how to do that).
Data files should be in .csv format:
- single row header
- comma separated fields
- one row per record of equal length
The file names should be:
- descriptive and understandable, avoid unfamiliar abbreviations
- if you have several files in your datapackage, they should all have a common and unique prefix - usually the same one as the name of your datapackage (see the example above).
The variable names should be:
- descriptive and understandable, avoid unfamiliar abbreviations
- avoid redundancy: if a descriptor is in the filename, there is no need to repeat it in the variable names
- do not use spaces or any special characters except underscores "
_" - use a double underscore "
__" to delineate hierarchical levels e.g.fuel_combustion__transport
e.g. if the file name is emissions.agriculture.csv the following applies for variable names:
emissions.agriculture.manure.management- not OKemissions_agriculture__manure_management- not OKmanure.management- not OKmanure_management- OK
Once the data files are all ready, you will
- use the
pythonpackagefrictionlessto infer the basic metadata directly from the files into a.yamlfile - manually add the information that cannot be inferred into the
.yamlfile just created.
The final .yaml file containing the metadata should be stored at the same level as the datasets/ subfolder in your datapackage.
Make sure you have python installed on your system, then install the frictionless package:
pip install frictionless
Alternatively, we also provide a Pipfile to install the frictionless package, which will install the required modules in a new virtual environment (see https://pipenv.pypa.io/ for details):
pipenv install # create the python virtual environment and install the necessary modules
pipenv shell # activate the newly created python virtual environment
The frictionless command describe will automatically create the basic metadata file.
From your datapackage folder you can describe a set of .csv files with the help of the * wildcard operator like in the following example.
frictionless describe datasets/emissions*.csv --yaml > datapackage.yaml
This creates a datapackage.yaml file inferring the metadata for all files that follow the emissions*.csv pattern in the data\ folder.
Open the datapackage.yaml file and amend it to add
- package-level metadata i.e. information that applies to all the
.csvfiles in your datapackage - resounce-level metadata i.e. information that applies to individual
.csvfiles (which are called resources).
Pay attention to indentation! A newly created datapackage.yaml file will only have two fields at the top level: profile and resources. You should attempt the following fields (but if any of this metadata does not apply equally to all of the files in your datapackage, you should instead add them to the individual resources instead!):
name: for your short datapackage name, this should be the same as the folder name (e.g. emissions)title: should be a longer name of your datapackage (e.g. Historical and projected CO2 equiv. emissions)description: enter a longer description of your datapackagekeywords: enter relevant keywords in english as an array enclosed in square brackets (e.g. [emissions, agriculure])contributors: enter a list of authors (please check the specification for the list of available fields)geography: enter the geographic area the data refer to (e.g. Europe)schedule: enter the time resolution for the data e.g. annual, monthly..sources: should follow this template:
sources:
- title: # name of data source - mandatory field!
path: # path to file in repo if exists
url: # url to original data source if possible
author: # organisation or person who is the owner of the data
code: # path to code in repo used to transform data into csv files if exists
date_accessed: # date when data was extracted in ISO format (e.g. 2021-07-12)
licenses:unless required otherwise by your data source, use the following:
licenses:
- name: ODbL-1.0
title: Open Data Commons Open Database License 1.0
path: http://www.opendefinition.org/licenses/odc-odbl
For each individual file (i.e. resource) check the existing metadata and add or change the following (if appropriate):
name: you should normally leave the name as it was inferred (the name of the file). Otherwise be careful because no spaces are allowed in the name! (e.g.historical.emissions.agriculture)title: should be a longer name describing your data file (e.g.Historical emissions from agriculture)
If there is not a common data source for the whole package and individual files have their separate sources you must add them here instead:
sources:
- title: # name of data source - mandatory field!
path: # path to file in repo if exists
url: # url to original data source if possible
title: # name of data source
author: # organisation or person who is the owner of the data
code: # path to code in repo used to transform data into csv files if exists
date_accessed: # date when data was extracted in ISO format (e.g. 2021-07-12)
Finally, the fields in each resource: they already have the resource.schema.fields.name and resource.schema.fields.type values inferred.
name: do not change this value, it should be identical to the one in the.csvfile.type: you should check the types are correct and change them if required. Usually this will only mean changing the values todateoryearif appropriate.title: add a descriptive title e.g.luluc->Land use and land use changeunit: if appropriate, add the unit of the variable- you can also add a
formatfield. See here for valid types and formats. - You can also add constraints for the fields, as well as missing value definitions and primary and foreign keys if necessary. See this for more options.
Once you're finished editing the datapackage.yaml file, make a copy of the file and name it datapakcage.si.yaml Keeping the keys in the original English, translate the values into Slovenian for the following fields: name, title, description, keywords, resources.name, resources.title, resources.sources.title, resources.sources.title, resources.schema.fields.title and resource.schema.fields.unit as appropriate.
Do not translate any of the values for the name keys!
Once you're happy with the files, the folder structure and the datapackage.yaml metadata file, create a (draft) pull request tagging @joahim and @majazaloznik as reviewers.
Frictionless Data Framework provides a command line tool to help you describe, extract and validate your data. The easiest way to install the framework is to use the pipenv and run:
pipenv installYou can then run the tool with:
pipenv run frictionlessFor example, to validate the data package, run:
pipenv run frictionless validate emissions/datapackage.yamlRun datasette
docker run --rm -p 8001:8001 ghcr.io/podnebnik/data:latest
or locally
datasette serve build/