Each folder in this directory contains all the descriptions for building a data lake that can enable studies on a specific infectious disease. Each folder has four subfolders: Data Collection, Data Curation, Data Description, and Data ETL.
-
Data Collection: contains the scripts for downloading data from open sources and updating when new versions are available in their original system (source).
-
Data Curation: contains the scripts for data harmonisation and cleansing for each data set. The scripts may change over time due to changes detected after the record update.
-
Data Description: we provide codes to perform basic data analysis and data validation.
-
Data ETL: we provide codes to format data for modelling and visualization.
Currently the library is on production, so the easiest way to use is clone our repository or copy the functions available in this directory.
Models were implemented using Python > 3.5 and depend on libraries such as Pandas, SciPy, Numpy, Matplotlib, etc. For the full list of dependencies as well libraries versions check requirements.txt inside each folder.
Platform For Analytical Modelis in Epidemiology. (2022). GitHub directory: https://github.com/PAMepi/PAMepi_scripts_datalake.git. PAMepi/PAMepi_scripts_datalake: v1.0.0 (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.6384641
Note that in each folder you will find a doi linked to the dataset processed by the team, which can be cited in your work.
This study was financed by
* Bill and Melinda Gates Foundation and Minderoo Foundation HDR UK, through the Grand Challenges ICODA COVID-19 Data Science, with reference number 2021.0097
* Fiocruz Innovation Promotion Program - Innovative ideas and products - COVID-19, orders and strategies INOVA-FIOCRUZ, with reference Number VPPIS-005-FIO-20-2-40.
[1] Platform for Analytical Models in Epidemiology - PAMEpi (2020).
[2] Platform for Analytical Models in Epidemiology - PAMEpi-Covid-19: Data (2020).
