I am developing the microData package to search, browse, and extract
metadata from microdata provided by the World Bank (WB), Food and
Agriculture Organization (FAO), International Household Survey (IHSN),
United Nations High Commissioner for Refugees (UNHCR), and International
Labor Organization (ILO) via the NADA API. Any researcher who has used
microdata from these organizations knows how difficult and
time-consuming it is to understand and import these data and variables
into R. If you are a user or plan to use micradata, then this is the
life-saving R package for you.
The purpose of microData is to simplify the process of extracting
complex metadata from data provided by various organizations, thereby
improving data preparation efficiency. At the moment, it supports five
international organizations, namely the World Bank, FAO, UNHCR, IHSN,
and ILO. It has the ability to search, filter, extract, and perform
other tasks that you can do on the web, but it cannot download the data
file itself. This is because, to my knowledge, there is currently no
available documentation for use with the API. I think it is due to data
license issue because there are few accessible datasets through the API.
Furthermore, this package has the ability to assist in obtaining the
names of variables from a specific survey, as well as their labels. It
also allows you to select only variables that you are interested in and
rename them, while assigning variable descriptions as label attributes.
You can set custom names and labels for the dataset. Labels play a
crucial role when exporting tables and graphs, as they save you from
setting long names in manuscripts manually. Therefore, this package is
available to alleviate all these difficulties.
Warning: Since this package is still under development, I don’t recommend you use it in reproducible code, as any changes can happen in the future.
You can install the development version of microData from GitHub with:
# install.packages("devtools")
devtools::install_github("GutUrago/microData")All organizations supported by this package use the NADA API to publish
micro-data, which makes use of similar terminologies. Collection simply
means gathering multiple related studies or data sets. To see all
available collections, you can use collections() function.
Note: I used customized gt table theme that I created in this blog.
library(microData)
collections(org = "wb") |>
head() |>
my_gt_theme()| id | repo_id | title |
|---|---|---|
| 26 | afrobarometer | Afrobarometer |
| 2 | datafirst | DataFirst , University of Cape Town, South Africa |
| 22 | dime | Development Impact Evaluation (DIME) |
| 1 | microdata_rg | Development Research Microdata |
| 4 | enterprise_surveys | Enterprise Surveys |
| 30 | fao | FAO - Food and Agriculture Microdata Catalog |
This package gives all flexibility of searching on the web. For more see
the documentation for search_catalog().
search_catalog(
keyword = "food",
org = "unhcr",
from = 2015,
to = 2024,
country ="Ethiopia",
sort_by = "year",
sort_order = "desc",
results = 10)There is also handy function to check latest publications of these datasets.
latest_entries(org = "wb", limit = 15)You can use data_files to see the data files included in the study.
Let’s see one of the popular survey on the WB. We can also use id number
of the study, which is 3110 instead of the name (See next code chunk).
data_files(id = "TZA_1991_KHDS_v01_M", org = "wb") |>
head() |>
my_gt_theme()| id | sid | file_id | file_name | description | case_count | var_count |
|---|---|---|---|---|---|---|
| 81328 | 359 | F1 | Wave1_HH_S_____HH | Miscellaneous | 981 | 163 |
| 81329 | 359 | F2 | Wave1_HH_S00B_OTH | Section verification | 18258 | 16 |
| 81330 | 359 | F3 | Wave1_HH_S1___IND | Household Roster | 5373 | 25 |
| 81331 | 359 | F4 | Wave1_HH_S2___KID | Children Residing Elsewhere | 3394 | 28 |
| 81332 | 359 | F5 | Wave1_HH_S3___IND | Parents | 5298 | 27 |
| 81333 | 359 | F6 | Wave1_HH_S4___BUS | Overview of Household Businesses | 334 | 7 |
How about variables included in the data file? Of course you can check them as well.
variables(id = 359, file_id = "F3") |>
head() |>
my_gt_theme()| uid | sid | fid | vid | name | labl |
|---|---|---|---|---|---|
| 265957 | 359 | F3 | V180 | cluster | Cluster |
| 265958 | 359 | F3 | V181 | hh | Household Number |
| 265959 | 359 | F3 | V182 | id | Individual ID Code in HH |
| 265960 | 359 | F3 | V183 | wave | Wave |
| 265961 | 359 | F3 | V184 | passage | Passage |
| 265962 | 359 | F3 | V185 | sex | S1Q2: Sex |
Variables in microdata are often named something that has nothing to do with the variable except question order like this.
| id | v1 | v2 | v3 | v4 |
|---|---|---|---|---|
| 1 | 44 | male | master | 6395.007 |
| 2 | 48 | female | phd | 7402.144 |
| 3 | 43 | female | master | 5496.753 |
| 4 | 32 | female | phd | 4200.946 |
| 5 | 39 | male | master | 5391.046 |
| 6 | 47 | female | phd | 7186.892 |
Then you can prepare another data that contains metadata like this. It will be explained in detail in vignettes later.
| var_id | var_name | label |
|---|---|---|
| id | individual_id | Respondent ID |
| v1 | age | Age of respondent |
| v2 | sex | Sex of respondent |
| v3 | education | Educational level |
| v4 | salary | Monthly salay ($) |
You can use set_attributes function to rename and set labels to these
variables.
my_data <- set_attributes(
mdt,
mtdt,
old_name = var_id,
new_name = var_name,
label = label)
head(my_data) |> my_gt_theme()| individual_id | age | sex | education | salary |
|---|---|---|---|---|
| 1 | 44 | male | master | 6395.007 |
| 2 | 48 | female | phd | 7402.144 |
| 3 | 43 | female | master | 5496.753 |
| 4 | 32 | female | phd | 4200.946 |
| 5 | 39 | male | master | 5391.046 |
| 6 | 47 | female | phd | 7186.892 |
labels are also assigned.
str(my_data)
#> 'data.frame': 100 obs. of 5 variables:
#> $ individual_id: int 1 2 3 4 5 6 7 8 9 10 ...
#> ..- attr(*, "label")= chr "Respondent ID"
#> $ age : int 44 48 43 32 39 47 40 34 49 43 ...
#> ..- attr(*, "label")= chr "Age of respondent"
#> $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 2 1 2 ...
#> ..- attr(*, "label")= chr "Sex of respondent"
#> $ education : Factor w/ 3 levels "bachelor","master",..: 2 3 2 3 2 3 1 1 1 1 ...
#> ..- attr(*, "label")= chr "Educational level"
#> $ salary : num 6395 7402 5497 4201 5391 ...
#> ..- attr(*, "label")= chr "Monthly salay ($)"More coming soon!
