Skip to content

reglab/third-party_dmr_public

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DMR Integrity: DMR Similarity Investigation

Evaluating the effectiveness of self-reporting and third-party reporting under the Clean Water Act.

Data

Filtered and processed data that can be used to recreate our analysis is stored in this dropbox. This includes our filtered DMR matrices and other data needed to complete regressions.

Interim and processed data created within the pipeline are stored after creation within the output_directory (specified in the config file), which is created locally when the pipeline is run.

Pipeline Structure

The pipeline has four primary steps. The data we make public can be used to run the second two steps

  1. Pre-processing for the first- and third-party coding, in which NetDMR users are coded as either first or third parties.

  2. Pre-processing DMR data and creating the 'DMR matrix' for every state in our sample, which combines the NetDMR user data and parameter-level DMR data into a single flat file, with one row per DMR.

  3. For each state, finding a given number of distinct pairs of DMRs and find each pair's similarity for five different groups of DMRs: pairs of DMRs from the same permit, pairs of DMRs from the same permit and same lab, pairs of DMRs across different permits for each lab, pairs of DMRs for a given permit with any other permit, and pairs of DMRs for third-parties and all of their permits.

  4. Analyzing differences between first- and third-party submissions, as well as identifying outlier third parties with unusual similarity score distributions. We do this by calculating the rate of duplications (units: number of duplications per 1000 permits) for all the permits. We plot placebo distribution curves to determine the distribution of outliers. Lastly, for the outlier third parties, we use KL-divergence between a permittee's\lab's similarity distribution with respect to the distribution of the placebo distribution.

Pipeline Setup

Database logins

The pipeline currently requires connection to a postgres database to run.

You will need to set up a user with read-write privledges on the database you are trying to use for the pipeline.

Enter these credentials into your own copy of the database connection config file:

cp configs/database_template.yaml configs/database.yaml

Configuring the pipeline's settings

You will need to edit other entries in here if you are connecting from your home PC (not currently supported)

Make a copy of the example configuration file to create your personal config.yml file:

cp configs/config_example.yml configs/config.yml

Modify the paths in configs/config.yml to align with your machine and set pipeline parameters. You can use the configuration file to tweak parameters (e.g., states/jurisdictions, minimum number of nonzero reported values required for DMR inclusion) as desired for different runs.

Setting up environments

Prior to setup install postgreSQL, python and r:

brew install postgresql
brew install r
brew install python

For R setup, we need the correct R version which is 4.3: For ARM Mac install pkg here https://mirror.las.iastate.edu/CRAN/bin/macosx/big-sur-arm64/base/R-4.3.3-arm64.pkg

we then use the environment manger rv which you will need to do as follows to install packages:

curl -sSL https://raw.githubusercontent.com/A2-ai/rv/refs/heads/main/scripts/install.sh | bash
rv sync

For the Python code, it is recommended to create a virtual environment and load these packages from the requirements file:

#create virtual environment
python3 -m venv ./.venv
source .venv/bin/activate
pip3 install -r requirements.txt

Downloading/preparing raw data:

MUCH OF THIS DATA NOT MADE PUBLIC, contact authors for questions To run the pre-processing part of the pipeline, the easiest way to get this data is to download the full directory and save it to the path specified in the config's shared_data_directory entry. To run the pipeline you will need at minimum the following subdirectories and files:

  • EPA_netdmr_metadata
  • labels_cleaned
  • Oct2023-ICIS-NPDES
  • intercoder_agreement
  • icis_permits.csv
  • icis_facilities.csv
  • npdes_sics.csv
  • gini_calculation.csv
  • excluded_params_high_null_limit_prop.csv
  • filter-limit-unit-desc.csv
  • parameter_code_lookupcsv.csv
  • statistical_base_code_descriptions.csv
  • dmr_integrity_summary_ext_20200908.csv

Running the Pipeline

Run the full pipeline

Calling this shell script will run the entire pipeline, preprocessing, duplicate creation and figure plotting.

sh scripts/run_full_pipeline configs/config.yml

Pre-processing code:

For this repo, third_party_dmr.Rproj is the project file to use in RStudio.

Once configured appropriately using configs/config.yml, the first portion of the pipeline (written in R) can be run via scripts/pre_process.R. This script sources the other scripts in steps 0 and 1, and it can be run within RStudio within the third_party_dmr project. Alternatively, it can be run from the command line; from within the third_party_dmr directory, run the following:

Rscript scripts/pre_process.R configs/config.yml

Once you have the results in the output_directory (specified in config.yml), you need to run the script file for step 2 of the pipeline

Creating duplicates and tables

The second part of the pipeline will create sql tables and run combinations/distance calculations using these tables. Running the following bash script will accomplish this:

sh scripts/run_duplicate_pipeline.sh configs/config.yml

This part of the pipeline takes some time (a few hours for all states) and can be quite memory and CPU intensive (when making the levenshtein distances). For this reason, I usually remote into lc-r and use that to run this part of the pipline, rather than my personal computer.

Creating tablea and figures

Finally, the pipleline will create the tables and figures present in the draft. This can be accomplished by running the bash script:

sh scripts/create_tables_and_figures.sh configs/config.yml

Running this will require some extra tables in the database:

  • uncertain_permits_table: contains the permits in the uncertain_permits.csv made earlier in the pipeline for filtering purpose
  • npdes_sic_codes: contains the npdes permit sic codes from ICIS
  • icis_permits: permit data from ICIS
  • icis_facilities: faciliy data from ICIS

About

Public repository of analysis code for NPDES DMR third-party reporting paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published