Evaluating the effectiveness of self-reporting and third-party reporting under the Clean Water Act.
Filtered and processed data that can be used to recreate our analysis is stored in this dropbox. This includes our filtered DMR matrices and other data needed to complete regressions.
Interim and processed data created within the pipeline are stored after creation within the output_directory (specified in the config file), which is created locally when the pipeline is run.
The pipeline has four primary steps. The data we make public can be used to run the second two steps
-
Pre-processing for the first- and third-party coding, in which NetDMR users are coded as either first or third parties.
-
Pre-processing DMR data and creating the 'DMR matrix' for every state in our sample, which combines the NetDMR user data and parameter-level DMR data into a single flat file, with one row per DMR.
-
For each state, finding a given number of distinct pairs of DMRs and find each pair's similarity for five different groups of DMRs: pairs of DMRs from the same permit, pairs of DMRs from the same permit and same lab, pairs of DMRs across different permits for each lab, pairs of DMRs for a given permit with any other permit, and pairs of DMRs for third-parties and all of their permits.
-
Analyzing differences between first- and third-party submissions, as well as identifying outlier third parties with unusual similarity score distributions. We do this by calculating the rate of duplications (units: number of duplications per 1000 permits) for all the permits. We plot placebo distribution curves to determine the distribution of outliers. Lastly, for the outlier third parties, we use KL-divergence between a permittee's\lab's similarity distribution with respect to the distribution of the placebo distribution.
The pipeline currently requires connection to a postgres database to run.
You will need to set up a user with read-write privledges on the database you are trying to use for the pipeline.
Enter these credentials into your own copy of the database connection config file:
cp configs/database_template.yaml configs/database.yamlYou will need to edit other entries in here if you are connecting from your home PC (not currently supported)
Make a copy of the example configuration file to create your personal config.yml file:
cp configs/config_example.yml configs/config.ymlModify the paths in configs/config.yml to align with your machine and set pipeline parameters. You can use the configuration file to tweak parameters (e.g., states/jurisdictions, minimum number of nonzero reported values required for DMR inclusion) as desired for different runs.
Prior to setup install postgreSQL, python and r:
brew install postgresql
brew install r
brew install python
For R setup, we need the correct R version which is 4.3:
For ARM Mac install pkg here https://mirror.las.iastate.edu/CRAN/bin/macosx/big-sur-arm64/base/R-4.3.3-arm64.pkg
we then use the environment manger rv which you will need to do as follows to install packages:
curl -sSL https://raw.githubusercontent.com/A2-ai/rv/refs/heads/main/scripts/install.sh | bash
rv syncFor the Python code, it is recommended to create a virtual environment and load these packages from the requirements file:
#create virtual environment
python3 -m venv ./.venv
source .venv/bin/activate
pip3 install -r requirements.txtMUCH OF THIS DATA NOT MADE PUBLIC, contact authors for questions
To run the pre-processing part of the pipeline, the easiest way to get this data is to download the full directory and save it to the path specified in the config's shared_data_directory entry. To run the pipeline you will need at minimum the following subdirectories and files:
- EPA_netdmr_metadata
- labels_cleaned
- Oct2023-ICIS-NPDES
- intercoder_agreement
- icis_permits.csv
- icis_facilities.csv
- npdes_sics.csv
- gini_calculation.csv
- excluded_params_high_null_limit_prop.csv
- filter-limit-unit-desc.csv
- parameter_code_lookupcsv.csv
- statistical_base_code_descriptions.csv
- dmr_integrity_summary_ext_20200908.csv
Calling this shell script will run the entire pipeline, preprocessing, duplicate creation and figure plotting.
sh scripts/run_full_pipeline configs/config.ymlFor this repo, third_party_dmr.Rproj is the project file to use in RStudio.
Once configured appropriately using configs/config.yml, the first portion of the pipeline (written in R) can be run via scripts/pre_process.R. This script sources the other scripts in steps 0 and 1, and it can be run within RStudio within the third_party_dmr project. Alternatively, it can be run from the command line; from within the third_party_dmr directory, run the following:
Rscript scripts/pre_process.R configs/config.ymlOnce you have the results in the output_directory (specified in config.yml), you need to run the script file for step 2 of the pipeline
The second part of the pipeline will create sql tables and run combinations/distance calculations using these tables. Running the following bash script will accomplish this:
sh scripts/run_duplicate_pipeline.sh configs/config.ymlThis part of the pipeline takes some time (a few hours for all states) and can be quite memory and CPU intensive (when making the levenshtein distances). For this reason, I usually remote into lc-r and use that to run this part of the pipline, rather than my personal computer.
Finally, the pipleline will create the tables and figures present in the draft. This can be accomplished by running the bash script:
sh scripts/create_tables_and_figures.sh configs/config.ymlRunning this will require some extra tables in the database:
- uncertain_permits_table: contains the permits in the uncertain_permits.csv made earlier in the pipeline for filtering purpose
- npdes_sic_codes: contains the npdes permit sic codes from ICIS
- icis_permits: permit data from ICIS
- icis_facilities: faciliy data from ICIS