Skip to content
Brian Everett Peterson edited this page Jun 1, 2014 · 12 revisions

Introduction

There are many files that together make up the structure of the project, and as a recent project refactoring showed (#342), if any of them fail, the scraper may not run, the database may not be updated, or the API may go down, among other things.

The goal of this document is to get out in the open all of this critical project infrastructure. The author is hopeful that it will also help those new to the project to more rapidly come to an understanding of it, as he once had to do.

This document also begins to fulfill a related gap in our documentation, which is on our deployment setup and commands. The implementation of our deployment commands represents how we have decided to interact with our project and its structure.

Logical parts and entities

There are chiefly two logical parts to this project. For the moment, we will look only at the v1 version of the project, coming back to recent changes and design decisions later. The logical parts of the former are:

  • the scraper;
  • and the API

With this small summary of the project's scope in mind, an initial understanding of the way this application is wired together would be as follows.

Each day, the scraper, running from our production server, accesses the Cook County Jail Sheriff's Inmate Locator website. By some means, it gets the data it collects from there and updates our production database, also on the server. Meanwhile, an API, also running on the server, accesses the database and by some means makes this data available on our public website.

As you can see, logical structure notwithstanding, all the code that is executed as a part of this project is executed on our server. The only other entities which required mention was our public website--which is simply another way of talking about the server--and the website of the Cook County Sheriff's Office. We shall see that there are at least a few other entities of note, mainly: clients who are requesting our data through the API, and our developers who are deploying new code to the server.

Putting things together

Now that we have set out the scope of this project and its parts, we can start to fill in some of the details on how the parts come together. As I tried to show in the previous section, the flow of information is basically from the Sheriff's Website outward, and the information enters into our control through the Scraper. As such, we will start there in exploring how things work.

Scraper

So then, the scraper runs each day at 8 a.m. Why? What tells it to do so? The entry in the server's crontab tells it to do so. Specifically, what it does is look for scripts/cron.sh in our project root on the server. It also pipes the resulting output to a log file called cookcountyjail-scraper.log in the logs/ directory of the server's home.

cron.sh in turn calls scraper.sh, in the same directory. It also creates a log file called v1-cookcountyjail-scraper-YY-MM-DD.log. cron.sh also starts a second "v2 scraper", but we're leaving that for later.

Now the most important thing that scraper.sh does is of course to start the scraper. However, it's actually a long script responsible for most of the administrative tasks that have to be done on server. It just so happens that most of these are done before or after the scraper runs. For a full list of administrative tasks, see this work-in-progress list, or better yet read scraper.sh yourself.

So what does running the scraper consist in? The answer is that for now it is a Django management command. In practice, this means that calling it goes through the Django countyapi application, and that python manage.py ng_scraper will get the scraper started. For reference, the code which is responsible for doing the scraping is hidden deep within our application: countyapi/management/commands/ng_scraper.py.

The ng_scraper command calls code in the /scraper/ directory. I said that the scraper updates the production database. The code that currently makes up the scraper is complex in itself, and is divided up between parts that have different responsibilities. I will say that the part that handles direct interaction with the Django ORM is mostly in the scraper/inmate.py file, as well as a little in the charges.py, housing_location_info.py, and court_location_info.py files. For more details, you will need to read the scraper's documentation, and maybe start reading the code that makes up the scraper. If you end up doing this, use the wiki as a guide, but know that the database-specific calls are all made in the files I mentioned above.

API

The nature of a server is that it is usually running at all times. Similarly, our API which runs on the server and actually makes available the data the scraper has stored in the database, is generally always running. The API is thus configured to be run at the time the server starts. This is achieved with the config/upstart.conf file -- a configuration file for Ubuntu's upstart init system. upstart.conf calls the gunicorn.sh from the scripts directory, and registers a system service called cookcountyjail. The fact that we register a service means that typing cookcountyjail service status into a terminal on the server, or any of the other service commands, will result in it telling you the status of the Gunicorn application server.

Now gunicorn.sh actually launches our django app under the Gunicorn HTTP server, by using the application object defined in countyapi/wsgi.py. It runs the application on port 8000. Since the standard Web port is 80, this will not yet be publicly accessible. But the reason we launched on port 8000 is so that we can run it behind a proxy server. Nginx will be our proxy server, and two configuration files are defined for this server at config/nginx.conf.master, and nginx-v1.conf in the same directory. nginx-v1.conf tells Nginx to listen at the port Gunicorn is serving. Note that like the cookcountyjail service, Nginx starts when the server does, though we don't have a special configuration file defined for it.

So then, putting it all together, when someone navigates to our API to access data, their request is routed through Nginx. If it has the correct URL (/api/1.0/?format=json, for example), it goes through to the gunicorn application server, which passes off the request to the django app which runs our API.

Inside the django app, countyapi/urls.py is our root URL configuration. It listens for URLs which begin with '/api/', and routes these requests to the Django-Tastypie API, which is a model-based resource definition. The API itself is configured in countyapi/api.py, and gets the model definitions from countyapi/models.py. These models are our interface with the Django ORM, which handles actually accessing the production database, and thereby determines what the API serves.

Deployment

One thing this description of our project leaves in question is how exactly all these files get used. Many of them are part of a django application, which knows how to handle itself. But some of the configuration files are simply kept up to date in our Github repo, a process which will not by itself change the way the server is running. The code which is responsible for getting all these little bits of glue where they need to be on our server is the deployment command(s), defined in the fabfile.py, using Fabric (a command line tool for application deployment).

[Talk about deployment, where the apps are on the server, and how all these files get where they are needed]

Our current deployment mechanisms fall short in a few places. These include: defining what is in the server's crontab, defining where the static files are on the server...

Version 2.0

The 2nd version of our API and project, currently in development, re-uses or mirrors a lot of the critical infrastructure we have already gone over. Briefly, then:

[Talk about multiple apps in 2.0]

Future

[Talk about proposed new infrastructure, like POSTing to the v2 API]

Clone this wiki locally