README.md
crime_data_api================ [](https://circleci.com/gh/18F/crime-data-api)[](https://codeclimate.com/github/18F/crime-data-api)[](https://codeclimate.com/github/18F/crime-data-api/coverage)[](https://gemnasium.com/18F/crime-data-api) This project is the back end for the Crime DataExplorer[18f/crime-data-explorer](https://github.com/18f/crime-data-explorer). TheCrime Data Explorer is a website that allows law enforcement and thegeneral public to more easily access uniform crime data. The FBIcollects this data from state and local law enforcement agencies andpublishes it in the form of annual reports. This API allows users tolook at multiple years of data as well as other information onlyprovided as a master file from the FBI. This project is a Python flask app that uses SqlAlchemy to wrap aPostgres database. The Sample Database------------------- The full database of UCR crime data is too large to include in thisrepository or as a download -- over many hundreds of gigabytes in size-- comprising many tables that are linked together by primary andforeign key constraints. Instead, this project includes a smallexcerpt of the database that mainly comprises only data for RhodeIsland in 2014. This slice means that API endpoints will return somedata and that we can run our tests, but you will have to load the fulldatabase to see more data. See the section of "Loading the Full Database" for information aboutloading the full database and the tasks we run after loading to addadditional data and derive new tables. Note: if you have a full database running somewhere on cloud.gov orthe like, it's possible to setup a SSH tunnel to use that databaseinstead. It's recommended you don't do this for development though. Quickstart---------- First, let's set some environment variables that will be useful. Youmight want to define these within a `.env` file that can be used by`autoenv` ``` shexport FLASK_APP=/path/to/autoapp.pyexport FLASK_DEBUG=1export CRIME_DATA_API_DB_URL="postgres:///crime_data_api_dev"export APP_NAME=crime-data``` You can run the following commands to bootstrap your environment andload the testing database into Postgres: ``` shgit clone https://github.com/18F/crime-data-apicd crime-data-apipip install -r requirements/dev.txtcreatedb crime_data_api_devpsql crime_data_api_dev < dba/crime_data_api_test.sqlflask run``` You can then find the Swagger UI API documentation pageat [http://127.0.0.1:5000/](http://127.0.0.1:5000/). To test out theAPI, make a requestto [http://127.0.0.1:5000/agencies](http://127.0.0.1:5000/agencies) Deployment---------- The app is continuously deployed to [cloud.gov](https://www.cloud.gov)with [CircleCI](https://circleci.com/gh/18F/crime-data-api) with everycommit to `master`. Tagged releases must be deployed manually using the`manifests/demo.yml` manifest file. You can use the following set ofsteps once you have authenticated with cloud.gov: 0. `git checkout <<version tag>>` - i.e. `git checkout v2017-04-12`0. `cf push -f manifests/demo.yml` In production environments, make sure the `FLASK_DEBUG` environmentvariable is unset or is set to `0`, so that `ProdConfig` is used. Shell----- To open the interactive shell, run : ``` shflask shell``` By default, you will have access to the flask `app`. Running Tests------------- Tests are written in the [py.test](https://docs.pytest.org/en/latest/)testing framework, with additional supportfrom[webtest](http://docs.pylonsproject.org/projects/webtest/en/latest/)for functional testing of the API. As part of its testing, it alsowill output coverage statistics for the tests. Swagger------- To make it easier to see what the APIs can do, this project includesendpoints the return [Swagger](http://swagger.io/) documentation forthe API. These documents are not yet complete or consistent and cannot yet be used to automatically generate API clients or to validateAPI responses. But they do let you test out the API and see roughlywhat the valid behavior is. Unlike some other project that dynamically generate their Swagger fromdecorations applied to methods, we are using a static swagger.jsonfile for this project. This means that developers will have to updatethe swagger file for any new methods or changes to the responses forexisting methods. This is admittedly not immediately convenient, butit lets us treat the swagger.json file as a contract of what the APIresponses should be and use that for functional tests to ensure ourresponses conform to what is described. We are using the [flex](https://github.com/pipermerriam/flex) Swaggervalidator in functional tests to verify they match Swagger. To add,modify your functional test to be like the following ``` pythonfrom flex.core import validate_api_call def TestCountsEndpoint: def test_counts_matches_swagger(self, testapp, swagger): res = testapp.get('/counts') validate_api_call(swagger, raw_request=res.request, raw_response=res)``` The `swagger` fixture in py.test loads the`crime_data/static/swagger.json` into a Flex object so we can use itfor validating that requests and responses match what is in theSwagger schema. Using The Full Database----------------------- If you have loaded a full database from the FBI into production, thereare still some other steps that you must run to use the data with theCDE. This is because the CDE does not use many of the tables withinthe UCR database directly, but instead has its own tables that arederived from and supplement the default database. There are severalgeneral types of such tables used by the CDE API: 1. Additional data tables are necessary to load in some data that is not provided within the UCR database itself. For instance, estimated crime counts for the nation and state, crime figures for the territories and corrected cargo theft incidents are all examples of external tables that are also loaded into the database.2. Denormalized tables pull together data from multiple tables into a single record. The UCR database is in a highly normalized format that mitigates redundancy to reduce error, but it requires the database to join across dozens of tables to return the results of any API call, which is not performant.3. Aggregate tables collect monthly statistics in Return A and related reports from agencies and add them at the annual level. Similarly, we create annual counts from related monthly reports like arson as well as a few incident-level reports like hate crimes or cargo theft.4. NIBRS counts are a special type of aggregate tables that assemble ETL rollups of crime counts for specific dimensions within NIBRS incidents and related tables. So for instance, this table can tell you the annual counts by race_code or sex of all victims of aggravated assault for a given agency, or the general locations where crimes occur in a state. Because they are both specialized and could be large, there are many distinct tables for this information and different API endpoints for accessing them. For a more detailed view of what tables are added to the stock UCRdatabase, see the related document CDE_TABLES.md. The important thinghere is how to load them. We have a script called `after_load.sh` thatcan be run to generate the necessary CDE tables. It's meant to beidempotent and can be run multiple times safely. Some tables may takea long time to build. Note that the script will output a fair amountof debugging text under normal operation. To run the after_load tasks, first you need to set an environmentvariable with a database URL with the location of the productiondatabase. I find the easiest way to do this is to setup a tunnel usingthe [cf-service-connect](https://github.com/18F/cf-service-connect)plugin, which you must install to do this. You can then run it withthe name of an running app service and the name of its database to geta list of connection values. ``` shcf connect-to-service --no-client crime-data-api crime-data-upload-dbexport CRIME_DATA_API_DB_URL=postgres://USERNAME:PASSWORD@localhost:PORT/DBNAME``` and in another terminal running `psql $CRIME_DATA_API_DB_URL` toconnect to the database through the tunnel. Make sure of course thatyou aren't doing any of these steps to a production database servingAPI requests to the public, since this operations will certainly causesome rolling errors and downtime (best to work with a snapshot). Anyhow, to augment the default UCR database with CDE tables, you needto run the following steps. ``` shcd dba/after_load/ext./load_external.shcd ../.././after_load.sh``` Note that while the load scripts are generally idempotent, theexternal data includes renaming the existing cargo theft tables topreserve them before loading revised data. This may spit out an errorif you attempt to run again. Also, please note that the after_load.sh runs a bunch of queries thatmight be quite slow. It might be better to run the commands insideindividually if you have a lower-powered DB instance. Updating Police Agency Names---------------------------- The master list of edited police names is located in`dba/after_load/ext/agency_name_edits.csv`. To update the names ofagencies, make changes to the `edited_agency_name` column in theCSV. Then do the following with an open tunnel to the productiondatabase. ``` shcd dba/after_load/extpsql $CRIME_DATA_API_DB_URL < update_agency_names.sql``` This will load in the agencies CSV from your machine than update a fewtables where the agency name is displayed. Don't be alarmed if thesame number of rows is not updated in every table, since some agenciesmay not be in various derived tables or may be present multiple timesin others. Don't forget to check in your changes to the agency_name_edits.csv andfile a pull request when you are done. Security Scans-------------- This repository uses the [bandit](https://github.com/openstack/bandit)tool to run automated static analysis of the project code forpotential vulnerabilities. These are run automatically as part ofcontinuous integration to identify potential vulnerabilities when theyare introduced in pull requests. You can run bandit locally with the following command: ``` shellbandit -r .``` In some cases, bandit will identify false positives, code that lookslike it could be a security vulnerability but that will likely neverbe triggered in a production environment. To disable reporting ofthese vulnerabilities, you can append a `#nosec` comment on the lineof code where the vulnerability was identified.