thatandromeda/hamlet

View on GitHub
docs/developer.md

Summary

Maintainability
Test Coverage
# Documentation
This document is for people who are trying to stand up an instance of Hamlet on localhost in order to write code. It assumes you are generally familiar with setting up development environments (for instance, that you can install Python dependencies and stand up local Postgres).

## Standing up Hamlet
You will need:
* git
* pipenv (https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv)
* the Heroku CLI
* postgres

* Get the code and dependencies
  * `git clone https://github.com/MITLibraries/hamlet.git`
  * `cd hamlet`
  * `pipenv install`
    * If you want to do neural net training or OCR source files, there are additional non-pip dependencies; see below.
  * `pipenv shell`
* Set up your postgres database
  * Create a database
    * The name of this database should be `hamlet`, or else set an environment variable `DJANGO_DB` with its name
  * Create a database user
    * The name of this user should be `hamlet`, or else set an environment variable `DJANGO_DB_USER` with its name
  * Grant all privileges on your database to your user
  * Set an environment variable `DJANGO_DB_PASSWORD` with your database user's password
  * `python manage.py migrate`
  * Ask Andy or Andromeda for a data dump to populate the db.
* Set an environment variable `DJANGO_SETTINGS_MODULE=hamlet.settings.local`
* `python manage.py createsuperuser` and follow the prompts - this will let you log in at `/admin`
* `python manage.py collectstatic --noinput`
* `heroku local`

## Tests
Run tests with `python manage.py test --settings=hamlet.settings.test`.

This ensures that they use the test neural net. The primary keys of objects in
the test file are written around the assumption that they will be present in
both the test net and the fixtures.

You can generate additional fixtures with statements like `python manage.py dumpdata theses.Person --pks=63970,29903 > hamlet/theses/fixtures/authors.json`, but make sure to include the pks of all objects already in the fixtures (or to write it to a separate file and then unite it with the existing - you can't just append because the json syntax will be wrong). Also make sure that the theses you use are in fact present in the test neural net.

If you are seeing unpredictable test failures (e.g. tests that succeed in
isolation but fail in a suite), make sure you're using the right settings file.

## System configuration

### Development dependencies: pipenv
For the most part, dependencies are installed via pipenv. There's a `.env` file (kept out of version control) for use by `pipenv shell`. It specifies:
* `DJANGO_SETTINGS_MODULE`
  * `DJANGO_SETTINGS_MODULE='hamlet.settings.local'` (for using heroku local)
  * `DJANGO_SETTINGS_MODULE='hamlet.settings.base'` (for python manage.py runserver)
* `DJANGO_DB_PASSWORD='(your password)'`
* `DJANGO_DEBUG_IS_TRUE='True' (if you want)`
* `DSPACE_OAI_IDENTIFIER`
* `DSPACE_OAI_URI`

The latter two are only relevant if you plan to be downloading files or metadata from DSpace. They can be omitted or given dummy values otherwise.

### Additional non-pipenv dependencies
Some dependencies require extra help. However, you only need to bother if you are running the functions that rely on those dependencies.

* In order to run the Django app, including the test suite:
  * python-magic needs libmagic (`brew install libmagic` on OSX).
  * captcha says it needs `apt-get -y install libz-dev libjpeg-dev libfreetype6-dev python-dev` or similar. You can't yum install them on AWS, but the captcha works anyway, so maybe it's lying.

* In order to run OCR on thesis PDFs:
  * tika requires Java

* In order to train neural nets:
  * nltk may require installing corpora through the python shell
  * gensim wants a C compiler (it can run without one but will be 70x slower; a single neural net training run can take literally days in this case)

### Environment Variables
You may set the following environment variables to configure your database:
* `DJANGO_DB_ENGINE` (default `django.db.backends.postgresql`)
* `DJANGO_DB` (default `hamlet`)
* `DJANGO_DB_USER` (default `hamlet`)
* `DJANGO_DB_PASSWORD` (no default)
* `DJANGO_DB_HOST` (default `localhost`)

If you are running with the Postgres default, be sure to stand up Postgres; create the named database and user; and grant privileges on your database to your user.

## Static assets
If you need to edit styles, edit files in `hamlet/static/sass/apps/`. Don't edit css directly - these changes will be blown away during asset precompilation.

### for `python manage.py runserver`
* run `python manage.py collectstatic`
* use `hamlet.settings.base`

### for `heroku local`
* run `python manage.py compress`
* then run `python manage.py collectstatic`
* use `hamlet.settings.local`

### for AWS
The static asset pipeline runs automatically; see `.ebextensions/02_python.config`.

## Working with neural nets
Hamlet needs a neural net in order to operate. By default it uses the files in `hamlet/testmodels`, but this is configurable.

## Using neural net files
hamlet.model is a copy of `all_theses_no_split_w4_s52.model`. This is a model trained with a window size of 4 and a step of 52. It is kept out of version control because it is too big.

`hamlet/testmodels/` contains some smaller models not suitable for production, but usable for testing (and small enough to be pushed to GitHub, although it will complain, and hence to be used on Travis). You can configure your local settings to point at these files and that will suffice for development.

These models don't represent the entire MIT thesis collection (that's what lets them be smaller), so don't be surprised if documents of interest are not present.

`hamlet.settings.local` defaults to using the test model, since it is checked
into version control. If you have a different model you want to use, set `DJANGO_MODEL_PATH=/full/path/to/model` in `.env`.

### Checking that a document is in a given neural net

* Make sure your settings file points to the desired `MODEL_FILE`
* `python manage.py shell`

```
from gensim.models.doc2vec import Doc2Vec
from django.conf import settings
model = Doc2Vec.load(settings.MODEL_FILE)
identifier = '1721.1-%d.txt' % YOUR THESIS IDENTIFIER HERE
identifier in model.docvecs.doctags.keys()
```

If you don't have a target thesis object but you need one you know is in the neural net, look at the output of `model.docvecs.doctags.keys()`. This is a list of filenames of text files from dspace; they are all of the format `1721.1-NUMBER.txt`, where `NUMBER` is the identifier of the thesis. You can look up `Thesis` objects in your database by this identifier (which is `Thesis.identifier`, not the primary key).

## Docker
You can start up a running instance locally using docker compose:

```
$ docker-compose up
```

By default, this will use the test model included in the codebase. If you'd like to use a different model, you will need to add it to a hamlet docker volume. This volume is mounted as `/data` in the container. You'll also need to set the `DJANGO_MODEL_PATH` env var to point to the model within that docker volume. Assuming your model files are in the top directory of the project:

```
$ docker run --name hamlet-data --mount type=volume,src=hamlet,target=/data busybox true
$ for f in hamlet.model*; do docker cp $f hamlet-data:/data; done
$ docker rm hamlet-data
$ export DJANGO_MODEL_PATH=/data/hamlet.model
$ docker-compose up
```