README.md
[![Build Status](https://travis-ci.org/dpn-admin/dpn-sync.svg?branch=master)](https://travis-ci.org/dpn-admin/dpn-sync) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/cfd723a84b1f493596f0374ec6fa790e)](https://www.codacy.com/app/DPN/dpn-sync?utm_source=github.com&utm_medium=referral&utm_content=dpn-admin/dpn-sync&utm_campaign=Badge_Grade) [![Codacy Badge](https://api.codacy.com/project/badge/Coverage/cfd723a84b1f493596f0374ec6fa790e)](https://www.codacy.com/app/DPN/dpn-sync?utm_source=github.com&utm_medium=referral&utm_content=dpn-admin/dpn-sync&utm_campaign=Badge_Coverage) [![Inline docs](http://inch-ci.org/github/dpn-admin/dpn-sync.svg?branch=master)](http://inch-ci.org/github/dpn-admin/dpn-sync)
# DPN Synchronization
An application for synchronizing DPN registry data from remote nodes, using the
[Sidekiq](https://github.com/mperham/sidekiq) background jobs framework.
## Components
- the DPN nodes are defined in `config/settings.yml`
- the settings are handled by `DPN::Workers`
- a set of DPN nodes is loaded by `DPN::Workers.nodes`
- a set of DPN nodes is modeled by the `DPN::Workers::Nodes` class
- it requires a `local_namespace` to identify a `local_node`
- it makes an important distinction between a `local_node` and `remote_nodes`
- it has methods to `sync` data from `remote_nodes` into the `local_node`
- the `DPN::Workers::SyncWorker` is a `Sidekiq::Worker`
- subclasses of `DPN::Workers::Sync` implement `#sync`
- they use `DPN::Workers::JobData` for tracking success
- a node is modeled by the `DPN::Workers::Node` class
- it requires a node namespace, API URL, and authentication token
- it contains a `DPN::Client` to access a node's HTTP API
- the client is from https://github.com/dpn-admin/dpn-client
- the HTTP API is https://github.com/dpn-admin/DPN-REST-Wiki
- it is implemented in https://github.com/dpn-admin/dpn-server
## Requirements
- [Sidekiq](https://github.com/mperham/sidekiq) requires the [Redis](http://redis.io/) document store.
- Access to at least two DPN nodes that host a DPN HTTP-REST-API
- https://github.com/dpn-admin/DPN-REST-Wiki
- https://github.com/dpn-admin/dpn-server
## Getting Started
```sh
git clone git@github.com:dpn-admin/dpn-sync.git
cd dpn-sync
bundle install
# Start the Sidekiq daemon to run background jobs; some
# jobs are managed by sidekiq-cron, see config/schedule.yml
bundle exec rake sidekiq:service:start
# Start the Sidekiq dashboard at http://localhost:9292/
bundle exec rackup
# Explore the dashboard web pages and then
# Cnt-C to stop and then
bundle exec rake sidekiq:service:stop
```
## Configuration
- Configuration files are in:
- `config/*.yml`
- `config/settings/*.yml`
- `config/initializers/*.rb`
- additional configuration details may be in the project wiki pages
- https://github.com/dpn-admin/dpn-sync/wiki
The `config` gem provides several layers of specificity for settings, see
https://github.com/railsconfig/config#accessing-the-settings-object
### Configuring Nodes
The most important values in `Settings` are the `nodes` definitions and the
`local_namespace` that should belong to one of the `nodes`. These values
should be derived from the `Node` table of the `dpn-server` project.
From the `rails c` console of the `dpn-server` project, the nodes data
can be dumped using:
```ruby
require 'yaml'
yml = Node.all.map do |n|
{
namespace: n.namespace,
api_root: n.api_root,
auth_credential: n.auth_credential
}
end.to_yaml
puts yml
```
Note that the `auth_credential` values are private and should be kept secret.
The node information can be retrieved from the HTTP-REST-API. The response
will include many details, including those required, but not the
`auth_credential` values. For example, when the `dpn-server` cluster
is running locally, it can be retrieved using:
```sh
curl -k -H "Authorization: Token token=aptrust_token" -L http://127.0.0.1:3001/api-v1/node/
```
An abridged response looks like:
```json
{
"count": 5,
"next": null,
"previous": null,
"results": [{
"name": "APTrust",
"namespace": "aptrust",
"api_root": "http://127.0.0.1:3001"
}, {
"name": "Chronopolis",
"namespace": "chron",
"api_root": "http://127.0.0.1:3002"
}, {
"name": "Hathi Trust",
"namespace": "hathi",
"api_root": "http://127.0.0.1:3003"
}, {
"name": "Stanford Digital Repository",
"namespace": "sdr",
"api_root": "http://127.0.0.1:3004"
}, {
"name": "Texas Digital Repository",
"namespace": "tdr",
"api_root": "http://127.0.0.1:3005"
}]
}
```
### Configuring Test Cluster
When running in development, the `dpn-server` project can run a test cluster and the nodes settings can be set to work with that cluster; the default values in `config/settings.yml` should work with this cluster. See
- https://github.com/dpn-admin/dpn-server/blob/master/Cluster.md
- https://github.com/dpn-admin/dpn-server/blob/master/script/setup_cluster.rb
- https://github.com/dpn-admin/dpn-server/blob/master/script/run_cluster.rb
### Environment Variables
- Environment variables can be set in various places, with the following order
of importance:
- On deployed apps, running under Apache/Passenger:
- see `/etc/httpd/conf.d/z*`
- The content of the config files is managed by puppet
- Command line values, e.g. `RACK_ENV=production bundle exec rackup`
## Deployment
Capistrano is configured to run all the deployments. See `cap -T` for all the options. There are private configuration files in the [DLSS shared-configs](https://github.com/sul-dlss/shared_configs). The following files should be in the `shared_configs`, in a branch like `dpn-*-sync`. The generic `settings.yml` should contain config parameters that are independent of the deployment `{environment}.yml` (like `development.yml` or `production.yml`), whereas the `settings/{environment}.yml` should contain `nodes` or other details that are specific to the deployment network.
```
config/
├── redis.yml
├── settings
│ └── {environment}.yml
├── settings.yml
└── sidekiq_schedule.yml
```
Capistrano can start and stop the `Sidekiq` service. The tasks include:
```
cap sidekiq:quiet # Quiet sidekiq (stop processing new tasks)
cap sidekiq:respawn # Respawn missing sidekiq processes
cap sidekiq:restart # Restart sidekiq
cap sidekiq:rolling_restart # Rolling-restart sidekiq
cap sidekiq:start # Start sidekiq
cap sidekiq:stop # Stop sidekiq
```
## Rake
There are rake tasks for starting `dpn-sync` jobs and inspecting the `Sidekiq` API. All the tasks can be listed using `bundle exec rake -T`, e.g.
```
rake dpn:sync:bags # DPN - queue a job to fetch bag meta-data from remote nodes
rake dpn:sync:members # DPN - queue a job to fetch member meta-data from remote nodes
rake dpn:sync:nodes # DPN - queue a job to fetch node meta-data from remote nodes
rake dpn:sync:replications # DPN - queue a job to fetch replication request meta-data from remote nodes
...
rake sidekiq:default_queue:clear # Sidekiq - clear the default queue
rake sidekiq:default_queue:entries # Sidekiq - default queue entries
rake sidekiq:stats:all # Sidekiq - statistics - all
rake sidekiq:stats:history[days] # Sidekiq - statistics - history[days]
rake sidekiq:stats:reset # Sidekiq - statistics - reset
...
```
## Development
- To get a console: `bundle exec rackup -d`
- if anything goes wrong, look at `log/rack_debug.log`
- if the `dpn-server` cluster is running, the following works:
```ruby
DPN::Workers.nodes.map(&:alive?)
#=> [true, true, true, true, true]
```
- To see and test jobs:
- `bundle exec sidekiq -C ./config/sidekiq.yml -r ./config/initializers/sidekiq.rb`
- in another shell, run `bundle exec rackup`
- use a browser to open `http://localhost:9292`
- use the `/test` page to check messages are processed by a worker
- use the `/sidekiq` dashboard