ubpb/metacrunch-elasticsearch

View on GitHub
Readme.md

Summary

Maintainability
Test Coverage
metacrunch-elasticsearch
========================

[![Gem Version](https://badge.fury.io/rb/metacrunch-elasticsearch.svg)](http://badge.fury.io/rb/metacrunch-elasticsearch)
[![Code Climate](https://codeclimate.com/github/ubpb/metacrunch-elasticsearch/badges/gpa.svg)](https://codeclimate.com/github/ubpb/metacrunch-elasticsearch)
[![Test Coverage](https://codeclimate.com/github/ubpb/metacrunch-elasticsearch/badges/coverage.svg)](https://codeclimate.com/github/ubpb/metacrunch-elasticsearch/coverage)
[![CircleCI](https://circleci.com/gh/ubpb/metacrunch-elasticsearch.svg?style=svg)](https://circleci.com/gh/ubpb/metacrunch-elasticsearch)

This is the official [Elasticsearch](https://www.elastic.co) package for the [metacrunch ETL toolkit](https://github.com/ubpb/metacrunch).

*NOTE: metacrunch-elasticsearch 5.x requires Elasticsearch 7.x. For older versions of Elasticsearch try metacrunch-elasticsearch 4.x*

Installation
------------

Include the gem in your `Gemfile`

```ruby
gem "metacrunch-elasticsearch", "~> 5.0.0"
```

and run `$ bundle install` to install it.

Or install it manually

```
$ gem install metacrunch-elasticsearch
```

Usage
-----

*Note: For working examples on how to use this package check out our [demo repository](https://github.com/ubpb/metacrunch-demo).*

### `Metacrunch::Elasticsearch::Source`

This class provides a metacrunch `source` implementation that can be used to read data from Elasticsearch into a metacrunch job.

```ruby
# my_job.metacrunch

# Create a Elasticsearch connection 
elasticsearch = Elasticsearch::Client.new(...)

# Set the source
source Metacrunch::Elasticsearch::Source.new(elasticsearch, OPTIONS)
```

**Options**

* `:search_options`: A hash with search options (including your query) as described [here](https://github.com/elastic/elasticsearch-ruby/blob/master/elasticsearch-api/lib/elasticsearch/api/actions/search.rb). We have set some meaningful defaults though: `size: 100`, `scroll: 1m`, `sort: ["_doc"]`. Depending on your use-case it may be needed to modify `:size` and `:scroll` for optimal performance.
* `:total_hits_callback`: You can set a `Proc` that gets called with the total number of hits your query will match. Use can use this callback to setup a progress bar for example. Defaults to `nil`.


### `Metacrunch::Elasticsearch::Destination`

This class provides a metacrunch `destination` implementation that can be used to write data from a metacrunch job to Elasticsearch.

The data that gets passed to the destination, must be in a proper format. You can use a transformation to transform your data before it reaches the destination.

As `Metacrunch::Elasticsearch::Destination` utilizes the Elasticsearch bulk API, the expected format must match one of the available options for the `body`parameter described [here](https://github.com/elastic/elasticsearch-ruby/blob/master/elasticsearch-api/lib/elasticsearch/api/actions/bulk.rb). Please note that you can use the bulk API not only to index records. You can update or delete records as well.

```ruby
# my_job.metacrunch

# Transform data into a format that the destination can understand.
# In this example `data` is some hash.
transformation ->(data) do
  {
    index: {
      _index: "my-index",
      _id: data.delete(:id),
      data: data
    }
  }
end
```

It is not efficient to call Elasticsearch for every single record. Therefore we can use a transformation with a buffer, to create bulks of records. In this example we use a buffer size of 10. In production environments and depending on your data, larger buffers may be useful.

```ruby
# my_job.metacrunch

transformation ->(data) { data }, buffer: 10
```

If these transformations are in place you can now use the `Metacrunch::Elasticsearch::Destination` class as a destination.

```ruby
# my_job.metacrunch

# Write data into elasticsearch
destination Metacrunch::Elasticsearch::Destination.new(elasticsearch [, OPTIONS])
```

**Options**

* `:result_callback`: You can set a `Proc` that gets called with the result from the bulk operation. Defaults to `nil`.
* `:bulk_options`: A hash of options for the Eleasticsearch bulk API as described [here](https://github.com/elastic/elasticsearch-ruby/blob/master/elasticsearch-api/lib/elasticsearch/api/actions/bulk.rb). Setting `body` here will be ignored. Defaults to `{}`.

License
-------

metacrunch-elasticsearch is available at [github](https://github.com/ubpb/metacrunch-elasticsearch) under [MIT license](https://github.com/ubpb/metacrunch-elasticsearch/blob/master/License.txt).