treasure-data/embulk-input-mixpanel

View on GitHub
README.md

Summary

Maintainability
Test Coverage
[![Build Status](https://travis-ci.org/treasure-data/embulk-input-mixpanel.svg?branch=master)](https://travis-ci.org/treasure-data/embulk-input-mixpanel)
[![Code Climate](https://codeclimate.com/github/treasure-data/embulk-input-mixpanel/badges/gpa.svg)](https://codeclimate.com/github/treasure-data/embulk-input-mixpanel)
[![Test Coverage](https://codeclimate.com/github/treasure-data/embulk-input-mixpanel/badges/coverage.svg)](https://codeclimate.com/github/treasure-data/embulk-input-mixpanel/coverage)

# Mixpanel input plugin for Embulk

embulk-input-mixpanel is the Embulk input plugin for [Mixpanel](https://mixpanel.com).

## Overview

Required Embulk version >= 0.8.6 (since v0.4.0).

* **Plugin type**: input
* **Resume supported**: no
* **Cleanup supported**: no
* **Guess supported**: yes

## Setup

### How to get API configuration

This plugin uses API key and API secret for target project. Before you make your config.yml, you should get API key and API secret in mixpanel website.

For API configuration, you should log in mixpanel website, and click "Account" at the header. When you select "Projects" panel, you can get "API Key" and "API Secret" for each project.

### How to get project's timezone

This plugin uses project's timezone to adjust timestamp to UTC.

To get it, you should log in mixpanel website, and click gear icon at the lower left. Then an opened dialog shows timezone at "Timezone" column in "Management" tab.

### Configuration

- **api_secret**: project API Secret (string, required)
- **export_endpoint**: the Data Export API's endpoint (string, default to "http://data.mixpanel.com/api/2.0/export")
- **jql_endpoint**: the JQL API's endpoint (string, default to "https://mixpanel.com/api/2.0/jql/")
- **jql_mode**: using JQL or export endpoint (boolean, default to false)
- **jql_script**: JQL script sent the JQL endpoint(string)
- **timezone**: project timezone(string, required)
- **from_date**: From date to export (string, optional, default: today - 2)
  - NOTE: Mixpanel API supports to export data from at least 2 days before to at most the previous day.
- **fetch_days**: Count of days range for exporting (integer, optional, default: from_date - (today - 1))
  - NOTE: Mixpanel doesn't support to from_date > today - 2
- **incremental**: Run incremental mode nor not (boolean, optional, default: true)
- **incremental_column**: Column to be add to where query as a constraint for incremental time. Only data that have incremental_column timestamp > than previous latest_fetched_time will be return (string, optional, default: time)
- **back_fill_time**: Amount of time that will be subtracted from `from_date` to calculate the final `from_date` that will be use for API Request. This is due to Mixpanel caching data on user devices before sending it to Mixpanel server (integer, optional, default: 5)
  - NOTE: Only have effect when incremental is true and incremental_column is specified
- **incremental_column_upper_limit_delay_in_seconds**: When query with incremental column, plugin will lock the upper limit of incremental column query with the job start time, in order to avoid issue with data that commit when the job is running
 ex: `where mp_processing_time <= job_start_time`. The upper limit will be calculated by using job_start_time minus with this configuration parameter. This is to support case when Mixpanel have delay in their processing (integer, optional, default: 0)
- **fetch_unknown_columns**(deprecated): If you want this plugin fetches unknown (unconfigured in config) columns (boolean, optional, default: false)
  - NOTE: If true, `unknown_columns` column is created and added unknown columns' data.
- **fetch_custom_properties**: All custom properties into `custom_properties` key. "custom properties" are not desribed Mixpanel document [1](https://mixpanel.com/help/questions/articles/special-or-reserved-properties), [2](https://mixpanel.com/help/questions/articles/what-properties-do-mixpanels-libraries-store-by-default).  (boolean, optional, default: true)
  - NOTE: Cannot set both `fetch_unknown_columns` and `fetch_custom_properties` to `true`.
- **event**: The event or events to filter data (array, optional, default: nil)
- **where**: Expression to filter data (c.f. https://mixpanel.com/docs/api-documentation/data-export-api#segmentation-expressions) (string, optional, default: nil)
- **bucket**:The data backet to filter data (string, optional, default: nil)
- **retry_initial_wait_sec** Wait seconds for exponential backoff initial value (integer, default: 1)
- **retry_limit**: Try to retry this times (integer, default: 5)
- **allow_partial_import**: Allow plugin to skip errored import (boolean, default: true)

### `fetch_unknown_columns` and `fetch_custom_properties`

If you have such data and set config.yml as below.

| event | $city   | $custom | $foobar |
| ----- | ------- | ------- | ------- |
| ev    | Tokyo   | custom  | foobar  |

(NOTE: `$city` is a [reserved key](https://mixpanel.com/help/questions/articles/what-properties-do-mixpanels-libraries-store-by-default), `$custom` and `$foobar` are not)

```yaml
in:
  type: mixpanel
  api_secret: "API_SECRET"
  timezone: "US/Pacific"
  from_date: "2015-07-19"
  fetch_days: 5
  columns:
    - {name: event, type: string}
    - {name: $custom, type: string}
```


`fetch_unknown_columns: true` will fetch as:

| event | $custom | unknown_columns (json) |
| ----- | ------- | ----------------- |
| ev    | custom  | `{"$city":"Tokyo", "$foobar": "foobar"}` |

`fetch_custom_properties: true` will fetch as:

| event | $custom | custom_properties (json) |
| ----- | ------- | ----------------- |
| ev    | custom  | `{"$foobar": "foobar"}` |


`fetch_unknown_columns` recognize `$city` and `$foobar` as `unknown_columns` because they are not described in config.yml.

`fetch_custom_properties` recognize `$foobar` as `custom_properties`. `$custom` is also custom property but it was described in config.yml.

## Example

```yaml
in:
  type: mixpanel
  api_secret: "API_SECRET"
  timezone: "US/Pacific"
  from_date: "2015-07-19"
  fetch_days: 5
```

## Run test

```
$ rake
```