whylabs/whylogs-python

View on GitHub
python/examples/datasets/employee.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
    ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=employee)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=employee) to leverage the power of whylogs and WhyLabs together!*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Employee Dataset - Usage Example"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/datasets/employee.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This an example demonstrating the usage of the Employee Dataset.\n",
    "\n",
    "For more information about the dataset itself, check the documentation on :\n",
    "https://whylogs.readthedocs.io/en/latest/datasets/employee.html"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing the datasets module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: you may need to restart the kernel to use updated packages.\n",
    "%pip install 'whylogs[datasets]'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the Dataset\n",
    "\n",
    "You can load the dataset of your choice by calling it from the `datasets` module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.datasets import Employee\n",
    "\n",
    "dataset = Employee(version=\"base\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If no `version` parameter is passed, the default version is `base`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will create a folder in the current directory named `whylogs_data` with the csv files for the Employee Dataset. If the files already exist, the module will not redownload the files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discovering Information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To know what are the available versions for a given dataset, you can call:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('base',)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Employee.describe_versions()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get access to overall description of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Employee Dataset\n",
      "================\n",
      "\n",
      "The employee dataset contains annual salary information for employees of an american County. It contains features related to each employee, such as employee's department, gender, salary, and hiring date.\n",
      "\n",
      "The original data was sourced from the `employee_salaries` OpenML dataset, and can be found here: https://www.openml.org/d/42125. From the source data additional transformations were made, such as: data cleaning, feature creation and feature engineering.\n",
      "\n",
      "License:\n",
      "CC0: Public Domain\n",
      "\n",
      "Usage\n",
      "-----\n",
      "\n",
      "You can follow this guide to see how to use the ecommerce dataset:\n",
      "\n",
      ".. toctree::\n",
      "    :maxdepth: 1\n",
      "\n",
      "    ../examples/datasets/employee\n",
      "\n",
      "Versions and Data Partitions\n",
      "----------------------------\n",
      "\n",
      "Currently the dataset contains one version: **base**. This dataset has no particular tasks defined, as it is aimed to explore data quality issues that are not necessarily related to ML.\n",
      "The **base** version contains two partitions: **Baseline** and **Production**\n",
      "\n",
      "base\n"
     ]
    }
   ],
   "source": [
    "print(Employee.describe()[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "note: the output was truncated to first 1000 characters as `describe()` will print a rather lengthy description."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Baseline Data\n",
    "\n",
    "You can access data from two different partitions: the baseline dataset and production dataset.\n",
    "\n",
    "The baseline can be accessed as a whole, whereas the production dataset can be accessed in periodic batches, defined by the user.\n",
    "\n",
    "To get a `baseline` object, just call `dataset.get_baseline()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.datasets import Employee\n",
    "\n",
    "dataset = Employee()\n",
    "\n",
    "baseline = dataset.get_baseline()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`baseline` will contain different attributes - one timestamp and five dataframes.\n",
    "\n",
    "- timestamp: the batch's timestamp (at the start)\n",
    "- data: the complete dataframe\n",
    "- features: input features\n",
    "- target: output feature(s)\n",
    "- prediction: output prediction and, possibly, features such as uncertainty, confidence, probability\n",
    "- extra: metadata features that are not of any of the previous categories, but still contain relevant information about the data.\n",
    "\n",
    "The Employee dataset is a non-ml dataset, so the `prediction` and `target` dataframes will be empty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "datetime.datetime(2023, 2, 16, 0, 0, tzinfo=datetime.timezone.utc)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "baseline.timestamp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee_id</th>\n",
       "      <th>gender</th>\n",
       "      <th>overtime_pay</th>\n",
       "      <th>department</th>\n",
       "      <th>position_title</th>\n",
       "      <th>date_first_hired</th>\n",
       "      <th>year_first_hired</th>\n",
       "      <th>salary</th>\n",
       "      <th>full_time</th>\n",
       "      <th>part_time</th>\n",
       "      <th>sector</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>8894</td>\n",
       "      <td>M</td>\n",
       "      <td>9136.78</td>\n",
       "      <td>POL</td>\n",
       "      <td>Police Sergeant</td>\n",
       "      <td>07/21/2003</td>\n",
       "      <td>2003</td>\n",
       "      <td>103506.00</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>6920</td>\n",
       "      <td>M</td>\n",
       "      <td>0.00</td>\n",
       "      <td>FRS</td>\n",
       "      <td>Firefighter/Rescuer III</td>\n",
       "      <td>12/12/2016</td>\n",
       "      <td>2016</td>\n",
       "      <td>45261.00</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>2265</td>\n",
       "      <td>F</td>\n",
       "      <td>0.00</td>\n",
       "      <td>LIB</td>\n",
       "      <td>Library Associate</td>\n",
       "      <td>06/27/1997</td>\n",
       "      <td>1997</td>\n",
       "      <td>25167.75</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Sector 4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>8790</td>\n",
       "      <td>M</td>\n",
       "      <td>0.00</td>\n",
       "      <td>OHR</td>\n",
       "      <td>Labor Relations Advisor</td>\n",
       "      <td>10/28/2001</td>\n",
       "      <td>2001</td>\n",
       "      <td>112899.00</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>7728</td>\n",
       "      <td>M</td>\n",
       "      <td>12516.95</td>\n",
       "      <td>DOT</td>\n",
       "      <td>Bus Operator</td>\n",
       "      <td>11/10/2014</td>\n",
       "      <td>2014</td>\n",
       "      <td>42053.42</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           employee_id gender  overtime_pay department  \\\n",
       "date                                                                     \n",
       "2023-02-16 00:00:00+00:00         8894      M       9136.78        POL   \n",
       "2023-02-16 00:00:00+00:00         6920      M          0.00        FRS   \n",
       "2023-02-16 00:00:00+00:00         2265      F          0.00        LIB   \n",
       "2023-02-16 00:00:00+00:00         8790      M          0.00        OHR   \n",
       "2023-02-16 00:00:00+00:00         7728      M      12516.95        DOT   \n",
       "\n",
       "                                    position_title date_first_hired  \\\n",
       "date                                                                  \n",
       "2023-02-16 00:00:00+00:00          Police Sergeant       07/21/2003   \n",
       "2023-02-16 00:00:00+00:00  Firefighter/Rescuer III       12/12/2016   \n",
       "2023-02-16 00:00:00+00:00        Library Associate       06/27/1997   \n",
       "2023-02-16 00:00:00+00:00  Labor Relations Advisor       10/28/2001   \n",
       "2023-02-16 00:00:00+00:00             Bus Operator       11/10/2014   \n",
       "\n",
       "                           year_first_hired     salary  full_time  part_time  \\\n",
       "date                                                                           \n",
       "2023-02-16 00:00:00+00:00              2003  103506.00          1          0   \n",
       "2023-02-16 00:00:00+00:00              2016   45261.00          1          0   \n",
       "2023-02-16 00:00:00+00:00              1997   25167.75          0          1   \n",
       "2023-02-16 00:00:00+00:00              2001  112899.00          1          0   \n",
       "2023-02-16 00:00:00+00:00              2014   42053.42          1          0   \n",
       "\n",
       "                             sector  \n",
       "date                                 \n",
       "2023-02-16 00:00:00+00:00  Sector 3  \n",
       "2023-02-16 00:00:00+00:00  Sector 1  \n",
       "2023-02-16 00:00:00+00:00  Sector 4  \n",
       "2023-02-16 00:00:00+00:00  Sector 3  \n",
       "2023-02-16 00:00:00+00:00  Sector 4  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "baseline.features.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Parameters\n",
    "\n",
    "With `set_parameters`, you can specify the timestamps for both baseline and production datasets, as well as the production interval.\n",
    "\n",
    "By default, the timestamp is set as:\n",
    "- Current date for baseline dataset\n",
    "- Tomorrow's date for production dataset\n",
    "\n",
    "These timestamps can be defined by the user to any given day, including the dataset's original date.\n",
    "\n",
    "The `production_interval` defines the interval for each batch: '1d' means that we will have daily batches, while '7d' would mean weekly batches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To set the timestamps to the original dataset's date, set `original` to true, like below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Currently, the production interval takes a str in the format \"Xd\", where X is an integer between 1-30\n",
    "dataset.set_parameters(production_interval=\"1d\", original=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "datetime.datetime(2023, 1, 16, 0, 0, tzinfo=datetime.timezone.utc)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "baseline = dataset.get_baseline()\n",
    "baseline.timestamp"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can set timestamp by using the `baseline_timestamp` and `production_start_timestamp`, and the production interval like below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime, timezone\n",
    "now = datetime.now(timezone.utc)\n",
    "dataset.set_parameters(baseline_timestamp=now, production_start_timestamp=now, production_interval=\"1d\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Note that we are passing the datetime converted to the UTC timezone. If a naive datetime is passed (no information on timezones), local time zone will be assumed. The local timestamp, however, will be converted to the proper datetime in UTC timezone. Passing a naive datetime will trigger a warning, letting you know of this behavior."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that if both `original` and a timestamp (baseline or production) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Inference Data #1 - By Date"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can get production data in two different ways. The first is to specify the exact date you want, which will return a single batch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch = dataset.get_production_data(target_date=now)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can access the attributes just as showed before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "datetime.datetime(2023, 2, 16, 0, 0, tzinfo=datetime.timezone.utc)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch.timestamp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>employee_id</th>\n",
       "      <th>gender</th>\n",
       "      <th>overtime_pay</th>\n",
       "      <th>department</th>\n",
       "      <th>assignment_category</th>\n",
       "      <th>position_title</th>\n",
       "      <th>date_first_hired</th>\n",
       "      <th>year_first_hired</th>\n",
       "      <th>salary</th>\n",
       "      <th>full_time</th>\n",
       "      <th>part_time</th>\n",
       "      <th>sector</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>6309</td>\n",
       "      <td>F</td>\n",
       "      <td>0.00</td>\n",
       "      <td>HHS</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Administrative Specialist I</td>\n",
       "      <td>02/08/2016</td>\n",
       "      <td>2016</td>\n",
       "      <td>59276.91</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>4078</td>\n",
       "      <td>M</td>\n",
       "      <td>19677.72</td>\n",
       "      <td>POL</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Police Officer III</td>\n",
       "      <td>06/25/1990</td>\n",
       "      <td>1990</td>\n",
       "      <td>92756.70</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>2445</td>\n",
       "      <td>F</td>\n",
       "      <td>0.00</td>\n",
       "      <td>DEP</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Planning Specialist III</td>\n",
       "      <td>06/30/2014</td>\n",
       "      <td>2014</td>\n",
       "      <td>80499.91</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>2548</td>\n",
       "      <td>F</td>\n",
       "      <td>0.00</td>\n",
       "      <td>REC</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Recreation Specialist</td>\n",
       "      <td>03/24/2014</td>\n",
       "      <td>2014</td>\n",
       "      <td>69842.16</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>5949</td>\n",
       "      <td>M</td>\n",
       "      <td>45267.21</td>\n",
       "      <td>DGS</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Property Manager II</td>\n",
       "      <td>05/07/1990</td>\n",
       "      <td>1990</td>\n",
       "      <td>99870.24</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>8594</td>\n",
       "      <td>F</td>\n",
       "      <td>0.00</td>\n",
       "      <td>CCL</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Confidential Aide</td>\n",
       "      <td>05/05/2003</td>\n",
       "      <td>2003</td>\n",
       "      <td>146664.49</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>3479</td>\n",
       "      <td>M</td>\n",
       "      <td>17711.08</td>\n",
       "      <td>FRS</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Firefighter/Rescuer III</td>\n",
       "      <td>02/27/2012</td>\n",
       "      <td>2012</td>\n",
       "      <td>60618.00</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>6067</td>\n",
       "      <td>F</td>\n",
       "      <td>0.00</td>\n",
       "      <td>HHS</td>\n",
       "      <td>Parttime-Regular</td>\n",
       "      <td>School Health Room Technician I</td>\n",
       "      <td>08/06/2012</td>\n",
       "      <td>2012</td>\n",
       "      <td>36797.13</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Sector 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>5788</td>\n",
       "      <td>M</td>\n",
       "      <td>9526.23</td>\n",
       "      <td>DLC</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Liquor Store Clerk II</td>\n",
       "      <td>04/04/2000</td>\n",
       "      <td>2000</td>\n",
       "      <td>57760.61</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-02-16 00:00:00+00:00</th>\n",
       "      <td>4375</td>\n",
       "      <td>M</td>\n",
       "      <td>1020.28</td>\n",
       "      <td>DOT</td>\n",
       "      <td>Fulltime-Regular</td>\n",
       "      <td>Motor Pool Attendant</td>\n",
       "      <td>05/27/2008</td>\n",
       "      <td>2008</td>\n",
       "      <td>36493.52</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Sector 4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>916 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                           employee_id gender  overtime_pay department  \\\n",
       "date                                                                     \n",
       "2023-02-16 00:00:00+00:00         6309      F          0.00        HHS   \n",
       "2023-02-16 00:00:00+00:00         4078      M      19677.72        POL   \n",
       "2023-02-16 00:00:00+00:00         2445      F          0.00        DEP   \n",
       "2023-02-16 00:00:00+00:00         2548      F          0.00        REC   \n",
       "2023-02-16 00:00:00+00:00         5949      M      45267.21        DGS   \n",
       "...                                ...    ...           ...        ...   \n",
       "2023-02-16 00:00:00+00:00         8594      F          0.00        CCL   \n",
       "2023-02-16 00:00:00+00:00         3479      M      17711.08        FRS   \n",
       "2023-02-16 00:00:00+00:00         6067      F          0.00        HHS   \n",
       "2023-02-16 00:00:00+00:00         5788      M       9526.23        DLC   \n",
       "2023-02-16 00:00:00+00:00         4375      M       1020.28        DOT   \n",
       "\n",
       "                          assignment_category  \\\n",
       "date                                            \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "...                                       ...   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Parttime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "2023-02-16 00:00:00+00:00    Fulltime-Regular   \n",
       "\n",
       "                                            position_title date_first_hired  \\\n",
       "date                                                                          \n",
       "2023-02-16 00:00:00+00:00      Administrative Specialist I       02/08/2016   \n",
       "2023-02-16 00:00:00+00:00               Police Officer III       06/25/1990   \n",
       "2023-02-16 00:00:00+00:00          Planning Specialist III       06/30/2014   \n",
       "2023-02-16 00:00:00+00:00            Recreation Specialist       03/24/2014   \n",
       "2023-02-16 00:00:00+00:00              Property Manager II       05/07/1990   \n",
       "...                                                    ...              ...   \n",
       "2023-02-16 00:00:00+00:00                Confidential Aide       05/05/2003   \n",
       "2023-02-16 00:00:00+00:00          Firefighter/Rescuer III       02/27/2012   \n",
       "2023-02-16 00:00:00+00:00  School Health Room Technician I       08/06/2012   \n",
       "2023-02-16 00:00:00+00:00            Liquor Store Clerk II       04/04/2000   \n",
       "2023-02-16 00:00:00+00:00             Motor Pool Attendant       05/27/2008   \n",
       "\n",
       "                           year_first_hired     salary  full_time  part_time  \\\n",
       "date                                                                           \n",
       "2023-02-16 00:00:00+00:00              2016   59276.91          1          0   \n",
       "2023-02-16 00:00:00+00:00              1990   92756.70          1          0   \n",
       "2023-02-16 00:00:00+00:00              2014   80499.91          1          0   \n",
       "2023-02-16 00:00:00+00:00              2014   69842.16          1          0   \n",
       "2023-02-16 00:00:00+00:00              1990   99870.24          1          0   \n",
       "...                                     ...        ...        ...        ...   \n",
       "2023-02-16 00:00:00+00:00              2003  146664.49          1          0   \n",
       "2023-02-16 00:00:00+00:00              2012   60618.00          1          0   \n",
       "2023-02-16 00:00:00+00:00              2012   36797.13          0          1   \n",
       "2023-02-16 00:00:00+00:00              2000   57760.61          1          0   \n",
       "2023-02-16 00:00:00+00:00              2008   36493.52          1          0   \n",
       "\n",
       "                             sector  \n",
       "date                                 \n",
       "2023-02-16 00:00:00+00:00  Sector 1  \n",
       "2023-02-16 00:00:00+00:00  Sector 3  \n",
       "2023-02-16 00:00:00+00:00  Sector 4  \n",
       "2023-02-16 00:00:00+00:00  Sector 2  \n",
       "2023-02-16 00:00:00+00:00  Sector 3  \n",
       "...                             ...  \n",
       "2023-02-16 00:00:00+00:00  Sector 4  \n",
       "2023-02-16 00:00:00+00:00  Sector 1  \n",
       "2023-02-16 00:00:00+00:00  Sector 1  \n",
       "2023-02-16 00:00:00+00:00  Sector 2  \n",
       "2023-02-16 00:00:00+00:00  Sector 4  \n",
       "\n",
       "[916 rows x 12 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Inference Data #2 - By Number of Batches\n",
    "\n",
    "The second way is to specify the number of batches you want and also the date for the first batch.\n",
    "\n",
    "You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here's an example that retrieves daily batches for a period of 5 days and logs each one with __whylogs__, saving the binary profiles to disk:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "logging batch of size 916 for 2023-02-16 00:00:00+00:00\n",
      "logging batch of size 818 for 2023-02-17 00:00:00+00:00\n",
      "logging batch of size 891 for 2023-02-18 00:00:00+00:00\n",
      "logging batch of size 935 for 2023-02-19 00:00:00+00:00\n",
      "logging batch of size 854 for 2023-02-20 00:00:00+00:00\n"
     ]
    }
   ],
   "source": [
    "import whylogs as why\n",
    "batches = dataset.get_production_data(number_batches=5)\n",
    "\n",
    "for batch in batches:\n",
    "  print(\"logging batch of size {} for {}\".format(len(batch.data),batch.timestamp))\n",
    "  profile = why.log(batch.data).profile()\n",
    "  profile.set_dataset_timestamp(batch.timestamp)\n",
    "  profile.view().write(\"batch_{}\".format(batch.timestamp))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}