python/examples/advanced/Profile_Store/LocalStore/LocalStore_with_Constraints.ipynb from whylabs/whylogs-python

python/examples/advanced/Profile_Store/LocalStore/LocalStore_with_Constraints.ipynb
Summary

Maintainability

Test Coverage

Issues
{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "DPe5QBtcib8V",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "# Local Profile Store with Constraints\n",
        "\n",
        "Hey there! In this example we will understand how to setup the `LocalStore` and use it to track changes to our incoming data. It is an implementation of the `ProfileStore` that will manage listing, reading and writing whylogs' Dataset Profiles locally.\n",
        "\n",
        "## Installing whylogs\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {},
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "851.04s - pydevd: Sending message related to process being replaced timed-out after 5 seconds\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: whylogs in /home/jamie/projects/v1/whylogs/python/.venv/lib/python3.8/site-packages (1.1.32)\n",
            "Requirement already satisfied: protobuf>=3.19.4 in /home/jamie/projects/v1/whylogs/python/.venv/lib/python3.8/site-packages (from whylogs) (4.22.1)\n",
            "Requirement already satisfied: typing-extensions>=3.10 in /home/jamie/projects/v1/whylogs/python/.venv/lib/python3.8/site-packages (from whylogs) (4.5.0)\n",
            "Requirement already satisfied: whylogs-sketching>=3.4.1.dev3 in /home/jamie/projects/v1/whylogs/python/.venv/lib/python3.8/site-packages (from whylogs) (3.4.1.dev3)\n",
            "Note: you may need to restart the kernel to use updated packages.\n"
          ]
        }
      ],
      "source": [
        "%pip install whylogs"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Setting up\n",
        "The first thing you'll need to do to start using the `LocalStore` is instantiate the object and check if any profiles were written with the `list` method."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "id": "6aTdd-q9ib8Z",
        "outputId": "4628fb91-f4ce-487c-b6da-c0daafa0acce",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['base_model_name']"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from whylogs.api.store import LocalStore\n",
        "\n",
        "store = LocalStore()\n",
        "store.list()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KJ02JYHRib8d",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "And since we have an empty list returned, it means that we haven't used the profile store in this location so far. But we can already check that a new directory called `profile_store` was created on our working directory:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "id": "iNFlI8awib8d",
        "outputId": "b735310a-de1e-432c-e802-05425b7fec55",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import os \n",
        "\"profile_store\" in os.listdir(os.getcwd())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uCL2jzlwib8e",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "## Logging profiles\n",
        "\n",
        "Now that we have our `LocalStore` configured, let's write some data to it. In order to emulate a real use-case but also maintain this notebook less complex, we will instantiate a rolling logger instance and run it for 2 minutes. The interval in which we choose to roll, the logger will rotate and persist a merged profile to the `LocalStore`. And then we will ingest the same pandas DataFrame in order to emulate multiple log calls not in sync with the rotation schedule. This tries to bring to light a real streaming case, where there is a long-living logging application that receives multiple requests and rotates the profiles to the LocalStore with a certain time-range."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "id": "TSAPmEyqib8e",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "import pandas as pd \n",
        "\n",
        "df = pd.DataFrame({\"column_1\": [1,2,3,45], \"column_2\": [1,2,2,None], \"column_3\": [\"strings\", \"more\", \"strings\", \"\"]})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "id": "UsD2XQbKib8e",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "import time\n",
        "import whylogs as why\n",
        "\n",
        "with why.logger(mode=\"rolling\", interval=1, when=\"M\", base_name=\"base_model_name\") as logger:\n",
        "    logger.append_store(store=store)\n",
        "\n",
        "    for _ in range(60):\n",
        "        logger.log(df)\n",
        "        time.sleep(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_GSEE5Qiib8f",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "And then you should see new profiles created on your LocalStore. Let's investigate if that is actually the case:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "id": "IFxqGCLpib8f",
        "outputId": "9a944577-a182-4d97-d811-4648a1d94dde",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['profile_2023-03-25_2:13:43_b25f04c8-a785-4392-a0f9-21bd69e3f32c.bin',\n",
              " 'profile_2023-03-25_2:11:41_40dbc665-6f09-472b-ac4e-3443112c6be2.bin',\n",
              " 'profile_2023-03-25_2:13:0_a4bac770-627c-4399-a396-0fbfa11cda76.bin',\n",
              " 'profile_2023-03-25_2:25:0_93250835-7f87-4942-a0eb-ee382c3e8d74.bin',\n",
              " 'profile_2023-03-25_2:11:0_a490462d-defa-40b2-a934-84175dcb4b37.bin',\n",
              " 'profile_2023-03-25_2:25:51_721b2ad7-f390-4981-9e74-67295ccbe7c9.bin']"
            ]
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dataset_id = store.list()[0]\n",
        "os.listdir(f\"profile_store/{dataset_id}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TAJurTBlib8f",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "## Read profiles from the store\n",
        "\n",
        "Another step in learning how to use the `LocalStore` is the ability to fetch back profiles. You can either do that by passing in a `DatasetIdQuery`, which will fetch all existing profiles within that dataset_id, or a `DateQuery`, that will get all written profiles for a specific datetime range."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "id": "SxZilMQYib8g",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from whylogs.api.store import DateQuery, DatasetIdQuery\n",
        "\n",
        "name_query = DatasetIdQuery(dataset_id=dataset_id)\n",
        "\n",
        "profile_view = store.get(query=name_query)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "id": "UfGbEDS_ib8g",
        "outputId": "5040bb7f-4ac0-4370-91a4-bbc7f28cc91f",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cardinality/est</th>\n",
              "      <th>cardinality/lower_1</th>\n",
              "      <th>cardinality/upper_1</th>\n",
              "      <th>counts/inf</th>\n",
              "      <th>counts/n</th>\n",
              "      <th>counts/nan</th>\n",
              "      <th>counts/null</th>\n",
              "      <th>distribution/max</th>\n",
              "      <th>distribution/mean</th>\n",
              "      <th>distribution/median</th>\n",
              "      <th>...</th>\n",
              "      <th>frequent_items/frequent_strings</th>\n",
              "      <th>ints/max</th>\n",
              "      <th>ints/min</th>\n",
              "      <th>type</th>\n",
              "      <th>types/boolean</th>\n",
              "      <th>types/fractional</th>\n",
              "      <th>types/integral</th>\n",
              "      <th>types/object</th>\n",
              "      <th>types/string</th>\n",
              "      <th>types/tensor</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>column_1</th>\n",
              "      <td>4.0</td>\n",
              "      <td>4.0</td>\n",
              "      <td>4.00020</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>45.0</td>\n",
              "      <td>12.750000</td>\n",
              "      <td>3.0</td>\n",
              "      <td>...</td>\n",
              "      <td>[FrequentItem(value='1', est=181, upper=181, l...</td>\n",
              "      <td>45.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column_2</th>\n",
              "      <td>2.0</td>\n",
              "      <td>2.0</td>\n",
              "      <td>2.00010</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>181</td>\n",
              "      <td>181</td>\n",
              "      <td>2.0</td>\n",
              "      <td>1.666667</td>\n",
              "      <td>2.0</td>\n",
              "      <td>...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>543</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column_3</th>\n",
              "      <td>3.0</td>\n",
              "      <td>3.0</td>\n",
              "      <td>3.00015</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>NaN</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>...</td>\n",
              "      <td>[FrequentItem(value='strings', est=361, upper=...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>3 rows × 31 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "          cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
              "column                                                                \n",
              "column_1              4.0                  4.0              4.00020   \n",
              "column_2              2.0                  2.0              2.00010   \n",
              "column_3              3.0                  3.0              3.00015   \n",
              "\n",
              "          counts/inf  counts/n  counts/nan  counts/null  distribution/max  \\\n",
              "column                                                                      \n",
              "column_1           0       724           0            0              45.0   \n",
              "column_2           0       724         181          181               2.0   \n",
              "column_3           0       724           0            0               NaN   \n",
              "\n",
              "          distribution/mean  distribution/median  ...  \\\n",
              "column                                            ...   \n",
              "column_1          12.750000                  3.0  ...   \n",
              "column_2           1.666667                  2.0  ...   \n",
              "column_3           0.000000                  NaN  ...   \n",
              "\n",
              "                            frequent_items/frequent_strings  ints/max  \\\n",
              "column                                                                  \n",
              "column_1  [FrequentItem(value='1', est=181, upper=181, l...      45.0   \n",
              "column_2                                                NaN       NaN   \n",
              "column_3  [FrequentItem(value='strings', est=361, upper=...       NaN   \n",
              "\n",
              "          ints/min                type  types/boolean  types/fractional  \\\n",
              "column                                                                    \n",
              "column_1       1.0  SummaryType.COLUMN              0                 0   \n",
              "column_2       NaN  SummaryType.COLUMN              0               543   \n",
              "column_3       NaN  SummaryType.COLUMN              0                 0   \n",
              "\n",
              "          types/integral  types/object  types/string  types/tensor  \n",
              "column                                                              \n",
              "column_1             724             0             0             0  \n",
              "column_2               0             0             0             0  \n",
              "column_3               0             0           724             0  \n",
              "\n",
              "[3 rows x 31 columns]"
            ]
          },
          "execution_count": 26,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "profile_view.to_pandas()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yJLRM6gZib8h",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "The second approach is to get from a certain date range. Since we have written only two profiles for the same minute, we will end up with the same result from before. The nice thing about this is that it will allow users to fetch profiles for a moving window of reference, as we will demonstrate below:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "id": "HwKyZ6ICib8h",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from datetime import datetime, timedelta, timezone\n",
        "\n",
        "\n",
        "date_query = DateQuery(\n",
        "    dataset_id=dataset_id,\n",
        "    start_date=datetime.now(timezone.utc) - timedelta(days=7),\n",
        "    end_date=datetime.now(timezone.utc)\n",
        ")\n",
        "\n",
        "timed_profile_view = store.get(query=date_query)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "id": "vyjXqlM5ib8h",
        "outputId": "519c5a92-2410-4337-9ff1-3ee7730dc83e",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cardinality/est</th>\n",
              "      <th>cardinality/lower_1</th>\n",
              "      <th>cardinality/upper_1</th>\n",
              "      <th>counts/inf</th>\n",
              "      <th>counts/n</th>\n",
              "      <th>counts/nan</th>\n",
              "      <th>counts/null</th>\n",
              "      <th>distribution/max</th>\n",
              "      <th>distribution/mean</th>\n",
              "      <th>distribution/median</th>\n",
              "      <th>...</th>\n",
              "      <th>frequent_items/frequent_strings</th>\n",
              "      <th>ints/max</th>\n",
              "      <th>ints/min</th>\n",
              "      <th>type</th>\n",
              "      <th>types/boolean</th>\n",
              "      <th>types/fractional</th>\n",
              "      <th>types/integral</th>\n",
              "      <th>types/object</th>\n",
              "      <th>types/string</th>\n",
              "      <th>types/tensor</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>column_1</th>\n",
              "      <td>4.0</td>\n",
              "      <td>4.0</td>\n",
              "      <td>4.00020</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>45.0</td>\n",
              "      <td>12.750000</td>\n",
              "      <td>3.0</td>\n",
              "      <td>...</td>\n",
              "      <td>[FrequentItem(value='1', est=181, upper=181, l...</td>\n",
              "      <td>45.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column_2</th>\n",
              "      <td>2.0</td>\n",
              "      <td>2.0</td>\n",
              "      <td>2.00010</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>181</td>\n",
              "      <td>181</td>\n",
              "      <td>2.0</td>\n",
              "      <td>1.666667</td>\n",
              "      <td>2.0</td>\n",
              "      <td>...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>543</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column_3</th>\n",
              "      <td>3.0</td>\n",
              "      <td>3.0</td>\n",
              "      <td>3.00015</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>NaN</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>...</td>\n",
              "      <td>[FrequentItem(value='strings', est=361, upper=...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>724</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>3 rows × 31 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "          cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
              "column                                                                \n",
              "column_1              4.0                  4.0              4.00020   \n",
              "column_2              2.0                  2.0              2.00010   \n",
              "column_3              3.0                  3.0              3.00015   \n",
              "\n",
              "          counts/inf  counts/n  counts/nan  counts/null  distribution/max  \\\n",
              "column                                                                      \n",
              "column_1           0       724           0            0              45.0   \n",
              "column_2           0       724         181          181               2.0   \n",
              "column_3           0       724           0            0               NaN   \n",
              "\n",
              "          distribution/mean  distribution/median  ...  \\\n",
              "column                                            ...   \n",
              "column_1          12.750000                  3.0  ...   \n",
              "column_2           1.666667                  2.0  ...   \n",
              "column_3           0.000000                  NaN  ...   \n",
              "\n",
              "                            frequent_items/frequent_strings  ints/max  \\\n",
              "column                                                                  \n",
              "column_1  [FrequentItem(value='1', est=181, upper=181, l...      45.0   \n",
              "column_2                                                NaN       NaN   \n",
              "column_3  [FrequentItem(value='strings', est=361, upper=...       NaN   \n",
              "\n",
              "          ints/min                type  types/boolean  types/fractional  \\\n",
              "column                                                                    \n",
              "column_1       1.0  SummaryType.COLUMN              0                 0   \n",
              "column_2       NaN  SummaryType.COLUMN              0               543   \n",
              "column_3       NaN  SummaryType.COLUMN              0                 0   \n",
              "\n",
              "          types/integral  types/object  types/string  types/tensor  \n",
              "column                                                              \n",
              "column_1             724             0             0             0  \n",
              "column_2               0             0             0             0  \n",
              "column_3               0             0           724             0  \n",
              "\n",
              "[3 rows x 31 columns]"
            ]
          },
          "execution_count": 28,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "timed_profile_view.to_pandas()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1Vpx-uI4ib8i",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "So everytime you run `get` with this specified `DateQuery`, you will get 7 previous days worth of data. And this can be useful, for instance, to compare a reference profile against a newly logged one. Let's see how to do that on the next section. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rVTWrDuOib8i",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        ">**IMPORTANT**: Please note that even if we pass a milisecond granular datetime in the `DateQuery` range, `store.get` will always search for profiles on a daily basis. We decided to do that to simplify the API usage as well as having a more statistically significant merged profile view when reading. If this does not fit your current needs for the `LocalStore`, please submit an issue on our Github repo and also feel free to ask others on our [community Slack](http://join.slack.whylabs.ai/).  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OpOrUXjoib8i",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "## Validating profiles with the Local Store\n",
        "\n",
        "Now let's use this new functionality to validate incoming profiles! This will be useful to trigger some actions when receiving incoming data, for example. In order to do that, we will need a set of fixed rules for the Validator, as well as a well-defined set of Constraints, so we can do comparisons while profiling and also after profiling. So let's get to it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "id": "spR6tgQrib8j",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from whylogs.core.relations import  Not, Predicate\n",
        "\n",
        "X = Predicate()\n",
        "\n",
        "name_condition = {\"is_not_value\": Not(X.equals(\"John\"))}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y-IfXOKVib8j",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "After defining the conditions that we wish to validate, we need to set the callback that will be triggered when this condition is met. We will simply print something to the screen for this example, but in a real usage scenario, you could possibly stop your processes and trigger an alert to your central communications channel, for example :) "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {
        "id": "Zu9MY4K1ib8j",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from typing import Any\n",
        "\n",
        "def do_something_important(validator_name: str, condition_name: str, value: Any):\n",
        "    print(f\"Validator {validator_name} failed! Condition name {condition_name} failed for value {value}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ur8N6jH8ib8k",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "Lastly, we need to create the validator, that will take in the condition that we set along with the callback function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "id": "SZMgkbR0ib8k",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from whylogs.core.validators import ConditionValidator\n",
        "\n",
        "\n",
        "name_validator = ConditionValidator(\n",
        "    name=\"no_one_named_john\",\n",
        "    conditions=name_condition,\n",
        "    actions=[do_something_important],\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lCgTBfWjib8k",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "And now we will map the condition to specific columns: "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {
        "id": "e9s92QBuib8k",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "validators = {\n",
        "    \"column_3\": [name_validator]\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "D6fW1-0fib8l",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "Finally, we can again log incoming data. For this example, we will log data for approximately 2 minutes, which will dump 2 new profiles to our `LocalStore`. Then, we will introduce data that won't match the validator condition, and we will see the callback being executed while logging! Please note that both DataFrames follow the **same schema**, only invalid data is being brought to the second one."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {
        "id": "BrslYrRXib8l",
        "outputId": "1f2d3b2e-5f15-438d-b61e-823261bfb16e",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Validator no_one_named_john failed! Condition name is_not_value failed for value John\n"
          ]
        }
      ],
      "source": [
        "from whylogs.core.schema import DatasetSchema\n",
        "import whylogs as why\n",
        "\n",
        "schema = DatasetSchema(validators=validators)\n",
        "with why.logger(schema=schema, mode=\"rolling\", base_name=\"base_model_name\", interval=1, when=\"M\") as logger:\n",
        "    logger.append_store(store=store)\n",
        "\n",
        "    for _ in range(60):\n",
        "        logger.log(df)\n",
        "        time.sleep(2)\n",
        "\n",
        "    new_df = pd.DataFrame({\"column_1\": [1,2,3,45], \"column_2\": [1,2,2,None], \"column_3\": [\"John\", \"more\", \"strings\", \"\"]})\n",
        "\n",
        "    logger.log(new_df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T4fULFVMib8m",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "And as we can see, the validation failed, since we introduced a DataFrame with an invalid data point! Let's check how does the stored profile look like:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {
        "id": "3Fe__X4gib8m",
        "outputId": "3d12d7c5-ef5e-49bd-dc4d-4bb848f71ec0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cardinality/est</th>\n",
              "      <th>cardinality/lower_1</th>\n",
              "      <th>cardinality/upper_1</th>\n",
              "      <th>counts/inf</th>\n",
              "      <th>counts/n</th>\n",
              "      <th>counts/nan</th>\n",
              "      <th>counts/null</th>\n",
              "      <th>distribution/max</th>\n",
              "      <th>distribution/mean</th>\n",
              "      <th>distribution/median</th>\n",
              "      <th>...</th>\n",
              "      <th>frequent_items/frequent_strings</th>\n",
              "      <th>ints/max</th>\n",
              "      <th>ints/min</th>\n",
              "      <th>type</th>\n",
              "      <th>types/boolean</th>\n",
              "      <th>types/fractional</th>\n",
              "      <th>types/integral</th>\n",
              "      <th>types/object</th>\n",
              "      <th>types/string</th>\n",
              "      <th>types/tensor</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>column_1</th>\n",
              "      <td>4.0</td>\n",
              "      <td>4.0</td>\n",
              "      <td>4.00020</td>\n",
              "      <td>0</td>\n",
              "      <td>968</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>45.0</td>\n",
              "      <td>12.750000</td>\n",
              "      <td>3.0</td>\n",
              "      <td>...</td>\n",
              "      <td>[FrequentItem(value='1', est=242, upper=242, l...</td>\n",
              "      <td>45.0</td>\n",
              "      <td>1.0</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>968</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column_2</th>\n",
              "      <td>2.0</td>\n",
              "      <td>2.0</td>\n",
              "      <td>2.00010</td>\n",
              "      <td>0</td>\n",
              "      <td>968</td>\n",
              "      <td>242</td>\n",
              "      <td>242</td>\n",
              "      <td>2.0</td>\n",
              "      <td>1.666667</td>\n",
              "      <td>2.0</td>\n",
              "      <td>...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>726</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column_3</th>\n",
              "      <td>3.0</td>\n",
              "      <td>3.0</td>\n",
              "      <td>3.00015</td>\n",
              "      <td>0</td>\n",
              "      <td>968</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>NaN</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>...</td>\n",
              "      <td>[FrequentItem(value='strings', est=482, upper=...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>968</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>3 rows × 31 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "          cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
              "column                                                                \n",
              "column_1              4.0                  4.0              4.00020   \n",
              "column_2              2.0                  2.0              2.00010   \n",
              "column_3              3.0                  3.0              3.00015   \n",
              "\n",
              "          counts/inf  counts/n  counts/nan  counts/null  distribution/max  \\\n",
              "column                                                                      \n",
              "column_1           0       968           0            0              45.0   \n",
              "column_2           0       968         242          242               2.0   \n",
              "column_3           0       968           0            0               NaN   \n",
              "\n",
              "          distribution/mean  distribution/median  ...  \\\n",
              "column                                            ...   \n",
              "column_1          12.750000                  3.0  ...   \n",
              "column_2           1.666667                  2.0  ...   \n",
              "column_3           0.000000                  NaN  ...   \n",
              "\n",
              "                            frequent_items/frequent_strings  ints/max  \\\n",
              "column                                                                  \n",
              "column_1  [FrequentItem(value='1', est=242, upper=242, l...      45.0   \n",
              "column_2                                                NaN       NaN   \n",
              "column_3  [FrequentItem(value='strings', est=482, upper=...       NaN   \n",
              "\n",
              "          ints/min                type  types/boolean  types/fractional  \\\n",
              "column                                                                    \n",
              "column_1       1.0  SummaryType.COLUMN              0                 0   \n",
              "column_2       NaN  SummaryType.COLUMN              0               726   \n",
              "column_3       NaN  SummaryType.COLUMN              0                 0   \n",
              "\n",
              "          types/integral  types/object  types/string  types/tensor  \n",
              "column                                                              \n",
              "column_1             968             0             0             0  \n",
              "column_2               0             0             0             0  \n",
              "column_3               0             0           968             0  \n",
              "\n",
              "[3 rows x 31 columns]"
            ]
          },
          "execution_count": 34,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "name_query = DatasetIdQuery(dataset_id=\"base_model_name\")\n",
        "profile_view = store.get(query=name_query)\n",
        "profile_view.to_pandas()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4ljkzZipib8m",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "## Comparing profiles with the Profile Store\n",
        "\n",
        "Last thing we wanted to demonstrate is the ability to fetch ever-moving reference profiles from the `LocalStore` and use them to compare to recently profile data. We will use whylogs' Constraints module along with the `LocalStore` and we will see how users might benefit from it in the future. In order to do that, we will make two queries to the store, one of them will be our reference, and the other one will aggregate only today's worth of data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {
        "id": "16CByOp8ib8m",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from datetime import timezone\n",
        "\n",
        "\n",
        "store = LocalStore()\n",
        "\n",
        "today_query = DateQuery(start_date=datetime.now(timezone.utc), dataset_id=\"base_model_name\")\n",
        "reference_query = DateQuery(start_date=datetime.now(timezone.utc) - timedelta(days=7), end_date=datetime.now(timezone.utc), dataset_id=\"base_model_name\")\n",
        "\n",
        "today_profile = store.get(query=today_query)\n",
        "reference_profile = store.get(query=reference_query)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M3TAR0P1ib8n",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "With both profiles read, now what we need to do is define our constraints suite. For demonstration purposes, we will check if the column values are **not** greater than the average of the reference. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 36,
      "metadata": {
        "id": "L1PFJT0Wib8o",
        "outputId": "a696d0c6-8f72-4f15-c36a-c95538b90397",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "False\n",
            "[ReportResult(name='null percentage of column_2 lower than 0.4', passed=1, failed=0, summary=None), ReportResult(name='column_1 greater than number 12.749999999999998', passed=0, failed=1, summary=None)]\n"
          ]
        }
      ],
      "source": [
        "from whylogs.core.constraints import ConstraintsBuilder\n",
        "from whylogs.core.constraints.factories import greater_than_number, null_percentage_below_number\n",
        "\n",
        "reference_mean = reference_profile.get_column(\"column_1\").get_metric(\"distribution\").avg\n",
        "\n",
        "builder = ConstraintsBuilder(dataset_profile_view=today_profile)\n",
        "builder.add_constraint(null_percentage_below_number(column_name=\"column_2\", number=0.4))\n",
        "builder.add_constraint(greater_than_number(column_name=\"column_1\", number=reference_mean))\n",
        "\n",
        "constraints = builder.build()\n",
        "print(constraints.validate())\n",
        "print(constraints.generate_constraints_report())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "_rcsE2fkib8o",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "So we have just validated our recently read profile against the reference, which has 7 days worth of profiles. We can also mix the reference profile checks with other ones just for the newest. \n",
        "Hopefully this short demonstration notebook can bring some of the features that the `LocalStore` brings to help you make the most out of whylogs and make your data and ML pipelines more robust and responsible. To learn more, check out our [other examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {
        "id": "5JnYKq5Wib8p",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "# cleaning up\n",
        "import shutil\n",
        "shutil.rmtree(\"./profile_store\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3.9.13 ('.venv': poetry)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.10"
    },
    "orig_nbformat": 4,
    "vscode": {
      "interpreter": {
        "hash": "0f484380554f045e8316d9ef136659363ef199c84a7347221e49b73e46486d36"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}