python/examples/basic/Inspecting_Profiles.ipynb from whylabs/whylogs-python

python/examples/basic/Inspecting_Profiles.ipynb
Summary

Maintainability

Test Coverage

Issues
{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "bbfe7789",
      "metadata": {},
      "source": [
        ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
        ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Inspecting_Profiles)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Inspecting_Profiles) to leverage the power of whylogs and WhyLabs together!*"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "wgBeKz4TmtP7",
      "metadata": {
        "id": "wgBeKz4TmtP7"
      },
      "source": [
        "# Inspecting Profiles\n",
        "\n",
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Inspecting_Profiles.ipynb)\n",
        "\n",
        "In this notebook, we'll show how you can use whylog's Profile Viewer (`profile.view()`) to find useful statistics in a dataset. \n",
        "\n",
        "This includes:\n",
        "\n",
        "- Counters, such as number of samples and null values\n",
        "- Inferred types, such as integral, fractional, boolean, and strings\n",
        "- Estimated cardinality\n",
        "- Frequent items\n",
        "- Distribution metrics: min, max, mean, median, standard deviation, and quantile values\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "eShCq4LYGae9",
      "metadata": {
        "id": "eShCq4LYGae9"
      },
      "source": [
        "## Setup\n",
        "We'll need the `whylogs` and `pandas` libraries for this example.\n",
        "\n",
        "We'll also populate a dataframe with some data to inspect.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "ad907ce3-0c3b-49e4-86f1-eae9de934f7b",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ad907ce3-0c3b-49e4-86f1-eae9de934f7b",
        "jupyter": {
          "outputs_hidden": true
        },
        "outputId": "36cb94da-cb73-43d6-b26f-5e2360fe71f0",
        "tags": []
      },
      "outputs": [],
      "source": [
        "# install whylogs & pandas if needed\n",
        "# Note: you may need to restart the kernel to use updated packages.\n",
        "%pip install whylogs\n",
        "%pip install pandas "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "8369d3a8-9bf2-4043-a45a-13838498f211",
      "metadata": {
        "id": "8369d3a8-9bf2-4043-a45a-13838498f211"
      },
      "outputs": [],
      "source": [
        "# import whylogs and pandas\n",
        "import whylogs as why\n",
        "import pandas as pd\n",
        "\n",
        "# Set to show all columns in dataframe\n",
        "pd.set_option(\"display.max_columns\", None)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "WdF4F9FugqHq",
      "metadata": {
        "id": "WdF4F9FugqHq"
      },
      "outputs": [],
      "source": [
        "# create a simple test dataset\n",
        "data = {\n",
        "    \"animal\": [\"lion\", \"shark\", \"cat\", \"bear\", \"jellyfish\", \"kangaroo\",\n",
        "                                      \"jellyfish\", \"jellyfish\", \"fish\"],\n",
        "    \"legs\": [4, 0, 4, 4.0, None, 2, None, None, \"fins\"],\n",
        "    \"weight\": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],\n",
        "}\n",
        "\n",
        "# Create dataframe with test dataset\n",
        "df = pd.DataFrame(data)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0nzsw8mHdzO6",
      "metadata": {
        "id": "0nzsw8mHdzO6"
      },
      "source": [
        "## Log data with whylogs, create a profile, and view statistics:\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "OHDz8SmCgqE6",
      "metadata": {
        "id": "OHDz8SmCgqE6"
      },
      "outputs": [],
      "source": [
        "# Log data with whylogs & create profile\n",
        "results = why.log(pandas=df)\n",
        "profile = results.profile()\n",
        "\n",
        "# Create profile view dataframe\n",
        "prof_view = profile.view()\n",
        "prof_df = prof_view.to_pandas()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "e6CXme06hook",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 274
        },
        "id": "e6CXme06hook",
        "outputId": "a5a61521-a39e-4daa-f386-bdda1252bf59"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>counts/n</th>\n",
              "      <th>counts/null</th>\n",
              "      <th>types/integral</th>\n",
              "      <th>types/fractional</th>\n",
              "      <th>types/boolean</th>\n",
              "      <th>types/string</th>\n",
              "      <th>types/object</th>\n",
              "      <th>cardinality/est</th>\n",
              "      <th>cardinality/upper_1</th>\n",
              "      <th>cardinality/lower_1</th>\n",
              "      <th>frequent_items/frequent_strings</th>\n",
              "      <th>type</th>\n",
              "      <th>distribution/mean</th>\n",
              "      <th>distribution/stddev</th>\n",
              "      <th>distribution/n</th>\n",
              "      <th>distribution/max</th>\n",
              "      <th>distribution/min</th>\n",
              "      <th>distribution/q_10</th>\n",
              "      <th>distribution/q_25</th>\n",
              "      <th>distribution/median</th>\n",
              "      <th>distribution/q_75</th>\n",
              "      <th>distribution/q_90</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>column</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>legs</th>\n",
              "      <td>9</td>\n",
              "      <td>3</td>\n",
              "      <td>4</td>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>4.0</td>\n",
              "      <td>4.00020</td>\n",
              "      <td>4.0</td>\n",
              "      <td>[FrequentItem(value='4.000000', est=3, upper=3...</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>weight</th>\n",
              "      <td>9</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>9</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>9.0</td>\n",
              "      <td>9.00045</td>\n",
              "      <td>9.0</td>\n",
              "      <td>NaN</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>20.955556</td>\n",
              "      <td>38.29749</td>\n",
              "      <td>9.0</td>\n",
              "      <td>120.0</td>\n",
              "      <td>1.2</td>\n",
              "      <td>1.2</td>\n",
              "      <td>2.2</td>\n",
              "      <td>4.3</td>\n",
              "      <td>14.3</td>\n",
              "      <td>120.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>animal</th>\n",
              "      <td>9</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>9</td>\n",
              "      <td>0</td>\n",
              "      <td>7.0</td>\n",
              "      <td>7.00035</td>\n",
              "      <td>7.0</td>\n",
              "      <td>[FrequentItem(value='jellyfish', est=3, upper=...</td>\n",
              "      <td>SummaryType.COLUMN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "        counts/n  counts/null  types/integral  types/fractional  \\\n",
              "column                                                            \n",
              "legs           9            3               4                 1   \n",
              "weight         9            0               0                 9   \n",
              "animal         9            0               0                 0   \n",
              "\n",
              "        types/boolean  types/string  types/object  cardinality/est  \\\n",
              "column                                                               \n",
              "legs                0             1             0              4.0   \n",
              "weight              0             0             0              9.0   \n",
              "animal              0             9             0              7.0   \n",
              "\n",
              "        cardinality/upper_1  cardinality/lower_1  \\\n",
              "column                                             \n",
              "legs                4.00020                  4.0   \n",
              "weight              9.00045                  9.0   \n",
              "animal              7.00035                  7.0   \n",
              "\n",
              "                          frequent_items/frequent_strings                type  \\\n",
              "column                                                                          \n",
              "legs    [FrequentItem(value='4.000000', est=3, upper=3...  SummaryType.COLUMN   \n",
              "weight                                                NaN  SummaryType.COLUMN   \n",
              "animal  [FrequentItem(value='jellyfish', est=3, upper=...  SummaryType.COLUMN   \n",
              "\n",
              "        distribution/mean  distribution/stddev  distribution/n  \\\n",
              "column                                                           \n",
              "legs                  NaN                  NaN             NaN   \n",
              "weight          20.955556             38.29749             9.0   \n",
              "animal                NaN                  NaN             NaN   \n",
              "\n",
              "        distribution/max  distribution/min  distribution/q_10  \\\n",
              "column                                                          \n",
              "legs                 NaN               NaN                NaN   \n",
              "weight             120.0               1.2                1.2   \n",
              "animal               NaN               NaN                NaN   \n",
              "\n",
              "        distribution/q_25  distribution/median  distribution/q_75  \\\n",
              "column                                                              \n",
              "legs                  NaN                  NaN                NaN   \n",
              "weight                2.2                  4.3               14.3   \n",
              "animal                NaN                  NaN                NaN   \n",
              "\n",
              "        distribution/q_90  \n",
              "column                     \n",
              "legs                  NaN  \n",
              "weight              120.0  \n",
              "animal                NaN  "
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# View Profile dataframe for dataset statistics\n",
        "prof_df"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c5b53612",
      "metadata": {},
      "source": [
        "The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics' dataframe contains a specific dimension of a given **Metric**."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "gJFHHsqSG07U",
      "metadata": {
        "id": "gJFHHsqSG07U"
      },
      "source": [
        "Taking a quick look at the generated statistics:\n",
        "\n",
        "#### animal\n",
        "The animal row shows there are `9` entries (counts/n). All the data types are strings. Cardinality estimates that `7` different animal types are in the dataset. Frequent items show `jellyfish` appearing the most.\n",
        "\n",
        "#### weight\n",
        "Our weight data contains `9` entries. All of them are `fractional` values. Cardinality shows that all `9` values are estimated to be unique. Since all entries were numerical the distribution statistics are generated.\n",
        "\n",
        "#### legs\n",
        "We can see that there are `9` entries for leg values, but they're several different data types. `3 null`, `4 integrals`, `1 float`, and `1 string`. Cardinality estimates `5` unique values. The most frequent number of legs that appear in the dataset is `4`.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "nIS_XHZFRuXb",
      "metadata": {
        "id": "nIS_XHZFRuXb"
      },
      "source": [
        "### Selecting a single value\n",
        "A single cell can be selected to see full results if needed."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 64,
      "id": "SFNxGh7K-mRs",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "SFNxGh7K-mRs",
        "outputId": "18e61708-6806-42dc-e4fb-261f084e7a6e"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[FrequentItem(value='jellyfish', est=3, upper=3, lower=3),\n",
              " FrequentItem(value='cat', est=1, upper=1, lower=1),\n",
              " FrequentItem(value='lion', est=1, upper=1, lower=1),\n",
              " FrequentItem(value='fish', est=1, upper=1, lower=1),\n",
              " FrequentItem(value='shark', est=1, upper=1, lower=1),\n",
              " FrequentItem(value='kangaroo', est=1, upper=1, lower=1),\n",
              " FrequentItem(value='bear', est=1, upper=1, lower=1)]"
            ]
          },
          "execution_count": 64,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# Select a single statistic by feature and row\n",
        "prof_df['frequent_items/frequent_strings']['animal']"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "mTCLx4X6iYYC",
      "metadata": {
        "id": "mTCLx4X6iYYC"
      },
      "source": [
        "## Understanding The whylogs Profile Statistics\n",
        "\n",
        "By default whylogs will automatically generate these metrics based on data types.  \n",
        "\n",
        "The standard metrics available in whylogs are grouped in namespaces. They are:\n",
        "\n",
        "### Counts & Inferred Data Types:\n",
        "Counts and inferred data types track how many entries exist and what type data they contain.\n",
        "\n",
        "- `counts/n` - the total number of entries in a feature\n",
        "- `counts/null` the number of null values\n",
        "- `types/integral` - the number of values consisting of an integral (whole number)\n",
        "- `types/fractional` - the number of values consisting of a fractional value (float) \n",
        "- `types/boolean` - the number of values consisting of a boolean\n",
        "- `types/string` - the number of values consisting of a string\n",
        "- `types/object` - the number of values consisting of an object. If the data is not of any of the previous types, it will be assumed as an object\n",
        "\n",
        "### Cardinality\n",
        "Cardinality tracks an approximate unique value for each feature\n",
        "\n",
        "- `cardinality/est` - the estimated unique values for each feature\n",
        "- `cardinality/upper_1` - upper bound for the cardinality estimation. The actual cardinality will always be below this number.\n",
        "- `cardinality/lower_1` - lower bound for the cardinality estimation. The actual cardinality will always be above this number.\n",
        "       \n",
        "### Frequent Items:\n",
        "Frequent items track which items show up the most. \n",
        "\n",
        "- `frequent_items/frequent_strings` - the most frequent items\n",
        "\n",
        "### Distribution: \n",
        "Distribution statistics are generated when a feature contains numerical data. \n",
        "\n",
        "- `distribution/mean` - the calculated mean of the feature data\n",
        "- `distribution/stddev` - the calculated standard deviation of the feature data\n",
        "- `distribution/n` - the number of rows belonging to the feature\n",
        "- `distribution/max` - the highest (max) value in the feature \n",
        "- `distribution/min` - the smallest (min) value in the feature\n",
        "- `distribution/median` - the median value of the feature data\n",
        "- `distribution/q_xx` - the xx-th quantile value of the data's distribution  \n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "431d3d55",
      "metadata": {},
      "source": [
        "## Data Types and Metrics\n",
        "\n",
        "whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:\n",
        "\n",
        "- Integral\n",
        "- Fractional\n",
        "- String\n",
        "\n",
        "By default, whylogs will track the following metrics according to the column's inferred data type:\n",
        "\n",
        "- Integral:\n",
        "    - `counts`\n",
        "    - `types`\n",
        "    - `distribution`\n",
        "    - `ints`\n",
        "    - `cardinality`\n",
        "    - `frequent_items`\n",
        "- Fractional:\n",
        "    - `counts`\n",
        "    - `types`\n",
        "    - `cardinality`\n",
        "    - `distribution`\n",
        "- String:\n",
        "    - `counts`\n",
        "    - `types`\n",
        "    - `cardinality`\n",
        "    - `frequent_items`\n",
        "\n",
        "If you want to know how you can customize this configuration, selecting the metrics according to the data type or column name, please go to the [Schema Configuration example](./Schema_Configuration.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "_ZGuhJQBckGO",
      "metadata": {
        "id": "_ZGuhJQBckGO"
      },
      "source": [
        "That's it!\n",
        "If you want to know more about whylogs, check our [documentation](https://whylogs.readthedocs.io/en/latest/)."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "v1 Inspecting w/ whylogs",
      "provenance": []
    },
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.10"
    },
    "vscode": {
      "interpreter": {
        "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}