whylabs/whylogs-python

View on GitHub
python/examples/advanced/Segments.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
    ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Segments) to leverage the power of whylogs and WhyLabs together!*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro to Segmentation"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Segments.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, certain subgroups of data can behave very differently from the overall dataset. When monitoring the health of a dataset, it’s often helpful to have visibility at the sub-group level to better understand how these subgroups are contributing to trends in the overall dataset. whylogs supports data segmentation for this purpose.\n",
    "\n",
    "Data segmentation is done at the point of profiling a dataset.\n",
    "\n",
    "Segmentation can be done by a single feature or by multiple features simultaneously. For example, you could have different profiles according to the gender of your dataset (\"M\" or \"F\"), and also for different combinations of, let's say, Gender and City Code. You can also further filter the segments for specific partitions you are interested in - let's say, Gender \"M\" with age above 18.\n",
    "\n",
    "In this example, we will show you a number of ways you can segment your data, and also how you can write these profiles to different locations."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Content\n",
    "\n",
    "1. Segmenting on a single column\n",
    "2. Segmenting on multiple columns\n",
    "3. Filtering Segments\n",
    "4. Writing Segmented Results to Disk\n",
    "5. Sending Segmented Results to WhyLabs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing whylogs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you don't have it installed already, install whylogs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: you may need to restart the kernel to use updated packages.\n",
    "%pip install whylogs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the Data & Defining the Segments"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first download the data we'll be working with.\n",
    "\n",
    "This dataset contains transaction information for an online grocery store, such as:\n",
    "\n",
    "- product description\n",
    "- category\n",
    "- user rating\n",
    "- market price\n",
    "- number of items sold last week"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>product</th>\n",
       "      <th>category</th>\n",
       "      <th>rating</th>\n",
       "      <th>market_price</th>\n",
       "      <th>sales_last_week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2022-08-09 00:00:00+00:00</td>\n",
       "      <td>Wood - Centre Filled Bar Infused With Dark Mou...</td>\n",
       "      <td>Snacks and Branded Foods</td>\n",
       "      <td>4</td>\n",
       "      <td>350.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2022-08-09 00:00:00+00:00</td>\n",
       "      <td>Toasted Almonds</td>\n",
       "      <td>Gourmet and World Food</td>\n",
       "      <td>3</td>\n",
       "      <td>399.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2022-08-09 00:00:00+00:00</td>\n",
       "      <td>Instant Thai Noodles - Hot &amp; Spicy Tomyum</td>\n",
       "      <td>Gourmet and World Food</td>\n",
       "      <td>3</td>\n",
       "      <td>95.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2022-08-09 00:00:00+00:00</td>\n",
       "      <td>Thokku - Vathakozhambu</td>\n",
       "      <td>Snacks and Branded Foods</td>\n",
       "      <td>4</td>\n",
       "      <td>336.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2022-08-09 00:00:00+00:00</td>\n",
       "      <td>Beetroot Powder</td>\n",
       "      <td>Gourmet and World Food</td>\n",
       "      <td>3</td>\n",
       "      <td>150.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        date  \\\n",
       "0  2022-08-09 00:00:00+00:00   \n",
       "1  2022-08-09 00:00:00+00:00   \n",
       "2  2022-08-09 00:00:00+00:00   \n",
       "3  2022-08-09 00:00:00+00:00   \n",
       "4  2022-08-09 00:00:00+00:00   \n",
       "\n",
       "                                             product  \\\n",
       "0  Wood - Centre Filled Bar Infused With Dark Mou...   \n",
       "1                                    Toasted Almonds   \n",
       "2          Instant Thai Noodles - Hot & Spicy Tomyum   \n",
       "3                             Thokku - Vathakozhambu   \n",
       "4                                    Beetroot Powder   \n",
       "\n",
       "                   category  rating  market_price  sales_last_week  \n",
       "0  Snacks and Branded Foods       4         350.0                1  \n",
       "1    Gourmet and World Food       3         399.0                1  \n",
       "2    Gourmet and World Food       3          95.0                1  \n",
       "3  Snacks and Branded Foods       4         336.0                1  \n",
       "4    Gourmet and World Food       3         150.0                1  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "url = \"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Ecommerce/baseline_dataset_base.csv\"\n",
    "df = pd.read_csv(url)[[\"date\",\"product\",\"category\", \"rating\", \"market_price\",\"sales_last_week\"]]\n",
    "df['rating'] = df['rating'].astype(int)\n",
    "\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a class=\"anchor\" id=\"single\"></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Segmenting on a Single Column"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like the `category` feature is a good one to segment on. Let's see how many categories there are for the complete dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Beauty and Hygiene            9793\n",
       "Gourmet and World Food        6201\n",
       "Kitchen, Garden and Pets      4493\n",
       "Snacks and Branded Foods      3826\n",
       "Cleaning and Household        3446\n",
       "Foodgrains, Oil and Masala    3059\n",
       "Beverages                     1034\n",
       "Bakery, Cakes and Dairy        979\n",
       "Fruits and Vegetables          749\n",
       "Baby Care                      707\n",
       "Eggs, Meat and Fish            456\n",
       "Name: category, dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['category'].value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 11 categories.\n",
    "\n",
    "We might be interested in having access to metrics specific to each category, so let's generate segmented profiles for each category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.core.segmentation_partition import segment_on_column\n",
    "\n",
    "column_segments = segment_on_column(\"category\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'category': SegmentationPartition(name='category', mapper=ColumnMapperFunction(col_names=['category'], map=None, field_projector=<whylogs.core.projectors.FieldProjector object at 0x7fe82f5c9b80>, id='31aee7544d31ada00c3bb3d94ca2e0595c42a1f21c266da65e132168914ed61fe8b1b8c99aaa1a0c5cf5e2dfbd621aa3f9700bef1f6e85f4de4ca6364f149592'), id='8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8', filter=None)}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "column_segments"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`column_segments` is a dictionary for different SegmentationPartition, with informations such as id and additional logic. For the moment, all we're interested in is that we can pass it to our `DatasetSchema` in order to generate segmented profiles while logging: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import whylogs as why\n",
    "from whylogs.core.schema import DatasetSchema\n",
    "\n",
    "results = why.log(df, schema=DatasetSchema(segments=column_segments))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we had 11 different categories, we can expect the `results` to have 11 segments. Let's make sure that is the case:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After profiling the result set has: 11 segments\n"
     ]
    }
   ],
   "source": [
    "print(f\"After profiling the result set has: {results.count} segments\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great.\n",
    "\n",
    "Now, let's visualize the metrics for a single segment (the first one).\n",
    "\n",
    "Results can have multiple partitions, and each partition can have multiple segments. Segments within a partition are non-overlapping. Segments across partitions, however, might overlap. \n",
    "\n",
    "In this example, we have only one partition with 11 non-overlapping segments. Let's fetch the available segments:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's visualize the metrics for the first segment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Profile view for segment ('Baby Care',)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cardinality/est</th>\n",
       "      <th>cardinality/lower_1</th>\n",
       "      <th>cardinality/upper_1</th>\n",
       "      <th>counts/n</th>\n",
       "      <th>counts/null</th>\n",
       "      <th>distribution/max</th>\n",
       "      <th>distribution/mean</th>\n",
       "      <th>distribution/median</th>\n",
       "      <th>distribution/min</th>\n",
       "      <th>distribution/n</th>\n",
       "      <th>...</th>\n",
       "      <th>distribution/stddev</th>\n",
       "      <th>frequent_items/frequent_strings</th>\n",
       "      <th>type</th>\n",
       "      <th>types/boolean</th>\n",
       "      <th>types/fractional</th>\n",
       "      <th>types/integral</th>\n",
       "      <th>types/object</th>\n",
       "      <th>types/string</th>\n",
       "      <th>ints/max</th>\n",
       "      <th>ints/min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>column</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000050</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='Baby Care', est=707, uppe...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>707</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>date</th>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.0</td>\n",
       "      <td>8.000400</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='2022-08-15 00:00:00+00:00...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>707</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>market_price</th>\n",
       "      <td>57.000008</td>\n",
       "      <td>57.0</td>\n",
       "      <td>57.002854</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>2799.0</td>\n",
       "      <td>621.190948</td>\n",
       "      <td>299.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>707</td>\n",
       "      <td>...</td>\n",
       "      <td>713.745256</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>product</th>\n",
       "      <td>69.000012</td>\n",
       "      <td>69.0</td>\n",
       "      <td>69.003457</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='Baby Powder', est=21, upp...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>707</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rating</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.000150</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.823197</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>707</td>\n",
       "      <td>...</td>\n",
       "      <td>0.500566</td>\n",
       "      <td>[FrequentItem(value='4', est=508, upper=508, l...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sales_last_week</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.000250</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1.391796</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>707</td>\n",
       "      <td>...</td>\n",
       "      <td>1.003162</td>\n",
       "      <td>[FrequentItem(value='1', est=557, upper=557, l...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>707</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
       "column                                                                       \n",
       "category                1.000000                  1.0             1.000050   \n",
       "date                    8.000000                  8.0             8.000400   \n",
       "market_price           57.000008                 57.0            57.002854   \n",
       "product                69.000012                 69.0            69.003457   \n",
       "rating                  3.000000                  3.0             3.000150   \n",
       "sales_last_week         5.000000                  5.0             5.000250   \n",
       "\n",
       "                 counts/n  counts/null  distribution/max  distribution/mean  \\\n",
       "column                                                                        \n",
       "category              707            0               NaN           0.000000   \n",
       "date                  707            0               NaN           0.000000   \n",
       "market_price          707            0            2799.0         621.190948   \n",
       "product               707            0               NaN           0.000000   \n",
       "rating                707            0               5.0           3.823197   \n",
       "sales_last_week       707            0               6.0           1.391796   \n",
       "\n",
       "                 distribution/median  distribution/min  distribution/n  ...  \\\n",
       "column                                                                  ...   \n",
       "category                         NaN               NaN               0  ...   \n",
       "date                             NaN               NaN               0  ...   \n",
       "market_price                   299.0              50.0             707  ...   \n",
       "product                          NaN               NaN               0  ...   \n",
       "rating                           4.0               3.0             707  ...   \n",
       "sales_last_week                  1.0               1.0             707  ...   \n",
       "\n",
       "                 distribution/stddev  \\\n",
       "column                                 \n",
       "category                    0.000000   \n",
       "date                        0.000000   \n",
       "market_price              713.745256   \n",
       "product                     0.000000   \n",
       "rating                      0.500566   \n",
       "sales_last_week             1.003162   \n",
       "\n",
       "                                   frequent_items/frequent_strings  \\\n",
       "column                                                               \n",
       "category         [FrequentItem(value='Baby Care', est=707, uppe...   \n",
       "date             [FrequentItem(value='2022-08-15 00:00:00+00:00...   \n",
       "market_price                                                   NaN   \n",
       "product          [FrequentItem(value='Baby Powder', est=21, upp...   \n",
       "rating           [FrequentItem(value='4', est=508, upper=508, l...   \n",
       "sales_last_week  [FrequentItem(value='1', est=557, upper=557, l...   \n",
       "\n",
       "                               type  types/boolean  types/fractional  \\\n",
       "column                                                                 \n",
       "category         SummaryType.COLUMN              0                 0   \n",
       "date             SummaryType.COLUMN              0                 0   \n",
       "market_price     SummaryType.COLUMN              0               707   \n",
       "product          SummaryType.COLUMN              0                 0   \n",
       "rating           SummaryType.COLUMN              0                 0   \n",
       "sales_last_week  SummaryType.COLUMN              0                 0   \n",
       "\n",
       "                 types/integral  types/object  types/string  ints/max ints/min  \n",
       "column                                                                          \n",
       "category                      0             0           707       NaN      NaN  \n",
       "date                          0             0           707       NaN      NaN  \n",
       "market_price                  0             0             0       NaN      NaN  \n",
       "product                       0             0           707       NaN      NaN  \n",
       "rating                      707             0             0       5.0      3.0  \n",
       "sales_last_week             707             0             0       6.0      1.0  \n",
       "\n",
       "[6 rows x 28 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_segment = results.segments()[0]\n",
    "segmented_profile = results.profile(first_segment)\n",
    "\n",
    "print(\"Profile view for segment {}\".format(first_segment.key))\n",
    "segmented_profile.view().to_pandas()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the first segment is for product transactions of the `Baby Care` category, and we have 707 rows for that particular segment."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Segmenting on More than one Column\n",
    "\n",
    "<a id='multi-column'></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We might also be interested in segmenting based on more than one segment.\n",
    "\n",
    "Let's say we are interested in generating profiles for every combination of `category` and `rating`. That way, we can inspect the metrics for, let's say, for `Beverages` with `rating` of 5. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4    15699\n",
       "3    15340\n",
       "5     1901\n",
       "2     1222\n",
       "1      581\n",
       "Name: rating, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['rating'].value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time, we'll use `SegmentationPartition` to create the partition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.core.segmentation_partition import (\n",
    "    ColumnMapperFunction,\n",
    "    SegmentationPartition,\n",
    ")\n",
    "\n",
    "segmentation_partition = SegmentationPartition(\n",
    "    name=\"category,rating\", mapper=ColumnMapperFunction(col_names=[\"category\", \"rating\"])\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create our dictionary with the only partition we have:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After profiling the result set has: 46 segments\n"
     ]
    }
   ],
   "source": [
    "multi_column_segments = {segmentation_partition.name: segmentation_partition}\n",
    "results = why.log(df, schema=DatasetSchema(segments=multi_column_segments))\n",
    "\n",
    "print(f\"After profiling the result set has: {results.count} segments\")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let's check the first segment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Profile view for segment ('Baby Care', '3')\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cardinality/est</th>\n",
       "      <th>cardinality/lower_1</th>\n",
       "      <th>cardinality/upper_1</th>\n",
       "      <th>counts/n</th>\n",
       "      <th>counts/null</th>\n",
       "      <th>distribution/max</th>\n",
       "      <th>distribution/mean</th>\n",
       "      <th>distribution/median</th>\n",
       "      <th>distribution/min</th>\n",
       "      <th>distribution/n</th>\n",
       "      <th>...</th>\n",
       "      <th>distribution/stddev</th>\n",
       "      <th>frequent_items/frequent_strings</th>\n",
       "      <th>type</th>\n",
       "      <th>types/boolean</th>\n",
       "      <th>types/fractional</th>\n",
       "      <th>types/integral</th>\n",
       "      <th>types/object</th>\n",
       "      <th>types/string</th>\n",
       "      <th>ints/max</th>\n",
       "      <th>ints/min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>column</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000050</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='Baby Care', est=162, uppe...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>162</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>date</th>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.0</td>\n",
       "      <td>8.000400</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='2022-08-15 00:00:00+00:00...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>162</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>market_price</th>\n",
       "      <td>15.000001</td>\n",
       "      <td>15.0</td>\n",
       "      <td>15.000749</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>2799.0</td>\n",
       "      <td>649.987654</td>\n",
       "      <td>265.0</td>\n",
       "      <td>149.0</td>\n",
       "      <td>162</td>\n",
       "      <td>...</td>\n",
       "      <td>889.494280</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>product</th>\n",
       "      <td>16.000001</td>\n",
       "      <td>16.0</td>\n",
       "      <td>16.000799</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='Baby Sipper With Pop-up S...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>162</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rating</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000050</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>162</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='3', est=162, upper=162, l...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sales_last_week</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.000150</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.271605</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>162</td>\n",
       "      <td>...</td>\n",
       "      <td>0.705125</td>\n",
       "      <td>[FrequentItem(value='1', est=134, upper=134, l...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>162</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
       "column                                                                       \n",
       "category                1.000000                  1.0             1.000050   \n",
       "date                    8.000000                  8.0             8.000400   \n",
       "market_price           15.000001                 15.0            15.000749   \n",
       "product                16.000001                 16.0            16.000799   \n",
       "rating                  1.000000                  1.0             1.000050   \n",
       "sales_last_week         3.000000                  3.0             3.000150   \n",
       "\n",
       "                 counts/n  counts/null  distribution/max  distribution/mean  \\\n",
       "column                                                                        \n",
       "category              162            0               NaN           0.000000   \n",
       "date                  162            0               NaN           0.000000   \n",
       "market_price          162            0            2799.0         649.987654   \n",
       "product               162            0               NaN           0.000000   \n",
       "rating                162            0               3.0           3.000000   \n",
       "sales_last_week       162            0               4.0           1.271605   \n",
       "\n",
       "                 distribution/median  distribution/min  distribution/n  ...  \\\n",
       "column                                                                  ...   \n",
       "category                         NaN               NaN               0  ...   \n",
       "date                             NaN               NaN               0  ...   \n",
       "market_price                   265.0             149.0             162  ...   \n",
       "product                          NaN               NaN               0  ...   \n",
       "rating                           3.0               3.0             162  ...   \n",
       "sales_last_week                  1.0               1.0             162  ...   \n",
       "\n",
       "                 distribution/stddev  \\\n",
       "column                                 \n",
       "category                    0.000000   \n",
       "date                        0.000000   \n",
       "market_price              889.494280   \n",
       "product                     0.000000   \n",
       "rating                      0.000000   \n",
       "sales_last_week             0.705125   \n",
       "\n",
       "                                   frequent_items/frequent_strings  \\\n",
       "column                                                               \n",
       "category         [FrequentItem(value='Baby Care', est=162, uppe...   \n",
       "date             [FrequentItem(value='2022-08-15 00:00:00+00:00...   \n",
       "market_price                                                   NaN   \n",
       "product          [FrequentItem(value='Baby Sipper With Pop-up S...   \n",
       "rating           [FrequentItem(value='3', est=162, upper=162, l...   \n",
       "sales_last_week  [FrequentItem(value='1', est=134, upper=134, l...   \n",
       "\n",
       "                               type  types/boolean  types/fractional  \\\n",
       "column                                                                 \n",
       "category         SummaryType.COLUMN              0                 0   \n",
       "date             SummaryType.COLUMN              0                 0   \n",
       "market_price     SummaryType.COLUMN              0               162   \n",
       "product          SummaryType.COLUMN              0                 0   \n",
       "rating           SummaryType.COLUMN              0                 0   \n",
       "sales_last_week  SummaryType.COLUMN              0                 0   \n",
       "\n",
       "                 types/integral  types/object  types/string  ints/max ints/min  \n",
       "column                                                                          \n",
       "category                      0             0           162       NaN      NaN  \n",
       "date                          0             0           162       NaN      NaN  \n",
       "market_price                  0             0             0       NaN      NaN  \n",
       "product                       0             0           162       NaN      NaN  \n",
       "rating                      162             0             0       3.0      3.0  \n",
       "sales_last_week             162             0             0       4.0      1.0  \n",
       "\n",
       "[6 rows x 28 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "partition = results.partitions[0]\n",
    "segments = results.segments_in_partition(partition)\n",
    "\n",
    "first_segment = next(iter(segments))\n",
    "segmented_profile = results.profile(first_segment)\n",
    "\n",
    "print(\"Profile view for segment {}\".format(first_segment.key))\n",
    "segmented_profile.view().to_pandas()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first segment is now for transactions of __Baby Care__ category with rating of __3__. There are 162 records for this specific segment."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering the Segments"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can further select data in a partition by using a `SegmentFilter`.\n",
    "\n",
    "Let's say you are interested only in the `Baby Care` category. Instead of generating all 11 segmented features, you can specify a __SegmentFilter__ to get only one segment.\n",
    "\n",
    "We can do so by specifying a __filter function__ to the __filter__ property of the Partition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from whylogs.core.segmentation_partition import segment_on_column\n",
    "from whylogs.core.segmentation_partition import SegmentFilter\n",
    "\n",
    "column_segments = segment_on_column(\"category\")\n",
    "\n",
    "column_segments['category'].filter = SegmentFilter(filter_function=lambda df: df.category=='Baby Care')\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're passing a simple lambda function here, but you can define more complex scenarios by passing any Callable to it."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we just repeat the logging process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After profiling the result set has: 1 segments\n"
     ]
    }
   ],
   "source": [
    "import whylogs as why\n",
    "from whylogs.core.schema import DatasetSchema\n",
    "\n",
    "results = why.log(df, schema=DatasetSchema(segments=column_segments))\n",
    "\n",
    "print(f\"After profiling the result set has: {results.count} segments\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that now we have only 1 segment."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering on other columns"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You don't need to filter on the same category you're segmenting on. In fact, you can use multiple columns to get very specific slices of interest for your data.\n",
    "\n",
    "Unlike segmenting on multiple columns, with filtering you don't need to get the segments for the complete cartesian product of your rules. This avoids combinatorial explosions for cases when you are interested in a very specific slice of your data, and are not particularly interested in all possible group combinations.\n",
    "\n",
    "Let's say high-quality, high-cost products are key to a certain promotion you want to release. You can create segments based on `category`, just as before, and can further filter it to track only data for your defined rule.\n",
    "\n",
    "The only difference between this case and the previous one is the lambda function provided, but for reproducibility let's repeat the whole code again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Profile view for segment ('Baby Care',)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cardinality/est</th>\n",
       "      <th>cardinality/lower_1</th>\n",
       "      <th>cardinality/upper_1</th>\n",
       "      <th>counts/n</th>\n",
       "      <th>counts/null</th>\n",
       "      <th>distribution/max</th>\n",
       "      <th>distribution/mean</th>\n",
       "      <th>distribution/median</th>\n",
       "      <th>distribution/min</th>\n",
       "      <th>distribution/n</th>\n",
       "      <th>...</th>\n",
       "      <th>distribution/stddev</th>\n",
       "      <th>frequent_items/frequent_strings</th>\n",
       "      <th>type</th>\n",
       "      <th>types/boolean</th>\n",
       "      <th>types/fractional</th>\n",
       "      <th>types/integral</th>\n",
       "      <th>types/object</th>\n",
       "      <th>types/string</th>\n",
       "      <th>ints/max</th>\n",
       "      <th>ints/min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>column</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000050</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='Baby Care', est=389, uppe...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>389</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>date</th>\n",
       "      <td>8.000000</td>\n",
       "      <td>8.0</td>\n",
       "      <td>8.000400</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='2022-08-12 00:00:00+00:00...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>389</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>market_price</th>\n",
       "      <td>32.000002</td>\n",
       "      <td>32.0</td>\n",
       "      <td>32.001600</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>2638.0</td>\n",
       "      <td>809.352185</td>\n",
       "      <td>495.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>389</td>\n",
       "      <td>...</td>\n",
       "      <td>679.345870</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>product</th>\n",
       "      <td>38.000003</td>\n",
       "      <td>38.0</td>\n",
       "      <td>38.001901</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>[FrequentItem(value='Baby Powder', est=21, upp...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>389</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rating</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.000100</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.071979</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>389</td>\n",
       "      <td>...</td>\n",
       "      <td>0.258787</td>\n",
       "      <td>[FrequentItem(value='4', est=361, upper=361, l...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sales_last_week</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.000200</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1.483290</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>389</td>\n",
       "      <td>...</td>\n",
       "      <td>1.170009</td>\n",
       "      <td>[FrequentItem(value='1', est=292, upper=292, l...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
       "column                                                                       \n",
       "category                1.000000                  1.0             1.000050   \n",
       "date                    8.000000                  8.0             8.000400   \n",
       "market_price           32.000002                 32.0            32.001600   \n",
       "product                38.000003                 38.0            38.001901   \n",
       "rating                  2.000000                  2.0             2.000100   \n",
       "sales_last_week         4.000000                  4.0             4.000200   \n",
       "\n",
       "                 counts/n  counts/null  distribution/max  distribution/mean  \\\n",
       "column                                                                        \n",
       "category              389            0               NaN           0.000000   \n",
       "date                  389            0               NaN           0.000000   \n",
       "market_price          389            0            2638.0         809.352185   \n",
       "product               389            0               NaN           0.000000   \n",
       "rating                389            0               5.0           4.071979   \n",
       "sales_last_week       389            0               6.0           1.483290   \n",
       "\n",
       "                 distribution/median  distribution/min  distribution/n  ...  \\\n",
       "column                                                                  ...   \n",
       "category                         NaN               NaN               0  ...   \n",
       "date                             NaN               NaN               0  ...   \n",
       "market_price                   495.0             215.0             389  ...   \n",
       "product                          NaN               NaN               0  ...   \n",
       "rating                           4.0               4.0             389  ...   \n",
       "sales_last_week                  1.0               1.0             389  ...   \n",
       "\n",
       "                 distribution/stddev  \\\n",
       "column                                 \n",
       "category                    0.000000   \n",
       "date                        0.000000   \n",
       "market_price              679.345870   \n",
       "product                     0.000000   \n",
       "rating                      0.258787   \n",
       "sales_last_week             1.170009   \n",
       "\n",
       "                                   frequent_items/frequent_strings  \\\n",
       "column                                                               \n",
       "category         [FrequentItem(value='Baby Care', est=389, uppe...   \n",
       "date             [FrequentItem(value='2022-08-12 00:00:00+00:00...   \n",
       "market_price                                                   NaN   \n",
       "product          [FrequentItem(value='Baby Powder', est=21, upp...   \n",
       "rating           [FrequentItem(value='4', est=361, upper=361, l...   \n",
       "sales_last_week  [FrequentItem(value='1', est=292, upper=292, l...   \n",
       "\n",
       "                               type  types/boolean  types/fractional  \\\n",
       "column                                                                 \n",
       "category         SummaryType.COLUMN              0                 0   \n",
       "date             SummaryType.COLUMN              0                 0   \n",
       "market_price     SummaryType.COLUMN              0               389   \n",
       "product          SummaryType.COLUMN              0                 0   \n",
       "rating           SummaryType.COLUMN              0                 0   \n",
       "sales_last_week  SummaryType.COLUMN              0                 0   \n",
       "\n",
       "                 types/integral  types/object  types/string  ints/max ints/min  \n",
       "column                                                                          \n",
       "category                      0             0           389       NaN      NaN  \n",
       "date                          0             0           389       NaN      NaN  \n",
       "market_price                  0             0             0       NaN      NaN  \n",
       "product                       0             0           389       NaN      NaN  \n",
       "rating                      389             0             0       5.0      4.0  \n",
       "sales_last_week             389             0             0       6.0      1.0  \n",
       "\n",
       "[6 rows x 28 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from whylogs.core.segmentation_partition import segment_on_column\n",
    "from whylogs.core.segmentation_partition import SegmentFilter\n",
    "import whylogs as why\n",
    "from whylogs.core.schema import DatasetSchema\n",
    "\n",
    "column_segments = segment_on_column(\"category\")\n",
    "column_segments['category'].filter = SegmentFilter(filter_function=lambda df: (df.market_price>200) & (df.rating > 3))\n",
    "\n",
    "results = why.log(df, schema=DatasetSchema(segments=column_segments))\n",
    "\n",
    "partition = results.partitions[0]\n",
    "segments = results.segments_in_partition(partition)\n",
    "\n",
    "first_segment = next(iter(segments))\n",
    "segmented_profile = results.profile(first_segment)\n",
    "\n",
    "print(\"Profile view for segment {}\".format(first_segment.key))\n",
    "segmented_profile.view().to_pandas()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that we now have a count of 389, whereas our first example had a count of 707. That's because now we're filtering the data to track only points that match our rule for high-quality, high-cost products."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing the Segments to Disk"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have the segmented results, you can use the results' `writer` method to write it to disk, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "directory = \"segmented_profiles\"\n",
    "if not os.path.exists(directory):\n",
    "    os.makedirs(directory)\n",
    "\n",
    "\n",
    "results.writer().option(base_dir=directory).write()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will write 11 binary profiles to the specified folder. Let's check with `listdir`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['profile_2022-09-13 13:47:12.595280_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Baby Care.bin',\n",
       " 'profile_2022-09-13 13:47:12.606867_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Bakery, Cakes and Dairy.bin',\n",
       " 'profile_2022-09-13 13:47:12.613083_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Beauty and Hygiene.bin',\n",
       " 'profile_2022-09-13 13:47:12.643941_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Beverages.bin',\n",
       " 'profile_2022-09-13 13:47:12.650850_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Cleaning and Household.bin',\n",
       " 'profile_2022-09-13 13:47:12.661408_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Eggs, Meat and Fish.bin',\n",
       " 'profile_2022-09-13 13:47:12.668325_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Foodgrains, Oil and Masala.bin',\n",
       " 'profile_2022-09-13 13:47:12.678308_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Fruits and Vegetables.bin',\n",
       " 'profile_2022-09-13 13:47:12.742280_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Gourmet and World Food.bin',\n",
       " 'profile_2022-09-13 13:47:12.786080_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Kitchen, Garden and Pets.bin',\n",
       " 'profile_2022-09-13 13:47:12.804480_8ff3ae39c46814563082fd6b3a9c0cfa922a8ef8dee5e685502485ed6482e4dcf7ecc3e4f7def5451c476c5b87485c1e0d9684c7ccf0f2cf3e2ba6106ec674d8_Snacks and Branded Foods.bin']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "os.listdir(directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sending Segmented Profiles to WhyLabs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the whylogs Writer, you can write your profiles to different locations. If you have a WhyLabs account, you can easily send your segmented profiles to be monitored in your dashboard.\n",
    "\n",
    "We will show briefly how to do it in this example. If you want more details, please check the [WhyLabs Writer Example](../integrations/writers/Writing_to_WhyLabs.ipynb) (also available [in our documentation](https://whylogs.readthedocs.io/en/latest/examples/integrations/writers/Writing_to_WhyLabs.html))."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Provided you already have the required information and keys, let's first set our environment variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter your WhyLabs Org ID\n",
      "Enter your WhyLabs Dataset ID\n",
      "Enter your WhyLabs API key\n",
      "Using API Key ID:  ygG04qE3gQ\n"
     ]
    }
   ],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "# set your org-id here - should be something like \"org-xxxx\"\n",
    "print(\"Enter your WhyLabs Org ID\") \n",
    "os.environ[\"WHYLABS_DEFAULT_ORG_ID\"] = input()\n",
    "\n",
    "# set your datased_id (or model_id) here - should be something like \"model-xxxx\"\n",
    "print(\"Enter your WhyLabs Dataset ID\")\n",
    "os.environ[\"WHYLABS_DEFAULT_DATASET_ID\"] = input()\n",
    "\n",
    "\n",
    "# set your API key here\n",
    "print(\"Enter your WhyLabs API key\")\n",
    "os.environ[\"WHYLABS_API_KEY\"] = getpass.getpass()\n",
    "print(\"Using API Key ID: \", os.environ[\"WHYLABS_API_KEY\"][0:10])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, it's as simple as calling `writer(\"whylabs\")`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results.writer(\"whylabs\").write()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should be able to see your segments at your dashboard at the __segments__ tab:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](images/whylabs_segments.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.10 ('.venv': poetry)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "d39f874c9b8a97550ecbd783714b95e79c9b905449b34f44c40e3bf053b54b41"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}