python/examples/basic/Getting_Started.ipynb from whylabs/whylogs-python

python/examples/basic/Getting_Started.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
    ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started) to leverage the power of whylogs and WhyLabs together!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "whylogs provides a standard to log any kind of data.\n",
    "\n",
    "With whylogs, we will show how to log data, generating statistical summaries called *profiles*. These profiles can be used in a number of ways, like:\n",
    "\n",
    "* Data Visualization\n",
    "* Data Validation\n",
    "* Tracking changes in your datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we'll explore the basics of logging data with whylogs:\n",
    "\n",
    "- Installing whylogs\n",
    "- Profiling data\n",
    "- Interacting with the profile\n",
    "- Writing/Reading profiles to/from disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing whylogs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: you may need to restart the kernel to use updated packages.\n",
    "%pip install whylogs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Minimal requirements:\n",
    "\n",
    "- Python 3.7+ up to Python 3.10\n",
    "- Windows, Linux x86_64, and MacOS 10+"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading a Pandas DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before showing how we can log data, we first need the data itself. Let's create a simple Pandas DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "data = {\n",
    "    \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n",
    "    \"legs\": [4, 2, 0, 4],\n",
    "    \"weight\": [4.3, 1.8, 1.3, 4.1],\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Profiling with whylogs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To obtain a profile of your data, you can simply use whylogs' `log` call, and navigate through the result to a specific profile with `profile()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import whylogs as why\n",
    "\n",
    "results = why.log(df)\n",
    "profile = results.profile()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyzing Profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you're done logging the data, you can generate a `Profile View` and inspect it in a Pandas Dataframe format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cardinality/est</th>\n",
       "      <th>cardinality/lower_1</th>\n",
       "      <th>cardinality/upper_1</th>\n",
       "      <th>counts/inf</th>\n",
       "      <th>counts/n</th>\n",
       "      <th>counts/nan</th>\n",
       "      <th>counts/null</th>\n",
       "      <th>distribution/max</th>\n",
       "      <th>distribution/mean</th>\n",
       "      <th>distribution/median</th>\n",
       "      <th>...</th>\n",
       "      <th>frequent_items/frequent_strings</th>\n",
       "      <th>type</th>\n",
       "      <th>types/boolean</th>\n",
       "      <th>types/fractional</th>\n",
       "      <th>types/integral</th>\n",
       "      <th>types/object</th>\n",
       "      <th>types/string</th>\n",
       "      <th>types/tensor</th>\n",
       "      <th>ints/max</th>\n",
       "      <th>ints/min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>column</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>animal</th>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.00015</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>[FrequentItem(value='cat', est=2, upper=2, low...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>legs</th>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.00015</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2.500</td>\n",
       "      <td>4.0</td>\n",
       "      <td>...</td>\n",
       "      <td>[FrequentItem(value='4', est=2, upper=2, lower...</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>weight</th>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.00020</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4.3</td>\n",
       "      <td>2.875</td>\n",
       "      <td>4.1</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SummaryType.COLUMN</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 31 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        cardinality/est  cardinality/lower_1  cardinality/upper_1  counts/inf  \\\n",
       "column                                                                          \n",
       "animal              3.0                  3.0              3.00015           0   \n",
       "legs                3.0                  3.0              3.00015           0   \n",
       "weight              4.0                  4.0              4.00020           0   \n",
       "\n",
       "        counts/n  counts/nan  counts/null  distribution/max  \\\n",
       "column                                                        \n",
       "animal         4           0            0               NaN   \n",
       "legs           4           0            0               4.0   \n",
       "weight         4           0            0               4.3   \n",
       "\n",
       "        distribution/mean  distribution/median  ...  \\\n",
       "column                                          ...   \n",
       "animal              0.000                  NaN  ...   \n",
       "legs                2.500                  4.0  ...   \n",
       "weight              2.875                  4.1  ...   \n",
       "\n",
       "                          frequent_items/frequent_strings                type  \\\n",
       "column                                                                          \n",
       "animal  [FrequentItem(value='cat', est=2, upper=2, low...  SummaryType.COLUMN   \n",
       "legs    [FrequentItem(value='4', est=2, upper=2, lower...  SummaryType.COLUMN   \n",
       "weight                                                NaN  SummaryType.COLUMN   \n",
       "\n",
       "        types/boolean  types/fractional  types/integral  types/object  \\\n",
       "column                                                                  \n",
       "animal              0                 0               0             0   \n",
       "legs                0                 0               4             0   \n",
       "weight              0                 4               0             0   \n",
       "\n",
       "        types/string  types/tensor  ints/max  ints/min  \n",
       "column                                                  \n",
       "animal             4             0       NaN       NaN  \n",
       "legs               0             0       4.0       0.0  \n",
       "weight             0             0       NaN       NaN  \n",
       "\n",
       "[3 rows x 31 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prof_view = profile.view()\n",
    "prof_df = prof_view.to_pandas()\n",
    "\n",
    "prof_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will provide you with valuable statistics on a column (feature) basis, such as:\n",
    "\n",
    "- Counters, such as number of samples and null values\n",
    "- Inferred types, such as integral, fractional and boolean\n",
    "- Estimated Cardinality\n",
    "- Frequent Items\n",
    "- Distribution Metrics: min,max, median, quantile values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing to Disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also store your profile in disk for further inspection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "why.write(profile, \"profile.bin\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will create a profile binary file in your local filesystem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading from Disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can read the profile back into memory with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_prof = why.read(\"profile.bin\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Note: `write` expects a profile as parameter, while `read` returns a `Profile View`. That means that you can use the loaded profile for visualization purposes and merging, but not for further tracking and updates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's Next?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a lot you can do with the profiles you just created. Keep getting your hands dirty with the following examples!"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Basic\n",
    "    - [Visualizing Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Notebook_Profile_Visualizer.html) - Compare profiles to detect distribution shifts, visualize histograms and bar charts and explore your data\n",
    "    - [Logging Data](https://whylogs.readthedocs.io/en/stable/examples/basic/Logging_Different_Data.html) - See the different ways you can log your data with whylogs\n",
    "    - [Inspecting Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Inspecting_Profiles.html) - A deeper dive on the metrics generated by whylogs\n",
    "    - [Schema Configuration for Tracking Metrics](https://whylogs.readthedocs.io/en/stable/examples/basic/Schema_Configuration.html) - Configure tracking metrics according to data type or column features\n",
    "    - [Data Constraints](https://whylogs.readthedocs.io/en/stable/examples/advanced/Metric_Constraints.html) - Set constraints to your data to ensure its quality\n",
    "    - [Merging Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Merging_Profiles.html) - Merge your profiles logged across different computing instances, time periods or data segments\n",
    "- Integrations\n",
    "    - [WhyLabs](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_to_WhyLabs.html) - Monitor your profiles continuously with the WhyLabs Observability Platform\n",
    "    - [Pyspark](https://whylogs.readthedocs.io/en/stable/examples/integrations/Pyspark_Profiling.html) - Use whylogs with pyspark\n",
    "    - [Writing Profiles](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_Profiles.html) - See different ways and locations to output your profiles\n",
    "    - [Flask](https://whylogs.readthedocs.io/en/stable/examples/integrations/flask_streaming/flask_with_whylogs.html) - See how you can create a Flask app with whylogs and WhyLabs integration\n",
    "    - [Feature Stores](https://whylogs.readthedocs.io/en/stable/examples/integrations/Feature_Stores_and_whylogs.html) - Learn how to log features from your Feature Store with feast and whylogs\n",
    "    - [BigQuery](https://whylogs.readthedocs.io/en/stable/examples/integrations/BigQuery_Example.html) - Profile data queried from a Google BigQuery table\n",
    "    - [MLflow](https://whylogs.readthedocs.io/en/stable/examples/integrations/Mlflow_Logging.html) - Log your whylogs profiles to an MLflow environment\n",
    "\n",
    "Or go to the [examples page](https://whylogs.readthedocs.io/en/stable/examples.html) for the complete list of examples!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}