python/examples/basic/Getting_Started.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"whylogs provides a standard to log any kind of data.\n",
"\n",
"With whylogs, we will show how to log data, generating statistical summaries called *profiles*. These profiles can be used in a number of ways, like:\n",
"\n",
"* Data Visualization\n",
"* Data Validation\n",
"* Tracking changes in your datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we'll explore the basics of logging data with whylogs:\n",
"\n",
"- Installing whylogs\n",
"- Profiling data\n",
"- Interacting with the profile\n",
"- Writing/Reading profiles to/from disk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Minimal requirements:\n",
"\n",
"- Python 3.7+ up to Python 3.10\n",
"- Windows, Linux x86_64, and MacOS 10+"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading a Pandas DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before showing how we can log data, we first need the data itself. Let's create a simple Pandas DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"data = {\n",
" \"animal\": [\"cat\", \"hawk\", \"snake\", \"cat\"],\n",
" \"legs\": [4, 2, 0, 4],\n",
" \"weight\": [4.3, 1.8, 1.3, 4.1],\n",
"}\n",
"\n",
"df = pd.DataFrame(data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profiling with whylogs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To obtain a profile of your data, you can simply use whylogs' `log` call, and navigate through the result to a specific profile with `profile()`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import whylogs as why\n",
"\n",
"results = why.log(df)\n",
"profile = results.profile()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyzing Profiles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you're done logging the data, you can generate a `Profile View` and inspect it in a Pandas Dataframe format:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cardinality/est</th>\n",
" <th>cardinality/lower_1</th>\n",
" <th>cardinality/upper_1</th>\n",
" <th>counts/inf</th>\n",
" <th>counts/n</th>\n",
" <th>counts/nan</th>\n",
" <th>counts/null</th>\n",
" <th>distribution/max</th>\n",
" <th>distribution/mean</th>\n",
" <th>distribution/median</th>\n",
" <th>...</th>\n",
" <th>frequent_items/frequent_strings</th>\n",
" <th>type</th>\n",
" <th>types/boolean</th>\n",
" <th>types/fractional</th>\n",
" <th>types/integral</th>\n",
" <th>types/object</th>\n",
" <th>types/string</th>\n",
" <th>types/tensor</th>\n",
" <th>ints/max</th>\n",
" <th>ints/min</th>\n",
" </tr>\n",
" <tr>\n",
" <th>column</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>animal</th>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>3.00015</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>0.000</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>[FrequentItem(value='cat', est=2, upper=2, low...</td>\n",
" <td>SummaryType.COLUMN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>legs</th>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>3.00015</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4.0</td>\n",
" <td>2.500</td>\n",
" <td>4.0</td>\n",
" <td>...</td>\n",
" <td>[FrequentItem(value='4', est=2, upper=2, lower...</td>\n",
" <td>SummaryType.COLUMN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>weight</th>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>4.00020</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4.3</td>\n",
" <td>2.875</td>\n",
" <td>4.1</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>SummaryType.COLUMN</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 31 columns</p>\n",
"</div>"
],
"text/plain": [
" cardinality/est cardinality/lower_1 cardinality/upper_1 counts/inf \\\n",
"column \n",
"animal 3.0 3.0 3.00015 0 \n",
"legs 3.0 3.0 3.00015 0 \n",
"weight 4.0 4.0 4.00020 0 \n",
"\n",
" counts/n counts/nan counts/null distribution/max \\\n",
"column \n",
"animal 4 0 0 NaN \n",
"legs 4 0 0 4.0 \n",
"weight 4 0 0 4.3 \n",
"\n",
" distribution/mean distribution/median ... \\\n",
"column ... \n",
"animal 0.000 NaN ... \n",
"legs 2.500 4.0 ... \n",
"weight 2.875 4.1 ... \n",
"\n",
" frequent_items/frequent_strings type \\\n",
"column \n",
"animal [FrequentItem(value='cat', est=2, upper=2, low... SummaryType.COLUMN \n",
"legs [FrequentItem(value='4', est=2, upper=2, lower... SummaryType.COLUMN \n",
"weight NaN SummaryType.COLUMN \n",
"\n",
" types/boolean types/fractional types/integral types/object \\\n",
"column \n",
"animal 0 0 0 0 \n",
"legs 0 0 4 0 \n",
"weight 0 4 0 0 \n",
"\n",
" types/string types/tensor ints/max ints/min \n",
"column \n",
"animal 4 0 NaN NaN \n",
"legs 0 0 4.0 0.0 \n",
"weight 0 0 NaN NaN \n",
"\n",
"[3 rows x 31 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prof_view = profile.view()\n",
"prof_df = prof_view.to_pandas()\n",
"\n",
"prof_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will provide you with valuable statistics on a column (feature) basis, such as:\n",
"\n",
"- Counters, such as number of samples and null values\n",
"- Inferred types, such as integral, fractional and boolean\n",
"- Estimated Cardinality\n",
"- Frequent Items\n",
"- Distribution Metrics: min,max, median, quantile values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing to Disk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also store your profile in disk for further inspection:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"why.write(profile, \"profile.bin\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create a profile binary file in your local filesystem."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading from Disk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can read the profile back into memory with:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"n_prof = why.read(\"profile.bin\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note: `write` expects a profile as parameter, while `read` returns a `Profile View`. That means that you can use the loaded profile for visualization purposes and merging, but not for further tracking and updates."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's Next?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a lot you can do with the profiles you just created. Keep getting your hands dirty with the following examples!"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"- Basic\n",
" - [Visualizing Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Notebook_Profile_Visualizer.html) - Compare profiles to detect distribution shifts, visualize histograms and bar charts and explore your data\n",
" - [Logging Data](https://whylogs.readthedocs.io/en/stable/examples/basic/Logging_Different_Data.html) - See the different ways you can log your data with whylogs\n",
" - [Inspecting Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Inspecting_Profiles.html) - A deeper dive on the metrics generated by whylogs\n",
" - [Schema Configuration for Tracking Metrics](https://whylogs.readthedocs.io/en/stable/examples/basic/Schema_Configuration.html) - Configure tracking metrics according to data type or column features\n",
" - [Data Constraints](https://whylogs.readthedocs.io/en/stable/examples/advanced/Metric_Constraints.html) - Set constraints to your data to ensure its quality\n",
" - [Merging Profiles](https://whylogs.readthedocs.io/en/stable/examples/basic/Merging_Profiles.html) - Merge your profiles logged across different computing instances, time periods or data segments\n",
"- Integrations\n",
" - [WhyLabs](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_to_WhyLabs.html) - Monitor your profiles continuously with the WhyLabs Observability Platform\n",
" - [Pyspark](https://whylogs.readthedocs.io/en/stable/examples/integrations/Pyspark_Profiling.html) - Use whylogs with pyspark\n",
" - [Writing Profiles](https://whylogs.readthedocs.io/en/stable/examples/integrations/writers/Writing_Profiles.html) - See different ways and locations to output your profiles\n",
" - [Flask](https://whylogs.readthedocs.io/en/stable/examples/integrations/flask_streaming/flask_with_whylogs.html) - See how you can create a Flask app with whylogs and WhyLabs integration\n",
" - [Feature Stores](https://whylogs.readthedocs.io/en/stable/examples/integrations/Feature_Stores_and_whylogs.html) - Learn how to log features from your Feature Store with feast and whylogs\n",
" - [BigQuery](https://whylogs.readthedocs.io/en/stable/examples/integrations/BigQuery_Example.html) - Profile data queried from a Google BigQuery table\n",
" - [MLflow](https://whylogs.readthedocs.io/en/stable/examples/integrations/Mlflow_Logging.html) - Log your whylogs profiles to an MLflow environment\n",
"\n",
"Or go to the [examples page](https://whylogs.readthedocs.io/en/stable/examples.html) for the complete list of examples!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "5dd5901cadfd4b29c2aaf95ecd29c0c3b10829ad94dcfe59437dbee391154aea"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}