python/examples/experimental/Logging_with_Debug_Events.ipynb from whylabs/whylogs-python

python/examples/experimental/Logging_with_Debug_Events.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview of `debug_events`\n",
    "\n",
    "There is a new `debug_events` parameter that can be passed into whylogs.log statement to define debug information that will be stored as JSON in WhyLabs and can be correlated to whylogs profiles with a common `trace_id` value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test Data\n",
    "Consider the scenario where you may want to record additional information pertaining to your dataset, and maybe you also have some private data that you store separately in your own systems such as a database or filesystem. We do not want the raw private data stored in the dataset profile, but we need a way to trace back to a record for purposes of later debugging if there are alerts or constraint failures related to this data. With DebugEvents you can store small arbitrary JSON along with dataset profiles, and these debug events can be correlated with profiles or other external data by sharing a user supplied `trace_id`. In practice this should be something like a uuid, url or other unique string that you can later use to lookup information that may be stored in your environment.\n",
    "\n",
    "In the following example we setup a toy example where we profile a single test message with whylogs, but also attach to it a debug event. Additionally you can supply segment key value pairs to help partition this debug data, and tags for things like your environment to help make searching and filtering easier in debug scenarios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: you may need to restart the kernel to use updated packages.\n",
    "%pip install 'whylogs[embeddings,viz,image]'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running example with explicitly specified trace_id: d992c9fa-ac0d-4b12-8da8-fb4a534f984c\n"
     ]
    }
   ],
   "source": [
    "from uuid import uuid4\n",
    "import random\n",
    "import whylogs as why\n",
    "from whylogs.core.segmentation_partition import segment_on_column\n",
    "from whylogs.core.schema import DatasetSchema\n",
    "\n",
    "test_message = {\"col1\": \"Green\", \"col2\": 0.007}\n",
    "\n",
    "trace_id = str(uuid4())\n",
    "print(f\"Running example with explicitly specified trace_id: {trace_id}\")\n",
    "\n",
    "test_debug_event = {\n",
    "    \"custom_field\": 1,\n",
    "    \"custom_nested_field\": {\n",
    "        \"subA\": random.random(),\n",
    "        \"subB\": random.random() + random.random()\n",
    "    },\n",
    "    \"debug_notes\": \"Sometimes you might want to record a longer string and not lose the full value such as this verbose example.\"\n",
    "}\n",
    "\n",
    "debug_tags = [\"dev\", \"demo_test\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup API Key, dataset_id, and org_id before logging debug events!\n",
    "The debug_event parameter can be passed to why.log along with the message to be profiled. If there is a debug_event dictionary, that data will be sent to WhyLabs, so we must have env variables defined on where this should be logged.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "# Replace the empty strings to the right of the os.environ lines with your information\n",
    "# or if you already have the env variables defined you can comment on these next three lines\n",
    "os.environ[\"WHYLABS_DEFAULT_DATASET_ID\"] = \"\"\n",
    "os.environ[\"WHYLABS_DEFAULT_ORG_ID\"] = \"\"\n",
    "os.environ[\"WHYLABS_API_KEY\"] = \"\"\n",
    "\n",
    "# the call to why.log returns the results for the test_message only,\n",
    "# so this profile is a single message profile. The WhyLabs side configuration\n",
    "# will determine if this profile is preserved for individual profile download\n",
    "results = why.log(\n",
    "    test_message,\n",
    "    trace_id=trace_id,\n",
    "    tags=debug_tags,\n",
    "    debug_event=test_debug_event\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First the debug event is written\n",
    "Next we still need to write the results containing the statistical profile (summarization) of the message `test_message`. We can also pass in a pandas dataframe as the first parameter to why.log\n",
    "\n",
    "Let's look at what the results contain before uploading to WhyLabs with a write call. Note that the trace_id we specified earlier matches this 'whylabs.traceId' field in the metadata. That is something we can later use to query for both this profile and the debug event."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'whylabs.traceId': 'd992c9fa-ac0d-4b12-8da8-fb4a534f984c',\n",
       " 'whylogs.creationTimestamp': '1695414037772',\n",
       " 'whylogs.user.tags': '[\"demo_test\", \"dev\"]'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the above metadata is not the full debug_event by design. The profile results metadata is only the portion of data we need to correlate this profile with debug_events if there are any. All metadata keys and values need to be of type string, and non-string values such as timestamps will be converted to string. \n",
    "\n",
    "Also note that the `segment_key_values`, [optional parameter] are stored in a different part of the results so that they can be used to partition the data platform side."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Write results to WhyLabs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(True, 'log-6tpJJelNhOgPdJsp')]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.writer(\"whylabs\").write()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}