examples/docs/parameter_optimization.ipynb from zincware/ZnTrack

examples/docs/parameter_optimization.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parameter Optimization with Optuna\n",
    "\n",
    "In this example we will train a RandomForest model and optimize its parameters using [Optuna](https://optuna.readthedocs.io/en/stable/).\n",
    "This example is an adapted version from the Optuna [Basic Concept example](https://optuna.readthedocs.io/en/stable/#basic-concepts).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initialized empty Git repository in /tmp/tmpp4i3ht48/.git/\n",
      "Initialized DVC repository.\n",
      "\n",
      "You can now commit the changes to git.\n",
      "\n",
      "+---------------------------------------------------------------------+\n",
      "|                                                                     |\n",
      "|        DVC has enabled anonymous aggregate usage analytics.         |\n",
      "|     Read the analytics documentation (and how to opt-out) here:     |\n",
      "|             <https://dvc.org/doc/user-guide/analytics>              |\n",
      "|                                                                     |\n",
      "+---------------------------------------------------------------------+\n",
      "\n",
      "What's next?\n",
      "------------\n",
      "- Check out the documentation: <https://dvc.org/doc>\n",
      "- Get help and share ideas: <https://dvc.org/chat>\n",
      "- Star us on GitHub: <https://github.com/iterative/dvc>\n"
     ]
    }
   ],
   "source": [
    "# Setup temporary directory and initialize git and dvc\n",
    "from zntrack import config\n",
    "\n",
    "config.nb_name = \"parameter_optimization.ipynb\"\n",
    "\n",
    "from zntrack.utils import cwd_temp_dir\n",
    "\n",
    "temp_dir = cwd_temp_dir()\n",
    "\n",
    "!git init\n",
    "!dvc init"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Workflow\n",
    "Our Workflow consists of multiple steps:\n",
    "- Download the dataset\n",
    "- Split into train / test data\n",
    "- Train a RandomForest model on the train data\n",
    "- Evaluate the model on the test data\n",
    "\n",
    "We want to optimize using two differen Models: RandomForest and LinearSVR with their respective hyperparameters.\n",
    "We want to optimize them and use the `Evaluate` Node to compute a RMSE that Optuna optimizes.\n",
    "We will use DVC [Experiments](https://dvc.org/doc/start/experiments) to track each run.\n",
    "In combination with Optuna, this allows us not only to optimize the parameters but also easily store and access the trained models afterwards.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![](https://mermaid.ink/img/pako:eNp1j7sOgkAQRX-FTC0FYEVhhYmNFXSuxQQG2GQfZJnVGMK_OzFKaKxmcu6981ig9R1BCb3xz3bEwElTKafYCc5uCs4PNBGZFNx_OBd88XHWbqiQsSbeiYWIV6lmx47CmoDaNTRzPRm9D-RpevqYtglfkG3xf6BQDg5gKVjUnfywKJckCngkK_eW0nbUYzSyTrlVrBjZ1y_XQskh0gHi1MlrlcYhoIWyRzPT-gaiDmCv?type=png)](https://mermaid.live/edit#pako:eNp1j7sOgkAQRX-FTC0FYEVhhYmNFXSuxQQG2GQfZJnVGMK_OzFKaKxmcu6981ig9R1BCb3xz3bEwElTKafYCc5uCs4PNBGZFNx_OBd88XHWbqiQsSbeiYWIV6lmx47CmoDaNTRzPRm9D-RpevqYtglfkG3xf6BQDg5gKVjUnfywKJckCngkK_eW0nbUYzSyTrlVrBjZ1y_XQskh0gHi1MlrlcYhoIWyRzPT-gaiDmCv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import optuna, sklearn, zntrack\n",
    "import sklearn.datasets\n",
    "import sklearn.ensemble\n",
    "import sklearn.model_selection\n",
    "import sklearn.metrics\n",
    "\n",
    "\n",
    "class HousingDataSet(zntrack.Node):\n",
    "    \"\"\"Download and prepare the California housing dataset.\"\"\"\n",
    "\n",
    "    data = zntrack.dvc.outs(\"scikit_learn_data\")\n",
    "\n",
    "    def run(self) -> None:\n",
    "        _ = sklearn.datasets.fetch_california_housing(\n",
    "            data_home=self.data, return_X_y=True\n",
    "        )\n",
    "\n",
    "    @property\n",
    "    def labels(self) -> dict:\n",
    "        _, labels = sklearn.datasets.fetch_california_housing(\n",
    "            data_home=self.data, return_X_y=True\n",
    "        )\n",
    "        return labels\n",
    "\n",
    "    @property\n",
    "    def features(self) -> dict:\n",
    "        features, _ = sklearn.datasets.fetch_california_housing(\n",
    "            data_home=self.data, return_X_y=True\n",
    "        )\n",
    "        return features\n",
    "\n",
    "\n",
    "class TrainTestSplit(zntrack.Node):\n",
    "    \"\"\"Split the dataset into train and test sets.\"\"\"\n",
    "\n",
    "    labels = zntrack.zn.deps()\n",
    "    features = zntrack.zn.deps()\n",
    "    seed = zntrack.zn.params(1234)\n",
    "\n",
    "    train_features = zntrack.zn.outs()\n",
    "    test_features = zntrack.zn.outs()\n",
    "    train_labels = zntrack.zn.outs()\n",
    "    test_labels = zntrack.zn.outs()\n",
    "\n",
    "    def run(self) -> None:\n",
    "        self.train_features, self.test_features, self.train_labels, self.test_labels = (\n",
    "            sklearn.model_selection.train_test_split(\n",
    "                self.features, self.labels, test_size=0.2, random_state=self.seed\n",
    "            )\n",
    "        )\n",
    "\n",
    "\n",
    "class RandomForest(zntrack.Node):\n",
    "    \"\"\"Train a random forest model.\"\"\"\n",
    "\n",
    "    train_features = zntrack.zn.deps()\n",
    "    train_labels = zntrack.zn.deps()\n",
    "    seed = zntrack.zn.params(1234)\n",
    "    max_depth = zntrack.zn.params()\n",
    "\n",
    "    model = zntrack.zn.outs()\n",
    "\n",
    "    def run(self) -> None:\n",
    "        self.model = sklearn.ensemble.RandomForestRegressor(\n",
    "            random_state=self.seed, max_depth=self.max_depth\n",
    "        )\n",
    "        self.model.fit(self.train_features, self.train_labels)\n",
    "\n",
    "\n",
    "class LinearSVR(zntrack.Node):\n",
    "    \"\"\"Train a SVR model.\"\"\"\n",
    "\n",
    "    train_features = zntrack.zn.deps()\n",
    "    train_labels = zntrack.zn.deps()\n",
    "    C = zntrack.zn.params()\n",
    "\n",
    "    model = zntrack.zn.outs()\n",
    "\n",
    "    def run(self) -> None:\n",
    "        self.model = sklearn.svm.LinearSVR(C=self.C)\n",
    "        self.model.fit(self.train_features, self.train_labels)\n",
    "\n",
    "\n",
    "class Evaluate(zntrack.Node):\n",
    "    \"\"\"Evaluate the model on a test set.\"\"\"\n",
    "\n",
    "    model = zntrack.zn.deps()\n",
    "    test_features = zntrack.zn.deps()\n",
    "    test_labels = zntrack.zn.deps()\n",
    "\n",
    "    score = zntrack.zn.metrics()\n",
    "\n",
    "    def run(self) -> None:\n",
    "        prediction = self.model.predict(self.test_features)\n",
    "        self.score = sklearn.metrics.mean_squared_error(self.test_labels, prediction)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `zntrack.Project` to create our workflow as usual.\n",
    "To use DVC Experiments, we need to create an initial commit.\n",
    "Therefore, we run the project directly and make an initial git commit afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name HousingDataSet --force ...'\n",
      "Jupyter support is an experimental feature! Please save your notebook before running this command!\n",
      "Submit issues to https://github.com/zincware/ZnTrack.\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name TrainTestSplit --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name model --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name Evaluate --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'repro'\n"
     ]
    }
   ],
   "source": [
    "with zntrack.Project() as project:\n",
    "    data = HousingDataSet()\n",
    "    split = TrainTestSplit(labels=data.labels, features=data.features)\n",
    "    model = RandomForest(\n",
    "        train_features=split.train_features,\n",
    "        train_labels=split.train_labels,\n",
    "        max_depth=2,\n",
    "        name=\"model\",\n",
    "    )\n",
    "    evaluate = Evaluate(\n",
    "        model=model.model,\n",
    "        test_features=split.test_features,\n",
    "        test_labels=split.test_labels,\n",
    "    )\n",
    "\n",
    "project.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "data": {
      "text/plain": [
       "NodeStatus(loaded=True, results=<NodeStatusResults.AVAILABLE: 5>, remote=None, rev=None)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RandomForest.from_rev(name=\"model\").state"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[main (root-commit) 6b7996b] initial commit\n",
      " 24 files changed, 1580 insertions(+)\n",
      " create mode 100644 .dvc/.gitignore\n",
      " create mode 100644 .dvc/config\n",
      " create mode 100644 .dvcignore\n",
      " create mode 100644 .gitignore\n",
      " create mode 100644 dvc.lock\n",
      " create mode 100644 dvc.yaml\n",
      " create mode 100644 nodes/Evaluate/node-meta.json\n",
      " create mode 100644 nodes/Evaluate/score.json\n",
      " create mode 100644 nodes/HousingDataSet/node-meta.json\n",
      " create mode 100644 nodes/TrainTestSplit/.gitignore\n",
      " create mode 100644 nodes/TrainTestSplit/node-meta.json\n",
      " create mode 100644 nodes/model/.gitignore\n",
      " create mode 100644 nodes/model/node-meta.json\n",
      " create mode 100644 parameter_optimization.ipynb\n",
      " create mode 100644 params.yaml\n",
      " create mode 100644 src/Evaluate.py\n",
      " create mode 100644 src/HousingDataSet.py\n",
      " create mode 100644 src/RandomForest.py\n",
      " create mode 100644 src/TrainTestSplit.py\n",
      " create mode 100644 src/__pycache__/Evaluate.cpython-310.pyc\n",
      " create mode 100644 src/__pycache__/HousingDataSet.cpython-310.pyc\n",
      " create mode 100644 src/__pycache__/RandomForest.cpython-310.pyc\n",
      " create mode 100644 src/__pycache__/TrainTestSplit.cpython-310.pyc\n",
      " create mode 100644 zntrack.json\n"
     ]
    }
   ],
   "source": [
    "!git add .\n",
    "\n",
    "!git commit -m \"initial commit\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimize\n",
    "\n",
    "For Optuna we need to define an objective we want to optimize.\n",
    "We use the `project.create_experiment` API from ZnTrack to change the model parameter and return the score from the `Evaluate` stage as final metric to optimize.\n",
    "To later identify the experiments, we name them according to the `trial.number` from optuna."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2023-07-26 15:58:10,744] A new study created in memory with name: no-name-85a8203d-fed8-45ba-99ba-3adcda3a06c0\n",
      "Running DVC command: 'stage add --name HousingDataSet --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name TrainTestSplit --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name model --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name Evaluate --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'exp apply exp-0'\n",
      "\u0000[I 2023-07-26 15:58:22,034] Trial 0 finished with value: 0.8952389211454506 and parameters: {'classifier': 'SVR', 'svr_c': 1.9996547699912692e-05}. Best is trial 0 with value: 0.8952389211454506.\n",
      "Running DVC command: 'stage add --name HousingDataSet --force ...'\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name TrainTestSplit --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name model --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name Evaluate --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'exp apply exp-1'\n",
      "\u0000[I 2023-07-26 15:58:42,694] Trial 1 finished with value: 0.2627596918267919 and parameters: {'classifier': 'RandomForest', 'max_depth': 27}. Best is trial 0 with value: 0.8952389211454506.\n",
      "Running DVC command: 'stage add --name HousingDataSet --force ...'\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name TrainTestSplit --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name model --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'stage add --name Evaluate --force ...'\n",
      "\u0000"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Running DVC command: 'exp apply exp-2'\n",
      "\u0000[I 2023-07-26 15:58:54,242] Trial 2 finished with value: 1.5650217116478067 and parameters: {'classifier': 'SVR', 'svr_c': 171896877.50579312}. Best is trial 2 with value: 1.5650217116478067.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u0000"
     ]
    }
   ],
   "source": [
    "def objective(trial):\n",
    "    with project.create_experiment(queue=False, name=f\"exp-{trial.number}\") as exp:\n",
    "        regressor_name = trial.suggest_categorical(\"classifier\", [\"SVR\", \"RandomForest\"])\n",
    "\n",
    "        # we need to replace the existing model on the graph with a new model.\n",
    "\n",
    "        project.remove(\"model\")\n",
    "\n",
    "        if regressor_name == \"SVR\":\n",
    "            svr_c = trial.suggest_float(\"svr_c\", 1e-10, 1e10, log=True)\n",
    "            model = LinearSVR(\n",
    "                train_features=split.train_features,\n",
    "                train_labels=split.train_labels,\n",
    "                C=svr_c,\n",
    "                name=\"model\",\n",
    "            )\n",
    "        else:\n",
    "            max_depth = trial.suggest_int(\"max_depth\", 2, 32)\n",
    "            model = RandomForest(\n",
    "                train_features=split.train_features,\n",
    "                train_labels=split.train_labels,\n",
    "                max_depth=max_depth,\n",
    "                name=\"model\",\n",
    "            )\n",
    "\n",
    "        # need to let the evaluate node know which model to evaluate\n",
    "        evaluate.model = model.model\n",
    "\n",
    "    return exp[evaluate].score\n",
    "\n",
    "\n",
    "study = optuna.create_study(\n",
    "    direction=\"maximize\", sampler=optuna.samplers.TPESampler(seed=314)\n",
    ")\n",
    "study.optimize(objective, n_trials=3)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate\n",
    "\n",
    "We can now investigate the best parameters via `study.best_params`.\n",
    "Additionally, because we used DVC experiments we can directly access the experiment with the best parameters, by the name we used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'classifier': 'SVR', 'svr_c': 171896877.50579312}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "study.best_params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['exp-2', 'exp-1', 'exp-0'])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "project.experiments.keys()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can either load the Node via the experiment or by it's name using `zntrack.from_rev()`.\n",
    "The node should not be loaded via `model.load()` because the `model` instance could be `RandomForest` and the best model would be `LinearSVR` or *vice versa*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "exp = project.experiments[f\"exp-{study.best_trial.number}\"]\n",
    "best_model = exp[\"model\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'exp-2'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f\"exp-{study.best_trial.number}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best score: 1.565 compared to initial score: 0.750\n"
     ]
    }
   ],
   "source": [
    "# we load split data into memory to compute the score.\n",
    "split.load()\n",
    "\n",
    "best_score = evaluate.from_rev(rev=f\"exp-{study.best_trial.number}\").score\n",
    "initial_score = evaluate.from_rev(rev=\"HEAD\").score\n",
    "print(f\"Best score: {best_score:.3f} compared to initial score: {initial_score:.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp_dir.cleanup()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "zntrack",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}