jhfjhfj1/autokeras

View on GitHub
docs/ipynb/load.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "!pip install autokeras"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import shutil\n",
    "\n",
    "import keras\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "\n",
    "import autokeras as ak"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Load Images from Disk\n",
    "If the data is too large to put in memory all at once, we can load it batch by\n",
    "batch into memory from disk with tf.data.Dataset.  This\n",
    "[function](\n",
    "https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/\n",
    "image_dataset_from_directory)\n",
    "can help you build such a tf.data.Dataset for image data.\n",
    "\n",
    "First, we download the data and extract the files.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "dataset_url = \"https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz\"  # noqa: E501\n",
    "local_file_path = keras.utils.get_file(\n",
    "    origin=dataset_url, fname=\"image_data\", extract=True\n",
    ")\n",
    "# The file is extracted in the same directory as the downloaded file.\n",
    "local_dir_path = os.path.dirname(local_file_path)\n",
    "# After check mannually, we know the extracted data is in 'flower_photos'.\n",
    "data_dir = os.path.join(local_dir_path, \"flower_photos\")\n",
    "print(data_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The directory should look like this. Each folder contains the images in the\n",
    "same class.\n",
    "\n",
    "```\n",
    "flowers_photos/\n",
    "  daisy/\n",
    "  dandelion/\n",
    "  roses/\n",
    "  sunflowers/\n",
    "  tulips/\n",
    "```\n",
    "\n",
    "We can split the data into training and testing as we load them.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "batch_size = 2\n",
    "img_height = 180\n",
    "img_width = 180\n",
    "\n",
    "train_data = ak.image_dataset_from_directory(\n",
    "    data_dir,\n",
    "    # Use 20% data as testing data.\n",
    "    validation_split=0.2,\n",
    "    subset=\"training\",\n",
    "    # Set seed to ensure the same split when loading testing data.\n",
    "    seed=123,\n",
    "    image_size=(img_height, img_width),\n",
    "    batch_size=batch_size,\n",
    ")\n",
    "\n",
    "test_data = ak.image_dataset_from_directory(\n",
    "    data_dir,\n",
    "    validation_split=0.2,\n",
    "    subset=\"validation\",\n",
    "    seed=123,\n",
    "    image_size=(img_height, img_width),\n",
    "    batch_size=batch_size,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Then we just do one quick demo of AutoKeras to make sure the dataset works.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "clf = ak.ImageClassifier(overwrite=True, max_trials=1)\n",
    "clf.fit(train_data.take(100), epochs=1)\n",
    "print(clf.evaluate(test_data.take(2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Load Texts from Disk\n",
    "You can also load text datasets in the same way.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "dataset_url = \"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n",
    "\n",
    "local_file_path = keras.utils.get_file(\n",
    "    fname=\"text_data\",\n",
    "    origin=dataset_url,\n",
    "    extract=True,\n",
    ")\n",
    "# The file is extracted in the same directory as the downloaded file.\n",
    "local_dir_path = os.path.dirname(local_file_path)\n",
    "# After check mannually, we know the extracted data is in 'aclImdb'.\n",
    "data_dir = os.path.join(local_dir_path, \"aclImdb\")\n",
    "# Remove the unused data folder.\n",
    "\n",
    "shutil.rmtree(os.path.join(data_dir, \"train/unsup\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "For this dataset, the data is already split into train and test.\n",
    "We just load them separately.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "print(data_dir)\n",
    "train_data = ak.text_dataset_from_directory(\n",
    "    os.path.join(data_dir, \"train\"), batch_size=batch_size\n",
    ")\n",
    "\n",
    "test_data = ak.text_dataset_from_directory(\n",
    "    os.path.join(data_dir, \"test\"), shuffle=False, batch_size=batch_size\n",
    ")\n",
    "\n",
    "clf = ak.TextClassifier(overwrite=True, max_trials=1)\n",
    "clf.fit(train_data.take(2), epochs=1)\n",
    "print(clf.evaluate(test_data.take(2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Load Data with Python Generators\n",
    "If you want to use generators, you can refer to the following code.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "N_BATCHES = 2\n",
    "BATCH_SIZE = 10\n",
    "\n",
    "\n",
    "def get_data_generator(n_batches, batch_size):\n",
    "    \"\"\"Get a generator returning n_batches random data.\"\"\"\n",
    "\n",
    "    def data_generator():\n",
    "        for _ in range(n_batches * batch_size):\n",
    "            x = np.random.randn(32, 32, 3)\n",
    "            y = x.sum() / 32 * 32 * 3 > 0.5\n",
    "            yield x, y\n",
    "\n",
    "    return data_generator\n",
    "\n",
    "\n",
    "dataset = tf.data.Dataset.from_generator(\n",
    "    get_data_generator(N_BATCHES, BATCH_SIZE),\n",
    "    output_types=(tf.float32, tf.float32),\n",
    "    output_shapes=((32, 32, 3), tuple()),\n",
    ").batch(BATCH_SIZE)\n",
    "\n",
    "clf = ak.ImageClassifier(overwrite=True, max_trials=1, seed=5)\n",
    "clf.fit(x=dataset, validation_data=dataset, batch_size=BATCH_SIZE)\n",
    "print(clf.evaluate(dataset))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Reference\n",
    "[image_dataset_from_directory](utils/#image_dataset_from_directory-function)\n",
    "[text_dataset_from_directory](utils/#text_dataset_from_directory-function)\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "load",
   "private_outputs": false,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}