docs/ipynb/text_classification.ipynb from jhfjhfj1/autokeras

docs/ipynb/text_classification.ipynb
Summary

Maintainability

Test Coverage

Issues
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "!pip install autokeras"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import keras\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from sklearn.datasets import load_files\n",
    "\n",
    "import autokeras as ak"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## A Simple Example\n",
    "The first step is to prepare your data. Here we use the [IMDB\n",
    "dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification)\n",
    "as an example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "dataset = keras.utils.get_file(\n",
    "    fname=\"aclImdb.tar.gz\",\n",
    "    origin=\"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\",\n",
    "    extract=True,\n",
    ")\n",
    "\n",
    "# set path to dataset\n",
    "IMDB_DATADIR = os.path.join(os.path.dirname(dataset), \"aclImdb\")\n",
    "\n",
    "classes = [\"pos\", \"neg\"]\n",
    "train_data = load_files(\n",
    "    os.path.join(IMDB_DATADIR, \"train\"), shuffle=True, categories=classes\n",
    ")\n",
    "test_data = load_files(\n",
    "    os.path.join(IMDB_DATADIR, \"test\"), shuffle=False, categories=classes\n",
    ")\n",
    "\n",
    "x_train = np.array(train_data.data)[:100]\n",
    "y_train = np.array(train_data.target)[:100]\n",
    "x_test = np.array(test_data.data)[:100]\n",
    "y_test = np.array(test_data.target)[:100]\n",
    "\n",
    "print(x_train.shape)  # (25000,)\n",
    "print(y_train.shape)  # (25000, 1)\n",
    "print(x_train[0][:50])  # this film was just brilliant casting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The second step is to run the [TextClassifier](/text_classifier).  As a quick\n",
    "demo, we set epochs to 2.  You can also leave the epochs unspecified for an\n",
    "adaptive number of epochs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# Initialize the text classifier.\n",
    "clf = ak.TextClassifier(\n",
    "    overwrite=True, max_trials=1\n",
    ")  # It only tries 1 model as a quick demo.\n",
    "# Feed the text classifier with training data.\n",
    "clf.fit(x_train, y_train, epochs=1, batch_size=2)\n",
    "# Predict with the best model.\n",
    "predicted_y = clf.predict(x_test)\n",
    "# Evaluate the best model with testing data.\n",
    "print(clf.evaluate(x_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Validation Data\n",
    "By default, AutoKeras use the last 20% of training data as validation data.  As\n",
    "shown in the example below, you can use `validation_split` to specify the\n",
    "percentage.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "clf.fit(\n",
    "    x_train,\n",
    "    y_train,\n",
    "    # Split the training data and use the last 15% as validation data.\n",
    "    validation_split=0.15,\n",
    "    epochs=1,\n",
    "    batch_size=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "You can also use your own validation set instead of splitting it from the\n",
    "training data with `validation_data`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "split = 5\n",
    "x_val = x_train[split:]\n",
    "y_val = y_train[split:]\n",
    "x_train = x_train[:split]\n",
    "y_train = y_train[:split]\n",
    "clf.fit(\n",
    "    x_train,\n",
    "    y_train,\n",
    "    epochs=1,\n",
    "    # Use your own validation set.\n",
    "    validation_data=(x_val, y_val),\n",
    "    batch_size=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Customized Search Space\n",
    "For advanced users, you may customize your search space by using\n",
    "[AutoModel](/auto_model/#automodel-class) instead of\n",
    "[TextClassifier](/text_classifier). You can configure the\n",
    "[TextBlock](/block/#textblock-class) for some high-level configurations. You can\n",
    "also do not specify these arguments, which would leave the different choices to\n",
    "be tuned automatically.  See the following example for detail.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "input_node = ak.TextInput()\n",
    "output_node = ak.TextBlock()(input_node)\n",
    "output_node = ak.ClassificationHead()(output_node)\n",
    "clf = ak.AutoModel(\n",
    "    inputs=input_node, outputs=output_node, overwrite=True, max_trials=1\n",
    ")\n",
    "clf.fit(x_train, y_train, epochs=1, batch_size=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Data Format\n",
    "The AutoKeras TextClassifier is quite flexible for the data format.\n",
    "\n",
    "For the text, the input data should be one-dimensional For the classification\n",
    "labels, AutoKeras accepts both plain labels, i.e. strings or integers, and\n",
    "one-hot encoded encoded labels, i.e. vectors of 0s and 1s.\n",
    "\n",
    "We also support using [tf.data.Dataset](\n",
    "https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable)\n",
    "format for the training data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "train_set = tf.data.Dataset.from_tensor_slices(((x_train,), (y_train,))).batch(\n",
    "    2\n",
    ")\n",
    "test_set = tf.data.Dataset.from_tensor_slices(((x_test,), (y_test,))).batch(2)\n",
    "\n",
    "clf = ak.TextClassifier(overwrite=True, max_trials=1)\n",
    "# Feed the tensorflow Dataset to the classifier.\n",
    "clf.fit(train_set.take(2), epochs=1)\n",
    "# Predict with the best model.\n",
    "predicted_y = clf.predict(test_set.take(2))\n",
    "# Evaluate the best model with testing data.\n",
    "print(clf.evaluate(test_set.take(2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Reference\n",
    "[TextClassifier](/text_classifier),\n",
    "[AutoModel](/auto_model/#automodel-class),\n",
    "[ConvBlock](/block/#convblock-class),\n",
    "[TextInput](/node/#textinput-class),\n",
    "[ClassificationHead](/block/#classificationhead-class).\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "text_classification",
   "private_outputs": false,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}