docs/ipynb/text_classification.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"!pip install autokeras"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"import keras\n",
"import numpy as np\n",
"import tensorflow as tf\n",
"from sklearn.datasets import load_files\n",
"\n",
"import autokeras as ak"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"## A Simple Example\n",
"The first step is to prepare your data. Here we use the [IMDB\n",
"dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification)\n",
"as an example.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"dataset = keras.utils.get_file(\n",
" fname=\"aclImdb.tar.gz\",\n",
" origin=\"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\",\n",
" extract=True,\n",
")\n",
"\n",
"# set path to dataset\n",
"IMDB_DATADIR = os.path.join(os.path.dirname(dataset), \"aclImdb\")\n",
"\n",
"classes = [\"pos\", \"neg\"]\n",
"train_data = load_files(\n",
" os.path.join(IMDB_DATADIR, \"train\"), shuffle=True, categories=classes\n",
")\n",
"test_data = load_files(\n",
" os.path.join(IMDB_DATADIR, \"test\"), shuffle=False, categories=classes\n",
")\n",
"\n",
"x_train = np.array(train_data.data)[:100]\n",
"y_train = np.array(train_data.target)[:100]\n",
"x_test = np.array(test_data.data)[:100]\n",
"y_test = np.array(test_data.target)[:100]\n",
"\n",
"print(x_train.shape) # (25000,)\n",
"print(y_train.shape) # (25000, 1)\n",
"print(x_train[0][:50]) # this film was just brilliant casting"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"The second step is to run the [TextClassifier](/text_classifier). As a quick\n",
"demo, we set epochs to 2. You can also leave the epochs unspecified for an\n",
"adaptive number of epochs.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"# Initialize the text classifier.\n",
"clf = ak.TextClassifier(\n",
" overwrite=True, max_trials=1\n",
") # It only tries 1 model as a quick demo.\n",
"# Feed the text classifier with training data.\n",
"clf.fit(x_train, y_train, epochs=1, batch_size=2)\n",
"# Predict with the best model.\n",
"predicted_y = clf.predict(x_test)\n",
"# Evaluate the best model with testing data.\n",
"print(clf.evaluate(x_test, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"## Validation Data\n",
"By default, AutoKeras use the last 20% of training data as validation data. As\n",
"shown in the example below, you can use `validation_split` to specify the\n",
"percentage.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"clf.fit(\n",
" x_train,\n",
" y_train,\n",
" # Split the training data and use the last 15% as validation data.\n",
" validation_split=0.15,\n",
" epochs=1,\n",
" batch_size=2,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"You can also use your own validation set instead of splitting it from the\n",
"training data with `validation_data`.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"split = 5\n",
"x_val = x_train[split:]\n",
"y_val = y_train[split:]\n",
"x_train = x_train[:split]\n",
"y_train = y_train[:split]\n",
"clf.fit(\n",
" x_train,\n",
" y_train,\n",
" epochs=1,\n",
" # Use your own validation set.\n",
" validation_data=(x_val, y_val),\n",
" batch_size=2,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"## Customized Search Space\n",
"For advanced users, you may customize your search space by using\n",
"[AutoModel](/auto_model/#automodel-class) instead of\n",
"[TextClassifier](/text_classifier). You can configure the\n",
"[TextBlock](/block/#textblock-class) for some high-level configurations. You can\n",
"also do not specify these arguments, which would leave the different choices to\n",
"be tuned automatically. See the following example for detail.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"input_node = ak.TextInput()\n",
"output_node = ak.TextBlock()(input_node)\n",
"output_node = ak.ClassificationHead()(output_node)\n",
"clf = ak.AutoModel(\n",
" inputs=input_node, outputs=output_node, overwrite=True, max_trials=1\n",
")\n",
"clf.fit(x_train, y_train, epochs=1, batch_size=2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"## Data Format\n",
"The AutoKeras TextClassifier is quite flexible for the data format.\n",
"\n",
"For the text, the input data should be one-dimensional For the classification\n",
"labels, AutoKeras accepts both plain labels, i.e. strings or integers, and\n",
"one-hot encoded encoded labels, i.e. vectors of 0s and 1s.\n",
"\n",
"We also support using [tf.data.Dataset](\n",
"https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable)\n",
"format for the training data.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab_type": "code"
},
"outputs": [],
"source": [
"train_set = tf.data.Dataset.from_tensor_slices(((x_train,), (y_train,))).batch(\n",
" 2\n",
")\n",
"test_set = tf.data.Dataset.from_tensor_slices(((x_test,), (y_test,))).batch(2)\n",
"\n",
"clf = ak.TextClassifier(overwrite=True, max_trials=1)\n",
"# Feed the tensorflow Dataset to the classifier.\n",
"clf.fit(train_set.take(2), epochs=1)\n",
"# Predict with the best model.\n",
"predicted_y = clf.predict(test_set.take(2))\n",
"# Evaluate the best model with testing data.\n",
"print(clf.evaluate(test_set.take(2)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text"
},
"source": [
"## Reference\n",
"[TextClassifier](/text_classifier),\n",
"[AutoModel](/auto_model/#automodel-class),\n",
"[ConvBlock](/block/#convblock-class),\n",
"[TextInput](/node/#textinput-class),\n",
"[ClassificationHead](/block/#classificationhead-class).\n"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "text_classification",
"private_outputs": false,
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}