demos/zzz-internal-developers/graph-resource-usage.ipynb
{
"cells": [
{
"cell_type": "markdown",
"id": "0",
"metadata": {},
"source": [
"# Resource usage of the StellarGraph class\n",
"\n",
"> This notebooks records the time and memory (both peak and long-term) required to construct a StellarGraph object for several datasets."
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {
"nbsphinx": "hidden",
"tags": [
"CloudRunner"
]
},
"source": [
"<table><tr><td>Run the latest release of this notebook:</td><td><a href=\"https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/zzz-internal-developers/graph-resource-usage.ipynb\" alt=\"Open In Binder\" target=\"_parent\"><img src=\"https://mybinder.org/badge_logo.svg\"/></a></td><td><a href=\"https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/zzz-internal-developers/graph-resource-usage.ipynb\" alt=\"Open In Colab\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a></td></tr></table>"
]
},
{
"cell_type": "markdown",
"id": "2",
"metadata": {},
"source": [
"This notebook is aimed at helping contributors to the StellarGraph library itself understand how their changes affect the resource usage of the `StellarGraph` object.\n",
"\n",
"Various measures of resource usage for several \"real world\" graphs of various sizes are recorded:\n",
"\n",
"- time for construction\n",
"- memory usage of the final `StellarGraph` object\n",
"- peak memory usage during `StellarGraph` construction (both absolute, and additional compared to the raw input data)\n",
"\n",
"These are recorded both with explicit nodes (and node features if they exist), and implicit/inferred nodes.\n",
"\n",
"The memory usage is recorded end-to-end. That is, the recording starts from data on disk and continues until the `StellarGraph` object has been constructed and other data has been cleaned up. This is important for accurately recording the total memory usage, as NumPy arrays can often share data with existing arrays in memory and so retroactive or partial (starting from data in memory) analysis can miss significant amounts of data. The parsing code in `stellargraph.datasets` doesn't allow determining the memory usage of the intermediate nodes and edges input to the `StellarGraph` constructor, and so cannot be used here."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3",
"metadata": {
"nbsphinx": "hidden",
"tags": [
"CloudRunner"
]
},
"outputs": [],
"source": [
"# install StellarGraph if running on Google Colab\n",
"import sys\n",
"if 'google.colab' in sys.modules:\n",
" %pip install -q stellargraph[demos]==1.3.0b"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4",
"metadata": {
"nbsphinx": "hidden",
"tags": [
"VersionCheck"
]
},
"outputs": [],
"source": [
"# verify that we're using the correct version of StellarGraph for this notebook\n",
"import stellargraph as sg\n",
"\n",
"try:\n",
" sg.utils.validate_notebook_version(\"1.3.0b\")\n",
"except AttributeError:\n",
" raise ValueError(\n",
" f\"This notebook requires StellarGraph version 1.3.0b, but a different version {sg.__version__} is installed. Please see <https://github.com/stellargraph/stellargraph/issues/1172>.\"\n",
" ) from None"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5",
"metadata": {},
"outputs": [],
"source": [
"import stellargraph as sg\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import gc\n",
"import json\n",
"import os\n",
"import timeit\n",
"import tempfile\n",
"import tracemalloc"
]
},
{
"cell_type": "markdown",
"id": "6",
"metadata": {},
"source": [
"## Optional reddit data\n",
"\n",
"The original GraphSAGE paper evaluated on a reddit dataset, available at <http://snap.stanford.edu/graphsage/#datasets>. This dataset is large (1.3GB compressed) and so there is not automatic download support for it. The following `reddit_path` variable controls whether and how the reddit dataset is included:\n",
"\n",
"- to ignore the dataset: set the variable to `None`\n",
"- to include the dataset: download the dataset zip, decompress it, and set the variable to the decompressed directory"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7",
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"reddit_path = os.path.expanduser(\"~/data/reddit\")"
]
},
{
"cell_type": "markdown",
"id": "8",
"metadata": {},
"source": [
"## Datasets"
]
},
{
"cell_type": "markdown",
"id": "9",
"metadata": {},
"source": [
"### Cora"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "10",
"metadata": {},
"outputs": [],
"source": [
"cora = sg.datasets.Cora()\n",
"cora.download()\n",
"\n",
"cora_cites_path = os.path.join(cora.data_directory, \"cora.cites\")\n",
"cora_content_path = os.path.join(cora.data_directory, \"cora.content\")\n",
"cora_dtypes = {0: int, **{i: np.float32 for i in range(1, 1433 + 1)}}\n",
"\n",
"\n",
"def cora_pandas_parts(include_nodes):\n",
" if include_nodes:\n",
" nodes = pd.read_csv(\n",
" cora_content_path,\n",
" header=None,\n",
" sep=\"\\t\",\n",
" index_col=0,\n",
" usecols=range(0, 1433 + 1),\n",
" dtype=cora_dtypes,\n",
" na_filter=False,\n",
" )\n",
" else:\n",
" nodes = None\n",
" edges = pd.read_csv(\n",
" cora_cites_path,\n",
" header=None,\n",
" sep=\"\\t\",\n",
" names=[\"source\", \"target\"],\n",
" dtype=int,\n",
" na_filter=False,\n",
" )\n",
" return nodes, edges, {}\n",
"\n",
"\n",
"def cora_indexed_array_parts(include_nodes):\n",
" nodes, edges, args = cora_pandas_parts(include_nodes)\n",
" if nodes is not None:\n",
" nodes = sg.IndexedArray(nodes.to_numpy(), index=nodes.index)\n",
" return nodes, edges, args"
]
},
{
"cell_type": "markdown",
"id": "11",
"metadata": {},
"source": [
"### BlogCatalog3"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "12",
"metadata": {},
"outputs": [],
"source": [
"blogcatalog3 = sg.datasets.BlogCatalog3()\n",
"blogcatalog3.download()\n",
"\n",
"blogcatalog3_edges = os.path.join(blogcatalog3.data_directory, \"edges.csv\")\n",
"blogcatalog3_group_edges = os.path.join(blogcatalog3.data_directory, \"group-edges.csv\")\n",
"blogcatalog3_groups = os.path.join(blogcatalog3.data_directory, \"groups.csv\")\n",
"blogcatalog3_nodes = os.path.join(blogcatalog3.data_directory, \"nodes.csv\")\n",
"\n",
"\n",
"def blogcatalog3_parts(include_nodes):\n",
" if include_nodes:\n",
" raw_nodes = pd.read_csv(blogcatalog3_nodes, header=None)[0]\n",
" raw_groups = pd.read_csv(blogcatalog3_groups, header=None)[0]\n",
" nodes = {\n",
" \"user\": pd.DataFrame(index=raw_nodes),\n",
" \"group\": pd.DataFrame(index=-raw_groups),\n",
" }\n",
" else:\n",
" nodes = None\n",
"\n",
" edges = pd.read_csv(blogcatalog3_edges, header=None, names=[\"source\", \"target\"])\n",
"\n",
" group_edges = pd.read_csv(\n",
" blogcatalog3_group_edges, header=None, names=[\"source\", \"target\"]\n",
" )\n",
" group_edges[\"target\"] *= -1\n",
" start = len(edges)\n",
" group_edges.index = range(start, start + len(group_edges))\n",
"\n",
" edges = {\"friend\": edges, \"belongs\": group_edges}\n",
" return nodes, edges, {}"
]
},
{
"cell_type": "markdown",
"id": "13",
"metadata": {},
"source": [
"### FB15k"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "14",
"metadata": {},
"outputs": [],
"source": [
"fb15k = sg.datasets.FB15k()\n",
"fb15k.download()\n",
"fb15k_files = [\n",
" os.path.join(fb15k.data_directory, f\"freebase_mtr100_mte100-{x}.txt\")\n",
" for x in [\"train\", \"test\", \"valid\"]\n",
"]\n",
"\n",
"\n",
"def fb15k_parts(include_nodes, usecols=None):\n",
" loaded = [\n",
" pd.read_csv(\n",
" name,\n",
" header=None,\n",
" names=[\"source\", \"label\", \"target\"],\n",
" sep=\"\\t\",\n",
" dtype=str,\n",
" na_filter=False,\n",
" usecols=usecols,\n",
" )\n",
" for name in fb15k_files\n",
" ]\n",
" edges = pd.concat(loaded, ignore_index=True)\n",
"\n",
" if include_nodes:\n",
" # infer the set of nodes manually, in a memory-minimal way\n",
" raw_nodes = set(edges.source)\n",
" raw_nodes.update(edges.target)\n",
" nodes = pd.DataFrame(index=raw_nodes)\n",
" else:\n",
" nodes = None\n",
"\n",
" return nodes, edges, {\"edge_type_column\": \"label\"}\n",
"\n",
"\n",
"def fb15k_no_edge_types_parts(include_nodes):\n",
" nodes, edges, _ = fb15k_parts(include_nodes, usecols=[\"source\", \"target\"])\n",
" return nodes, edges, {}"
]
},
{
"cell_type": "markdown",
"id": "15",
"metadata": {},
"source": [
"### reddit\n",
"\n",
"As discussed above, the reddit dataset is large and optional. It is also slow to parse, as the graph structure is a huge JSON file. Thus, we prepare the dataset by converting that JSON file into a NumPy edge list array, of shape `(num_edges, 2)`. This is significantly faster to load from disk."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "16",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 15.9 s, sys: 1.97 s, total: 17.8 s\n",
"Wall time: 17.9 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"# if requested, prepare the reddit dataset by saving the slow-to-read JSON to a temporary .npy file\n",
"if reddit_path is not None:\n",
" reddit_graph_path = os.path.join(reddit_path, \"reddit-G.json\")\n",
" reddit_feats_path = os.path.join(reddit_path, \"reddit-feats.npy\")\n",
"\n",
" with open(reddit_graph_path) as f:\n",
" reddit_g = json.load(f)\n",
" reddit_numpy_edges = np.array([[x[\"source\"], x[\"target\"]] for x in reddit_g[\"links\"]])\n",
" \n",
" reddit_edges_file = tempfile.NamedTemporaryFile(suffix=\".npy\")\n",
" np.save(reddit_edges_file, reddit_numpy_edges)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "17",
"metadata": {},
"outputs": [],
"source": [
"def reddit_numpy_parts(include_nodes):\n",
" if include_nodes:\n",
" nodes = np.load(reddit_feats_path).astype(np.float32)\n",
" else:\n",
" nodes = None\n",
"\n",
" raw_edges = np.load(reddit_edges_file.name)\n",
" edges = pd.DataFrame(raw_edges, columns=[\"source\", \"target\"])\n",
" return nodes, edges, {}\n",
"\n",
"\n",
"def reddit_pandas_parts(include_nodes):\n",
" nodes, edges, args = reddit_numpy_parts(include_nodes)\n",
" if nodes is not None:\n",
" nodes = pd.DataFrame(nodes)\n",
"\n",
" return nodes, edges, args"
]
},
{
"cell_type": "markdown",
"id": "18",
"metadata": {},
"source": [
"### Collected"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "19",
"metadata": {},
"outputs": [],
"source": [
"datasets = {\n",
" \"Cora (Pandas)\": cora_pandas_parts,\n",
" \"Cora (IndexedArray)\": cora_indexed_array_parts,\n",
" \"BlogCatalog3\": blogcatalog3_parts,\n",
" \"FB15k (no edge types)\": fb15k_no_edge_types_parts,\n",
" \"FB15k\": fb15k_parts,\n",
"}\n",
"if reddit_path is not None:\n",
" datasets[\"reddit (Pandas)\"] = reddit_pandas_parts\n",
" datasets[\"reddit (NumPy)\"] = reddit_numpy_parts"
]
},
{
"cell_type": "markdown",
"id": "20",
"metadata": {},
"source": [
"## Measurement"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "21",
"metadata": {},
"outputs": [],
"source": [
"def mem_snapshot_diff(after, before):\n",
" \"\"\"Total memory difference between two tracemalloc.snapshot objects\"\"\"\n",
" return sum(elem.size_diff for elem in after.compare_to(before, \"lineno\"))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "22",
"metadata": {},
"outputs": [],
"source": [
"# names of columns computed by the measurement code\n",
"def measurement_columns(title):\n",
" names = [\n",
" \"time\",\n",
" \"memory (graph)\",\n",
" \"memory (graph, not shared with data)\",\n",
" \"peak memory (graph)\",\n",
" \"peak memory (graph, ignoring data)\",\n",
" \"memory (data)\",\n",
" \"peak memory (data)\",\n",
" ]\n",
" return [(title, x) for x in names]\n",
"\n",
"\n",
"columns = pd.MultiIndex.from_tuples(\n",
" [\n",
" (\"graph\", \"nodes\"),\n",
" (\"graph\", \"node feat size\"),\n",
" (\"graph\", \"edges\"),\n",
" *measurement_columns(\"explicit nodes\"),\n",
" *measurement_columns(\"inferred nodes (no features)\"),\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "23",
"metadata": {},
"outputs": [],
"source": [
"def measure_time(f, include_nodes):\n",
" nodes, edges, args = f(include_nodes)\n",
" start = timeit.default_timer()\n",
" sg.StellarGraph(nodes, edges, **args)\n",
" end = timeit.default_timer()\n",
" return end - start"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "24",
"metadata": {},
"outputs": [],
"source": [
"def measure_memory(f, include_nodes):\n",
" \"\"\"\n",
" Measure exactly what it takes to load the data.\n",
" \n",
" - the size of the original edge data (as a baseline)\n",
" - the size of the final graph\n",
" - the peak memory use of both\n",
" \n",
" This uses a similar technique to the 'allocation_benchmark' fixture in tests/test_utils/alloc.py.\n",
" \"\"\"\n",
" gc.collect()\n",
" # ensure we're measuring the worst-case peak, when no GC happens\n",
" gc.disable()\n",
"\n",
" tracemalloc.start()\n",
" snapshot_start = tracemalloc.take_snapshot()\n",
"\n",
" nodes, edges, args = f(include_nodes)\n",
"\n",
" gc.collect()\n",
" _, data_memory_peak = tracemalloc.get_traced_memory()\n",
" snapshot_data = tracemalloc.take_snapshot()\n",
"\n",
" if include_nodes:\n",
" assert nodes is not None, f\n",
" sg_g = sg.StellarGraph(nodes, edges, **args)\n",
" else:\n",
" assert nodes is None, f\n",
" sg_g = sg.StellarGraph(edges=edges, **args)\n",
"\n",
" gc.collect()\n",
" snapshot_graph = tracemalloc.take_snapshot()\n",
"\n",
" # clean up the input data and anything else leftover, so that the snapshot\n",
" # includes only the long-lasting data: the StellarGraph.\n",
" del edges\n",
" del nodes\n",
" del args\n",
" gc.collect()\n",
"\n",
" _, graph_memory_peak = tracemalloc.get_traced_memory()\n",
" snapshot_end = tracemalloc.take_snapshot()\n",
" tracemalloc.stop()\n",
"\n",
" gc.enable()\n",
"\n",
" data_memory = mem_snapshot_diff(snapshot_data, snapshot_start)\n",
" graph_memory = mem_snapshot_diff(snapshot_end, snapshot_start)\n",
" graph_over_data_memory = mem_snapshot_diff(snapshot_graph, snapshot_data)\n",
"\n",
" return (\n",
" sg_g,\n",
" graph_memory,\n",
" graph_over_data_memory,\n",
" graph_memory_peak,\n",
" graph_memory_peak - data_memory,\n",
" data_memory,\n",
" data_memory_peak,\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "25",
"metadata": {},
"outputs": [],
"source": [
"def measure(f):\n",
" time_nodes = measure_time(f, include_nodes=True)\n",
" time_no_nodes = measure_time(f, include_nodes=False)\n",
"\n",
" sg_g, *mem_nodes = measure_memory(f, include_nodes=True)\n",
" _, *mem_no_nodes = measure_memory(f, include_nodes=False)\n",
"\n",
" feat_sizes = sg_g.node_feature_sizes()\n",
" try:\n",
" feat_sizes = feat_sizes[sg_g.unique_node_type()]\n",
" except ValueError:\n",
" pass\n",
"\n",
" return [\n",
" sg_g.number_of_nodes(),\n",
" feat_sizes,\n",
" sg_g.number_of_edges(),\n",
" time_nodes,\n",
" *mem_nodes,\n",
" time_no_nodes,\n",
" *mem_no_nodes,\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "26",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 28 s, sys: 7.04 s, total: 35 s\n",
"Wall time: 35.1 s\n"
]
}
],
"source": [
"%%time\n",
"recorded = [measure(f) for f in datasets.values()]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "27",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"3\" halign=\"left\">graph</th>\n",
" <th colspan=\"7\" halign=\"left\">explicit nodes</th>\n",
" <th colspan=\"7\" halign=\"left\">inferred nodes (no features)</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>nodes</th>\n",
" <th>node feat size</th>\n",
" <th>edges</th>\n",
" <th>time</th>\n",
" <th>memory (graph)</th>\n",
" <th>memory (graph, not shared with data)</th>\n",
" <th>peak memory (graph)</th>\n",
" <th>peak memory (graph, ignoring data)</th>\n",
" <th>memory (data)</th>\n",
" <th>peak memory (data)</th>\n",
" <th>time</th>\n",
" <th>memory (graph)</th>\n",
" <th>memory (graph, not shared with data)</th>\n",
" <th>peak memory (graph)</th>\n",
" <th>peak memory (graph, ignoring data)</th>\n",
" <th>memory (data)</th>\n",
" <th>peak memory (data)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Cora (Pandas)</th>\n",
" <td>2708</td>\n",
" <td>1433</td>\n",
" <td>5429</td>\n",
" <td>0.025028</td>\n",
" <td>15586530</td>\n",
" <td>15564897</td>\n",
" <td>46764625</td>\n",
" <td>31079400</td>\n",
" <td>15685225</td>\n",
" <td>31995857</td>\n",
" <td>0.002037</td>\n",
" <td>60994</td>\n",
" <td>63025</td>\n",
" <td>251118</td>\n",
" <td>160985</td>\n",
" <td>90133</td>\n",
" <td>197529</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cora (IndexedArray)</th>\n",
" <td>2708</td>\n",
" <td>1433</td>\n",
" <td>5429</td>\n",
" <td>0.001163</td>\n",
" <td>15585170</td>\n",
" <td>40633</td>\n",
" <td>31993945</td>\n",
" <td>16356516</td>\n",
" <td>15637429</td>\n",
" <td>31993945</td>\n",
" <td>0.001545</td>\n",
" <td>61018</td>\n",
" <td>63049</td>\n",
" <td>251118</td>\n",
" <td>160985</td>\n",
" <td>90133</td>\n",
" <td>197529</td>\n",
" </tr>\n",
" <tr>\n",
" <th>BlogCatalog3</th>\n",
" <td>10351</td>\n",
" <td>{'group': 0, 'user': 0}</td>\n",
" <td>348459</td>\n",
" <td>0.020382</td>\n",
" <td>4635099</td>\n",
" <td>7428226</td>\n",
" <td>14146092</td>\n",
" <td>8477331</td>\n",
" <td>5668761</td>\n",
" <td>10805413</td>\n",
" <td>0.027501</td>\n",
" <td>4633843</td>\n",
" <td>7427186</td>\n",
" <td>14061652</td>\n",
" <td>8479763</td>\n",
" <td>5581889</td>\n",
" <td>10711633</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FB15k (no edge types)</th>\n",
" <td>14951</td>\n",
" <td>0</td>\n",
" <td>592213</td>\n",
" <td>0.098957</td>\n",
" <td>3970442</td>\n",
" <td>2985020</td>\n",
" <td>25830730</td>\n",
" <td>10151739</td>\n",
" <td>15678991</td>\n",
" <td>25830730</td>\n",
" <td>0.184846</td>\n",
" <td>3969362</td>\n",
" <td>3107220</td>\n",
" <td>34644016</td>\n",
" <td>19090289</td>\n",
" <td>15553727</td>\n",
" <td>25049683</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FB15k</th>\n",
" <td>14951</td>\n",
" <td>0</td>\n",
" <td>592213</td>\n",
" <td>0.610353</td>\n",
" <td>9793950</td>\n",
" <td>13398273</td>\n",
" <td>57650243</td>\n",
" <td>36747424</td>\n",
" <td>20902819</td>\n",
" <td>35792614</td>\n",
" <td>0.700297</td>\n",
" <td>9794126</td>\n",
" <td>13521649</td>\n",
" <td>57650811</td>\n",
" <td>36873168</td>\n",
" <td>20777643</td>\n",
" <td>35011663</td>\n",
" </tr>\n",
" <tr>\n",
" <th>reddit (Pandas)</th>\n",
" <td>232965</td>\n",
" <td>602</td>\n",
" <td>11606919</td>\n",
" <td>3.130784</td>\n",
" <td>665684661</td>\n",
" <td>665691152</td>\n",
" <td>1868696353</td>\n",
" <td>1121990320</td>\n",
" <td>746706033</td>\n",
" <td>1682947017</td>\n",
" <td>0.483119</td>\n",
" <td>106555123</td>\n",
" <td>106556406</td>\n",
" <td>375628530</td>\n",
" <td>189913865</td>\n",
" <td>185714665</td>\n",
" <td>185723196</td>\n",
" </tr>\n",
" <tr>\n",
" <th>reddit (NumPy)</th>\n",
" <td>232965</td>\n",
" <td>602</td>\n",
" <td>11606919</td>\n",
" <td>0.545932</td>\n",
" <td>665684061</td>\n",
" <td>104705536</td>\n",
" <td>1682947017</td>\n",
" <td>936252548</td>\n",
" <td>746694469</td>\n",
" <td>1682947017</td>\n",
" <td>0.475468</td>\n",
" <td>106555123</td>\n",
" <td>106556406</td>\n",
" <td>375628530</td>\n",
" <td>189913865</td>\n",
" <td>185714665</td>\n",
" <td>185723196</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" graph \\\n",
" nodes node feat size edges \n",
"Cora (Pandas) 2708 1433 5429 \n",
"Cora (IndexedArray) 2708 1433 5429 \n",
"BlogCatalog3 10351 {'group': 0, 'user': 0} 348459 \n",
"FB15k (no edge types) 14951 0 592213 \n",
"FB15k 14951 0 592213 \n",
"reddit (Pandas) 232965 602 11606919 \n",
"reddit (NumPy) 232965 602 11606919 \n",
"\n",
" explicit nodes \\\n",
" time memory (graph) \n",
"Cora (Pandas) 0.025028 15586530 \n",
"Cora (IndexedArray) 0.001163 15585170 \n",
"BlogCatalog3 0.020382 4635099 \n",
"FB15k (no edge types) 0.098957 3970442 \n",
"FB15k 0.610353 9793950 \n",
"reddit (Pandas) 3.130784 665684661 \n",
"reddit (NumPy) 0.545932 665684061 \n",
"\n",
" \\\n",
" memory (graph, not shared with data) \n",
"Cora (Pandas) 15564897 \n",
"Cora (IndexedArray) 40633 \n",
"BlogCatalog3 7428226 \n",
"FB15k (no edge types) 2985020 \n",
"FB15k 13398273 \n",
"reddit (Pandas) 665691152 \n",
"reddit (NumPy) 104705536 \n",
"\n",
" \\\n",
" peak memory (graph) peak memory (graph, ignoring data) \n",
"Cora (Pandas) 46764625 31079400 \n",
"Cora (IndexedArray) 31993945 16356516 \n",
"BlogCatalog3 14146092 8477331 \n",
"FB15k (no edge types) 25830730 10151739 \n",
"FB15k 57650243 36747424 \n",
"reddit (Pandas) 1868696353 1121990320 \n",
"reddit (NumPy) 1682947017 936252548 \n",
"\n",
" \\\n",
" memory (data) peak memory (data) \n",
"Cora (Pandas) 15685225 31995857 \n",
"Cora (IndexedArray) 15637429 31993945 \n",
"BlogCatalog3 5668761 10805413 \n",
"FB15k (no edge types) 15678991 25830730 \n",
"FB15k 20902819 35792614 \n",
"reddit (Pandas) 746706033 1682947017 \n",
"reddit (NumPy) 746694469 1682947017 \n",
"\n",
" inferred nodes (no features) \\\n",
" time memory (graph) \n",
"Cora (Pandas) 0.002037 60994 \n",
"Cora (IndexedArray) 0.001545 61018 \n",
"BlogCatalog3 0.027501 4633843 \n",
"FB15k (no edge types) 0.184846 3969362 \n",
"FB15k 0.700297 9794126 \n",
"reddit (Pandas) 0.483119 106555123 \n",
"reddit (NumPy) 0.475468 106555123 \n",
"\n",
" \\\n",
" memory (graph, not shared with data) \n",
"Cora (Pandas) 63025 \n",
"Cora (IndexedArray) 63049 \n",
"BlogCatalog3 7427186 \n",
"FB15k (no edge types) 3107220 \n",
"FB15k 13521649 \n",
"reddit (Pandas) 106556406 \n",
"reddit (NumPy) 106556406 \n",
"\n",
" \\\n",
" peak memory (graph) peak memory (graph, ignoring data) \n",
"Cora (Pandas) 251118 160985 \n",
"Cora (IndexedArray) 251118 160985 \n",
"BlogCatalog3 14061652 8479763 \n",
"FB15k (no edge types) 34644016 19090289 \n",
"FB15k 57650811 36873168 \n",
"reddit (Pandas) 375628530 189913865 \n",
"reddit (NumPy) 375628530 189913865 \n",
"\n",
" \n",
" memory (data) peak memory (data) \n",
"Cora (Pandas) 90133 197529 \n",
"Cora (IndexedArray) 90133 197529 \n",
"BlogCatalog3 5581889 10711633 \n",
"FB15k (no edge types) 15553727 25049683 \n",
"FB15k 20777643 35011663 \n",
"reddit (Pandas) 185714665 185723196 \n",
"reddit (NumPy) 185714665 185723196 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw = pd.DataFrame(recorded, columns=columns, index=datasets.keys())\n",
"raw"
]
},
{
"cell_type": "markdown",
"id": "28",
"metadata": {},
"source": [
"## Pretty results\n",
"\n",
"This shows the results in a prettier way, such as memory in MB instead of bytes."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "29",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"3\" halign=\"left\">graph</th>\n",
" <th colspan=\"7\" halign=\"left\">explicit nodes</th>\n",
" <th colspan=\"7\" halign=\"left\">inferred nodes (no features)</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>nodes</th>\n",
" <th>node feat size</th>\n",
" <th>edges</th>\n",
" <th>time</th>\n",
" <th>memory (graph)</th>\n",
" <th>memory (graph, not shared with data)</th>\n",
" <th>peak memory (graph)</th>\n",
" <th>peak memory (graph, ignoring data)</th>\n",
" <th>memory (data)</th>\n",
" <th>peak memory (data)</th>\n",
" <th>time</th>\n",
" <th>memory (graph)</th>\n",
" <th>memory (graph, not shared with data)</th>\n",
" <th>peak memory (graph)</th>\n",
" <th>peak memory (graph, ignoring data)</th>\n",
" <th>memory (data)</th>\n",
" <th>peak memory (data)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Cora (Pandas)</th>\n",
" <td>2708</td>\n",
" <td>1433</td>\n",
" <td>5429</td>\n",
" <td>0.025028</td>\n",
" <td>15.587</td>\n",
" <td>15.565</td>\n",
" <td>46.765</td>\n",
" <td>31.079</td>\n",
" <td>15.685</td>\n",
" <td>31.996</td>\n",
" <td>0.002037</td>\n",
" <td>0.061</td>\n",
" <td>0.063</td>\n",
" <td>0.251</td>\n",
" <td>0.161</td>\n",
" <td>0.090</td>\n",
" <td>0.198</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cora (IndexedArray)</th>\n",
" <td>2708</td>\n",
" <td>1433</td>\n",
" <td>5429</td>\n",
" <td>0.001163</td>\n",
" <td>15.585</td>\n",
" <td>0.041</td>\n",
" <td>31.994</td>\n",
" <td>16.357</td>\n",
" <td>15.637</td>\n",
" <td>31.994</td>\n",
" <td>0.001545</td>\n",
" <td>0.061</td>\n",
" <td>0.063</td>\n",
" <td>0.251</td>\n",
" <td>0.161</td>\n",
" <td>0.090</td>\n",
" <td>0.198</td>\n",
" </tr>\n",
" <tr>\n",
" <th>BlogCatalog3</th>\n",
" <td>10351</td>\n",
" <td>{'group': 0, 'user': 0}</td>\n",
" <td>348459</td>\n",
" <td>0.020382</td>\n",
" <td>4.635</td>\n",
" <td>7.428</td>\n",
" <td>14.146</td>\n",
" <td>8.477</td>\n",
" <td>5.669</td>\n",
" <td>10.805</td>\n",
" <td>0.027501</td>\n",
" <td>4.634</td>\n",
" <td>7.427</td>\n",
" <td>14.062</td>\n",
" <td>8.480</td>\n",
" <td>5.582</td>\n",
" <td>10.712</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FB15k (no edge types)</th>\n",
" <td>14951</td>\n",
" <td>0</td>\n",
" <td>592213</td>\n",
" <td>0.098957</td>\n",
" <td>3.970</td>\n",
" <td>2.985</td>\n",
" <td>25.831</td>\n",
" <td>10.152</td>\n",
" <td>15.679</td>\n",
" <td>25.831</td>\n",
" <td>0.184846</td>\n",
" <td>3.969</td>\n",
" <td>3.107</td>\n",
" <td>34.644</td>\n",
" <td>19.090</td>\n",
" <td>15.554</td>\n",
" <td>25.050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FB15k</th>\n",
" <td>14951</td>\n",
" <td>0</td>\n",
" <td>592213</td>\n",
" <td>0.610353</td>\n",
" <td>9.794</td>\n",
" <td>13.398</td>\n",
" <td>57.650</td>\n",
" <td>36.747</td>\n",
" <td>20.903</td>\n",
" <td>35.793</td>\n",
" <td>0.700297</td>\n",
" <td>9.794</td>\n",
" <td>13.522</td>\n",
" <td>57.651</td>\n",
" <td>36.873</td>\n",
" <td>20.778</td>\n",
" <td>35.012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>reddit (Pandas)</th>\n",
" <td>232965</td>\n",
" <td>602</td>\n",
" <td>11606919</td>\n",
" <td>3.130784</td>\n",
" <td>665.685</td>\n",
" <td>665.691</td>\n",
" <td>1868.696</td>\n",
" <td>1121.990</td>\n",
" <td>746.706</td>\n",
" <td>1682.947</td>\n",
" <td>0.483119</td>\n",
" <td>106.555</td>\n",
" <td>106.556</td>\n",
" <td>375.629</td>\n",
" <td>189.914</td>\n",
" <td>185.715</td>\n",
" <td>185.723</td>\n",
" </tr>\n",
" <tr>\n",
" <th>reddit (NumPy)</th>\n",
" <td>232965</td>\n",
" <td>602</td>\n",
" <td>11606919</td>\n",
" <td>0.545932</td>\n",
" <td>665.684</td>\n",
" <td>104.706</td>\n",
" <td>1682.947</td>\n",
" <td>936.253</td>\n",
" <td>746.694</td>\n",
" <td>1682.947</td>\n",
" <td>0.475468</td>\n",
" <td>106.555</td>\n",
" <td>106.556</td>\n",
" <td>375.629</td>\n",
" <td>189.914</td>\n",
" <td>185.715</td>\n",
" <td>185.723</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" graph \\\n",
" nodes node feat size edges \n",
"Cora (Pandas) 2708 1433 5429 \n",
"Cora (IndexedArray) 2708 1433 5429 \n",
"BlogCatalog3 10351 {'group': 0, 'user': 0} 348459 \n",
"FB15k (no edge types) 14951 0 592213 \n",
"FB15k 14951 0 592213 \n",
"reddit (Pandas) 232965 602 11606919 \n",
"reddit (NumPy) 232965 602 11606919 \n",
"\n",
" explicit nodes \\\n",
" time memory (graph) \n",
"Cora (Pandas) 0.025028 15.587 \n",
"Cora (IndexedArray) 0.001163 15.585 \n",
"BlogCatalog3 0.020382 4.635 \n",
"FB15k (no edge types) 0.098957 3.970 \n",
"FB15k 0.610353 9.794 \n",
"reddit (Pandas) 3.130784 665.685 \n",
"reddit (NumPy) 0.545932 665.684 \n",
"\n",
" \\\n",
" memory (graph, not shared with data) \n",
"Cora (Pandas) 15.565 \n",
"Cora (IndexedArray) 0.041 \n",
"BlogCatalog3 7.428 \n",
"FB15k (no edge types) 2.985 \n",
"FB15k 13.398 \n",
"reddit (Pandas) 665.691 \n",
"reddit (NumPy) 104.706 \n",
"\n",
" \\\n",
" peak memory (graph) peak memory (graph, ignoring data) \n",
"Cora (Pandas) 46.765 31.079 \n",
"Cora (IndexedArray) 31.994 16.357 \n",
"BlogCatalog3 14.146 8.477 \n",
"FB15k (no edge types) 25.831 10.152 \n",
"FB15k 57.650 36.747 \n",
"reddit (Pandas) 1868.696 1121.990 \n",
"reddit (NumPy) 1682.947 936.253 \n",
"\n",
" \\\n",
" memory (data) peak memory (data) \n",
"Cora (Pandas) 15.685 31.996 \n",
"Cora (IndexedArray) 15.637 31.994 \n",
"BlogCatalog3 5.669 10.805 \n",
"FB15k (no edge types) 15.679 25.831 \n",
"FB15k 20.903 35.793 \n",
"reddit (Pandas) 746.706 1682.947 \n",
"reddit (NumPy) 746.694 1682.947 \n",
"\n",
" inferred nodes (no features) \\\n",
" time memory (graph) \n",
"Cora (Pandas) 0.002037 0.061 \n",
"Cora (IndexedArray) 0.001545 0.061 \n",
"BlogCatalog3 0.027501 4.634 \n",
"FB15k (no edge types) 0.184846 3.969 \n",
"FB15k 0.700297 9.794 \n",
"reddit (Pandas) 0.483119 106.555 \n",
"reddit (NumPy) 0.475468 106.555 \n",
"\n",
" \\\n",
" memory (graph, not shared with data) \n",
"Cora (Pandas) 0.063 \n",
"Cora (IndexedArray) 0.063 \n",
"BlogCatalog3 7.427 \n",
"FB15k (no edge types) 3.107 \n",
"FB15k 13.522 \n",
"reddit (Pandas) 106.556 \n",
"reddit (NumPy) 106.556 \n",
"\n",
" \\\n",
" peak memory (graph) peak memory (graph, ignoring data) \n",
"Cora (Pandas) 0.251 0.161 \n",
"Cora (IndexedArray) 0.251 0.161 \n",
"BlogCatalog3 14.062 8.480 \n",
"FB15k (no edge types) 34.644 19.090 \n",
"FB15k 57.651 36.873 \n",
"reddit (Pandas) 375.629 189.914 \n",
"reddit (NumPy) 375.629 189.914 \n",
"\n",
" \n",
" memory (data) peak memory (data) \n",
"Cora (Pandas) 0.090 0.198 \n",
"Cora (IndexedArray) 0.090 0.198 \n",
"BlogCatalog3 5.582 10.712 \n",
"FB15k (no edge types) 15.554 25.050 \n",
"FB15k 20.778 35.012 \n",
"reddit (Pandas) 185.715 185.723 \n",
"reddit (NumPy) 185.715 185.723 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mem_columns = raw.columns[[\"memory\" in x[1] for x in raw.columns]]\n",
"\n",
"memory_mb = raw.copy()\n",
"memory_mb[mem_columns] = (memory_mb[mem_columns] / 10 ** 6).round(3)\n",
"memory_mb"
]
},
{
"cell_type": "markdown",
"id": "30",
"metadata": {
"nbsphinx": "hidden",
"tags": [
"CloudRunner"
]
},
"source": [
"<table><tr><td>Run the latest release of this notebook:</td><td><a href=\"https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/zzz-internal-developers/graph-resource-usage.ipynb\" alt=\"Open In Binder\" target=\"_parent\"><img src=\"https://mybinder.org/badge_logo.svg\"/></a></td><td><a href=\"https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/zzz-internal-developers/graph-resource-usage.ipynb\" alt=\"Open In Colab\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a></td></tr></table>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}