KarrLab/bpforms

View on GitHub
examples/2. Extended tutorial.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`BpForms` is a toolkit for unambiguously describing the primary sequence of biopolymers such as DNA, RNA, and proteins, including modified DNA, RNA, and proteins. BpForms represents biopolymers as monomeric forms linked.\n",
    "This tutorial illustrates how to use the `BpForms` Python API. Please see the [documentation](https://docs.karrlab.org/bpforms/) for more information about the `BpForms` grammar and more instructions for using the `BpForms` website, JSON REST API, and command line interface."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import `BpForms`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import bpforms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create instances of `BpForm`\n",
    "Use the [`BpForm` grammar](https://docs.karrlab.org/bpforms) and the `BpForm.from_str` method to create an instance of `BpForm` to represent a form of a biopolymer.\n",
    "\n",
    "`BpForms` includes six predefined alphabet and six corresponding pre-defined subclasses of `BpForm`:\n",
    "* Canonical DNA nucleotide monophosphates (`bpforms.canonical_dna_alphabet`, `bpforms.CanonicalDnaForm`): canonical nucleotide monophosphates\n",
    "* Canonical RNA nucleotide monophosphates (`bpforms.canonical_rna_alphabet`, `bpforms.CanonicalRnaForm`): canonical nucleotide monophosphates\n",
    "* Canonical protein amino acids (`bpforms.canonical_protein_alphabet`, `bpforms.CanonicalProteinForm`): canonical protein amino acids\n",
    "* DNA nucleotide monophosphates (`bpforms.dna_alphabet`, `bpforms.DnaForm`): canonical nucleotide monophosphates plus non-canonical nucleotide monophosphates based on [DNAmod](https://dnamod.hoffmanlab.org)\n",
    "* RNA nucleotide monophosphates (`bpforms.rna_alphabet`, `bpforms.RnaForm`): canonical nucleotide monophosphates plus non-canonical nucleotide monophosphates based on [MODOMICS](http://modomics.genesilico.pl/modifications/) and the [RNA Modification Database](https://mods.rna.albany.edu/mods/)\n",
    "* Protein amino acids (`bpforms.protein_alphabet`, `bpforms.ProteinForm`): canonical protein amino acids plus non-canonical protein amino acids based on the [PDB Chemical Component Dictionary](http://www.wwpdb.org/data/ccd) and [RESID](https://pir.georgetown.edu/resid/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create `BpForm`s composed canonical monomeric forms\n",
    "Monomeric forms defined in the alphabets can be referenced by their single character codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_form = bpforms.DnaForm().from_str('ACGT')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create `BpForm`s that includes non-canonical monomeric forms\n",
    "Some of the non-canonical monomeric forms in the alphabets are represented by multiple characters. Their character codes must be delimited by curly brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_form = bpforms.DnaForm().from_str('A{m2C}GT')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create `BpForm`s that include monomeric forms that are not defined in the alphabet\n",
    "Additional monomeric forms can be described in square brackets using one or more attributes separated by vertical pipes (\"|\"):\n",
    "* `id`\n",
    "* `name`\n",
    "* `synonym`\n",
    "* `identifier`\n",
    "* `structure`: canonical SMILES-encoded string that represents the structure of the monomeric form\n",
    "* `r-bond-atom`: list of atoms involved in bond(s) with monomeric forms to the left\n",
    "* `r-displaced-atom`: list of atoms displaced by the formation of bond(s) with monomeric forms to the left\n",
    "* `l-bond-atom`: list of atoms involved in bond(s) with monomeric forms to the right\n",
    "* `l-displaced-atom`: list of atoms displaced by the formation of bond(s) with monomeric forms to the right\n",
    "* `delta-mass`: additional mass in Daltons beyond that described by the `structure` attribute; used to represent uncertainty in the structure of the monomeric form.\n",
    "* `delta-charge`: additional charge beyond that described by the `structure` attribute; used to represent uncertainty in the structure of the monomeric form.\n",
    "* `position`: represents uncertainty in the location of a non-canonical monomeric form.\n",
    "* `base-monomer`: represents the parent monomeric form of the monomeric form (e.g. the parent of m2A is A)\n",
    "* `comments`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_form = bpforms.DnaForm().from_str(\n",
    "    '''A[\n",
    "         id: \"m2C\" \n",
    "         | name: \"2-O-methylcytosine\"\n",
    "         | synonym: \"4-amino-2-methoxypyrimidine\"\n",
    "         | synonym: \"o-2-methylcytosine\"\n",
    "         | identifier: \"CHEBI:70854\" @ \"chebi\"\n",
    "         | structure: \"COC1=NC(=CCN1C1CC(C(O1)COP(=O)([O-])[O-])O)N\"\n",
    "         | r-bond-atom: O20\n",
    "         | l-bond-atom: P16\n",
    "         | r-displaced-atom: H20\n",
    "         | l-displaced-atom: O19-1\n",
    "         | comments: \"Methylation of deoxycytidine\"\n",
    "         | base-monomer: \"C\"\n",
    "         | position: 2-3\n",
    "        ]GT'''.replace('\\n', '').replace(' ', ''))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get and set monomeric forms and slices of monomeric forms of biopolymers\n",
    "Individual residues and slices of residues can be get and set similar to lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bpforms.core.Monomer at 0x7f2d4d094908>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_form[2] = bpforms.dna_alphabet.monomers.A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<bpforms.core.Monomer at 0x7f2d4d094908>,\n",
       " <bpforms.core.Monomer at 0x7f2d4d03f198>]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form[2:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_form[2:4] = bpforms.DnaForm().from_str('TA')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get and set the bases of monomeric forms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{<bpforms.core.Monomer at 0x7f2d4d086940>}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form[1].base_monomers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate the major protonation state of biopolymer forms at specific pHs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-5"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "microspecies = dna_form.get_major_micro_species(8.)\n",
    "microspecies.GetTotalCharge()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-2"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "microspecies = dna_form.get_major_micro_species(3.)\n",
    "microspecies.GetTotalCharge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate physical properties of biopolymer forms\n",
    "`BpForms` can calculate the length, formula, molecular weight, and charge of the biopolymer forms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dna_form)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C40H49N15O24P4'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(dna_form.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1247.808047992"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-5"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form.get_charge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get IUPAC/IUBMB representations of biopolymer forms\n",
    "`BpForms` can generate IUPAC/IUBMB representations of forms. This is useful for understanding the semantic meaning of a form using tools such as BLAST. Where annotated, this method uses the `base_monomers` attribute to represent modified monomeric forms using the codes for their roots (e.g. m2A is represented as \"A\"). Monomeric forms without annotated bases are represented as \"N\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ACTA'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form.get_canonical_seq()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ACGT'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form_2 = bpforms.DnaForm().from_str(\n",
    "    '''A[\n",
    "         id: \"m2C\" \n",
    "         | name: \"2-O-methylcytosine\"\n",
    "         | synonym: \"4-amino-2-methoxypyrimidine\"\n",
    "         | synonym: \"o-2-methylcytosine\"\n",
    "         | identifier: \"CHEBI:70854\" @ \"chebi\"\n",
    "         | structure: \"COC1=NC(=CCN1C1CC(C(O1)COP(=O)([O-])[O-])O)N\"\n",
    "         | r-bond-atom: O20\n",
    "         | l-bond-atom: P16\n",
    "         | r-displaced-atom: H20\n",
    "         | l-displaced-atom: O19-1\n",
    "         | comments: \"Methylation of deoxycytidine\"\n",
    "         | base-monomer: \"C\"\n",
    "         | position: 2-3\n",
    "        ]GT'''.replace('\\n', '').replace(' ', ''))\n",
    "dna_form_2.get_canonical_seq()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ANGT'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form_2 = bpforms.DnaForm().from_str(\n",
    "    '''A[\n",
    "         id: \"m2C\" \n",
    "         | name: \"2-O-methylcytosine\"\n",
    "         | synonym: \"4-amino-2-methoxypyrimidine\"\n",
    "         | synonym: \"o-2-methylcytosine\"\n",
    "         | identifier: \"CHEBI:70854\" @ \"chebi\"\n",
    "         | structure: \"COC1=NC(=CCN1C1CC(C(O1)COP(=O)([O-])[O-])O)N\"\n",
    "         | r-bond-atom: O20\n",
    "         | l-bond-atom: P16\n",
    "         | r-displaced-atom: H20\n",
    "         | l-displaced-atom: O19-1\n",
    "         | comments: \"Methylation of deoxycytidine\"\n",
    "         | position: 2-3\n",
    "        ]GT'''.replace('\\n', '').replace(' ', ''))\n",
    "dna_form_2.get_canonical_seq()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Determine if biopolymers are equal\n",
    "Use the `is_equal` method to check if two biopolymers are equal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "dna_form_1 = bpforms.DnaForm().from_str('ACGT')\n",
    "dna_form_2 = bpforms.DnaForm().from_str('ACGT')\n",
    "dna_form_3 = bpforms.DnaForm().from_str('GCTC')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form_1.is_equal(dna_form_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dna_form_1.is_equal(dna_form_3)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}