KarrLab/bcforms

View on GitHub
examples/2. Extended tutorial.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`BcForms` is a toolkit for concretely describing the primary structure of macromolecular complexes, including non-canonical monomeric forms and intra and inter-subunit crosslinks. `BcForms` includes a textual grammar for describing complexes and a Python library, a command line program, and a REST API for validating and manipulating complexes described in this grammar.\n",
    "`BcForms` represents complexes as sets of subunits, with their stoichiometries, and covalent crosslinks which link the subunits.\n",
    "DNA, RNA, and protein subunits can be represented using `BpForms`. Small molecule subunits can be represented using `openbabel.OBMol`, typically imported from SMILES or InChI.\n",
    "This Jupyter notebook illustrates how to use the `BcForms` Python library via some simple and complex examples. Please see the [documentation](https://docs.karrlab.org/bcforms/) for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import `BcForms` to represent complexes and other modules to represent subunits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import bcforms # represent complexes\n",
    "import bpforms # represent DNA, RNA, and protein subunits\n",
    "import openbabel # represent small molecule subunits and import their descriptions from SMILES, InChI, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple `BcForms` examples\n",
    "This section illustrates how to use the `BcForms` grammar with several simple examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Representing, validating, and calculating properties of a homodimer with no crosslinks\n",
    "#### Representing a homodimer with no crosslinks\n",
    "This example illustrates how to use `BcForms` to describe a homodimer of subunit `unit_1`, which is the peptide tri-alanine (AAA)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "bc_form_1 = bcforms.BcForm().from_str('2 * unit_1')\n",
    "bc_form_1.set_subunit_attribute('unit_1', 'structure', bpforms.ProteinForm().from_str('AAA'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Validate the `BcForm` representation of the homodimer\n",
    "The `BcForm.validate` method can be used to check that there are no errors in the `BcForms` representation of the homodimer.\n",
    "The `validate` method will return a list of any errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1.validate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate physical properties and structure of the `Bcforms` object\n",
    "`BcForms` can also be used to calculate properties of complexes, such as their atom-bond structure, formula, molecular weight, and charge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C18H36N6O8'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_1.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "464.52000000000004"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1.get_charge()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C[C@H]([NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)O.C[C@H]([NH3+])C(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)O'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1.export()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Representing a complex when concrete structure is unknown\n",
    "Sometimes, the detailed structure of the subunits are not known. In that case, `BcForms` can still represent the complexes and hold the known information such as formula, molecular weight, and charges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "bc_form_1_u = bcforms.BcForm().from_str('2 * unit_u')\n",
    "bc_form_1_u.set_subunit_attribute('unit_u', 'formula', 'C9H18N3O4')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate physical properties of the `Bcforms` object when structure is unknown\n",
    "When structure of subunits is not known, if formula is known, `BcForms` can calculate the formula and molecular weight of the complex. If only molecular weight or charge of subunits is known, then `BcForms` can only calculate the respective properties of the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C18H36N6O8'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_1_u.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "464.52000000000004"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1_u.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check if two biocomplexes are equal\n",
    "The `BcForm.is_equal` method can be used to check whether two complexes are semantically equal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1_t = bcforms.BcForm().from_str('unit_1 + unit_1')\n",
    "bc_form_1_t.set_subunit_attribute('unit_1', 'structure', bpforms.ProteinForm().from_str('AAA'))\n",
    "bc_form_1.is_equal(bc_form_1_t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_1_f = bcforms.BcForm().from_str('3 * unit_1')\n",
    "bc_form_1_f.set_subunit_attribute('unit_1', 'structure', bpforms.ProteinForm().from_str('AAA'))\n",
    "bc_form_1.is_equal(bc_form_1_f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Representing, validating, and calculating properties of a homodimer with a disulfide bond with inline definition\n",
    "This example illustrates how to use `BcForms` to represent a homodimer of subunit `unit_2`, which is the single amino acid cysteine, with a disulfide bond between the cysteines. The crosslink can be either defined inline or described using our ontology of crosslinks. This example illustrates how to define a disulfide bond inline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "bc_form_2_a = bcforms.BcForm().from_str('2 * unit_2'\n",
    "                                      '| x-link: [ l-bond-atom: unit_2(1)-1S11 |'\n",
    "                                                 ' l-displaced-atom: unit_2(1)-1H11 |'\n",
    "                                                 ' r-bond-atom: unit_2(2)-1S11 |'\n",
    "                                                 ' r-displaced-atom: unit_2(2)-1H11 ]')\n",
    "bc_form_2_a.set_subunit_attribute('unit_2', 'structure', bpforms.ProteinForm().from_str('C'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Validate the `BcForms` representation of the homodimer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_a.validate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate physical properties and structure of the `Bcforms` object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C6H14N2O4S2'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_2_a.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "242.308"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_a.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_a.get_charge()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OC(=O)[C@@H]([NH3+])CSSC[C@@H](C(=O)O)[NH3+]'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_a.export()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Representing, validating, and calculating properties of a homodimer with a disulfide bond with our ontology of crosslinks\n",
    "Alternatively, crosslinks can be described using our ontology of crosslinks. This enables more compact descriptions of complexes. This example illustrates how to use our ontology to describe the same complex. The list of crosslinks defined in the ontology is available at [https://www.bpforms.org/crosslink.html](https://www.bpforms.org/crosslink.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "bc_form_2_b = bcforms.BcForm().from_str('2 * unit_2'\n",
    "                                      '| x-link: [ type: disulfide |'\n",
    "                                                 ' l: unit_2(1)-1 |'\n",
    "                                                 ' r: unit_2(2)-1 ]')\n",
    "bc_form_2_b.set_subunit_attribute('unit_2', 'structure', bpforms.ProteinForm().from_str('C'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Validate the `BcForms` representation of the homodimer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_b.validate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calculate physical properties and structure of the `Bcforms` object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C6H14N2O4S2'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_2_b.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "242.308"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_b.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_b.get_charge()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OC(=O)[C@@H]([NH3+])CSSC[C@@H](C(=O)O)[NH3+]'"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_2_b.export()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples of real macromolecular complexes\n",
    "This section illustrates how to use `BcForms` to represent real protein complexes that are annotated in the PDB and UniProt."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Protein complexes with disulfide bonds\n",
    "Disulfide bonds are one of the best characterized types of intermolecular crosslinks. Disulfide bonds are covalent bonds between cysteine residues in proteins. For more information, see [UniProt](https://www.uniprot.org/help/disulfid).\n",
    "`BpForms` can be used to represent intrachain disulfide bonds, and `BcForms` can be used to represent interchain disulfide bonds.\n",
    "Here, we demonstrate how to use `BcForms` to represent disulfide bonds via three examples: a parallel homodimer, an anti-parallel homodimer, and a heterodimer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Parallel homodimer\n",
    "This example illustrates how to use `BcForms` to represent bone morphogenetic protein 2-A (bmp2-a, [P25703](https://www.uniprot.org/uniprot/P25703)) from *Xenopus laevis*, which forms a parallel homodimer with a disulfide link at C-362 after post-translational processing to remove all of the amino acids before 285th residue and after the 398th residue.\n",
    "##### Representing and calculating properties of the subunits\n",
    "First, we illustrate how to use `BpForms` to represent the post-translationally processed subunits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_chain(fasta, chain_idx):\n",
    "    chain_fasta = fasta[chain_idx[0] - 1:chain_idx[1]]\n",
    "    chain = bpforms.ProteinForm().from_str(chain_fasta)\n",
    "    return chain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "p25703_fasta = ('MVAGIHSLLLLLFYQVLLSGCTGLIPEEGKRKYTESGRSSPQQSQRVLNQFELRLLSMFG'\n",
    "                'LKRRPTPGKNVVIPPYMLDLYHLHLAQLAADEGTSAMDFQMERAASRANTVRSFHHEESM'\n",
    "                'EEIPESREKTIQRFFFNLSSIPNEELVTSAELRIFREQVQEPFESDSSKLHRINIYDIVK'\n",
    "                'PAAAASRGPVVRLLDTRLVHHNESKWESFDVTPAIARWIAHKQPNHGFVVEVNHLDNDKN'\n",
    "                'VPKKHVRISRSLTPDKDNWPQIRPLLVTFSHDGKGHALHKRQKRQARHKQRKRLKSSCRR'\n",
    "                'HPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSVNTNIPK'\n",
    "                'ACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR')\n",
    "p25703_chain_idx = (285, 398)\n",
    "p25703_chain = get_chain(p25703_fasta, p25703_chain_idx)\n",
    "assert len(p25703_chain) == p25703_chain_idx[1] - p25703_chain_idx[0] + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C570H895N164O154S9'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(p25703_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12797.964"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p25703_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing and calculating properties of the complex\n",
    "Second, we illustrate how to use `BcForms` to represent the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert str(p25703_chain)[362 - p25703_chain_idx[0] + 1 - 1] == p25703_fasta[362-1]\n",
    "str_bmp2a = ('2 * p25703'\n",
    "             ' | x-link: [ l-bond-atom: p25703(1)-{}S11 |'\n",
    "                         ' r-bond-atom: p25703(2)-{}S11 |' \n",
    "                         ' l-displaced-atom: p25703(1)-{}H11 |'\n",
    "                         ' r-displaced-atom: p25703(2)-{}H11 ]'.format( \\\n",
    "                           362 - p25703_chain_idx[0] + 1, 362 - p25703_chain_idx[0] + 1, \\\n",
    "                           362 - p25703_chain_idx[0] + 1, 362 - p25703_chain_idx[0] + 1))\n",
    "bc_form_bmp2a = bcforms.BcForm().from_str(str_bmp2a)\n",
    "bc_form_bmp2a.set_subunit_attribute('p25703', 'structure', p25703_chain)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this `BcForms`, we can calculate properties of the complex such as its formula, molecular weight, and charge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C1140H1788N328O308S18'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_bmp2a.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "25593.911999999997"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_bmp2a.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As shown above, the formula of the complex is 2 hydrogen atoms less than 2 times the formula of the subunit, and the molecular weight is approximately 2 Da less than 2 times that of the subunit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Anti-parallel homodimer\n",
    "This following example illustrates how `BcForms` can be used to represent anti-parallel homodimer of disintegrin schistatin ([P83658](https://www.uniprot.org/uniprot/P83658)) from *Echis carinatus*, with the subunits linked by disulfide bonds between C-7 and C-12.\n",
    "##### Representing the subunits with `BpForms`\n",
    "First, we illustrate how `BpForms` can be used to represent the structure of the subunits and calculate properties of the subunits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "p83658_fasta = ('NSVHPCCDPVICEPREGEHCISGPCCENCYFLNSGTICKRARGDGNQDYCTGITPDCPRN'\n",
    "                'RYNV')\n",
    "p83658_chain_idx = (1, 64)\n",
    "p83658_chain = get_chain(p83658_fasta, p83658_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C290H456N91O89S10'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(p83658_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6961.986"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p83658_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Second, we illustrate how `BcForms` can be used to represent the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_dids = ('2 * p83658'\n",
    "            ' | x-link: [ l-bond-atom: p83658(1)-{}S11 |'\n",
    "                        ' r-bond-atom: p83658(2)-{}S11 |'\n",
    "                        ' l-displaced-atom: p83658(1)-{}H11 |'\n",
    "                        ' r-displaced-atom: p83658(2)-{}H11 ]'\n",
    "            ' | x-link: [ l-bond-atom: p83658(1)-{}S11 |'\n",
    "                        ' r-bond-atom: p83658(2)-{}S11 |'\n",
    "                        ' l-displaced-atom: p83658(1)-{}H11 |'\n",
    "                        ' r-displaced-atom: p83658(2)-{}H11 ]'.format( \\\n",
    "                          7 - p83658_chain_idx[0] + 1, 12 - p83658_chain_idx[0] + 1, \\\n",
    "                          7 - p83658_chain_idx[0] + 1, 12 - p83658_chain_idx[0] + 1, \\\n",
    "                          7 - p83658_chain_idx[0] + 1, 12 - p83658_chain_idx[0] + 1, \\\n",
    "                          7 - p83658_chain_idx[0] + 1, 12 - p83658_chain_idx[0] + 1))\n",
    "bc_form_dids = bcforms.BcForm().from_str(str_dids)\n",
    "bc_form_dids.set_subunit_attribute('p83658', 'structure', p83658_chain)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As illustrated below, `BcForms` can also be used to compute physical properties of the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C580H908N182O178S20'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_dids.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13919.94"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_dids.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As shown above, the formula of the complex is 4 hydrogens less than 2 times the formula of the subunit, and the molecular weight of the complex is approximately 4 Da less than 2 times that of the subunit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Heterodimer\n",
    "Here, we illustrate how `BcForms` can be used to represent snaclec botrocetin of *Bothrops jajaraca*, a heterodimer of subunits alpha ([P22029](https://www.uniprot.org/uniprot/P22029)) and beta ([P22030](https://www.uniprot.org/uniprot/P22030)) which are linked by a disulfide bond between C-80 of P22029 and C-75 of P22030.\n",
    "##### Representing the subunits with `BpForms`\n",
    "First, we illustrate how `BpForms` can be used to represent and calculate properties of the two subunits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "p22029_fasta = ('DCPSGWSSYEGNCYKFFQQKMNWADAERFCSEQAKGGHLVSIKIYSKEKDFVGDLVTKNI'\n",
    "                'QSSDLYAWIGLRVENKEKQCSSEWSDGSSVSYENVVERTVKKCFALEKDLGFVLWINLYC'\n",
    "                'AQKNPFVCKSPPP')\n",
    "p22029_chain_idx = (1, 133)\n",
    "p22029_chain = get_chain(p22029_fasta, p22029_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C682H1048N176O187S8'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(p22029_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14961.410999999998"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p22029_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "p22030_fasta = ('DCPPDWSSYEGHCYRFFKEWMHWDDAEEFCTEQQTGAHLVSFQSKEEADFVRSLTSEMLK'\n",
    "                'GDVVWIGLSDVWNKCRFEWTDGMEFDYDDYYLIAEYECVASKPTNNKWWIIPCTRFKNFV'\n",
    "                'CEFQA')\n",
    "p22030_chain_idx = (1, 125)\n",
    "p22030_chain = get_chain(p22030_fasta, p22030_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C682H972N166O178S10'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(p22030_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14664.862000000001"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p22030_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Below, we illustrate how to use `BcForms` to represent the heterodimer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_sle = ('p22029 + p22030'\n",
    "           ' | x-link: [ l-bond-atom: p22029(1)-{}S11 |'\n",
    "                       ' r-bond-atom: p22030(1)-{}S11 |'\n",
    "                       ' l-displaced-atom: p22029(1)-{}H11 |'\n",
    "                       ' r-displaced-atom: p22030(1)-{}H11 ]'.format( \\\n",
    "                         80 - p22029_chain_idx[0] + 1, 75 - p22030_chain_idx[0] + 1, \\\n",
    "                         80 - p22029_chain_idx[0] + 1, 75 - p22030_chain_idx[0] + 1))\n",
    "bc_form_sle = bcforms.BcForm().from_str(str_sle)\n",
    "bc_form_sle.set_subunit_attribute('p22029', 'structure', p22029_chain)\n",
    "bc_form_sle.set_subunit_attribute('p22030', 'structure', p22030_chain)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As illustrated below, `BcForms` can also be used to compute the formula and molecular weight of the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C1364H2018N342O365S18'"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_sle.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "29624.256999999998"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_sle.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Protein complexes with other types of crosslinks\n",
    "DNA, RNA, and protein polymers can be covalent bound by a variety of other types of crosslinks in addition to disulfide bonds. See [UniProt](https://www.uniprot.org/help/crosslnk) for more information.\n",
    "The PDB, UniProt, and other database contain detailed information about crosslinks among DNA, RNA, proteins, and small molecules.\n",
    "Here, we illustrate how `BcForms` can be used to concretely represent complexes that contain several types of these crosslinks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Sumoylation\n",
    "Sumoylation is a common interchain crosslink that binds small SUMO peptides to proteins to retain proteins in the nucleus.\n",
    "In this example, we illustrate how `BcForms` can be used to represent the sumoylation of human chromatin assembly factor 1 subunit A ([Q13111](https://www.uniprot.org/uniprot/Q13111)) by [P63165](https://www.uniprot.org/uniprot/P63165). The sumoylation takes place between K-182 of Q13111 and G-97 of P63165, forming a Glycyl lysine isopeptide bond ([AA0125](https://annotation.dbi.udel.edu/cgi-bin/resid?id=AA0125)).\n",
    "##### Representing the subunits with `BpForms`\n",
    "First, we illustrate how `BpForms` can be used to describe the subunits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "q13111_fasta = ('MLEELECGAPGARGAATAMDCKDRPAFPVKKLIQARLPFKRLNLVPKGKADDMSDDQGTS'\n",
    "                'VQSKSPDLEASLDTLENNCHVGSDIDFRPKLVNGKGPLDNFLRNRIETSIGQSTVIIDLT'\n",
    "                'EDSNEQPDSLVDHNKLNSEASPSREAINGQREDTGDQQGLLKAIQNDKLAFPGETLSDIP'\n",
    "                'CKTEEEGVGCGGAGRRGDSQECSPRSCPELTSGPRMCPRKEQDSWSEAGGILFKGKVPMV'\n",
    "                'VLQDILAVRPPQIKSLPATPQGKNMTPESEVLESFPEEDSVLSHSSLSSPSSTSSPEGPP'\n",
    "                'APPKQHSSTSPFPTSTPLRRITKKFVKGSTEKNKLRLQRDQERLGKQLKLRAEREEKEKL'\n",
    "                'KEEAKRAKEEAKKKKEEEKELKEKERREKREKDEKEKAEKQRLKEERRKERQEALEAKLE'\n",
    "                'EKRKKEEEKRLREEEKRIKAEKAEITRFFQKPKTPQAPKTLAGSCGKFAPFEIKEHMVLA'\n",
    "                'PRRRTAFHPDLCSQLDQLLQQQSGEFSFLKDLKGRQPLRSGPTHVSTRNADIFNSDVVIV'\n",
    "                'ERGKGDGVPERRKFGRMKLLQFCENHRPAYWGTWNKKTALIRARDPWAQDTKLLDYEVDS'\n",
    "                'DEEWEEEEPGESLSHSEGDDDDDMGEDEDEDDGFFVPHGYLSEDEGVTEECADPENHKVR'\n",
    "                'QKLKAKEWDEFLAKGKRFRVLQPVKIGCVWAADRDCAGDDLKVLQQFAACFLETLPAQEE'\n",
    "                'QTPKASKRERRDEQILAQLLPLLHGNVNGSKVIIREFQEHCRRGLLSNHTGSPRSPSTTY'\n",
    "                'LHTPTPSEDAAIPSKSRLKRLISENSVYEKRPDFRMCWYVHPQVLQSFQQEHLPVPCQWS'\n",
    "                'YVTSVPSAPKEDSGSVPSTGPSQGTPISLKRKSAGSMCITQFMKKRRHDGQIGAEDMDGF'\n",
    "                'QADTEEEEEEEGDCMIVDVPDAAEVQAPCGAASGAGGGVGVDTGKATLTASPLGAS')\n",
    "q13111_chain_idx = (1, 956)\n",
    "q13111_chain = get_chain(q13111_fasta, q13111_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C4613H7563N1356O1324S35'"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(q13111_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "104328.51500000001"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q13111_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "p63165_fasta = ('MSDQEAKPSTEDLGDKKEGEYIKLKVIGQDSSEIHFKVKMTTHLKKLKESYCQRQGVPMN'\n",
    "                'SLRFLFEGQRIADNHTPKELGMEEEDVIEVYQEQTGGHSTV')\n",
    "p63165_chain_idx = (2, 97)\n",
    "p63165_chain = get_chain(p63165_fasta, p63165_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C478H778N131O139S4'"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(p63165_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10712.499999999998"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p63165_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Second, we illustrate how `BcForms` can be used to represent the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_sumo_a = ('q13111 + p63165'\n",
    "           ' | x-link: [ l-bond-atom: q13111-{}N1-1 |'\n",
    "                       ' r-bond-atom: p63165-{}C2 |'\n",
    "                       ' l-displaced-atom: q13111-{}H1+1 |'\n",
    "                       ' l-displaced-atom: q13111-{}H1 |' \n",
    "                       ' r-displaced-atom: p63165-{}O1 |'\n",
    "                       ' r-displaced-atom: p63165-{}H1 ]'.format( \\\n",
    "                         182 - q13111_chain_idx[0] + 1, 97 - p63165_chain_idx[0] + 1, \\\n",
    "                         182 - q13111_chain_idx[0] + 1, 182 - q13111_chain_idx[0] + 1, \\\n",
    "                         97 - p63165_chain_idx[0] + 1, 97 - p63165_chain_idx[0] + 1))\n",
    "bc_form_sumo_a = bcforms.BcForm().from_str(str_sumo_a)\n",
    "bc_form_sumo_a.set_subunit_attribute('q13111', 'structure', q13111_chain)\n",
    "bc_form_sumo_a.set_subunit_attribute('p63165', 'structure', p63165_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C5091H8338N1487O1462S39'"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_sumo_a.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "115021.99200000001"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_sumo_a.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with our ontology of crosslinks\n",
    "Alternatively, the glycyl lysine isopepteide bond can be described using our ontology of crosslinks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_sumo_b = ('p63165 + q13111'\n",
    "           ' | x-link: [ type: glycyl_lysine_isopeptide |'\n",
    "                       ' l: p63165-{} |'\n",
    "                       ' r: q13111-{} ]'.format( \\\n",
    "                         97 - p63165_chain_idx[0] + 1, 182 - q13111_chain_idx[0] + 1))\n",
    "bc_form_sumo_b = bcforms.BcForm().from_str(str_sumo_b)\n",
    "bc_form_sumo_b.set_subunit_attribute('q13111', 'structure', q13111_chain)\n",
    "bc_form_sumo_b.set_subunit_attribute('p63165', 'structure', p63165_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C5091H8338N1487O1462S39'"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_sumo_b.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "115021.99200000001"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_sumo_b.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pupylation\n",
    "Proteins can be tagged for proteasomal degradation through the formation of crosslinks with pup protein tags.\n",
    "This example illustrates how `BcForms` can be used to represent the pupylation of 10 kDa chaperonin ([P9WPE5](https://www.uniprot.org/uniprot/P9WPE5)) of *Mycobacterium tuberculosis* by prokaryotic ubiquitin-like protein Pup ([P9WHN5](https://www.uniprot.org/uniprot/P9WHN5)), involving a isoglutamyl lysine isopeptide bond ([AA0124](https://annotation.dbi.udel.edu/cgi-bin/resid?id=AA0124)).\n",
    "##### Representing the subunits with `BpForms`\n",
    "First, we illustrate how `BpForms` can be used to represent the subunits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "P9WHN5_fasta = ('MAQEQTKRGGGGGDDDDIAGSTAAGQERREKLTEETDDLLDEIDDVLEENAEDFVRAYVQ'\n",
    "                'KGGQ')\n",
    "P9WHN5_chain_idx = (1, 64)\n",
    "P9WHN5_chain = get_chain(P9WHN5_fasta, P9WHN5_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C285H463N85O96S'"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(P9WHN5_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6648.398"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P9WHN5_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "P9WPE5_fasta = ('MAKVNIKPLEDKILVQANEAETTTASGLVIPDTAKEKPQEGTVVAVGPGRWDEDGEKRIP'\n",
    "                'LDVAEGDTVIYSKYGGTEIKYNGEEYLILSARDVLAVVSK')\n",
    "P9WPE5_chain_idx = (2, 100)\n",
    "P9WPE5_chain = get_chain(P9WPE5_fasta, P9WPE5_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C473H780N123O138'"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(P9WPE5_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10398.166"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P9WPE5_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Second, we illusrate how `BcForms` can be used to represent and calculate properties of the complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_pup = ('p9whn5 + p9wpe5'\n",
    "           ' | x-link: [ l-bond-atom: p9whn5-{}N1-1 |'\n",
    "                       ' r-bond-atom: p9wpe5-{}C2 |'\n",
    "                       ' l-displaced-atom: p9whn5-{}H1+1 |'\n",
    "                       ' l-displaced-atom: p9whn5-{}H1 |' \n",
    "                       ' r-displaced-atom: p9wpe5-{}N1 |'\n",
    "                       ' r-displaced-atom: p9wpe5-{}H1 |'\n",
    "                       ' r-displaced-atom: p9wpe5-{}H1 ]'.format( \\\n",
    "                         100 - P9WHN5_chain_idx[0] + 1, 64 - P9WPE5_chain_idx[0] + 1, \\\n",
    "                         100 - P9WHN5_chain_idx[0] + 1, 100 - P9WHN5_chain_idx[0] + 1, \\\n",
    "                         64 - P9WPE5_chain_idx[0] + 1, 64 - P9WPE5_chain_idx[0] + 1, \\\n",
    "                         64 - P9WPE5_chain_idx[0] + 1))\n",
    "bc_form_pup = bcforms.BcForm().from_str(str_pup)\n",
    "bc_form_pup.set_subunit_attribute('p9whn5', 'structure', P9WHN5_chain)\n",
    "bc_form_pup.set_subunit_attribute('p9wpe5', 'structure', P9WPE5_chain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C758H1239N207O234S'"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_pup.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "17028.52499999999"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_pup.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Protein-cofactor/coenzyme complexes\n",
    "Some enzymes require certain non-protein substances to function properly. These non-protein compoenents of the enzyme complexes can be either inorganic or organic. The inorganic components, such as metal ions, are often termed cofactors, while the organic components, such as certain vitamins, are often termed coenzymes. For more information, see [Uniprot](https://www.uniprot.org/help/cofactor). `BcForms` is capable of representing complexes that consist of protein and non-protein components."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Protein-cofactor complexes\n",
    "In this example, we demonstrate the `BcForms` representation of a simple protein-cofactor complex ubiquinol oxidase 1a, mitochondrial ([Q39219](https://www.uniprot.org/uniprot/Q39219)) in *Arabidopsis thaliana*. The protein can bind two iron ions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the subunits with `BpForms`\n",
    "First, we represent the protein subunit Q39219 using `BpForms`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "Q39219_fasta = ('MMITRGGAKAAKSLLVAAGPRLFSTVRTVSSHEALSASHILKPGVTSAWIWTRAPTIGGM'\n",
    "                'RFASTITLGEKTPMKEEDANQKKTENESTGGDAAGGNNKGDKGIASYWGVEPNKITKEDG'\n",
    "                'SEWKWNCFRPWETYKADITIDLKKHHVPTTFLDRIAYWTVKSLRWPTDLFFQRRYGCRAM'\n",
    "                'MLETVAAVPGMVGGMLLHCKSLRRFEQSGGWIKALLEEAENERMHLMTFMEVAKPKWYER'\n",
    "                'ALVITVQGVFFNAYFLGYLISPKFAHRMVGYLEEEAIHSYTEFLKELDKGNIENVPAPAI'\n",
    "                'AIDYWRLPADATLRDVVMVVRADEAHHRDVNHFASDIHYQGRELKEAPAPIGYH')\n",
    "Q39219_chain_idx = (63, 354)\n",
    "Q39219_chain = get_chain(Q39219_fasta, Q39219_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C1508H2342N408O389S13'"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(Q39219_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "32828.571"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Q39219_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Q39219_chain.get_charge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Then, we can represent the complex with `BcForms`. In particular, the concrete representation of cofactor iron ion can be either achieved by providing a SMILES-encoded string or an `openbabel.OBMol`. Here, we use the SMILES-encoded string for simplicity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_procof = ('q39219 + 2 * fe')\n",
    "bc_form_procof = bcforms.BcForm().from_str(str_procof)\n",
    "bc_form_procof.set_subunit_attribute('q39219', 'structure', Q39219_chain)\n",
    "bc_form_procof.set_subunit_attribute('fe', 'structure', '[Fe+2]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C1508Fe2H2342N408O389S13'"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_procof.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "32940.261000000006"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_procof.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "42"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_procof.get_charge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Protein-coenzyme complexes\n",
    "In this example, we use `BcForms` to represent peroxisomal sarcosine oxidase ([Q9P0Z9](https://www.uniprot.org/uniprot/Q9P0Z9)) in human. The protein is a monomer and binds 1 flavin adenine dinucleotide (FAD) per unit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the subunits with `BpForms`\n",
    "First, we represent the protein subunit Q9P0Z9 using `BpForms`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "Q9P0Z9_fasta = ('MAAQKDLWDAIVIGAGIQGCFTAYHLAKHRKRILLLEQFFLPHSRGSSHGQSRIIRKAYL'\n",
    "                'EDFYTRMMHECYQIWAQLEHEAGTQLHRQTGLLLLGMKENQELKTIQANLSRQRVEHQCL'\n",
    "                'SSEELKQRFPNIRLPRGEVGLLDNSGGVIYAYKALRALQDAIRQLGGIVRDGEKVVEINP'\n",
    "                'GLLVTVKTTSRSYQAKSLVITAGPWTNQLLRPLGIEMPLQTLRINVCYWREMVPGSYGVS'\n",
    "                'QAFPCFLWLGLCPHHIYGLPTGEYPGLMKVSYHHGNHADPEERDCPTARTDIGDVQILSS'\n",
    "                'FVRDHLPDLKPEPAVIESCMYTNTPDEQFILDRHPKYDNIVIGAGFSGHGFKLAPVVGKI'\n",
    "                'LYELSMKLTPSYDLAPFRISRFPSLGKAHL')\n",
    "Q9P0Z9_chain_idx = (1, 390)\n",
    "Q9P0Z9_chain = get_chain(Q9P0Z9_fasta, Q9P0Z9_chain_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C1981H3153N553O515S17'"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(Q9P0Z9_chain.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "43502.390999999996"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Q9P0Z9_chain.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Then, we can represent the complex with `BcForms`. Again, we will use the SMILES-encoded string to represent the structure of FAD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "str_procoe = ('q9p0z9 + fad')\n",
    "bc_form_procoe = bcforms.BcForm().from_str(str_procoe)\n",
    "bc_form_procoe.set_subunit_attribute('q9p0z9', 'structure', Q9P0Z9_chain)\n",
    "bc_form_procoe.set_subunit_attribute('fad', 'structure', 'c12cc(C)c(C)cc1N=C3C(=O)NC(=O)N=C3N2C[C@H](O)[C@H](O)[C@H](O)COP(=O)(O)OP(=O)(O)OC[C@@H]4[C@@H](O)[C@@H](O)[C@@H](O4)n5cnc6c5ncnc6N')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C2008H3186N562O530P2S17'"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_procoe.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "44287.947523995994"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_procoe.get_mol_wt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nucleic acid macromolecules with interstrand crosslinks\n",
    "`BcForms` not only can represent protein complexes but also can concretely describe nucleic acid complexes. This representation is particularly important since interstrand crosslinks in nucleic acid complexes have important biological implications. For example, in DNA, interstrand crosslinks can prevent proper transcription and replication. This mechanism is widely used in chemotherapeutic drugs. For a thorough review on DNA interstrand crosslinks, see [Deans and West, 2011](https://doi.org/10.1038/nrc3088).\n",
    "Here, we demonstrate `BcForm` representation of DNA interstrand crosslinks using a simple example: cisplatin crosslinking two guanines. Crosslink agent cisplatin is a clinically used drug that induces interstrand crosslinks between two guanines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing subunits with `BpForms`\n",
    "First, we represent the two guanines in the complex using `BpForms`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "guanine = bpforms.DnaForm().from_str('G')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C10H12N5O7P'"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(guanine.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "345.20776199799997"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "guanine.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-2"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "guanine.get_charge()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O'"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "guanine.export(format='smiles')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "We can represent the complex as the subunit composition and the crosslinks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "bc_form_dnaisc = bcforms.BcForm().from_str('2 * g + cisplatin'\n",
    "                                          '| x-link: [ l-bond-atom: g(1)-1N15'\n",
    "                                                    '| r-bond-atom: cisplatin-1Pt5'\n",
    "                                                    '| r-displaced-atom: cisplatin-1Cl6]'\n",
    "                                          '| x-link: [ l-bond-atom: g(2)-1N15'\n",
    "                                                    '| r-bond-atom: cisplatin-1Pt5'\n",
    "                                                    '| r-displaced-atom: cisplatin-1Cl7]')\n",
    "bc_form_dnaisc.set_subunit_attribute('g', 'structure', guanine)\n",
    "bc_form_dnaisc.set_subunit_attribute('cisplatin', 'structure', '[NH3+]-[Pt-2](Cl)(Cl)[NH3+]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OC1CC(OC1COP(=O)([O-])[O-])n1cn(c2c1nc(N)[nH]c2=O)[Pt-2](n1cn(C2CC(O)C(O2)COP(=O)([O-])[O-])c2c1c(=O)[nH]c(n2)N)([NH3+])[NH3+]'"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_dnaisc.export()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C20H30N12O14P2Pt'"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_dnaisc.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "919.5615239959998"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_dnaisc.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-4"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_dnaisc.get_charge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Protein-nucleic acid macromolecules with interstrand crosslinks\n",
    "In cells, protein-DNA crosslinks can also occur sometimes. There exist multiple mechanisms for their formation. For review, see [Ji et al., 2016](https://doi.org/10.1021%2Facs.accounts.5b00056). Interestingly, some agents that induce DNA-DNA crosslinks, such as cisplatin and nitrogen mustard, can also induce protein-DNA crosslinks. `BcForms` can concretely represent these complexes.\n",
    "In this example, we show the `BcForms` representation of a simple crosslink between guanine and lysine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing subunits with `BpForms`\n",
    "First, we represent the guanine and the lysine in the complex using `BpForms`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "guanine = bpforms.DnaForm().from_str('G')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O'"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "guanine.export(format='smiles')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "lysine = bpforms.ProteinForm().from_str('K')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'[NH3+]CCCC[C@@H](C(=O)O)[NH3+]'"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lysine.export(format='smiles')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Representing the complex with `BcForms`\n",
    "Then, we can represent the complex as the subunits and the crosslinks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "bc_form_prodna = bcforms.BcForm().from_str('guanine + cisplatin + lysine'\n",
    "                                          '| x-link: [ l-bond-atom: guanine-1N15'\n",
    "                                                    '| r-bond-atom: cisplatin-1Pt5'\n",
    "                                                    '| r-displaced-atom: cisplatin-1Cl6]'\n",
    "                                          '| x-link: [ l-bond-atom: lysine-1N1-1' \n",
    "                                                    '| r-bond-atom: cisplatin-1Pt5'\n",
    "                                                    '| l-displaced-atom: lysine-1H1+1'\n",
    "                                                    '| l-displaced-atom: lysine-1H1'\n",
    "                                                    '| r-displaced-atom: cisplatin-1Cl7]')\n",
    "bc_form_prodna.set_subunit_attribute('guanine', 'structure', guanine)\n",
    "bc_form_prodna.set_subunit_attribute('cisplatin', 'structure', '[NH3+]-[Pt-2](Cl)(Cl)[NH3+]')\n",
    "bc_form_prodna.set_subunit_attribute('lysine', 'structure', lysine)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'OC1CC(OC1COP(=O)([O-])[O-])n1cn(c2c1nc(N)[nH]c2=O)[Pt-2]([NH3+])([NH3+])NCCCC[C@@H](C(=O)O)[NH3+]'"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_prodna.export()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C16H32N9O9PPt'"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "str(bc_form_prodna.get_formula())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "720.5437619979998"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_prodna.get_mol_wt()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bc_form_prodna.get_charge()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}