matteoferla/Python_SmallWorld_API

View on GitHub
README.md

Summary

Maintainability
Test Coverage
# Python_SmallWorld_API

An unofficial Python3 module to query a SmallWorld chemical space search server.

[![Documentation Status](https://readthedocs.org/projects/python-smallworld-api/badge/?version=latest)](https://python-smallworld-api.readthedocs.io/en/latest/?badge=latest)
[![https img shields io pypi v smallworld api logo python](https://img.shields.io/pypi/v/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api)   [![https img shields io pypi pyversions smallworld api logo python](https://img.shields.io/pypi/pyversions/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api)   [![https img shields io pypi wheel smallworld api logo python](https://img.shields.io/pypi/wheel/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api)   [![https img shields io pypi format smallworld api logo python](https://img.shields.io/pypi/format/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api)   [![https img shields io pypi status smallworld api logo python](https://img.shields.io/pypi/status/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api)   [![https img shields io pypi dm smallworld api logo python](https://img.shields.io/pypi/dm/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api)   [![https img shields io codeclimate maintainability matteoferla Python_SmallWorld_API logo codeclimate](https://img.shields.io/codeclimate/maintainability/matteoferla/Python_SmallWorld_API?logo=codeclimate)](https://codeclimate.com/github/matteoferla/Python_SmallWorld_API)   [![https img shields io codeclimate issues matteoferla Python_SmallWorld_API logo codeclimate](https://img.shields.io/codeclimate/issues/matteoferla/Python_SmallWorld_API?logo=codeclimate)](https://codeclimate.com/github/matteoferla/Python_SmallWorld_API)   [![https img shields io codeclimate tech debt matteoferla Python_SmallWorld_API logo codeclimate](https://img.shields.io/codeclimate/tech-debt/matteoferla/Python_SmallWorld_API?logo=codeclimate)](https://codeclimate.com/github/matteoferla/Python_SmallWorld_API)   [![https img shields io github forks matteoferla Python_SmallWorld_API label Fork style social logo github](https://img.shields.io/github/forks/matteoferla/Python_SmallWorld_API?label=Fork&style=social&logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github stars matteoferla Python_SmallWorld_API style social logo github](https://img.shields.io/github/stars/matteoferla/Python_SmallWorld_API?style=social&logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github watchers matteoferla Python_SmallWorld_API label Watch style social logo github](https://img.shields.io/github/watchers/matteoferla/Python_SmallWorld_API?label=Watch&style=social&logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github last commit matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/last-commit/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github license matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/license/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API/raw/master/LICENCE)   [![https img shields io github release date matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/release-date/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github commit activity m matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/commit-activity/m/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github issues matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/issues/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)   [![https img shields io github issues closed matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/issues-closed/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)


### Disclaimer
> This is Unofficial So please do not abuse it or use it when you cannot legally use the site!

SmallWorld is a search engine for chemical space developed by John Mayfield and Roger Sayle at [NextMove Software](https://www.nextmovesoftware.com/).
John Irwin and Brian Shoichet at UCSF (the creators and maintainers of the [ZINC](https://zinc.docking.org/) database),
host a version of it at [sw.docking.org](https://sw.docking.org/search.html) along with another NextMove Software product,
[Arthor](https://arthor.docking.org/).

## Overview

SmallWorld allows one to search for similar compounds
to a give [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system)
in one of many databases โ€”a very complex feat.

A copy is hosted at [sw.docking.org](https://sw.docking.org/search.html) by John Irwin, Brian Shoichet and co.
This is a free service, but it is not intended for heavy use.
To change the endpoint, one can change the class attribute `SmallWorld.base_url` to a different URL.
The folk at NextMove Software deploy instances of it for paying customers โ€”and with full support.

The API points of the site are described in
[wiki.docking.org/index.php/How_to_use_SmallWorld_API](https://wiki.docking.org/index.php/How_to_use_SmallWorld_API).

This Python3 module allows one to search it.

For searches in Arthor, Zinc and EnamineStore see elsewhere.

## Install

    pip install -q smallworld-api

## Usage
The following searches for Aspirin in Enamine's make-on-demand space, Enamine REAL, which does not contain it
as the latter is filtered by Lipinski's rule of five (Aspirin is actually [a terrible placeholder drug](https://www.blopig.com/blog/2023/08/placeholder-compounds-distraction-vs-accuracy/))

```python
from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd  # for typehinting below

from smallworld_api import SmallWorld

print(SmallWorld.base_url)  # 'https://sw.docking.org'
aspirin = 'O=C(C)Oc1ccccc1C(=O)O'
sw = SmallWorld()
results : pd.DataFrame = sw.search(aspirin, dist=5, db=sw.REAL_dataset)

from IPython.display import display
display(results)
```

The first two import lines are optional as the code works without rdkit. If pandas gets imported before PandasTools
and Chem imported not in _main_ then display issues happen, 
which can be fixed with a `from rdkit.Chem.Draw import IPythonConsole`.

So it's up to you to remember to run:

```python
PandasTools.AddMoleculeColumnToFrame(results, 'smiles', 'molecule', includeFingerprints=True)
```

The argument `db` for `.search` is a string and is the name of the database. These do seem to change,
so they get updated during initialisation or with the call:

```python
dbs: pd.DataFrame = SmallWorld.retrieve_databases()  #: pd.DataFrame (.db_choices gets updated too)
```

The dynamic properties `.REAL_dataset` and `.ZINC_dataset` simply return the best value from the presets, which may have
become out of date (unless updated).

## Query terms

The first argument passed to `.search` can be:

* a `str` (SMILES)
* a `Chem.Mol` (rdkit is an optional requirement though)
* a list-like (sequence) or a dict-like (mapping) of the above, where the index or key becomes the name in the output
  table.

See the class attribute dictionary `SmallWorld.default_submission` for what the defaults are set to, which ought to be:

  {'dist': 8,
   'tdn': 6,
   'rdn': 6,
   'rup': 2,
   'ldn': 2,
   'lup': 2,
   'maj': 6,
   'min': 6,
   'sub': 6,
   'sdist': 12,
   'tup': 6,
   'scores': 'Atom Alignment,ECFP4,Daylight'}

If one is sure that the correct dataset is being used and any raised `NoMatchError` is due to the SMILES, then once can
add for the last case the argument `tolerate_tolerate_NoMatchError=True`, which makes them ignored bar for a warning.

## Debug

The instantiation is set up so for debugging, namely it has two attributes of interest:

* `sw.last_reply`, a `requests.Response` instance
* `sw.hit_list_id` an integer representing the search (AKA. `hlid` in the server responses)

The errors raised are generally either `requests.HTTPError`
or `smallworld.NoMatchError`. The former is raised by a `requests.Response.raise_for_status` call and means there is a
status code that isn't 200, the latter is raised by one of the various checks in `sw.get_results()`.

For the former errors, i.e. those by a serverside-declared HTML-formatting error (eg. status code 404), if one is in a
Jupyter notebook one can do `sw.show_reply_as_html()`. Generally if you get status code 500, it is best to try again
tomorrow as the server is having a hard time and is probably not okay on the web.

For the latter, the result in `.last_reply` should be a JSON string, therefore should give something like this:

```python3
reply_data: dict = sw.last_reply.json()
```

A common issue is the change in database names, therefore do do and pick a different one
(ATM, the index of the dataframe is the name to use, but in 2021 it was the `name`)

```python3
from IPython.display import display

from smallworld_api import SmallWorld
db_table : pd.DataFrame = SmallWorld.retrieve_databases()
display( db_table )
```

There will be a "ground control to major Tom" warning in the first query. This weird reply means that the stream has
finished, but not closed or something. Ignore it.

Also, as a shorthand, `mol: Chem.Mol = SmallWorld.check_smiles(aspirin)`
can be called to check if the molecules is fine.

For extreme debugging, open Chrome and go to [sw.docking.org](https://sw.docking.org/search.html) 
and open the developer tools (F12). Then go to the Network tab and do a search, eg. with `CC2=CC=C(CNC(=O)C1CCC1)C=N2`,
this will be populated by all the figure requests, but `/search/submit` will be the first one to look at
if the issue is with the submit method in the trace, `/search/view` if it's with the `get_results` method.
Then simply copy the url query off the request and use it as parameters or compare them etc. For example:

```python3
import urllib.parse
from smallworld_api import SmallWorld

url_query = '๐Ÿ‘พ=๐Ÿ‘พ&๐Ÿ‘พ=๐Ÿ‘พ'
expected = dict(urllib.parse.parse_qsl(url_query))

class Debug(SmallWorld):
    
    def submit_query(self, params):
        # override the method to check the parameters
        print('Going to use:', params)
        print('Missing:', set(expected).difference(params))
        return super().submit_query(expected)

d = Debug()
d.search('CCO', db=d.REAL_dataset)
```

If it's a field that change, raise an issue and I'll update the class or do a pull request :pray:.

## Choices

The database choices can seen with the preset list `SmallWorld.db_choices`. But also this can be recached via the
classmethod `SmallWorld.retrieve_databases()`.

Two databases, `REAL_Space_21Q3_2B(public)` and `REAL_DB_20Q2`, are Enamine REAL databases
(aka. Enamine will make the compound on request). Previously, the
repository, [enamine-real-search](https://github.com/xchem/enamine-real-search) was good for this, but unfortunately
Enamine changed their endpoints. So I wrote this to take its place!
Despite the smaller number of entries, `REAL_DB_20Q2` gives the most hits and is less likely to "Major Tom out".

Likewise, the attribute `SmallWorld.sf_choices` (type list) and 
the classmethod `SmallWorld.retrieve_scorefun_options()` do the same.
The values are less and are: `['Atom Alignment', 'SMARTS Alignment', 'ECFP4', 'Daylight']`, but these
are activated by default and will be visible as columns in the resulting dataframe from a search call.

Here is the full list of databases:

```python
import pandas as pd

choices: pd.DataFrame = SmallWorld.retrieve_databases()

display(choices)
```
Which will return (as of writing on the 9th Dec 2021):

|                                               | name                       |   numEntries |   numMapped |   numUnmapped |   numSkipped | status    |
|:----------------------------------------------|:---------------------------|-------------:|------------:|--------------:|-------------:|:----------|
| REAL_Space_21Q3_All_2B_public.smi.anon        | REAL_Space_21Q3_2B(public) |   1950356098 |  1935062471 |      15293627 |            0 | Available |
| ZINC-All-2020Q2-1.46B.anon                    | ZINC-All-20Q2-1.46B        |   1468554638 |  1467030947 |       1523691 |          231 | Available |
| ZINC-For-Sale-2020Q2-1.46B.anon               | ZINC-For-Sale-20Q2-1.46B   |   1464949146 |  1463519428 |       1429718 |           22 | Available |
| ZINC20-ForSale-21Q3.smi.anon                  | ZINC20-ForSale-21Q3-1.4B   |   1479284919 |  1440784765 |      38500154 |           29 | Available |
| Enamine_REAL_Public_July_2020_Q1-2_1.36B.anon | REAL_DB_20Q2               |   1361198468 |  1350462346 |      10736122 |            0 | Available |
| Wait-OK-2020Q2-1.2B.anon                      | Wait-OK-20Q2-1.2B          |   1174063221 |  1172785190 |       1278031 |            1 | Available |
| WuXi-20Q4.smi.anon                            | WuXi-20Q4-600M             |   2353582875 |   600762581 |    1752820294 |          284 | Available |
| MculeUltimate-20Q2.smi.anon                   | MculeUltimate_20Q2_126M    |    126471523 |   126471523 |             0 |            0 | Available |
| WuXi-2020Q2-120M.anon                         | WuXi-20Q2-120M             |    339132361 |   120400570 |     218731791 |            0 | Available |
| mcule_ultimate_200407_c8bxI4.anon             | Mcule_ultimate_20Q2-126M   |    126471523 |    45589462 |      80882061 |            0 | Available |
| BB-All-2020Q2-26.7M.anon                      | BB-All-20Q2-26.7M          |     26787985 |    26707241 |         80744 |           16 | Available |
| In-Stock-2020Q2-13.8M.anon                    | In-Stock-20Q2-13.8M        |     13842485 |    13829086 |         13399 |            1 | Available |
| ZINC20-InStock-21Q3.smi.anon                  | ZINC20-InStock-21Q3-11M    |     11122445 |    11103910 |         18535 |            5 | Available |
| BBall.smi.anon                                | BB-All-21Q4-3.3M           |      3319960 |     3319705 |           255 |            6 | Available |
| BBnow.smi.anon                                | BB-Now-21Q4-2M             |      2076639 |     2076464 |           175 |            6 | Available |
| BB-Now-2020Q2-1.6M.anon                       | BB-Now-20Q2-1.6M           |      1649789 |     1649386 |           403 |            4 | Available |
| BB_50.smi.anon                                | BB-50-21Q4-1.5M            |      1483551 |     1483434 |           117 |            2 | Available |
| BB_10.smi.anon                                | BB-10-21Q4-1.2M            |      1243321 |     1243241 |            80 |            0 | Available |
| BB_40.smi.anon                                | BB-40-21Q4-590K            |       589959 |      589911 |            48 |            4 | Available |
| interesting.smi.anon                          | ZINC-Interesting-20Q2-320K |       320845 |      320773 |            72 |            1 | Available |
| ZINC-Interesting-2020Q2-300K.anon             | ZINC-Interesting-20Q2-300K |       307854 |      300765 |          7089 |            1 | Available |
| TCNMP-2020Q2-31912.anon                       | TCNMP-20Q2-31912           |        37438 |       31912 |          5526 |            0 | Available |
| BB_30.smi.anon                                | BB-30-21Q4-3K              |         3129 |        3119 |            10 |            0 | Available |
| WorldDrugs-2020Q2-3004.anon                   | WorldDrugs-20Q2-3004       |         3004 |        3003 |             1 |            0 | Available |
| HMDB-2020Q2-584.anon                          | HMDB-20Q2-584              |          585 |         584 |             1 |            0 | Available |

## Names

There is a Python module called [smallworld](https://github.com/benmaier/smallworld),
which implements the small world algorithm.
This is not an API to the [sw.docking.org](https://sw.docking.org/search.html) site.

The blog of the [sw.docking.org](https://sw.docking.org/search.html) site mentions a pysmallworld.
There is no mention of this in Google so I am guessing it is for a future feature?
I however need to use this now as 
I need it as a publicly usable example workflow of [Fragmenstein](https://github.com/matteoferla/Fragmenstein).

Also, there is a great and wacky boardgame called [Small World](https://boardgamegeek.com/boardgame/40692/small-world),
with a curious/agonising dynamic which forces you to not be a collector.