README.md
# Python_SmallWorld_API
An unofficial Python3 module to query a SmallWorld chemical space search server.
[![Documentation Status](https://readthedocs.org/projects/python-smallworld-api/badge/?version=latest)](https://python-smallworld-api.readthedocs.io/en/latest/?badge=latest)
[![https img shields io pypi v smallworld api logo python](https://img.shields.io/pypi/v/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api) [![https img shields io pypi pyversions smallworld api logo python](https://img.shields.io/pypi/pyversions/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api) [![https img shields io pypi wheel smallworld api logo python](https://img.shields.io/pypi/wheel/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api) [![https img shields io pypi format smallworld api logo python](https://img.shields.io/pypi/format/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api) [![https img shields io pypi status smallworld api logo python](https://img.shields.io/pypi/status/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api) [![https img shields io pypi dm smallworld api logo python](https://img.shields.io/pypi/dm/smallworld--api?logo=python)](https://pypi.org/project/smallworld--api) [![https img shields io codeclimate maintainability matteoferla Python_SmallWorld_API logo codeclimate](https://img.shields.io/codeclimate/maintainability/matteoferla/Python_SmallWorld_API?logo=codeclimate)](https://codeclimate.com/github/matteoferla/Python_SmallWorld_API) [![https img shields io codeclimate issues matteoferla Python_SmallWorld_API logo codeclimate](https://img.shields.io/codeclimate/issues/matteoferla/Python_SmallWorld_API?logo=codeclimate)](https://codeclimate.com/github/matteoferla/Python_SmallWorld_API) [![https img shields io codeclimate tech debt matteoferla Python_SmallWorld_API logo codeclimate](https://img.shields.io/codeclimate/tech-debt/matteoferla/Python_SmallWorld_API?logo=codeclimate)](https://codeclimate.com/github/matteoferla/Python_SmallWorld_API) [![https img shields io github forks matteoferla Python_SmallWorld_API label Fork style social logo github](https://img.shields.io/github/forks/matteoferla/Python_SmallWorld_API?label=Fork&style=social&logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github stars matteoferla Python_SmallWorld_API style social logo github](https://img.shields.io/github/stars/matteoferla/Python_SmallWorld_API?style=social&logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github watchers matteoferla Python_SmallWorld_API label Watch style social logo github](https://img.shields.io/github/watchers/matteoferla/Python_SmallWorld_API?label=Watch&style=social&logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github last commit matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/last-commit/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github license matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/license/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API/raw/master/LICENCE) [![https img shields io github release date matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/release-date/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github commit activity m matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/commit-activity/m/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github issues matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/issues/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API) [![https img shields io github issues closed matteoferla Python_SmallWorld_API logo github](https://img.shields.io/github/issues-closed/matteoferla/Python_SmallWorld_API?logo=github)](https://github.com/matteoferla/Python_SmallWorld_API)
### Disclaimer
> This is Unofficial So please do not abuse it or use it when you cannot legally use the site!
SmallWorld is a search engine for chemical space developed by John Mayfield and Roger Sayle at [NextMove Software](https://www.nextmovesoftware.com/).
John Irwin and Brian Shoichet at UCSF (the creators and maintainers of the [ZINC](https://zinc.docking.org/) database),
host a version of it at [sw.docking.org](https://sw.docking.org/search.html) along with another NextMove Software product,
[Arthor](https://arthor.docking.org/).
## Overview
SmallWorld allows one to search for similar compounds
to a give [SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system)
in one of many databases โa very complex feat.
A copy is hosted at [sw.docking.org](https://sw.docking.org/search.html) by John Irwin, Brian Shoichet and co.
This is a free service, but it is not intended for heavy use.
To change the endpoint, one can change the class attribute `SmallWorld.base_url` to a different URL.
The folk at NextMove Software deploy instances of it for paying customers โand with full support.
The API points of the site are described in
[wiki.docking.org/index.php/How_to_use_SmallWorld_API](https://wiki.docking.org/index.php/How_to_use_SmallWorld_API).
This Python3 module allows one to search it.
For searches in Arthor, Zinc and EnamineStore see elsewhere.
## Install
pip install -q smallworld-api
## Usage
The following searches for Aspirin in Enamine's make-on-demand space, Enamine REAL, which does not contain it
as the latter is filtered by Lipinski's rule of five (Aspirin is actually [a terrible placeholder drug](https://www.blopig.com/blog/2023/08/placeholder-compounds-distraction-vs-accuracy/))
```python
from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd # for typehinting below
from smallworld_api import SmallWorld
print(SmallWorld.base_url) # 'https://sw.docking.org'
aspirin = 'O=C(C)Oc1ccccc1C(=O)O'
sw = SmallWorld()
results : pd.DataFrame = sw.search(aspirin, dist=5, db=sw.REAL_dataset)
from IPython.display import display
display(results)
```
The first two import lines are optional as the code works without rdkit. If pandas gets imported before PandasTools
and Chem imported not in _main_ then display issues happen,
which can be fixed with a `from rdkit.Chem.Draw import IPythonConsole`.
So it's up to you to remember to run:
```python
PandasTools.AddMoleculeColumnToFrame(results, 'smiles', 'molecule', includeFingerprints=True)
```
The argument `db` for `.search` is a string and is the name of the database. These do seem to change,
so they get updated during initialisation or with the call:
```python
dbs: pd.DataFrame = SmallWorld.retrieve_databases() #: pd.DataFrame (.db_choices gets updated too)
```
The dynamic properties `.REAL_dataset` and `.ZINC_dataset` simply return the best value from the presets, which may have
become out of date (unless updated).
## Query terms
The first argument passed to `.search` can be:
* a `str` (SMILES)
* a `Chem.Mol` (rdkit is an optional requirement though)
* a list-like (sequence) or a dict-like (mapping) of the above, where the index or key becomes the name in the output
table.
See the class attribute dictionary `SmallWorld.default_submission` for what the defaults are set to, which ought to be:
{'dist': 8,
'tdn': 6,
'rdn': 6,
'rup': 2,
'ldn': 2,
'lup': 2,
'maj': 6,
'min': 6,
'sub': 6,
'sdist': 12,
'tup': 6,
'scores': 'Atom Alignment,ECFP4,Daylight'}
If one is sure that the correct dataset is being used and any raised `NoMatchError` is due to the SMILES, then once can
add for the last case the argument `tolerate_tolerate_NoMatchError=True`, which makes them ignored bar for a warning.
## Debug
The instantiation is set up so for debugging, namely it has two attributes of interest:
* `sw.last_reply`, a `requests.Response` instance
* `sw.hit_list_id` an integer representing the search (AKA. `hlid` in the server responses)
The errors raised are generally either `requests.HTTPError`
or `smallworld.NoMatchError`. The former is raised by a `requests.Response.raise_for_status` call and means there is a
status code that isn't 200, the latter is raised by one of the various checks in `sw.get_results()`.
For the former errors, i.e. those by a serverside-declared HTML-formatting error (eg. status code 404), if one is in a
Jupyter notebook one can do `sw.show_reply_as_html()`. Generally if you get status code 500, it is best to try again
tomorrow as the server is having a hard time and is probably not okay on the web.
For the latter, the result in `.last_reply` should be a JSON string, therefore should give something like this:
```python3
reply_data: dict = sw.last_reply.json()
```
A common issue is the change in database names, therefore do do and pick a different one
(ATM, the index of the dataframe is the name to use, but in 2021 it was the `name`)
```python3
from IPython.display import display
from smallworld_api import SmallWorld
db_table : pd.DataFrame = SmallWorld.retrieve_databases()
display( db_table )
```
There will be a "ground control to major Tom" warning in the first query. This weird reply means that the stream has
finished, but not closed or something. Ignore it.
Also, as a shorthand, `mol: Chem.Mol = SmallWorld.check_smiles(aspirin)`
can be called to check if the molecules is fine.
For extreme debugging, open Chrome and go to [sw.docking.org](https://sw.docking.org/search.html)
and open the developer tools (F12). Then go to the Network tab and do a search, eg. with `CC2=CC=C(CNC(=O)C1CCC1)C=N2`,
this will be populated by all the figure requests, but `/search/submit` will be the first one to look at
if the issue is with the submit method in the trace, `/search/view` if it's with the `get_results` method.
Then simply copy the url query off the request and use it as parameters or compare them etc. For example:
```python3
import urllib.parse
from smallworld_api import SmallWorld
url_query = '๐พ=๐พ&๐พ=๐พ'
expected = dict(urllib.parse.parse_qsl(url_query))
class Debug(SmallWorld):
def submit_query(self, params):
# override the method to check the parameters
print('Going to use:', params)
print('Missing:', set(expected).difference(params))
return super().submit_query(expected)
d = Debug()
d.search('CCO', db=d.REAL_dataset)
```
If it's a field that change, raise an issue and I'll update the class or do a pull request :pray:.
## Choices
The database choices can seen with the preset list `SmallWorld.db_choices`. But also this can be recached via the
classmethod `SmallWorld.retrieve_databases()`.
Two databases, `REAL_Space_21Q3_2B(public)` and `REAL_DB_20Q2`, are Enamine REAL databases
(aka. Enamine will make the compound on request). Previously, the
repository, [enamine-real-search](https://github.com/xchem/enamine-real-search) was good for this, but unfortunately
Enamine changed their endpoints. So I wrote this to take its place!
Despite the smaller number of entries, `REAL_DB_20Q2` gives the most hits and is less likely to "Major Tom out".
Likewise, the attribute `SmallWorld.sf_choices` (type list) and
the classmethod `SmallWorld.retrieve_scorefun_options()` do the same.
The values are less and are: `['Atom Alignment', 'SMARTS Alignment', 'ECFP4', 'Daylight']`, but these
are activated by default and will be visible as columns in the resulting dataframe from a search call.
Here is the full list of databases:
```python
import pandas as pd
choices: pd.DataFrame = SmallWorld.retrieve_databases()
display(choices)
```
Which will return (as of writing on the 9th Dec 2021):
| | name | numEntries | numMapped | numUnmapped | numSkipped | status |
|:----------------------------------------------|:---------------------------|-------------:|------------:|--------------:|-------------:|:----------|
| REAL_Space_21Q3_All_2B_public.smi.anon | REAL_Space_21Q3_2B(public) | 1950356098 | 1935062471 | 15293627 | 0 | Available |
| ZINC-All-2020Q2-1.46B.anon | ZINC-All-20Q2-1.46B | 1468554638 | 1467030947 | 1523691 | 231 | Available |
| ZINC-For-Sale-2020Q2-1.46B.anon | ZINC-For-Sale-20Q2-1.46B | 1464949146 | 1463519428 | 1429718 | 22 | Available |
| ZINC20-ForSale-21Q3.smi.anon | ZINC20-ForSale-21Q3-1.4B | 1479284919 | 1440784765 | 38500154 | 29 | Available |
| Enamine_REAL_Public_July_2020_Q1-2_1.36B.anon | REAL_DB_20Q2 | 1361198468 | 1350462346 | 10736122 | 0 | Available |
| Wait-OK-2020Q2-1.2B.anon | Wait-OK-20Q2-1.2B | 1174063221 | 1172785190 | 1278031 | 1 | Available |
| WuXi-20Q4.smi.anon | WuXi-20Q4-600M | 2353582875 | 600762581 | 1752820294 | 284 | Available |
| MculeUltimate-20Q2.smi.anon | MculeUltimate_20Q2_126M | 126471523 | 126471523 | 0 | 0 | Available |
| WuXi-2020Q2-120M.anon | WuXi-20Q2-120M | 339132361 | 120400570 | 218731791 | 0 | Available |
| mcule_ultimate_200407_c8bxI4.anon | Mcule_ultimate_20Q2-126M | 126471523 | 45589462 | 80882061 | 0 | Available |
| BB-All-2020Q2-26.7M.anon | BB-All-20Q2-26.7M | 26787985 | 26707241 | 80744 | 16 | Available |
| In-Stock-2020Q2-13.8M.anon | In-Stock-20Q2-13.8M | 13842485 | 13829086 | 13399 | 1 | Available |
| ZINC20-InStock-21Q3.smi.anon | ZINC20-InStock-21Q3-11M | 11122445 | 11103910 | 18535 | 5 | Available |
| BBall.smi.anon | BB-All-21Q4-3.3M | 3319960 | 3319705 | 255 | 6 | Available |
| BBnow.smi.anon | BB-Now-21Q4-2M | 2076639 | 2076464 | 175 | 6 | Available |
| BB-Now-2020Q2-1.6M.anon | BB-Now-20Q2-1.6M | 1649789 | 1649386 | 403 | 4 | Available |
| BB_50.smi.anon | BB-50-21Q4-1.5M | 1483551 | 1483434 | 117 | 2 | Available |
| BB_10.smi.anon | BB-10-21Q4-1.2M | 1243321 | 1243241 | 80 | 0 | Available |
| BB_40.smi.anon | BB-40-21Q4-590K | 589959 | 589911 | 48 | 4 | Available |
| interesting.smi.anon | ZINC-Interesting-20Q2-320K | 320845 | 320773 | 72 | 1 | Available |
| ZINC-Interesting-2020Q2-300K.anon | ZINC-Interesting-20Q2-300K | 307854 | 300765 | 7089 | 1 | Available |
| TCNMP-2020Q2-31912.anon | TCNMP-20Q2-31912 | 37438 | 31912 | 5526 | 0 | Available |
| BB_30.smi.anon | BB-30-21Q4-3K | 3129 | 3119 | 10 | 0 | Available |
| WorldDrugs-2020Q2-3004.anon | WorldDrugs-20Q2-3004 | 3004 | 3003 | 1 | 0 | Available |
| HMDB-2020Q2-584.anon | HMDB-20Q2-584 | 585 | 584 | 1 | 0 | Available |
## Names
There is a Python module called [smallworld](https://github.com/benmaier/smallworld),
which implements the small world algorithm.
This is not an API to the [sw.docking.org](https://sw.docking.org/search.html) site.
The blog of the [sw.docking.org](https://sw.docking.org/search.html) site mentions a pysmallworld.
There is no mention of this in Google so I am guessing it is for a future feature?
I however need to use this now as
I need it as a publicly usable example workflow of [Fragmenstein](https://github.com/matteoferla/Fragmenstein).
Also, there is a great and wacky boardgame called [Small World](https://boardgamegeek.com/boardgame/40692/small-world),
with a curious/agonising dynamic which forces you to not be a collector.