README.md
# rnnmorph
[![Current version on PyPI](http://img.shields.io/pypi/v/rnnmorph.svg)](https://pypi.python.org/pypi/rnnmorph)
[![Python versions](https://img.shields.io/pypi/pyversions/rnnmorph.svg)](https://pypi.python.org/pypi/rnnmorph)
[![Tests Status](https://github.com/IlyaGusev/rnnmorph/actions/workflows/python-package.yml/badge.svg)](https://github.com/IlyaGusev/rnnmorph/actions/workflows/python-package.yml)
[![Code Climate](https://codeclimate.com/github/IlyaGusev/rnnmorph/badges/gpa.svg)](https://codeclimate.com/github/IlyaGusev/rnnmorph)
**Important**: please see https://github.com/natasha/slovnet#morphology-1
Morphological analyzer (POS tagger) for Russian and English languages based on neural networks and dictionary-lookup systems (pymorphy2, nltk).
### Contacts
* Telegram: [@YallenGusev](https://t.me/YallenGusev)
### Russian language, MorphoRuEval-2017 test dataset, accuracy
| Domain | Full tag | PoS tag | F.t. + lemma | Sentence f.t.| Sentence f.t.l. |
|:-------------|:---------|:--------|:-------------|:-------------|:----------------|
| Lenta (news) | 96.31% | 98.01% | 92.96% | 77.93% | 52.79% |
| VK (social) | 95.20% | 98.04% | 92.06% | 74.30% | 60.56% |
| JZ (lit.) | 95.87% | 98.71% | 90.45% | 73.10% | 43.15% |
| **All** | **95.81%**| **98.26%** | N/A | **74.92%** | N/A |
### English language, UD EWT test, accuracy
| Dataset | Full tag | PoS tag | F.t. + lemma | Sentence f.t.| Sentence f.t.l. |
|:-------------|:---------|:--------|:-------------|:-------------|:----------------|
| UD EWT test | 91.57% | 94.10% | 87.02% | 63.17% | 50.99% |
### Speed and memory consumption
Speed: from 200 to 600 words per second using CPU.
Memory consumption: about 500-600 MB for single-sentence predictions
### Install ###
```
pip install rnnmorph
```
### Usage ###
Example: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OowDoBnucMAdTh6cGuMt06BmFqGTCSvK)
```
from rnnmorph.predictor import RNNMorphPredictor
predictor = RNNMorphPredictor(language="ru")
forms = predictor.predict(["мама", "мыла", "раму"])
print(forms[0].pos)
>>> NOUN
print(forms[0].tag)
>>> Case=Nom|Gender=Fem|Number=Sing
print(forms[0].normal_form)
>>> мама
print(forms[0].vector)
>>> [0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1]
```
### Training ###
Simple model training:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Rh46pHS3FP8NHqdux1o2AF32U5xn1V5J)
### Acknowledgements ###
* Anastasyev D. G., Gusev I. O., Indenbom E. M., 2018, [Improving Part-of-speech Tagging Via Multi-task Learning and Character-level Word Representations](http://www.dialog-21.ru/media/4282/anastasyevdg.pdf)
* Anastasyev D. G., Andrianov A. I., Indenbom E. M., 2017, [Part-of-speech Tagging with Rich Language Description](http://www.dialog-21.ru/media/3895/anastasyevdgetal.pdf), [презентация](http://www.dialog-21.ru/media/4102/anastasyev.pdf)
* [Дорожка по морфологическому анализу "Диалога-2017"](http://www.dialog-21.ru/evaluation/2017/morphology/)
* [Материалы дорожки](https://github.com/dialogue-evaluation/morphoRuEval-2017)
* [Morphine by kmike](https://github.com/kmike/morphine), [CRF classifier for MorphoRuEval-2017 by kmike](https://github.com/kmike/dialog2017)
* [Universal Dependencies](http://universaldependencies.org/)
* Tobias Horsmann and Torsten Zesch, 2017, [Do LSTMs really work so well for PoS tagging? – A replication study](http://www.ltl.uni-due.de/wp-content/uploads/horsmannZesch_emnlp2017.pdf)
* Barbara Plank, Anders Søgaard, Yoav Goldberg, 2016, [Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss](https://arxiv.org/abs/1604.05529)