weblyzard/inscriptis

View on GitHub
docs/benchmarking.rst

Summary

Maintainability
Test Coverage
====================================
Testing, benchmarking and evaluation
====================================

Unit tests
==========
In addition to the standard unit tests that are located in the project's `test` directory Inscriptis also contains 
test cases that solely focus on the html to text conversion and are located in the `tests/html` directory. 
These tests consist of two files:

 1. `test-name.html` and
 2. `test-name.txt`

The `.txt` file contains the reference text output for the given html file.

Since Inscripits 2.0 there may also be a third file named `test-name.json` in the `tests/html` directory which contains a JSON dictioanry with keys

 1. `annotation-rules` containing the annotation rules for extracting metadata from the corresponding html file, and
 2. `result` which stores the surface forms of the extracted metadata.


Example::

    {"annotation_rules": {
        "h1": ["heading"],
        "b": ["emphasis"]
     },
     "result": [
        ["heading", "The first"],
        ["heading", "The second"],
        ["heading", "Subheading"]
     ]
    }


Text conversion output comparison and benchmarking
==================================================
The inscriptis project contains a benchmarking script that can compare different HTML to text conversion approaches.
The script will run the different approaches on a list of URLs, `url_list.txt`, and save the text output into a time stamped folder in `benchmarking/benchmarking_results` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called `speed_comparisons.txt` in the respective time stamped folder.

To run the benchmarking script execute `run_benchmarking.py` from within the folder `benchmarking`.
In `def pipeline()` set the which HTML -> Text algorithms to be executed by modifying::

   run_lynx = True
   run_justext = True
   run_html2text = True
   run_beautifulsoup = True
   run_inscriptis = True

In `url_list.txt` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.::

   http://www.informationscience.ch
   https://en.wikipedia.org/wiki/Information_science
   ...