docs/benchmarking.rst
====================================
Testing, benchmarking and evaluation
====================================
Unit tests
==========
In addition to the standard unit tests that are located in the project's `test` directory Inscriptis also contains
test cases that solely focus on the html to text conversion and are located in the `tests/html` directory.
These tests consist of two files:
1. `test-name.html` and
2. `test-name.txt`
The `.txt` file contains the reference text output for the given html file.
Since Inscripits 2.0 there may also be a third file named `test-name.json` in the `tests/html` directory which contains a JSON dictioanry with keys
1. `annotation-rules` containing the annotation rules for extracting metadata from the corresponding html file, and
2. `result` which stores the surface forms of the extracted metadata.
Example::
{"annotation_rules": {
"h1": ["heading"],
"b": ["emphasis"]
},
"result": [
["heading", "The first"],
["heading", "The second"],
["heading", "Subheading"]
]
}
Text conversion output comparison and benchmarking
==================================================
The inscriptis project contains a benchmarking script that can compare different HTML to text conversion approaches.
The script will run the different approaches on a list of URLs, `url_list.txt`, and save the text output into a time stamped folder in `benchmarking/benchmarking_results` for manual comparison.
Additionally the processing speed of every approach per URL is measured and saved in a text file called `speed_comparisons.txt` in the respective time stamped folder.
To run the benchmarking script execute `run_benchmarking.py` from within the folder `benchmarking`.
In `def pipeline()` set the which HTML -> Text algorithms to be executed by modifying::
run_lynx = True
run_justext = True
run_html2text = True
run_beautifulsoup = True
run_inscriptis = True
In `url_list.txt` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)
e.g.::
http://www.informationscience.ch
https://en.wikipedia.org/wiki/Information_science
...