fair-search/fairsearch-elasticsearch-plugin

View on GitHub
docs/usage.rst

Summary

Maintainability
Test Coverage
How to use the fair-search plugin
***************************

Now it is time to finally perform a fair re-scoring. 

The usual flow for the fairsearch plug-in is this one:

* a user executes a query in the search engine, and during this process, 
* indicates s/he wants to apply the fairsearch plug-in.

To achive this we are going to use a functionality provided by Elasticsearch named *re-scoring*.

=======================
Assumptions and preconditions for this example
=======================

Lets suppose we have already in our search engine this set of documents:

.. code-block:: json

    Doc1 { body: "hello hello hello hello hello hello hello hello hello hello", gender: "m" }
    Doc3 { body: "hello hello hello hello hello hello hello hello hello bye", gender: "m" }
    Doc5 { body: "hello hello hello hello hello hello hello hello bye bye", gender: "m" }
    Doc7 { body: "hello hello hello hello hello hello hello bye bye bye", gender: "m" }
    Doc9 { body: "hello hello hello hello hello hello bye bye bye bye", gender: "m" }
    Doc2 { body: "hello hello hello hello hello bye bye bye bye bye", gender: "f" }
    Doc4 { body: "hello hello hello hello bye bye bye bye bye bye", gender: "f" }
    Doc6 { body: "hello hello hello bye bye bye bye bye bye bye", gender: "f" }
    Doc8 { body: "hello hello bye bye bye bye bye bye bye bye", gender: "f" }
    Doc10 { body: "hello bye bye bye bye bye bye bye bye bye", gender: "f" }

In this example, women will be our protected category. As we see in the "body" of the documents above, the word "hello" occurs more in the ones having ``gender=m`` (male) than in the ones having ``gender=f`` (female). 

=======================
How does a search looks like
=======================

Lets first imagine we execute a normal search for "hello", one without using the fairsearch plug-in. The results would look like this:

.. code-block:: json

    GET test/_search
    {
        "query": {
            "match": {
                "body": "hello"
            }
        }
    }


This request will return all documents that match the word *hello*, sorted by their relevance scoring. For this particular dataset we would get these results:

    Doc1, Doc3, Doc5, Doc7, Doc9, Doc2, Doc4, Doc6, Doc8, Doc10

and if we take a closer look, these will be:

    m, m, m, m, m, f, f, f, f, f

with all men as first top results, however as we could see in the :doc:`theory` section, there are many situations where we might aim for a fair result. To achieve this we will use the plug-in.

A request with the re-score function will look like this:

.. code-block:: json

    GET test/_search
    {
        "query": {
            "match": {
                "body": "hello"
            }
        },
        "rescore": {
            "fair_rescorer": {
                "protected_key": "gender",
                "protected_value": "f",
                "significance_level": 0.1,
                "min_proportion_protected": 0.6
            }
        }
    }

this request is actually doing an Elasticsearch *match* query (it also could do any other type of query, for example a *bool* or a *multi match*), then after the results are calculated (in every shard) it applies the fair topK algorithm.

This request will give you a response where the target number of protected elements will be scored in relevant places, that for our example will be:

    Doc1, Doc3, Doc2, Doc5, Doc4, Doc7, Doc9, Doc6, Doc8, Doc10
    
or, in terms of gender:

    m, m, f, m, f, m, m, f, f, f

with a much fairer distribution of elements of the protected class (i.e., some women appear in the top positions).

=======================
Details on parameters
=======================

As we saw in the :doc:`theory`, for the fair query to work, we would need an *mtable* against which the re-scoring will be executed. To make the use of the plug-in easier, the re-score function does that in the background. So, the previous re-score call will create an *mtable* with the following parameters:

- ``k=10``, as 10 is the default documents that are returned by Elasticsearch
- ``α=0.1``, as that is the significance level specified in the query
- ``p=0.6``, as that is the minimum proportion of protected elements specified in the top results


Elasticsearch size parameters
-------------------------
Elasticsearch supports the following size parameters:

- ``size``, which is the total number of documents returned by ElasticSearch (defaults to 10)
- ``window_size``, which is the number of documents to be re-ranked by the re-score function and returned to the user (defaults to 10).

For example, if you make a query like this:

.. code-block:: json

    GET test/_search
    {
       "size": 50,
        "query": {
            "match": {
                "body": "hello"
            }
        },
        "rescore": {
            "window_size": 10,
            "fair_rescorer": {
                "protected_key": "gender",
                "protected_value": "f",
                "significance_level": 0.1,
                "min_proportion_protected": 0.6
            }
        }
    }

This will try to find initial 50 documents that match the query and, then, re-rank and *return* only the top 10. So, the plug-in will choose ``window_size`` as the ``k`` parameter or, if smaller, the actual numbers of documents initially returned. 

Ensuring there are protected elements in the topK results
-------------------------

A recommended way of using the plug-in is to specify a higher number for ``size``, so that more elements can be included in the initial list for more fair re-ranking. Then, after the re-scoring phase, only ``window_size`` number of documents will be returned.