fair-search/fairsearch-elasticsearch-plugin

View on GitHub
docs/theory.rst

Summary

Maintainability
Test Coverage
The Theory
**********

Ranking can be biased
-----------------

Core concepts: ranking bias
~~~~~~~~~~~~~~~~~~~~~~
Search engines today are used to rank many different types of items, including items that represent *people*. Job recruiting search engines, marketplaces for consulting and other services, dating apps, etc. have at its core the idea of ranking/ordering people from most relevant to less relevant, which often means from "best" to "worst".

Traditional scoring methods such as TF-IDF or BM25 (the two most popular ones) can introduce a certain degree of bias; the main motivation of this plugin is to provider methods to search without having this problem.

A computer system is biased [Friedman 1996] if:

    It systematically and unfairly discriminate[s] against certain individuals or groups of individuals in favor of
    others. A system discriminates unfairly if it denies an opportunity or a good or if it assigns an undesirable
    outcome to an individual or a group of individuals on grounds that are unreasonable or inappropriate.

In algorithmic bias, an important concept is that of a *protected group*, which is a category of individuals protected by law, voluntary commitments, or other reasons. Search results are considered *unfair* if members of a protected group are systematically ranked lower than those of a non-protected group.

Examples where a fair search would be required are for instance, the US Equal Employment Opportunity Commission, which sets a
goal of 12% of workers with disabilities in federal agencies in the US, while in Spain, a minimum of 40% of political candidates in voting districts exceeding a certain size must be women.

Real-world example: Job Search
~~~~~~~~~~~~~~~~~~~~~~

Consider the three rankings in the table below, corresponding to searches for an “economist,” “market research analyst,” and “copywriter” in a job search engine, i.e., an online platform for jobs that is used by recruiters and headhunters to find suitable candidates.

Positions 1, 2, ..., 10 are the top-10 ranking positions. A letter ``m`` indicates the candidate is a man, while ``f`` indicates the candidate is a woman.

+-------------+----+----+----+----+----+----+----+----+----+------+-----------------------------+-----------------------------+
| Query       | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | R10  | Men:Women (top-10)          | Men:Women (top-40)          |
+=============+====+====+====+====+====+====+====+====+====+======+=============================+=============================+
| Econ.       | f  | m  | m  | m  | m  | m  | m  | m  | m  | m    | 90%:10%                     | 73%:27%                     |
+-------------+----+----+----+----+----+----+----+----+----+------+-----------------------------+-----------------------------+
| Analyst     | f  | m  | f  | f  | f  | f  | f  | m  | f  | f    | 20%:80%                     | 43%:57%                     |
+-------------+----+----+----+----+----+----+----+----+----+------+-----------------------------+-----------------------------+
| Copywr.     | m  | m  | m  | m  | m  | m  | f  | m  | m  | m    | 90%:10%                     | 73%:27%                     |
+-------------+----+----+----+----+----+----+----+----+----+------+-----------------------------+-----------------------------+

While analyzing the extent to which candidates of both genders are represented as we go down these lists, we can observe that the proportions keep changing (compare the top-10 against the the top-40).

As a consequence, recruiters examining these lists will see different proportions depending on the point at which they decide to stop. This can cause under represented groups have not a fair outcome, so limiting the visibility.

The FA*IR algorithm
-----------------

What is the fairness criterion applied by FA*IR?
~~~~~~~~~~~~~~~~~~~~~~

A *prefix* of a list are the first elements of the list; for instance, the list *(A, B, C)* has prefixes *(A)*, *(A, B)*, and *(A, B, C)*.

The fairness criterion in FA*IR [Zehlike et al. 2017] requires that the number of protected elements *in every prefix* of the list corresponds to the number of protected elements we would expect if they where picked at random using Bernoulli trials (independent “coin tosses”) with success probability *p*.

This correspondence is not exact, and there is a parameter *α* corresponding to the accepted probability of a Type I error, which means rejecting a fair ranking in this test. A typical value of *α* could be 0.1, or 10%.

Given *p*, *α*, and *k*, which is the total length of the list to be returned, an M-table is computed. This M-table indicates what is the minimum number of protected elements at every prefix.

Example
~~~~~~~~~~~~~~~~~~~~~~

This example illustrates how the re-ranker works, but we will be omitting a correction on *α* that will be explained next.

Suppose *p=0.5*, this means that we would like a list in which the protected candidates are at least 50% of every prefix. Suppose *α=0.1*, meaning we accept a 10% of Type I error.

The M-table in this case is:

+----------+---+---+---+---+---+---+---+---+---+----+
+ Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
+==========+===+===+===+===+===+===+===+===+===+====+
| M        | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 3  |
+----------+---+---+---+---+---+---+---+---+---+----+

This means that, among the top 3 elements, even if there is no protected item, we would still consider the list to be fair, because if you toss a fair coin (*p=0.5*) 3 times, the chance of getting "heads" 3 times is above 10% (remember *α=0.1*). However, among the top 4 items, at least one of them has to be protected, because if you toss a fair coin, the chance of getting *heads* 4 times is below 10%, hence, with this *α* it is not believable that the coin was fair in the first place.

The rest of the M table is easy to interpret; for instance: among the top-5 elements there has to be at least 1 protected, among the top-7 there must be 2 at least, and among the top-9 there must be 3 at least.

Corrections for multiple hypotheses testing
~~~~~~~~~~~~~~~~~~~~~~

The FA*IR plug-in does not use directly the parameter *α*, but computes a *corrected α*, which is in general smaller. For instance, for *p=0.5*, *α=0.1*, *k=100*, the *corrected α=0.0207*.

The *corrected α* accounts for the fact that, in a list of size *k*, there will be *k* tests performed, one for every prefix (for instance, 100). Hence, the probability of failing in at least one prefix is larger than *α* (because there are 100 attempts being made). The correction mechanism is explained in the FA*IR paper [Zehlike et al. 2017].

References
-----------------

[Friedman 1996] Friedman, B., & Nissenbaum, H. (1996). `Bias in computer systems <https://vsdesign.org/publications/pdf/64_friedman.pdf>`_. ACM Transactions on Information Systems (TOIS), 14(3), 330-347.

[Zehlike et al. 2017] Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., and Baeza-Yates, R. (2017, November). `FA*IR: A fair top-k ranking algorithm <https://arxiv.org/abs/1706.06368>`_. Proc. CIKM 2017 (pp. 1569-1578). ACM Press.