data/README.txt from wikimedia/mediawiki-extensions-CirrusSearch

data/README.txt
Summary

Maintainability

Test Coverage

Issues
Origins of the utr30.txt file

UTR30 was a proposal drafted by the Unicode Consortium but was withdrawn
because of lack of consensus.
See https://unicode.org/reports/tr30/.
Despite that this proposal has been used in lucene in the analysis-icu module.
But because it's not part of unicode the way to generate this data is not
trivial.
It's a concatenation of multiple files. The output used by lucene is a binary
file suited to be loaded as a Normalizer2.
These apis are not available in PHP so the idea is to generate a rule file that
can be understood by Transliterator.
Source files:
https://github.com/apache/lucene-solr/tree/master/lucene/analysis/icu/src/data/utr30

Assemble the files into a single normlize rule file:
gennorm2 . BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt -o combined.nrm --combined

1/ Edit combined.nrm to transform any unicode code reference with the
notation understood by the Transliterator:
0020 should be \u0020

2/ Remove the last four lines (from 1D185..1D18B), for some reasons they cannot be loaded with the rest of the file.

3/ Join all lines with a ';'

4/ Prepend the NFD rules, case folding, naive accent removal rules

::NFD;::Upper;::Lower;::[:Nonspacing Mark:] Remove;::NFC;[\_\-\'\u2019\u02BC]>\u0020

This basically means:
Decompose, upper case, lower case, remove combining accents, Recompose, fold some chars to space.
BasicFoldings.txt DiacriticFolding.txt DingbatFolding.txt HanRadicalFolding.txt NativeDigitFolding.txt will
be applied after these.