Module:MakeSortKey

Lua
Code Discussion Edit History Links Link count Subpages:Documentation Tests Results Sandbox Live code All modules
Note: This module is used on a lot of pages. In order not to put too much load on the servers, edits should be kept to a bare minimum. Please discuss proposed changes on the talk page first.
Afterwards, changes can initially be done at and tested with Module:MakeSortKey/sandbox.
Editing a module causes all pages that use the module to be re-rendered. If the module is used often, this can put a lot of load on the servers since it fills up the job queue.
Keep in mind that displays produced by modules used on file description pages also show up on other wikis.
Notes:
The computed sort keys are composite, and use a NUL byte as separator between collation levels. That NUL byte is not suitable for output in HTML. If you want to use this module to create sort keys usable in HTML (for example in sortable tables), globally replace these NUL bytes ('\000' or '%z' in patterns) in Lua strings by ' !' (i.e. SPACE + exclamation mark), which is the smallest string that will be preserved by HTML whitespace compression, and that still allows distinction of sort keys that are prefixes from other longer sort keys.
For now this module computes only locale-neutral sort keys. The language code parameter is still not used, but it may be used in the future, using additional preprocessing, for example to sort 'ä' as 'ae' in German, or as a plain letter after 'z' in Swedish, or to treat specially the dotted and undotted 'i' in Turkish.
If it works now correctly for most primary keys used in Japanese and Korean (written in Kanas or Hangul), it does not attempt to sort the sinographic characters (which are only sorted by their numeric code point value) as it would require a large unimplemented table.
The secondary key is also very crude: it contains lower case letters followed by their decomposed diacritics (in normalized order), but does not warranty that these diacritics will be lower than another following base character: there's still no remapping/renumbering so the base letters are still present in the secondary key.
The third key is just a verbatim copy of the input string, in its original form (it is conforming to UCA as the "last chance" key to take into account any other remaining differences of encoding). This collator is then defined with a collation strength between 1 and 2, suitable for locale-neutral sorting.
Code

--[==[ This provides a very crude method to sort list items for languages using letters with diacritics,
or variant forms, to sort them with their associated base letters.
- '[\192-\223].' matches all valid UTF-8 sequences for characters in U+0080..U+07FF encoded on 2 bytes;
- '[\224-\239]..' matches all valid UTF-8 sequences for characters in U+0800..U+FFFF encoded on 3 bytes;
- '[\240-\247]...' matches all valid UTF-8 sequences for characters in U+10000..U+10FFFF encoded on 4 bytes.
These also match invalid UTF-8 sequences where characters matched by '.' may not be valid trailing bytes,
but we assume that the labels are already valid UTF-8 strings.

Based on the Unicode Collation Algorithm (UCA), but still very crude (does not support all scripts).

We could use a full UCA implementation, but it is complex and not needed for now to sort short lists of
short labels, and we can accept minor deviations of the resulting sort order for an 'automatic' sort
which is for now only language neutral (a more exact sort could be specified for specific languages
using data overrides to this 'automatic' sort).

So for now the language parameter is still unused.

--]==]
-- First usage was in [[Module:Countries]].

local lower, toNFD, toNFKD, toNFC = mw.ustring.lower, mw.ustring.toNFD, mw.ustring.toNFKD, mw.ustring.toNFC

-- The following mapping tables are synchronized with Unicode version 12.1 (7 May 2019).
local substLower = {
	-- Each key is multibyte in UTF-8 (e.g. first byte is 195 only for U+0080..U+00FF).
	-- Map only uppercase letters to their lowercase version, preserving all diacritics.
	-- Not all of them are mapped by mw.ustring.lower() notably compatibility characters,
	-- because it only uses simple case pairs mapped by the Unicode Characters Database.
	-- This mapping is still language-neutral and generates the crude secondary sort key.
	-- So locale-dependant mappings (e.g. for Turkic dotted/dotless i/j, or sharp s) is
	-- not supported here.

        -- TODO: reduce this set to only handle case mappings for letters that don't have a
        -- simple case pairing. We can do that because we use now the NF(K)D  decompositions,
        -- and can handle most (decomposed) base letters with mw.ustring.tolower() which
        -- support more scripts. Note that we still must handle some decompositions because
        -- they're not part of NFD or NFKD (see the UCA DUCET for known exceptions).

        -- TODO: handle locale-sensitive case pairs.

	-- Latin alphabet.
	['À'] = 'à', ['Á'] = 'á', ['Â'] = 'â', ['Ã'] = 'ã', ['Ä'] = 'ä', ['Å'] = 'å', ['Ā'] = 'ā', ['Ǎ'] = 'ǎ', ['Ă'] = 'ă', ['Ą'] = 'ą', ['Ǎ'] = 'ǎ', ['Ȁ'] = 'ȁ', ['Ȃ'] = 'ȃ', ['Ḁ'] = 'ḁ', ['Ạ'] = 'ạ', ['Ả'] = 'ả', ['Ấ'] = 'ấ', ['Ầ'] = 'ầ',
	['Ǟ'] = 'ǟ', ['Ǡ'] = 'ǡ', ['Ǻ'] = 'ǻ', ['Ẩ'] = 'ẩ', ['Ẫ'] = 'ẫ', ['Ậ'] = 'ậ', ['Ắ'] = 'ắ', ['Ằ'] = 'ằ', ['Ẳ'] = 'ẳ', ['Ẵ'] = 'ẵ', ['Ặ'] = 'ặ',
	['Æ'] = 'æ', ['Ǣ'] = 'ǣ', ['Ǽ'] = 'ǽ',
	['Ɓ'] = 'ƀ', ['Ƃ'] = 'ƃ', ['Ḃ'] = 'ḃ', ['Ḅ'] = 'ḅ', ['Ḇ'] = 'ḇ',
	['Ç'] = 'ç', ['Č'] = 'č', ['Ć'] = 'ć', ['Ĉ'] = 'ĉ', ['Ċ'] = 'ċ', ['Č'] = 'č', ['Ƈ'] = 'ƈ', ['Ḉ'] = 'ḉ',
	['Ď'] = 'ď', ['Đ'] = 'đ', ['Ɖ'] = 'ɖ', ['Ɗ'] = 'ɗ', ['Ƌ'] = 'ƌ', ['Ḋ'] = 'ḋ', ['Ḍ'] = 'ḍ', ['Ḏ'] = 'ḏ', ['Ḑ'] = 'ḑ', ['Ḓ'] = 'ḓ',
	['Ð'] = 'ð',
	['Ǳ'] = 'ǳ', ['ǲ'] = 'ǳ', ['Ǆ'] = 'ǆ', ['ǅ'] = 'ǆ',
	['È'] = 'è', ['É'] = 'é', ['Ê'] = 'ê', ['Ë'] = 'ë', ['Ē'] = 'ē', ['Ė'] = 'ė', ['Ĕ'] = 'ĕ', ['Ę'] = 'ę', ['Ě'] = 'ě', ['Ȅ'] = 'ȅ', ['Ȇ'] = 'ȇ', ['Ẹ'] = 'ẹ', ['Ẻ'] = 'ẻ', ['Ẽ'] = 'ẽ', ['Ế'] = 'ế', ['Ề'] = 'ề',
	['Ḕ'] = 'ḕ', ['Ḗ'] = 'ḗ', ['Ḙ'] = 'ḙ', ['Ḛ'] = 'ḛ', ['Ḝ'] = 'ḝ', ['Ể'] = 'ể', ['Ễ'] = 'ễ', ['Ệ'] = 'ệ',
	['Ǝ'] = 'ɘ', ['Ə'] = 'ə', ['Ɛ'] = 'ɛ',
	['Ƒ'] = 'ƒ', ['Ḟ'] = 'ḟ',
	['Ĝ'] = 'ĝ', ['Ğ'] = 'ğ', ['Ġ'] = 'ġ', ['Ģ'] = 'ģ', ['Ɠ'] = 'ɠ', ['Ǥ'] = 'ǥ', ['Ǧ'] = 'ǧ', ['Ǵ'] = 'ǵ', ['Ḡ'] = 'ḡ', ['Ɣ'] = 'ɣ',
	['Ƣ'] = 'ƣ',
	['Ĥ'] = 'ĥ', ['Ħ'] = 'ħ', ['Ḣ'] = 'ḣ', ['Ḥ'] = 'ḥ', ['Ḧ'] = 'ḧ', ['Ḩ'] = 'ḩ', ['Ḫ'] = 'ḫ',
	['Ì'] = 'ì', ['Í'] = 'í', ['Î'] = 'î', ['Ï'] = 'ï', ['Ĩ'] = 'ĩ', ['Ī'] = 'ī', ['Ĭ'] = 'ĭ', ['Į'] = 'į', ['İ'] = 'i', ['Ɩ'] = 'ɩ', ['Ɨ'] = 'ɨ', ['Ǐ'] = 'ǐ', ['Ȉ'] = 'ȉ', ['Ȋ'] = 'ȋ', ['Ḭ'] = 'ḭ', ['Ḯ'] = 'ḯ', ['Ỉ'] = 'ỉ', ['Ị'] = 'ị',
	['Ĳ'] = 'ĳ',
	['Ĵ'] = 'ĵ',
	['Ʒ'] = 'ʒ', ['Ƹ'] = 'ƹ', ['Ǯ'] = 'ǯ',
	['Ķ'] = 'ķ', ['Ǩ'] = 'ǩ', ['Ƙ'] = 'ƙ', ['Ǩ'] = 'ǩ', ['Ḱ'] = 'ḱ', ['Ḳ'] = 'ḳ', ['Ḵ'] = 'ḵ',
	['Ĺ'] = 'ĺ', ['Ļ'] = 'ļ', ['Ľ'] = 'ľ', ['Ŀ'] = 'ŀ', ['Ł'] = 'ł', ['Ḷ'] = 'ḷ', ['Ḹ'] = 'ḹ', ['Ḻ'] = 'ḻ', ['Ḽ'] = 'ḽ',
	['Ǉ'] = 'ǉ', ['ǈ'] = 'ǉ',
	['Ɯ'] = 'ɯ', ['Ḿ'] = 'ḿ', ['Ṁ'] = 'ṁ', ['Ṃ'] = 'ṃ',
	['Ñ'] = 'ñ', ['Ń'] = 'ń', ['Ņ'] = 'ņ', ['Ň'] = 'ň', ['Ɲ'] = 'ɲ', ['Ṅ'] = 'ṅ', ['Ṇ'] = 'ṇ', ['Ṉ'] = 'ṉ', ['Ṋ'] = 'ṋ',
	['Ŋ'] = 'ŋ',
	['Ǌ'] = 'ǌ', ['ǋ'] = 'ǌ',
	['Ò'] = 'ò', ['Ó'] = 'ó', ['Ô'] = 'ô', ['Õ'] = 'õ', ['Ö'] = 'ö', ['Ø'] = 'ø', ['Ō'] = 'ō', ['Ő'] = 'ő', ['Ǒ'] = 'ǒ', ['Ǫ'] = 'ǫ', ['Ǭ'] = 'ǭ', ['Ǫ'] = 'ǫ', ['Ǭ'] = 'ǭ', ['Ŏ'] = 'ŏ', ['Ő'] = 'ő', ['Ɔ'] = 'ɔ', ['Ɵ'] = 'ɵ', ['Ơ'] = 'ơ', ['Ọ'] = 'ọ', ['Ỏ'] = 'ỏ', ['Ố'] = 'ố', ['Ồ'] = 'ồ',
	['Ṍ'] = 'ṍ', ['Ṏ'] = 'ṏ', ['Ṑ'] = 'ṑ', ['Ṓ'] = 'ṓ', ['Ổ'] = 'ổ', ['Ỗ'] = 'ỗ', ['Ộ'] = 'ộ', ['Ớ'] = 'ớ', ['Ờ'] = 'ờ', ['Ở'] = 'ở', ['Ỡ'] = 'ỡ', ['Ợ'] = 'ợ',
	['Œ'] = 'œ',
	['Ƥ'] = 'ƥ', ['Ṕ'] = 'ṕ', ['Ṗ'] = 'ṗ',
	['Ŕ'] = 'ŕ', ['Ŗ'] = 'ŗ', ['Ř'] = 'ř',
	['Ş'] = 'ş', ['Š'] = 'š', ['Ś'] = 'ś', ['Ŝ'] = 'ŝ', ['Ṡ'] = 'ṡ', ['Ṣ'] = 'ṣ', ['Ṥ'] = 'ṥ', ['Ṧ'] = 'ṧ', ['Ṩ'] = 'ṩ',
	['Ƨ'] = 'ƨ', ['Ʃ'] = 'ʃ',
	['Ţ'] = 'ţ', ['Ť'] = 'ť', ['Ŧ'] = 'ŧ', ['Ƭ'] = 'ƭ', ['Ʈ'] = 'ƫ', ['Ṫ'] = 'ṫ', ['Ṭ'] = 'ṭ', ['Ṯ'] = 'ṯ', ['Ṱ'] = 'ṱ',
	['Þ'] = 'þ',
	['Ù'] = 'ù', ['Ú'] = 'ú', ['Û'] = 'û', ['Ü'] = 'ü', ['Ü'] = 'ü', ['Ũ'] = 'ũ', ['Ū'] = 'ū', ['Ŭ'] = 'ŭ', ['Ů'] = 'ů', ['Ű'] = 'ű', ['Ű'] = 'ű', ['Ų'] = 'ų', ['Ǔ'] = 'ǔ', ['Ȕ'] = 'ȕ', ['Ȗ'] = 'ȗ', ['Ṳ'] = 'ṳ', ['Ṵ'] = 'ṵ', ['Ṷ'] = 'ṷ', ['Ʊ'] = 'ʊ',
	['Ǖ'] = 'ǖ', ['Ǘ'] = 'ǘ', ['Ǚ'] = 'ǚ', ['Ǜ'] = 'ǜ', ['Ṹ'] = 'ṹ', ['Ṻ'] = 'ṻ', ['Ụ'] = 'ụ', ['Ủ'] = 'ủ', ['Ứ'] = 'ứ', ['Ừ'] = 'ừ', ['Ử'] = 'ử', ['Ữ'] = 'ữ', ['Ự'] = 'ự',
	['Ʋ'] = 'ʋ', ['Ṽ'] = 'ṽ', ['Ṿ'] = 'ṿ',
	['Ŵ'] = 'ŵ', ['Ẁ'] = 'ẁ', ['Ẃ'] = 'ẃ', ['Ẅ'] = 'ẅ', ['Ẇ'] = 'ẇ', ['Ẉ'] = 'ẉ',
	['Ẋ'] = 'ẋ', ['Ẍ'] = 'ẍ',
	['Ý'] = 'ý', ['Ŷ'] = 'ŷ', ['Ÿ'] = 'ÿ', ['Ƴ'] = 'ƴ', ['Ẏ'] = 'ẏ', ['Ỳ'] = 'ỳ', ['Ỵ'] = 'ỵ', ['Ỷ'] = 'ỷ', ['Ỹ'] = 'ỹ',
	['Ź'] = 'ź', ['Ż'] = 'ż', ['Ž'] = 'ž', ['Ƶ'] = 'ƶ', ['Ẑ'] = 'ẑ', ['Ẓ'] = 'ẓ', ['Ẕ'] = 'ẕ',

	-- Greek alphabet (and some uppercase symbols convertible to lowercase symbols)
	['Α'] = 'α', ['Ά'] = 'ά', ['Ὰ'] = 'ὰ', ['Ά'] = 'ά', ['Ᾰ'] = 'ᾰ', ['Ᾱ'] = 'ᾱ', ['Ἀ'] = 'ἀ', ['Ἁ'] = 'ἁ',
	['Ἄ'] = 'ἄ', ['Ἅ'] = 'ἅ', ['Ἂ'] = 'ἂ', ['Ἃ'] = 'ἃ', ['Ἆ'] = 'ἆ', ['Ἇ'] = 'ἇ',
	['ᾼ'] = 'ᾳ', ['ᾈ'] = 'ᾀ', ['ᾉ'] = 'ᾁ', ['ᾊ'] = 'ᾂ', ['ᾋ'] = 'ᾃ', ['ᾌ'] = 'ᾄ', ['ᾍ'] = 'ᾅ', ['ᾎ'] = 'ᾆ', ['ᾏ'] = 'ᾇ',
	['Β'] = 'β',
	['Γ'] = 'γ',
	['Δ'] = 'δ',
	['Ε'] = 'ε', ['Έ'] = 'έ',
	['Ζ'] = 'ζ',
	['Η'] = 'η', ['Ή'] = 'ή', ['Ͱ'] = 'ͱ',
	['Θ'] = 'θ',
	['Ι'] = 'ι', ['Ί'] = 'ί', ['Ϊ'] = 'ϊ',
	['Κ'] = 'κ',
	['Λ'] = 'λ',
	['Μ'] = 'μ',
	['Ν'] = 'ν',
	['Ξ'] = 'ξ',
	['Ο'] = 'ο', ['Ό'] = 'ό',
	['Π'] = 'π',
	['Ϙ'] = 'ϙ',
	['Ρ'] = 'ρ',
	['Σ'] = 'σ', ['Ϲ'] = 'ϲ', ['Ͼ'] = 'ͼ', ['Ͻ'] = 'ͻ', ['Ͽ'] = 'ͽ', ['Ϛ'] = 'ϛ',
	['Τ'] = 'τ',
	['Υ'] = 'υ', ['Ύ'] = 'ύ', ['Ϋ'] = 'ϋ',
	['Φ'] = 'φ',
	['Χ'] = 'χ',
	['Ψ'] = 'ψ',
	['Ω'] = 'ω', ['Ώ'] = 'ώ',
	['Ϝ'] = 'ϝ', ['Ͷ'] = 'ͷ',
	['Ϟ'] = 'ϟ',
	['Ϡ'] = 'ϡ', ['Ͳ'] = 'ͳ',
	['Ϸ'] = 'ϸ',
	['Ϻ'] = 'ϻ',

	-- Coptic alphabet
	['Ϣ'] = 'ϣ',
	['Ϥ'] = 'ϥ',
	['Ϧ'] = 'ϧ',
	['Ϩ'] = 'ϩ',
	['Ϫ'] = 'ϫ',
	['Ϭ'] = 'ϭ',
	['Ϯ'] = 'ϯ',

	['Ⲁ'] = 'ⲁ',
	['Ⲃ'] = 'ⲃ',
	['Ⲅ'] = 'ⲅ', ['Ⳮ'] = 'ⳮ',
	['Ⲇ'] = 'ⲇ',
	['Ⲉ'] = 'ⲉ', ['Ⳕ'] = 'ⳕ',
	['Ⲋ'] = 'ⲋ',
	['Ⲍ'] = 'ⲍ',
	['Ⲏ'] = 'ⲏ', ['Ⳏ'] = 'ⳏ', ['Ⳓ'] = 'ⳓ',	['Ⳛ'] = 'ⳛ', ['Ⳝ'] = 'ⳝ',
	['Ⲑ'] = 'ⲑ',
	['Ⲓ'] = 'ⲓ', ['Ⳗ'] = 'ⳗ', ['Ⳙ'] = 'ⳙ',
	['Ⲕ'] = 'ⲕ', ['Ⲹ'] = 'ⲹ', ['Ⳉ'] = 'ⳉ',
	['Ⲗ'] = 'ⲗ', ['Ⳑ'] = 'ⳑ',
	['Ⲙ'] = 'ⲙ',
	['Ⲛ'] = 'ⲛ', ['Ⲻ'] = 'ⲻ', ['Ⲽ'] = 'ⲽ', ['Ⳟ'] = 'ⳟ', ['Ⳡ'] = 'ⳡ',
	['Ⲝ'] = 'ⲝ',
	['Ⲟ'] = 'ⲟ',
	['Ⲡ'] = 'ⲡ',
	['Ⲣ'] = 'ⲣ',
	['Ⲥ'] = 'ⲥ', ['Ⳁ'] = 'ⳁ', ['Ⳃ'] = 'ⳃ', ['Ⳅ'] = 'ⳅ', ['Ⳇ'] = 'ⳇ', ['Ⳋ'] = 'ⳋ', ['Ⳍ'] = 'ⳍ', ['Ⳬ'] = 'ⳬ',
	['Ⲧ'] = 'ⲧ',
	['Ⲩ'] = 'ⲩ', ['Ⳣ'] = 'ⳣ',
	['Ⲫ'] = 'ⲫ',
	['Ⲭ'] = 'ⲭ',
	['Ⲯ'] = 'ⲯ',
	['Ⲱ'] = 'ⲱ', ['Ⲿ'] = 'ⲿ',
	['Ⲳ'] = 'ⲳ',
	['Ⲵ'] = 'ⲵ',
	['Ⲷ'] = 'ⲷ',
	['Ⳳ'] = 'ⳳ',

	-- Cyrillic alphabet (note that string:lower() does not change non-Latin capital letters)
	['А'] = 'а', ['Ӑ'] = 'ӑ', ['Ӓ'] = 'ӓ', ['Ӕ'] = 'ӕ',
	['Б'] = 'б',
	['В'] = 'в',
	['Г'] = 'г', ['Ѓ'] = 'ѓ', ['Ґ'] = 'ґ', ['Ғ'] = 'ғ', ['Ҕ'] = 'ҕ',
	['Д'] = 'д',
	['Е'] = 'е', ['Ё'] = 'ё', ['Ӗ'] = 'ӗ', ['Є'] = 'є', ['Ѥ'] = 'ѥ',
	['Ђ'] = 'ђ',
	['Ж'] = 'ж', ['Ӂ'] = 'ӂ', ['Ӝ'] = 'ӝ', ['Җ'] = 'җ',
	['З'] = 'з', ['Ҙ'] = 'ҙ', ['Ӟ'] = 'ӟ', ['Ӡ'] = 'ӡ',
	['Ѳ'] = 'ѳ',
	['И'] = 'и', ['Ӣ'] = 'ӣ', ['Ӥ'] = 'ӥ', ['І'] = 'і',
	['Й'] = 'й', ['Ї'] = 'ї', ['Ј'] = 'ј',
	['К'] = 'к', ['Ќ'] = 'ќ', ['Қ'] = 'қ', ['Ҝ'] = 'ҝ', ['Ҟ'] = 'ҟ', ['Ҡ'] = 'ҡ', ['Ӄ'] = 'ӄ',
	['Л'] = 'л', ['Љ'] = 'љ',
	['М'] = 'м',
	['Н'] = 'н', ['Њ'] = 'њ', ['Ң'] = 'ң', ['Ҥ'] = 'ҥ', ['Ӈ'] = 'ӈ',
	['Ѯ'] = 'ѯ',
	['О'] = 'о', ['Ӧ'] = 'ӧ', ['Ө'] = 'ө', ['Ӫ'] = 'ӫ', ['Ѹ'] = 'ѹ',
	['П'] = 'п', ['Ҧ'] = 'ҧ',
	['Ҁ'] = 'ҁ',
	['Р'] = 'р',
	['С'] = 'с', ['Ҫ'] = 'ҫ', ['Ѕ'] = 'ѕ',
	['Т'] = 'т', ['Ҭ'] = 'ҭ', ['Ћ'] = 'ћ', ['Ҵ'] = 'ҵ',
	['У'] = 'у', ['Ў'] = 'ў', ['Ӯ'] = 'ӯ', ['Ӱ'] = 'ӱ', ['Ӳ'] = 'ӳ', ['Ү'] = 'ү', ['Ұ'] = 'ұ',
	['Ѵ'] = 'ѵ', ['Ѷ'] = 'ѷ',
	['Ф'] = 'ф',
	['Х'] = 'х', ['Ҳ'] = 'ҳ',
	['Ц'] = 'ц',
	['Ч'] = 'ч', ['Ӵ'] = 'ӵ', ['Ҷ'] = 'ҷ', ['Ҹ'] = 'ҹ', ['Ҽ'] = 'ҽ', ['Ҿ'] = 'ҿ', ['Ӌ'] = 'ӌ',
	['Ш'] = 'ш',
	['Щ'] = 'щ',
	['Ъ'] = 'ъ',
	['Ы'] = 'ы', ['Ӹ'] = 'ӹ',
	['Ҩ'] = 'ҩ', ['Һ'] = 'һ',
	['Џ'] = 'џ',
	['Ь'] = 'ь', ['Ѣ'] = 'ѣ',
	['Э'] = 'э', ['Ә'] = 'ә', ['Ӛ'] = 'ӛ',
	['Ю'] = 'ю',
	['Я'] = 'я',
	['Ѱ'] = 'ѱ',
	['Ѡ'] = 'ѡ', ['Ѽ'] = 'ѽ', ['Ѿ'] = 'ѿ', ['Ѻ'] = 'ѻ',
	['Ѧ'] = 'ѧ', ['Ѩ'] = 'ѩ',
	['Ѫ'] = 'ѫ', ['Ѭ'] = 'ѭ',

	-- Armenian alphabet
	['Ա'] = 'ա',
	['Բ'] = 'բ',
	['Գ'] = 'գ',
	['Դ'] = 'դ',
	['Ե'] = 'ե',
	['Զ'] = 'զ',
	['Է'] = 'է',
	['Ը'] = 'ը',
	['Թ'] = 'թ',
	['Ժ'] = 'ժ',
	['Ի'] = 'ի',
	['Լ'] = 'լ',
	['Խ'] = 'խ',
	['Ծ'] = 'ծ',
	['Կ'] = 'կ',
	['Հ'] = 'հ',
	['Ձ'] = 'ձ',
	['Պ'] = 'ղ',
	['Ջ'] = 'ճ',
	['Մ'] = 'մ',
	['Յ'] = 'յ',
	['Ն'] = 'ն',
	['Շ'] = 'շ',
	['Ո'] = 'ո',
	['Չ'] = 'չ',
	['Պ'] = 'պ',
	['Ջ'] = 'ջ',
	['Ռ'] = 'ռ',
	['Ս'] = 'ս',
	['Վ'] = 'վ',
	['Տ'] = 'տ',
	['Ր'] = 'ր',
	['Ց'] = 'ց',
	['Ւ'] = 'ւ',
	['Փ'] = 'փ',
	['Ք'] = 'ք',
	['Օ'] = 'օ',
	['Ֆ'] = 'ֆ',

	-- Georgian alphabet - tricameral: Assomtavruli rounded (Khutsuri capitals) and Nuskhuri square handscript (Khutsuri lowercase) to modern unicameral Mkhedruli rounded (lowercase)
	['Ⴀ'] = 'ა', ['ⴀ'] = 'ა',
	['Ⴁ'] = 'ბ', ['ⴁ'] = 'ბ',
	['Ⴂ'] = 'გ', ['ⴂ'] = 'გ',
	['Ⴃ'] = 'დ', ['ⴃ'] = 'დ',
	['Ⴄ'] = 'ე', ['ⴄ'] = 'ე',
	['Ⴅ'] = 'ვ', ['ⴅ'] = 'ვ',
	['Ⴆ'] = 'ზ', ['ⴆ'] = 'ზ',
	['Ⴇ'] = 'თ', ['ⴇ'] = 'თ',
	['Ⴈ'] = 'ი', ['ⴈ'] = 'ი',
	['Ⴉ'] = 'კ', ['ⴉ'] = 'კ',
	['Ⴊ'] = 'ლ', ['ⴊ'] = 'ლ',
	['Ⴋ'] = 'მ', ['ⴋ'] = 'მ',
	['Ⴌ'] = 'ნ', ['ⴌ'] = 'ნ',
	['Ⴍ'] = 'ო', ['ⴍ'] = 'ო',
	['Ⴎ'] = 'პ', ['ⴎ'] = 'პ',
	['Ⴏ'] = 'ჟ', ['ⴏ'] = 'ჟ',
	['Ⴐ'] = 'რ', ['ⴐ'] = 'რ',
	['Ⴑ'] = 'ს', ['ⴑ'] = 'ს',
	['Ⴒ'] = 'ტ', ['ⴒ'] = 'ტ',
	['Ⴓ'] = 'უ', ['ⴓ'] = 'უ',
	['Ⴔ'] = 'ფ', ['ⴔ'] = 'ფ',
	['Ⴕ'] = 'ქ', ['ⴕ'] = 'ქ',
	['Ⴖ'] = 'ღ', ['ⴖ'] = 'ღ',
	['Ⴗ'] = 'ყ', ['ⴗ'] = 'ყ',
	['Ⴘ'] = 'შ', ['ⴘ'] = 'შ',
	['Ⴙ'] = 'ჩ', ['ⴙ'] = 'ჩ',
	['Ⴚ'] = 'ც', ['ⴚ'] = 'ც',
	['Ⴛ'] = 'ძ', ['ⴛ'] = 'ძ',
	['Ⴜ'] = 'წ', ['ⴜ'] = 'წ',
	['Ⴝ'] = 'ჭ', ['ⴝ'] = 'ჭ',
	['Ⴞ'] = 'ხ', ['ⴞ'] = 'ხ',
	['Ⴟ'] = 'ჯ', ['ⴟ'] = 'ჯ',
	['Ⴠ'] = 'ჰ', ['ⴠ'] = 'ჰ',
	['Ⴡ'] = 'ჱ', ['ⴡ'] = 'ჱ',
	['Ⴢ'] = 'ჲ', ['ⴢ'] = 'ჲ',
	['Ⴣ'] = 'ჳ', ['ⴣ'] = 'ჳ',
	['Ⴤ'] = 'ჴ', ['ⴤ'] = 'ჴ',
	['Ⴥ'] = 'ჵ', ['ⴥ'] = 'ჵ',

	-- TODO: Map standard Hangul LVT syllables to standard Hangul LV syllables + T jamos (algorithmicly?)
	-- and then Hangul LV syllables to standard Hangul L jamos + Hangul V jamos (algorithmicly?)

	-- Fullwidth uppercase basic Latin variants to fullwidth lowercase basic Latin
	['Ａ'] = 'ａ', ['Ｂ'] = 'ｂ', ['Ｃ'] = 'ｃ', ['Ｄ'] = 'ｄ', ['Ｅ'] = 'ｅ', ['Ｆ'] = 'ｆ', ['Ｇ'] = 'ｇ', ['Ｈ'] = 'ｈ',
	['Ｉ'] = 'ｉ', ['Ｊ'] = 'ｊ', ['Ｋ'] = 'ｋ', ['Ｌ'] = 'ｌ', ['Ｍ'] = 'ｍ', ['Ｎ'] = 'ｎ', ['Ｏ'] = 'ｏ', ['Ｐ'] = 'ｐ',
	['Ｑ'] = 'ｑ', ['Ｒ'] = 'ｒ', ['Ｓ'] = 'ｓ', ['Ｔ'] = 'ｔ', ['Ｕ'] = 'ｕ', ['Ｖ'] = 'ｖ', ['Ｗ'] = 'ｗ', ['Ｘ'] = 'ｘ',
	['Ｙ'] = 'ｙ', ['Ｚ'] = 'ｚ',
}

local substDiacritics = {
	-- Each key is multibyte in UTF-8 (e.g. first byte is 195 only for U+0080..U+00FF).
	-- Maps only lowercase letters to base letters without diacritics (uppercase letters are mapped to lowercase with the table above).
	-- Also splits ligatures to separate letters.
	-- This mapping (applied after the previous one which changes only case but preserves diacritics) generates the crude primary sort key.

	-- TODO: map some punctuation here

	-- Latin alphabet
	['à'] = 'a', ['á'] = 'a', ['â'] = 'a', ['ã'] = 'a', ['ä'] = 'a', ['å'] = 'a', ['ā'] = 'a', ['ǎ'] = 'a', ['ă'] = 'a', ['ą'] = 'a', ['ǎ'] = 'a', ['ǻ'] = 'a', ['ȁ'] = 'a', ['ȃ'] = 'a', ['ɐ'] = 'a', ['ḁ'] = 'a', ['ẚ'] = 'a', ['ạ'] = 'a', ['ả'] = 'a', ['ấ'] = 'a', ['ầ'] = 'a',
	['ǟ'] = 'a', ['ǡ'] = 'a', ['ẩ'] = 'a', ['ẫ'] = 'a', ['ậ'] = 'a', ['ắ'] = 'a', ['ằ'] = 'a', ['ẳ'] = 'a', ['ẵ'] = 'a', ['ặ'] = 'a',
	['ɑ'] = 'a', ['ɒ'] = 'a',
	['æ'] = 'ae', ['ǣ'] = 'ae', ['ǽ'] = 'ae',
	['ƀ'] = 'b', ['ƃ'] = 'b', ['ƅ'] = 'b', ['ɓ'] = 'b', ['ḃ'] = 'b', ['ḅ'] = 'b', ['ḇ'] = 'b',
	['ç'] = 'c', ['č'] = 'c', ['ć'] = 'c', ['ĉ'] = 'c', ['ċ'] = 'c', ['č'] = 'c', ['ƈ'] = 'c', ['ḉ'] = 'c', ['ɕ'] = 'c', ['ʗ'] = 'c',
	['ď'] = 'd', ['đ'] = 'd', ['ḋ'] = 'd', ['ḍ'] = 'd', ['ḏ'] = 'd', ['ḑ'] = 'd', ['ḓ'] = 'd',
	['ð'] = 'dh',
	['ǳ'] = 'dz', ['ǆ'] = 'dz', ['ʣ'] = 'dz', ['ʥ'] = 'dz', ['ʤ'] = 'dz',
	['è'] = 'e', ['é'] = 'e', ['ê'] = 'e', ['ë'] = 'e', ['ē'] = 'e', ['ė'] = 'e', ['ĕ'] = 'e', ['ę'] = 'e', ['ě'] = 'e', ['ȅ'] = 'e', ['ȇ'] = 'e', ['ḙ'] = 'e', ['ḛ'] = 'e', ['ḝ'] = 'e', ['ẹ'] = 'e', ['ẻ'] = 'e', ['ẽ'] = 'e',
	['ḕ'] = 'e', ['ḗ'] = 'e', ['ế'] = 'e', ['ề'] = 'e', ['ể'] = 'e', ['ễ'] = 'e', ['ệ'] = 'e',
	['ə'] = 'e', ['ɘ'] = 'e', ['ɛ'] = 'e', ['ɜ'] = 'e', ['ɝ'] = 'e', ['ɞ'] = 'e', ['ʚ'] = 'e',
	['ƒ'] = 'f', ['ḟ'] = 'f',
	['ɡ'] = 'g', ['ĝ'] = 'g', ['ğ'] = 'g', ['ġ'] = 'g', ['ǵ'] = 'g', ['ǥ'] = 'g', ['ǧ'] = 'g', ['ɠ'] = 'g', ['ḡ'] = 'g', ['ɢ'] = 'g', ['ɣ'] = 'g', ['ɤ'] = 'g',
	['ƣ'] = 'gh',
	['ĥ'] = 'h', ['ħ'] = 'h', ['ɦ'] = 'h', ['ḣ'] = 'h', ['ḥ'] = 'h', ['ḧ'] = 'h', ['ḩ'] = 'h', ['ḫ'] = 'h', ['ẖ'] = 'h', ['ɥ'] = 'h',
	['ƕ'] = 'hv',
	['ì'] = 'i', ['í'] = 'i', ['î'] = 'i', ['ï'] = 'i', ['ĩ'] = 'i', ['ī'] = 'i', ['ĭ'] = 'i', ['į'] = 'i', ['ı'] = 'i', ['ɩ'] = 'i', ['ɨ'] = 'i', ['ɪ'] = 'i', ['ǐ'] = 'i', ['ȉ'] = 'i', ['ȋ'] = 'i', ['ḭ'] = 'i', ['ḯ'] = 'i', ['ỉ'] = 'i', ['ị'] = 'i',
	['ĳ'] = 'ij',
	['ĵ'] = 'j', ['ǰ'] = 'j', ['ǯ'] = 'j', ['ɟ'] = 'j', ['ʄ'] = 'j', ['ʝ'] = 'j',
	['ʒ'] = 'j', ['ƹ'] = 'j', ['ƺ'] = 'j', ['ʓ'] = 'j',
	['ķ'] = 'k', ['ĸ'] = 'k', ['ǩ'] = 'k', ['ƙ'] = 'k', ['ǩ'] = 'k', ['ʞ'] = 'k', ['ḱ'] = 'k', ['ḳ'] = 'k', ['ḵ'] = 'k',
	['ĺ'] = 'l', ['ļ'] = 'l', ['ľ'] = 'l', ['ŀ'] = 'l', ['ł'] = 'l', ['ƚ'] = 'l', ['ƛ'] = 'l', ['ɫ'] = 'l', ['ɬ'] = 'l', ['ɭ'] = 'l', ['ḷ'] = 'l', ['ḹ'] = 'l', ['ḻ'] = 'l', ['ḽ'] = 'l',
	['ǉ'] = 'lj', ['ɮ'] = 'lj',
	['ɯ'] = 'm', ['ɰ'] = 'm', ['ɱ'] = 'm', ['ḿ'] = 'm', ['ṁ'] = 'm', ['ṃ'] = 'm',
	['ñ'] = 'n', ['ń'] = 'n', ['ņ'] = 'n', ['ň'] = 'n', ['ɲ'] = 'n', ['ƞ'] = 'n', ['ɳ'] = 'n', ['ɴ'] = 'n', ['ŉ'] = 'n', ['ṅ'] = 'n', ['ṇ'] = 'n', ['ṉ'] = 'n', ['ṋ'] = 'n',
	['ŋ'] = 'ng', ['ɧ'] = 'ng',
	['ǌ'] = 'nj',
	['ò'] = 'o', ['ó'] = 'o', ['ô'] = 'o', ['õ'] = 'o', ['ö'] = 'o', ['ø'] = 'o', ['ō'] = 'o', ['ő'] = 'o', ['ǒ'] = 'o', ['ǫ'] = 'o', ['ǭ'] = 'o', ['ŏ'] = 'o', ['ő'] = 'o', ['ɔ'] = 'o', ['ɵ'] = 'o', ['ơ'] = 'o', ['ɷ'] = 'o', ['ṍ'] = 'o', ['ṏ'] = 'o', ['ṑ'] = 'o', ['ṓ'] = 'o', ['ọ'] = 'o', ['ỏ'] = 'o',
	['ố'] = 'o', ['ồ'] = 'o', ['ổ'] = 'o', ['ỗ'] = 'o', ['ộ'] = 'o', ['ớ'] = 'o', ['ờ'] = 'o', ['ở'] = 'o', ['ỡ'] = 'o', ['ợ'] = 'o',
	['œ'] = 'oe', ['ɶ'] = 'oe',
	['ƥ'] = 'p',
	['ʠ'] = 'q',
	['ŕ'] = 'r', ['ŗ'] = 'r', ['ř'] = 'r', ['ȑ'] = 'r', ['ȓ'] = 'r', ['ɹ'] = 'r', ['ɺ'] = 'r', ['ɻ'] = 'r', ['ɼ'] = 'r', ['ɽ'] = 'r', ['ɾ'] = 'r', ['ɿ'] = 'r', ['ʀ'] = 'r', ['ʁ'] = 'r', ['ʅ'] = 'r', ['ṙ'] = 'r', ['ṛ'] = 'r', ['ṝ'] = 'r', ['ṟ'] = 'r',
	['Ʀ'] = 'r',
	['ś'] = 's', ['ŝ'] = 's', ['ş'] = 's', ['š'] = 's', ['ſ'] = 's', ['ʂ'] = 's', ['ṡ'] = 's', ['ẛ'] = 's', ['ṣ'] = 's', ['ṥ'] = 's', ['ṧ'] = 's', ['ṩ'] = 's',
	['ƨ'] = 's', ['ʃ'] = 's', ['ƪ'] = 's', ['ʆ'] = 's',
	['ţ'] = 't', ['ť'] = 't', ['ŧ'] = 't', ['ƭ'] = 't', ['ƫ'] = 't', ['ʇ'] = 't', ['ʈ'] = 't', ['ṫ'] = 't', ['ṭ'] = 't', ['ṯ'] = 't', ['ṱ'] = 't',
	['þ'] = 'th',
	['ù'] = 'u', ['ú'] = 'u', ['û'] = 'u', ['ü'] = 'u', ['ü'] = 'u', ['ũ'] = 'u', ['ū'] = 'u', ['ŭ'] = 'u', ['ů'] = 'u', ['ű'] = 'u', ['ű'] = 'u', ['ų'] = 'u', ['ǔ'] = 'u', ['ȕ'] = 'u', ['ȗ'] = 'u', ['ṳ'] = 'u', ['ṵ'] = 'u', ['ṷ'] = 'u', ['ʊ'] = 'u',
	['ǖ'] = 'u', ['ǘ'] = 'u', ['ǚ'] = 'u', ['ǜ'] = 'u', ['ṹ'] = 'u', ['ṻ'] = 'u', ['ụ'] = 'u', ['ủ'] = 'u', ['ứ'] = 'u', ['ừ'] = 'u', ['ử'] = 'u', ['ữ'] = 'u', ['ự'] = 'u',
	['ʋ'] = 'v', ['ʌ'] = 'v', ['ṽ'] = 'v', ['ṿ'] = 'v',
	['ŵ'] = 'w', ['ẁ'] = 'w', ['ẃ'] = 'w', ['ẅ'] = 'w', ['ẇ'] = 'w', ['ẉ'] = 'w', ['ẘ'] = 'w', ['ʍ'] = 'w', ['ƿ'] = 'w',
	['ẋ'] = 'x', ['ẍ'] = 'x',
	['ý'] = 'y', ['ŷ'] = 'y', ['ÿ'] = 'y', ['ƴ'] = 'y', ['ʎ'] = 'y', ['ẏ'] = 'y', ['ẙ'] = 'y', ['ỳ'] = 'y', ['ỵ'] = 'y', ['ỷ'] = 'y', ['ỹ'] = 'y',
	['ź'] = 'z', ['ż'] = 'z', ['ž'] = 'z', ['ƶ'] = 'z', ['ʐ'] = 'z', ['ʑ'] = 'z', ['ẑ'] = 'z', ['ẓ'] = 'z', ['ẕ'] = 'z',

	-- Greek alphabet (and uncased symbols that were not previously folded to lowercase, but can be given tertiary folded to lowercase approximants)
	['ά'] = 'α', ['ὰ'] = 'α', ['ά'] = 'α', ['ᾰ'] = 'α', ['ᾱ'] = 'α', ['ἀ'] = 'α', ['ἁ'] = 'α',
	['ἄ'] = 'α', ['ἅ'] = 'α', ['ἂ'] = 'α', ['ἃ'] = 'α', ['ἆ'] = 'α', ['ἇ'] = 'α',
	['ᾳ'] = 'αι', ['ᾲ'] = 'αι', ['ᾴ'] = 'αι', ['ᾶ'] = 'αι', ['ᾷ'] = 'αι', ['ᾀ'] = 'αι', ['ᾁ'] = 'αι',
	['ᾂ'] = 'αι', ['ᾃ'] = 'αι', ['ᾄ'] = 'αι', ['ᾅ'] = 'αι', ['ᾆ'] = 'αι', ['ᾇ'] = 'αι',
	['έ'] = 'ε',
	['ή'] = 'η',
	['ί'] = 'ι', ['ϊ'] = 'ι', ['ΐ'] = 'ι', ['ῒ'] = 'ι', ['ῗ'] = 'ι', ['ϳ'] = 'ι',
	['ϰ'] = 'κ',
	['ό'] = 'ο',
	['ϱ'] = 'ρ',
	['ς'] = 'σ', ['ϲ'] = 'σ', ['ͻ'] = 'σ', ['ͼ'] = 'σ', ['ͽ'] = 'σ', ['ϛ'] = 'σ',
	['ύ'] = 'υ', ['ϋ'] = 'υ',
	['ώ'] = 'ω',

	-- Coptic alphabet

	-- Cyrillic alphabet
	['ӑ'] = 'а', ['ӓ'] = 'а', ['ӕ'] = 'ае',
	['ѓ'] = 'г', ['ґ'] = 'г', ['ғ'] = 'г', ['ҕ'] = 'г',
	['ё'] = 'е', ['ӗ'] = 'е', ['є'] = 'е', ['ә'] = 'е', ['ӛ'] = 'е',
	['ӂ'] = 'ж', ['ӝ'] = 'ж', ['җ'] = 'ж',
	['ҙ'] = 'з', ['ӟ'] = 'з', ['ӡ'] = 'з',
	['і'] = 'и', ['ӣ'] = 'и', ['ӥ'] = 'и',
	['ї'] = 'й', ['ј'] = 'й',
	['ќ'] = 'к', ['қ'] = 'к', ['ҝ'] = 'к', ['ҟ'] = 'к', ['ҡ'] = 'к', ['ӄ'] = 'к',
	['љ'] = 'л',
	['њ'] = 'н', ['ң'] = 'н', ['ҥ'] = 'н', ['ӈ'] = 'н',
	['ҧ'] = 'п',
	['ӧ'] = 'о', ['ө'] = 'о', ['ӫ'] = 'о',
	['ҫ'] = 'с', ['ѕ'] = 'с',
	['ҭ'] = 'т', ['ћ'] = 'т', ['ҵ'] = 'тц',
	['ў'] = 'у', ['ӯ'] = 'у', ['ӱ'] = 'у', ['ӳ'] = 'у', ['ү'] = 'у', ['ұ'] = 'у',
	['ѷ'] = 'ѵ',
	['ҳ'] = 'х',
	['ӵ'] = 'ч', ['ҷ'] = 'ч', ['ҹ'] = 'ч', ['ӌ'] = 'ч', ['ҿ'] = 'ҽ',
	['џ'] = 'щ', ['ѹ'] = 'оу',
	['ӹ'] = 'ы',
	['ѽ'] = 'ѡ', ['ѿ'] = 'ѡ',

	-- Armenian alphabet (separate the ligatures)
	['ﬓ'] = 'մն', ['ﬔ'] = 'մե', ['ﬕ'] = 'մի', ['ﬖ'] = 'վն', ['ﬗ'] = 'մխ',

	-- TODO: Georgian alphabets

	-- TODO: Hebrew abjad

	-- Arabic abjad (remove combined diacritics or replace secondary variants by primary letter) -- TODO: decompose complex ligatures
    -- Note: The "x" comments are for easier editing, better exhibiting the syntax (Latin blocks the visual reordering caused by RTL Arabic letters)
    --[[x]]['أ'] = --[[x]]'ا',
      --[[x]]['ٱ'] = --[[x]]'ا',
      --[[x]]['ٳ'] = --[[x]]'ا',
      --[[x]]['ٲ'] = --[[x]]'ا',
      --[[x]]['ا'] = --[[x]]'ا',
      --[[x]]['آ'] = --[[x]]'ا',
    --[[x]]['ب'] = --[[x]]'ب',
      --[[x]]['ٻ'] = --[[x]]'ب',
      --[[x]]['ڀ'] = --[[x]]'ب',
      --[[x]]['پ'] = --[[x]]'ب',
      --[[x]]['ت'] = --[[x]]'ب',
      --[[x]]['ٺ'] = --[[x]]'ب',
      --[[x]]['ٿ'] = --[[x]]'ب',
      --[[x]]['ټ'] = --[[x]]'ب',
      --[[x]]['ٽ'] = --[[x]]'ب',
      --[[x]]['ٹ'] = --[[x]]'ب',
    --[[x]]['ڃ'] = --[[x]]'ج',
      --[[x]]['ڄ'] = --[[x]]'ج',
      --[[x]]['چ'] = --[[x]]'ج',
      --[[x]]['ڇ'] = --[[x]]'ج',
      --[[x]]['ح'] = --[[x]]'ج',
      --[[x]]['ځ'] = --[[x]]'ج',
      --[[x]]['ڂ'] = --[[x]]'ج',
      --[[x]]['څ'] = --[[x]]'ج',
      --[[x]]['خ'] = --[[x]]'ج',
    --[[x]]['ڋ'] = --[[x]]'د',
      --[[x]]['ڈ'] = --[[x]]'د',
      --[[x]]['ډ'] = --[[x]]'د',
      --[[x]]['ڊ'] = --[[x]]'د',
      --[[x]]['ڍ'] = --[[x]]'د',
      --[[x]]['ڎ'] = --[[x]]'د',
      --[[x]]['ڏ'] = --[[x]]'د',
      --[[x]]['ڐ'] = --[[x]]'د',
      --[[x]]['ذ'] = --[[x]]'د',
      --[[x]]['ڌ'] = --[[x]]'د',
    --[[x]]['ڕ'] = --[[x]]'ر',
      --[[x]]['ڒ'] = --[[x]]'ر',
      --[[x]]['ڔ'] = --[[x]]'ر',
      --[[x]]['ږ'] = --[[x]]'ر',
      --[[x]]['ڗ'] = --[[x]]'ر',
      --[[x]]['ڑ'] = --[[x]]'ر',
      --[[x]]['ړ'] = --[[x]]'ر',
      --[[x]]['ز'] = --[[x]]'ر',
      --[[x]]['ڙ'] = --[[x]]'ر',
      --[[x]]['ژ'] = --[[x]]'ر',
    --[[x]]['ڛ'] = --[[x]]'س',
      --[[x]]['ښ'] = --[[x]]'س',
      --[[x]]['ڜ'] = --[[x]]'س',
      --[[x]]['ش'] = --[[x]]'س',
    --[[x]]['ڝ'] = --[[x]]'ص',
      --[[x]]['ڞ'] = --[[x]]'ص',
      --[[x]]['ض'] = --[[x]]'ص',
    --[[x]]['ڟ'] = --[[x]]'ط',
      --[[x]]['ظ'] = --[[x]]'ط',
    --[[x]]['ڠ'] = --[[x]]'ع',
      --[[x]]['غ'] = --[[x]]'ع',
    --[[x]]['ف'] = --[[x]]'ڡ',
      --[[x]]['ڢ'] = --[[x]]'ڡ',
      --[[x]]['ڣ'] = --[[x]]'ڡ',
      --[[x]]['ڤ'] = --[[x]]'ڡ',
      --[[x]]['ڥ'] = --[[x]]'ڡ',
      --[[x]]['ڦ'] = --[[x]]'ڡ',
    --[[x]]['ڧ'] = --[[x]]'ق',
      --[[x]]['ڨ'] = --[[x]]'ق',
    --[[x]]['ګ'] = --[[x]]'ك',
      --[[x]]['ڮ'] = --[[x]]'ك',
      --[[x]]['ڬ'] = --[[x]]'ك',
      --[[x]]['ڭ'] = --[[x]]'ك',
      --[[x]]['ک'] = --[[x]]'ك',
      --[[x]]['ڪ'] = --[[x]]'ك',
      --[[x]]['گ'] = --[[x]]'ك',
      --[[x]]['ڰ'] = --[[x]]'ك',
      --[[x]]['ڱ'] = --[[x]]'ك',
      --[[x]]['ڳ'] = --[[x]]'ك',
      --[[x]]['ڲ'] = --[[x]]'ك',
      --[[x]]['ڴ'] = --[[x]]'ك',
    --[[x]]['ڵ'] = --[[x]]'ل',
      --[[x]]['ڶ'] = --[[x]]'ل',
      --[[x]]['ڷ'] = --[[x]]'ل',
--[=[
    --[[x]]['?'] = --[[x]]'م',
--]=]
    --[[x]]['ن'] = --[[x]]'ں',
      --[[x]]['ڼ'] = --[[x]]'ں',
      --[[x]]['ڻ'] = --[[x]]'ں',
      --[[x]]['ڽ'] = --[[x]]'ں',
    --[[x]]['ه'] = --[[x]]'ۃ',
      --[[x]]['ہ'] = --[[x]]'ۃ',
      --[[x]]['ھ'] = --[[x]]'ۃ',
      --[[x]]['ۂ'] = --[[x]]'ۃ',
      --[[x]]['ە'] = --[[x]]'ۃ',
      --[[x]]['ۀ'] = --[[x]]'ۃ',
    --[[x]]['ۄ'] = --[[x]]'و',
      --[[x]]['ۆ'] = --[[x]]'و',
      --[[x]]['ۅ'] = --[[x]]'و',
      --[[x]]['ۇ'] = --[[x]]'و',
      --[[x]]['ۈ'] = --[[x]]'و',
      --[[x]]['ۉ'] = --[[x]]'و',
    --[[x]]['ۍ'] = --[[x]]'ې',
      --[[x]]['ى'] = --[[x]]'ې',
      --[[x]]['ي'] = --[[x]]'ې',
      --[[x]]['ێ'] = --[[x]]'ې',
      --[[x]]['ۑ'] = --[[x]]'ې',
      --[[x]]['ے'] = --[[x]]'ې',
      --[[x]]['ی'] = --[[x]]'ې',
      --[[x]]['ئ'] = --[[x]]'ې',
      --[[x]]['ۓ'] = --[[x]]'ې',

	-- TODO: remap other precombined letters with diacritics or ligatures (in Indic scripts...)

	-- TODO: map standard Kanakana with diacritics to standard Katakana without diacritics, and drop Katakana diacritical voice marks

	-- Fullwidth basic Latin variants (excluding uppercase letters) to standard basic Latin
	['！'] = '!', ['＂'] = '"', ['＃'] = '#', ['＄'] = '$', ['％'] = '%', ['＆'] = '&', ['＇'] = "'", ['（'] = '(',
	['）'] = ')', ['＊'] = '*', ['＋'] = '+', ['，'] = ',', ['－'] = '-', ['．'] = '.', ['／'] = '/', ['：'] = ':',
	['；'] = ';', ['＜'] = '<', ['＝'] = '=', ['＞'] = '>', ['？'] = '?', ['０'] = '0', ['１'] = '1', ['２'] = '2',
	['３'] = '3', ['４'] = '4', ['５'] = '5', ['６'] = '6', ['７'] = '7', ['８'] = '8', ['９'] = '9', ['＠'] = '@',
	['ａ'] = 'a', ['ｂ'] = 'b', ['ｃ'] = 'c', ['ｄ'] = 'd', ['ｅ'] = 'e', ['ｆ'] = 'f', ['ｇ'] = 'g', ['ｈ'] = 'h',
	['ｉ'] = 'i', ['ｊ'] = 'j', ['ｋ'] = 'k', ['ｌ'] = 'l', ['ｍ'] = 'm', ['ｎ'] = 'n', ['ｏ'] = 'o', ['ｐ'] = 'p',
	['ｑ'] = 'q', ['ｒ'] = 'r', ['ｓ'] = 's', ['ｔ'] = 't', ['ｕ'] = 'u', ['ｖ'] = 'v', ['ｗ'] = 'w', ['ｘ'] = 'x',
	['ｙ'] = 'y', ['ｚ'] = 'z', ['｛'] = '}', ['｜'] = '|', ['｝'] = '}', ['～'] = '~',
	-- Fullwidth punctuation and currency symbols variants to standard punctuation and currency symbols
	['￠'] = '¢', ['￡'] = '£', ['￢'] = '¬', ['￣'] = '¯', ['￥'] = '￥', ['￦'] = '￦',
	-- Fullwidth CJK space to standard space
	['　'] = ' ',

	-- Halfwidth symbols and punctuation variants to standard symbols and punctuation
	['￨'] = '|', ['￩'] = '←', ['￪'] = '↑', ['￫'] = '→', ['￬'] = '↓', ['￭'] = '▪', ['￮'] = '○',
	['｡'] = '。', ['｢'] = '「', ['｣'] = '」', ['､'] = '、',
	-- Halfwidth small Katakana variants to standard Katakana
	['･'] = '・', ['ｦ'] = 'ヲ', ['ｧ'] = 'ア', ['ｨ'] = 'イ', ['ｩ'] = 'ウ', ['ｪ'] = 'エ', ['ｫ'] = 'オ', ['ｬ'] = 'ヤ',
	['ｭ'] = 'ユ', ['ｮ'] = 'ヨ', ['ｯ'] = 'ツ', ['ｰ'] = 'ー',
	-- Halfwidth Katakana variants to standard Katakana
	['ｱ'] = 'ア', ['ｲ'] = 'イ', ['ｳ'] = 'ウ', ['ｴ'] = 'エ', ['ｵ'] = 'オ', ['ｶ'] = 'カ', ['ｷ'] = 'キ', ['ｸ'] = 'ク',
	['ｹ'] = 'ケ', ['ｺ'] = 'コ', ['ｻ'] = 'サ', ['ｼ'] = 'シ', ['ｽ'] = 'ス', ['ｾ'] = 'セ', ['ｿ'] = 'ソ', ['ﾀ'] = 'タ',
	['ﾁ'] = 'チ', ['ﾂ'] = 'ツ', ['ﾃ'] = 'テ', ['ﾄ'] = 'ト', ['ﾅ'] = 'ナ', ['ﾆ'] = 'ニ', ['ﾇ'] = 'ヌ', ['ﾈ'] = 'ネ',
	['ﾉ'] = 'ノ', ['ﾊ'] = 'ハ', ['ﾋ'] = 'ヒ', ['ﾌ'] = 'フ', ['ﾍ'] = 'ヘ', ['ﾎ'] = 'ホ', ['ﾏ'] = 'マ', ['ﾐ'] = 'ミ',
	['ﾑ'] = 'ム', ['ﾒ'] = 'メ', ['ﾓ'] = 'モ', ['ﾔ'] = 'ヤ', ['ﾕ'] = 'ユ', ['ﾖ'] = 'ヨ', ['ﾗ'] = 'ラ', ['ﾘ'] = 'リ',
	['ﾙ'] = 'ル', ['ﾚ'] = 'レ', ['ﾛ'] = 'ロ', ['ﾜ'] = 'ワ', ['ﾝ'] = 'ン',
	-- Half-width (TL and V) Hangul to standard (TL and V) Hangul
	-- TODO: later convert them to standard Jamos, when standard Hangul TLV syllables will be decomposed to TL+V in the first map
	['ﾡ'] = 'ﾡ', ['ﾢ'] = 'ﾢ', ['ﾣ'] = 'ﾣ', ['ﾤ'] = 'ﾤ', ['ﾥ'] = 'ﾥ', ['ﾦ'] = 'ﾦ', ['ﾧ'] = 'ﾧ', ['ﾨ'] = 'ﾨ',
	['ﾩ'] = 'ﾩ', ['ﾪ'] = 'ﾪ', ['ﾫ'] = 'ﾫ', ['ﾬ'] = 'ﾬ', ['ﾭ'] = 'ﾭ', ['ﾮ'] = 'ﾮ', ['ﾯ'] = 'ﾯ', ['ﾰ'] = 'ﾰ',
	['ﾱ'] = 'ﾱ', ['ﾲ'] = 'ﾲ', ['ﾳ'] = 'ﾳ', ['ﾴ'] = 'ﾴ', ['ﾵ'] = 'ﾵ', ['ﾶ'] = 'ﾶ', ['ﾷ'] = 'ﾷ', ['ﾸ'] = 'ﾸ',
	['ﾹ'] = 'ﾹ', ['ﾺ'] = 'ﾺ', ['ﾻ'] = 'ﾻ', ['ﾼ'] = 'ﾼ', ['ﾽ'] = 'ﾽ', ['ﾾ'] = 'ﾾ',
	['ￂ'] = 'ￂ', ['ￃ'] = 'ￃ', ['ￄ'] = 'ￄ', ['ￅ'] = 'ￅ', ['ￆ'] = 'ￆ', ['ￇ'] = 'ￇ', ['ￊ'] = 'ￊ', ['ￋ'] = 'ￋ',
	['ￌ'] = 'ￌ', ['ￍ'] = 'ￍ', ['ￎ'] = 'ￎ', ['ￏ'] = 'ￏ', ['ￒ'] = 'ￒ', ['ￓ'] = 'ￓ', ['ￔ'] = 'ￔ', ['ￕ'] = 'ￕ',
	['ￖ'] = 'ￖ', ['ￗ'] = 'ￗ', ['ￚ'] = 'ￚ', ['ￛ'] = 'ￛ', ['ￜ'] = 'ￜ',
}

--[[
	Map all the enclosing marks (gc=Me), and most isolated combining diacritics (gc=Mn),
	except the spacing marks (gc=Mc) of vowels or consonants signs in scripts that are not semitic abjads
	and musical combining spacing marks (gc=Mc) even when they have a non-zero canonical combining class,
	to an empty string for computing the primary key.
]]
local toChar = mw.ustring.char
for _, codepoint in ipairs({
	-- =========== [ccc=0], [gc=Me] (enclosing marks)
	-- Cyrillic number marks
	0x0488, 0x0489, {0xA670, 0xA672},
	-- Inherited symbol marks
	0x1ABE, {0x20DD, 0x20E0}, {0x20E2, 0x20E4},
	-- =========== [ccc=1] (overlay marks)
	-- Inherited marks: tilde, (short,long) (horizontal, oblique) bar
	{0x0334, 0x0338},
	-- Inherited vedic signs
	0x1CD4, {0x1CE2, 0x1CE8},
	-- Inherited symbol marks
	0x20D2, 0x20D3, {0x20D8, 0x20DA}, 0x20E5, 0x20E6, 0x20EA, 0x20EB,
	-- =========== [ccc=7] (nukta signs)
	-- Devanagari; Bengali; Gurmukhi; Gujarati; Oriya; Kannada
	0x093C, 0x09BC, 0x0A3C, 0x0ABC, 0x0B3C, 0x0CBC,
	-- Myanmar: dot below; Balinese: rerekan; Batak: tompi; Lepcha; Javanese: cecak telu
	0x1037, 0x1B34, 0x1BE6, 0x1C37, 0xA9B3,
	-- Kaithi; Mahajani; Sharada; Khojki; Khudawadi
	0x110BA, 0x11173, 0x111CA, 0x11236, 0x112E9,
	-- Inherited: bindu below; Grantha; Newa; Tirhuta; Siddham; Takri; Dogra; Masaram: gondi
	0x1133B, 0x1133C, 0x11446, 0x114C3, 0x115C0, 0x116B7, 0x1183A, 0x11D42,
	-- Adlam
	0x1E94A,
	-- =========== [ccc=8] (Hiragana/Katakana voicing)
	-- Hiragana/Katakana sound marks: voiced, semi-voiced
	0x3099, 0x309A,
	-- =========== [ccc=10..26] (Hebrew points)
	-- Hebrew vowels, cantillation: [ccc=10..22]; modifiers: [ccc=23..26] raf, shin dot, sin dot, varika
	{0x05B0, 0x05BD}, 0x05BF, 0x05C1, 0x05C2, 0xFB1E,
	-- =========== [ccc=27..35] (Arabic points)]
	-- Arabic small vowels: [ccc=30..32]; vowels: [ccc=27..34]; superscript alef: [ccc=35]; open vowels: [ccc=27..29]
	{0x0618, 0x061A}, {0x064B, 0x0652}, 0x0670, {0x08F0, 0x08F2},
	-- =========== [ccc=36] (Syriac superscript)
	-- Syriac mark: alaph
	0x0711,
	-- =========== [ccc=200] (attached below left) : none
	-- =========== [ccc=202] (attached below)
	-- Inherited marks: palatalized hook, retroflex hook, cedilla, ogonek, is
	0x0321, 0x0322, 0x0327, 0x0328, 0x1DD0,
	-- =========== [ccc=204] (attached below right) : none
	-- =========== [ccc=208] (attached left) : none
	-- =========== [ccc=210] (attached right) : none
	-- =========== [ccc=212] (attached above left) : none
	-- =========== [ccc=214] (attached above)
	-- Inherited mark: ogonek
	0x1DCE,
	-- =========== [ccc=216] (attached above right)
	-- Inherited mark: horn
	0x031B,
	-- Tibetan mark (tsa undefined-phru)
	0x0F39,
	-- Musical symbols: stem, sprechgesang stem, flag-1..5
	--[[DISABLED(gc=Mc): 0x1D165, 0x1D166, {0x1D16E, 0x1D172},]]
	-- =========== [ccc=218] (distinct below left)
	-- Sinographic tone mark (level)
	0x302A,
	-- =========== [ccc=220] (distinct below)
	-- Inherited marks: (grave, acute) accent, (left, right) tack, left half ring, (up, down) tack, (plus, minus) sign, dot, diaeresis, ring, comma, vertical line, bridge, inverted double arch, caron, circumflex accent, breve, inverted breve, tilde, macron, line, double line, right half ring, inverted bridge, square, seagull, equals sign, double vertical line, left angle, (left right, upwards) arrow, x, (left, right) arrowhead, right arrowhead and up arrowhead, asterisk, double ring
	{0x0316, 0x0319}, {0x031C, 0x0320}, {0x0323, 0x0326}, {0x0329, 0x0333}, {0x0339, 0x033C}, {0x0347, 0x0349}, 0x034D, 0x034E, {0x0353, 0x0356}, 0x0359, 0x035A,
	-- Hebrew accents: etnahta, tipeha, tevir, atnah hafukh, munah, mahapakh, merkha, merkha kefula, darga, yerah ben yomo; mark: lower dot
	0x0591, 0x0596, 0x059B, {0x05A2, 0x05A7}, 0x05AA, 0x05C5,
	-- Arabic marks: hamza, subscript alef; vowel sign: dot; marks: wavy hamza, small seen, empty centre stop, small meem
	0x0655, 0x0656, 0x065C, 0x065F, 0x06E3, 0x06EA, 0x06ED,
	-- Syriac marks: pthaha, zqapha, rbasa, dotted zlama (horizontal, angular), hbasa, hbasa-esasa dotted, esasa, rukkakha, two vertical dots, three dots, oblique line
	0x0731, 0x0734, {0x0737, 0x0739}, 0x073B, 0x073C, 0x073E, 0x0742, 0x0744, 0x0746, 0x0748,
	-- N'ko marks: nasalization, dantayalan
	0x07F2, 0x07FD,
	-- Mandaic marks: affrication, vocalization, gemination
	{0x0859, 0x085B},
	-- Arabic marks: small waw, turned damma, curly kasra, curly kasratan
	0x08D3, 0x08E3, 0x08E6, 0x08E9,
	-- Arabic tones: one dot, two dots, loop
	{0x08ED, 0x08EF},
	-- Arabic marks: kasra with dot, (left, right) arrowhead
	0x08F6, 0x08F9, 0x08FA,
	-- Devanagari mark: stress sign anudatta
	0x0952,
	-- Tibetan astrological signs : undefined-khyud pa, sdong tshugs; ngas bzung marks: nyi zla, sgor rtags; symbol: padma gdan
	0x0F18, 0x0F19, 0x0F35, 0x0F37, 0x0FC6,
	-- Myanmar sign: shan council emphatic tone
	0x108D,
	-- Limbu sign: sa-i
	0x193B,
	-- Buginese vowel sign: u
	0x1A18,
	-- Tai Tham mark: cryptogrammic dot
	0x1A7F,
	-- Inherited marks: x-x, wiggly line, open, double open, (light, strong) centralization stroke, parentheses
	{0x1AB5, 0x1ABA}, 0x1ABD,
	-- Balinese Musical symbol: endep
	0x1B6C,
	-- Inherited vedic tones: yajurvedic (aggravated, -, kathaka) independent svarita, candra, yajurvedic kathaka independent svarita schroeder; marks: kathaka anudatta, dot, two dots, three dots; sign: tiryak
	{0x1CD5, 0x1CD9}, {0x1CDC, 0x1CDF}, 0x1CED,
	-- Inherited marks: snake, Latin small letter r, zigzag, wide inverted bridge, almost equal to, right arrowhead and down arrowhead, triple underdot, (rightwards, leftwards) harpoon with barb downwards, (left, right) arrow
	0x1DC2, 0x1DCA, 0x1DCF, 0x1DF9, 0x1DFD, 0x1DFF, 0x20E8, {0x20EC, 0x20EF},
	-- Kayah Li tones: plophu, calya, calya plophu
	0xA92B, 0xA92C, 0xA92D,
	-- Tai Viet vowel: u
	0xAAB4,
	-- Inherited marks: ligature (left half, right half), tilde (left half, right half), macron (left half, right half, conjoining)
	{0xFE27,0xFE2D},
	-- Phaistos Disc sign: oblique stroke
	0x101FD,
	-- Coptic Epact mark: thousands
	0x102E0,
	-- Kharoshthi signs: double ring, dot
	0x10A0D, 0x10A3A,
	-- Manichaean mark: abbreviation
	0x10AE6,
	-- Sogdian marks: dot, two dots, curve, hook, long hook, resh, stroke
	0x10F46, 0x10F47, 0x10F4B, {0x10F4D, 0x10F50},
	-- Musical symbols: accent, staccato, tenuto, staccatissimo, marcato, marcato-staccato, accent-staccato, loure, double tongue, triple tongue
	--[[DISABLED(gc=Mc): {0x1D17B, 0x1D182}, 0x1D18A, 0x1D18B,]]
	-- Mende Kikakui numbers: teens, tens, hundreds, thousands, ten thousands, hundred thousands, millions
	{0x1E8D0, 0x1E8D6},
	-- =========== [ccc=222] (distinct below right)
	-- Hebrew accents: yetiv, dehi
	0x059A, 0x05AD,
	-- Limbu sign: mukphreng
	0x1939,
	-- Sinographic tone mark: entering
	0x302D,
	-- =========== [ccc=224] (distinct left)
	-- Hangul tone marks: single dot, double dot
	0x302E, 0x302F,
	-- =========== [ccc=226] (distinct right)
	-- Musical symbol: augmentation dot
	--[[DISABLED(gc=Mc): 0x1D16D,]]
	-- =========== [ccc=228] (distinct above left)
	-- Hebrew accent: zinor
	0x05AE,
	-- Mongolian letter: ali gali dagalga
	0x18A9,
	-- Inherited marks: kavyka, dot
	0x1DF7, 0x1DF8,
	-- Sinographic tone mark: rising
	0x302B,
	-- =========== [ccc=230] (distinct above)
	-- Inherited marks: (grave, acute, circumflex) accent, tilde, macron, overline, breve, dot, diaeresis, hook, ring, double acute accent, caron, vertical line, double (vertical line, grave accent), candrabindu, inverted breve, (turned, -, reversed) comma
	{0x0300, 0x0314},
	-- Inherited marks: x, vertical tilde, double overline, grave tone mark, acute tone mark
	{0x033D, 0x0341},
	-- Greek marks: perispomeni, koronis, dialytika tonos
	{0x0342, 0x0344},
	-- Inherited marks: bridge, not tilde, homothetic, almost equal to, right arrowhead, left half ring, fermata, right half ring, zigzag
	0x0346,	{0x034A, 0x034C}, {0x0350, 0x0352}, 0x0357, 0x035B,
	-- Latin small letters: a, e, i, o, u, c, d, h, m, r, t, v, x
	{0x0363,0x036F},
	-- Cyrillic marks: titlo, palatalization, dasia pneumata, psili pneumata, pokrytie
	{0x0483,0x0487},
	-- Hebrew accents: segol, shalshelet, zaqef qatan, zaqef gadol, revia, zarqa, pashta, geresh, geresh muqdam, gershayim, qarney para, telisha gedola, tazer, qadma, telisha qetana, ole, iluy
	{0x0592, 0x0595}, {0x0597, 0x0599}, {0x059C,0x05A1}, 0x05A8, 0x05A9, 0x05AB, 0x05AC,
	-- Hebrew marks: masora circle, upper dot
	0x05AF, 0x05C4,
	-- Arabic signs: sallallahou alayhe wassallam, alayhe assallam, rahmatullah alayhe, radi allahou anhu, takhallus; small marks: tah, ligature alef with lam with yeh, zain
	{0x0610, 0x0617},
	-- Arabic marks: maddah, hamza, inverted damma, noon ghunna, zwarakay; small vowel signs: v, inverted v; marks: reversed damma, fatha with two dots
	0x0653, 0x0654, {0x0657, 0x065B}, 0x065D, 0x065E,
	-- Arabic small marks: ligatures (sad, qaf) with lam with alef maksura, meem initial form, lam alef, jeem, three dots, seen, rounded zero, upright rectangular zero, dotless head of khah, meem isolated form, madda, yeh, noon
	0x06D6, 0x06D7, {0x06D8, 0x06DC}, {0x06DF, 0x06E2}, 0x06E4, 0x06E7, 0x06E8,
	-- Arabic marks: centre stop (empty, rounded filled)
	0x06EB, 0x06EC,
	-- Syriac marks: pthaha, pthaha dotted, zqapha, zqapha dotted, rbasa, hbasa, esasarwaha, feminine dot, qushshaya, two vertical dots, three dots, oblique line, music, barrekh
	0x0730, 0x0732, 0x0733, 0x0735, 0x0736, 0x073A, 0x073D, {0x073F,0x0741}, 0x0743, 0x0745, 0x0747, 0x0749, 0x074A,
	-- N'ko tones: short (high, low, rising), long (descending, high, low, rising)
	{0x07EB,0x07F1},
	-- N'ko mark: double dot
	0x07F3,
	-- Samaritan mark: in, in-alaf, occlusion, dagesh, epenthetic yut; vowel signs: long e, e, overlong aa, long aa, aa, overlong a, long a, a, short a, long u, u, long i, i, o, sukun; mark: nequdaa
	{0x0816, 0x0819}, {0x081B, 0x0823}, {0x0825, 0x0827}, {0x0829,0x082D},
	-- Arabic word: small high ar-rub
	0x08D4,
	-- Arabic marks: small high sad, high ain, high qaf, high noon with kasra, low noon with kasra
	{0x08D5, 0x08D9},
	-- Arabic high words: ath-thalatha, as-sajda, an-nisf, sakta, qif, waqfa
	{0x08DA, 0x08DF},
	-- Arabic marks: footnote marker, sign safha
	0x08E0, 0x08E1,
	-- Arabic marks: curly (fatha, damma, fathatan, dammatan)
	0x08E4, 0x08E5, 0x08E7, 0x08E8,
	-- Arabic tones: one dot, two dots, loop
	{0x08EA, 0x08EC},
	-- Arabic marks: small high waw, fatha with ring, fatha with dot, left arrowhead, right arrowhead, double right arrowhead, double right arrowhead with dot, right arrowhead with dot, damma with dot, sideways noon ghunna
	{0x08F3, 0x08F5}, 0x08F7, 0x08F8, {0x08FB, 0x08FF},
	-- Devanagari accents: stress sign udatta, grave, acute
	0x0951, 0x0953, 0x0954,
	-- Bengali mark: sandhi
	0x09FE,
	-- Tibetan signs: nyi zla naa da, sna ldan, lci rtags, yang rtags
	0x0F82, 0x0F83, 0x0F86, 0x0F87,
	-- Ethiopic marks: gemination and vowel length, vowel length, gemination
	{0x135D, 0x135F},
	-- Khmer sign: atthacan
	0x17DD,
	-- Limbu sign: kemphreng
	0x193A,
	-- Buginese vowel sign: i
	0x1A17,
	-- Tai Tham signs: tone-1, tone-2, khuen tone-3, khuen tone-4, khuen tone-5, ra haam, mai sam, khuen-lue karan
	{0x1A75, 0x1A7C},
	-- Inherited marks: doubled circumflex accent, diaeresis-ring, infinity, downwards arrow, triple dot, parentheses, double parentheses
	{0x1AB0, 0x1AB4}, 0x1ABB, 0x1ABC,
	-- Balinese Musical symbols: tegeh kempul, kempli, jegogan, kempul with jegogan, kempli with jegogan, bende, gong
	0x1B6B, {0x1B6D, 0x1B73},
	-- Inherited vedic tones: karshana, shara, prenkha, double svarita, triple svarita, rigvedic Kashmiri independent svarita, candra, ring, double ring
	0x1CD0, 0x1CD2, 0x1CDA, 0x1CDB, 0x1CE0, 0x1CF4, 0x1CF8,0x1CF9,
	-- Inherited marks: dotted (grave, acute) accent
	0x1DC0, 0x1DC1,
	-- Inherited marks: suspension, macron-acute, grave-macron, macron-grave, acute-macron, grave-acute-grave, acute-grave-acute
	{0x1DC3, 0x1DC9},
	-- Inherited marks: breve-macron, macron-breve, ur, us
	0x1DCB, 0x1DCC, 0x1DD1, 0x1DD2,
	-- Latin small letters: flattened open a, ae, ao, av, ç, insular d, eth, g, capital G, k, l, capital L, capital M, n, capital N, capital R, r rotunda, s, long s, a, alpha, b, beta, schwa, f, l with double middle tilde, o with light centralization stroke, p, esh, u with light centralization stroke, w, ä, ö, ü
	{0x1DD3, 0x1DF4},
	-- Inherited marks: up tack, deletion, left arrowhead, left harpoon, right harpoon
	0x1DF5, 0x1DFB, 0x1DFE, 0x20D0, 0x20D1,
	-- Inherited marks: (anticlockwise, clockwise, left, right) arrow
	{0x20D4, 0x20D7},
	-- Inherited marks: three dots, four dots, left right arrow, annuity symbol, wide bridge, asterisk
	0x20DB, 0x20DC, 0x20E1, 0x20E7, 0x20E9, 0x20F0,
	-- Coptic marks: ni, spiritus asper, spiritus lenis
	{0x2CEF, 0x2CF1},
	-- Cyrillic letters: be, ve, ghe, de, zhe, ze, ka, el, em, en, o, pe, er, es, te, ha, tse, che, sha, shcha, fita, es-te, a, ie, djerv, monograph uk, yat, yu, iotified a, little yus, big yus, iotified big yus
	{0x2DE0, 0x2DFF},
	-- Cyrillic sign: vzmet
	0xA66F,
	-- Cyrillic letters: Ukrainian ie, i, yi, u, hard sign, yeru, soft sign, omega
	{0xA674, 0xA67B},
	-- Cyrillic signs: kavyka, payerok
	0xA67C, 0xA67D,
	-- Cyrillic letters: ef, iotified e
	0xA69E, 0xA69F,
	-- Bamum marks: koqndon, tukwentis
	0xA6F0, 0xA6F1,
	-- Devanagari digits: zero..nine
	{0xA8E0, 0xA8E9},
	-- Devanagari letters: a, u, ka, na, pa, ra, vi
	{0xA8EA, 0xA8F0},
	-- Devanagari sign: avagraha
	0xA8F1,
	-- Tai Viet sign: mai kang; vowel signs: i, ue; sign: mai khit, vowel signs: ia, am; tones: mai ek, mai tho
	0xAAB0, 0xAAB2, 0xAAB3, 0xAAB7, 0xAAB8, 0xAABE, 0xAABF, 0xAAC1,
	-- Inherited marks: ligature (left, right) half, double tilde (left, right) half, macron ((left, right) half, conjoining)
	{0xFE20, 0xFE26},
	-- Cyrillic half marks: titlo (left, right)
	0xFE2E, 0xFE2F,
	-- Old Permic letters: an, doi, zata, nenoe, sii
	{0x10376, 0x1037A},
	-- Kharoshthi signs: visarga, bar
	0x10A0F, 0x10A38,
	-- Manichaean mark: abbreviation
	0x10AE5,
	-- Hanifi Rohingya signs: harbahay, tahala, tana, tassi
	{0x10D24, 0x10D27},
	-- Sogdian marks: dot, two dots, curve, hook
	{0x10F48, 0x10F4A}, 0x10F4C,
	-- Chakma signs: candrabindu, anusvara, visarga
	{0x11100, 0x11102},
	-- Grantha digits: zero..six
	{0x11366, 0x1136C},
	-- Grantha letters: a, ka, na, vi, pa
	{0x11370, 0x11374},
	-- Newa mark: sandhi
	0x1145E,
	-- Pahawh Hmong marks: cim (tub, so, kes, khav, suam, hom, taum)
	{0x16B30, 0x16B36},
	-- Musical Symbol marks: doit, rip, flip, smear, bend, down bow, up bow, harmonic, snap pizzicato
	--[[DISABLED(gc=Mc): {0x1D185, 0x1D1AD},]]
	-- Greek Musical marks: triseme, tetraseme, pentaseme
	{0x1D242, 0x1D244},
	-- Glagolitic letters:
	{0x1E000, 0x1E006}, {0x1E008, 0x1E018}, {0x1E01B, 0x1E021}, 0x1E023, 0x1E024, {0x1E026, 0x1E02A},
	-- Nyiakeng Puachue Hmong tones: b, m, j, v, s, g, d
	{0x1E130, 0x1E136},
	-- Wancho tones: tup, tupni, koi, koini
	{0x1E2EC, 0x1E2EF},
	-- Adlam marks: alif lengthener, vowel lengthener, gemination, hamza, consonant modifier, geminate consonant modifier
	{0x1E944, 0x1E949},
	-- =========== [ccc=232] (distinct above right)
	-- Inherited marks: comma, left angle, dot, kavyka
	0x0315, 0x031A, 0x0358, 0x1DF6,
	-- Ideographic tone: departing
	0x302C,
	-- =========== [ccc=233] (distinct below two bases)
	-- Inherited marks: breve, macron, rightwards arrow, inverted breve
	0x035C, 0x035F, 0x0362, 0x1DFC,
	-- =========== [ccc=234] (distinct above two bases)
	-- Inherited marks: breve, macron, tilde, inverted breve, circumflex
	0x035D, 0x035E, 0x0360, 0x0361, 0x1DCD,
	-- =========== [ccc=240] (distinct below or right)
	-- Greek iota subscript: ypogegrammeni
	0x0345,
	-- ===========
}) do
	if type(codepoint) == 'number' then
		substDiacritics[toChar(codepoint)] = ''
	else
		for cp = codepoint[1], codepoint[2] do
			substDiacritics[toChar(cp)] = ''
		end
	end
end

local function makeSortKey(label, lang)
	-- In this module, the tertiary key is the label itself with all differences.
	-- Compute the secondary key (removed compatibility differences, folded letter case).
	-- Note that mw.ustring.lower() uses only the Unicode language-neutral case pairs
	-- and it may not work with all languages, for example in Turkish which requires
	-- preprocessing after decomposition to NFD or NFKD to preserve the primary
	-- distinction of dotted and undotted 'i' (the locale-neutral case mappings
	-- assume that 'i' and 'j' letters are "soft-dotted").
	local snd = lower(toNFKD(label))
		:gsub('[\192-\223].', substLower)
		:gsub('[\224-\239]..', substLower)
		:gsub('[\240-\247]...', substLower)
	-- Compute the primary key (remove diacritics and other minor differences).
	local pri = snd
		:gsub('[\192-\223].', substDiacritics)
		:gsub('[\224-\239]..', substDiacritics)
		:gsub('[\240-\247]...', substDiacritics)
	-- Renormalize subkeys.
	pri, snd, label = toNFD(pri), toNFD(snd), toNFD(label)
	if pri == snd then
		if snd == label then
			return pri
		else
			return pri .. '\0\0' .. label
		end
	else
		if snd == label then
			return pri .. '\0' .. snd
		else
			return pri .. '\0' .. snd .. '\0' .. label
		end
	end
end

-- exports
return {
	makeSortKey = makeSortKey, -- can only be used in Lua or for local debugging, but don't use it to generate HTML, wikitext or sort keys for Mediawiki categories!
}
Module:MakeSortKey

Code

Navigation menu

Search