Module talk:Countries

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search


Johnuniq (talk) 03:42, 9 September 2017 (UTC)


This mode (without all=1) should display no red link at all: the pages pointed by links must exist. When there are several alternative names, the link will try to avoid alternatives that are actually redirects or disambiguation pages and will attempt to use the next alternative available (alternatives are listed in the Data module associated to the template).

The displayed labels will be translated in the user language, using labels defined in Wikidata if the data module specifies a "Qid=Qnnn" for the country or territory (it will also use language fallbacks using Wikidata labels in other languages if there's no exact match for the user language): if this fails (Qid not found in Wikidata), the first (untranslated) name given in the data module (and which normally matches the names used for page titles in Commons, most frequently in English) will be displayed as is.

From Category:Netherlands:

  • {{Countries of Europe|prefix=:Category:}}

From Category:1882 in Finland:

  • {{Countries of Europe|prefix=:Category:1882 in}}

From Category:August 2013 in Germany:

  • {{Countries of Europe|prefix=:Category:August 2013 in}}

From Category:310s architecture in Italy:

  • {{Countries of Europe|prefix=:Category:310s architecture in}}

From Category:Countries of Africa:

  • {{Countries of Africa|prefix=:Category:}}

From Category:Musical groups from Senegal:

  • {{Countries of Africa|prefix=:Category:Musical groups from}}

All pages[edit]

All possible links that may exist on Commons, usually with titles in English (not translated), showing the actual part of the page title which will be used. The list is sorted differently according to their title in Commons (along with alternative titles separated by '|', in order of preference, the first one being the one used by the translated links when all=1 is not used). No test is performed to see if the links are going to an existing page, so there are red links.

This all=1 mode is useful in a parent category when creating (or maintaining) subcategories with consistant names. In this mode, the text content of the generated box does not vary according to the specified prefix and suffix parameters, only the target of the links will change. Once the listed pages or categories have been maintained and is stable, you can reset "all=1" to "all=" to return to normal mode, which will be easier to navigate for users.

  • {{Countries of Europe|prefix=:Category:|all=1}}
  • {{Countries of Europe|prefix=:Category:1882 in|all=1}}
  • {{Countries of Europe|prefix=:Category:August 2013 in|all=1}}
  • {{Countries of Europe|prefix=:Category:310s architecture in|all=1}}
  • {{Countries of Africa|prefix=:Category:|all=1}}
  • {{Countries of Africa|prefix=:Category:Musical groups from|all=1}}

Name of module[edit]

Zhuyifei1999 moved the module. That's fine but I want to record that there was a reason for using Module:Country as the name. This module could easily be extended to deal with other groups of countries and could then be used to implement, for example, Template:Countries of Africa. To put it another way, there is nothing in the module that is specific to countries of Europe apart from the data tables. If it were considered desirable, I could refactor the module so "Europe" or "Africa" etc. was accepted as a parameter passed from the template. Then the module would fetch the appropriate table of data from a submodule. For example, Template:Countries of Europe would be changed from
{{#invoke:Countries of Europe|main}} to
{{#invoke:Countries|main|Europe}} where "Europe" would select the appropriate table of data.
Module:Countries or some other generic name could be used if preferred. Johnuniq (talk) 05:17, 9 September 2017 (UTC)

You may want to make it more maintainable in that case. eg. use lua internal sorting instead of hardcoding them for each language. For reference, the naming of this module was already challenged in phab:T171392#3583337 but was not answered afaict. --Zhuyifei1999 (talk) 05:36, 9 September 2017 (UTC)
I saw the phab post but I did not want to focus on the rather trivial issue of what the module is called, particularly given that the whole post rejected the idea of using a module. My first comment about the template/module was at Template talk:Countries of Europe#Module:Country on 1 September 2017 (five days before the phab post). My comment included '(with a generic name in case it is enhanced for other "country" purposes)' and I thought that was enough.
I wondered about having the module sort the names but there does not appear to be any reasonable way to do that for all cases as there is no way I can see to compare strings for many languages. Johnuniq (talk) 05:48, 9 September 2017 (UTC)

@Zhuyifei1999: I have started looking at how the module could be made more general to handle other groups of countries. The code may be easier than I had anticipated and I am wondering whether to proceed. That is, do you think this module should have the ability to implement templates such as the following:

{{Countries in the United Kingdom}} + {{Countries of Africa}} + {{Countries of Asia}} + {{Countries of Central America}} + {{Countries of Europe}} + {{Countries of North America}} + {{Countries of North America (subcontinent)}} + {{Countries of Oceania}} + {{Countries of South America}} + {{Countries of the Americas}} + {{Countries of the Caribbean}}

If I examined them, I may find some do not fit the mold, or there might be other reasons to continue using the existing templates. Shall I work on it a bit more and put something in the sandbox of this module? If I finish it and the result appears desirable, this module could be moved, perhaps to Module:Countries. Johnuniq (talk) 11:27, 10 September 2017 (UTC)

No objections from me --Zhuyifei1999 (talk) 16:27, 10 September 2017 (UTC)
I moved the module and have updated it with a version that uses a parameter in the template to specify which group of countries is required. At the moment it only handles Europe. In due course I will see what is involved in adding data for other groups. Johnuniq (talk) 07:57, 12 September 2017 (UTC)

Template categories and documentation[edit]

Four templates are using the module, and I'm working on another seven (my user page has a list). For simplicity and consistency, I am hoping to use the same documentation for each template. The preliminary documentation is at:

Template:Countries of Africa/doc

and it is transcluded into several other templates, for example the two mentioned below.

What categories should a Countries of template use? The following have two categories, while some others have only the first.

It looks to me as if Category:Countries of Asia should not be used for a navigation template.

@Zhuyifei1999 or others: Any thoughts on the categories or the documentation? Johnuniq (talk) 05:48, 25 September 2017 (UTC)

I'd say it's fine, as long as it use a special sorting key, such as '*' --Zhuyifei1999 (talk) 02:01, 26 September 2017 (UTC)
That doc should not be part of Africa specifically. It should be part of a generic template, so that doc should better be renamed, possibly as another subpage of the generic module if there's no generic template, and sorted in categories like the generic module. verdy_p (talk) 19:01, 16 February 2018 (UTC)
Agree, there should not be stand-alone module:Countries/Africa/doc. There should be one doc for a generic continent subpage. Incnis Mrsi (talk) 19:48, 16 February 2018 (UTC)
Disagree: the doc page of a module can perfectly be used, notably to categorize the module itself, and show proper usage, and give a basic example which should work directly when editing the module and previewing it (even if we can select another page such as a template page using the module, or a testcases page when previewing the modified module).
Modules have to be documented, even these those used as data modules, and event their sandboxes ! This is a question of discipline. There's nothing wrong in including in the wikitext of the "/doc" subpage of a module, a transclusion of a template using it, plus some links to related templates and categories (in the "includeonly" section to categorize the module itself and not the "/doc " subpage). verdy_p (talk) 05:49, 8 November 2018 (UTC)

Propagation of translations[edit]

@Verdy p: you probably are doing the same thing as myself. Do you have a plan? Incnis Mrsi (talk) 13:32, 16 February 2018 (UTC)

This was done only for Africa, I was asked to come here about these templates. You may do other submodules if you want, do you see a problem in what I did for Africa ? verdy_p (talk) 14:01, 16 February 2018 (UTC)
Just inquired about intentions to avoid mutual hindrance. I proceed to enrich Asia, Europe, and so on then. Incnis Mrsi (talk) 14:06, 16 February 2018 (UTC)
@Verdy p: you duplicated Bosnian text. To my Slavic intuition (Države ≈ states, powers) in context of titles.main was certainly better than (Zemlje ≈ lands, countries), and bs.Wikipedia sides with me. Incnis Mrsi (talk) 17:12, 16 February 2018 (UTC)
I did not duplicate anything, I added only what was missing.... but may be there was an unsorted entry in the template whcih I did not see. I've not made anything since months for Europe, and only recently worked for Africa. verdy_p (talk) 17:14, 16 February 2018 (UTC)
@Verdy p: you also injected Latin-script stuff into Kirghiz (uses Cyrillic). Check all you are about to add against d:Q15 and similar. Incnis Mrsi (talk) 18:33, 16 February 2018 (UTC)
No, what is injected (country names) comes from Wikidata, but there may be fallbacks for missing translations in Kirghiz: fill in the necessary info in Wikidata. verdy_p (talk) 18:47, 16 February 2018 (UTC)
I've checked scrupulously, there's absolutely no injected Latin for Kirghiz, in what I did. So fix what is missing in Wikidata (for the names of some countries/territories), this is not part of this module ! If you don't know how to do that, give an real example of page where something is missing (most probably coming from language fallbacks to English). I've done nothing for Kirghiz in Wikidata. verdy_p (talk) 18:50, 16 February 2018 (UTC)
I think your's speaking about this when viewing the page in Kirghiz "Башка аймактар: Bir Tawil · Ceuta · Melilla · Puntland · Sahrawi Arab Democratic Republic · Saint Helena, Ascension and Tristan da Cunha · Scattered Islands in the Indian Ocean · Somaliland · Western Sahara · Майотта аралы · Реюньон департаменти". There's never been any name in Kirghiz for now in most of them. But it's not part of that module or template: follow the links still showing English, then go to the interwiki bar to add missing translations. This is not specific to Kirghiz but for any language where fallbacks are used everywhere on Commons when translations are missing. All is translated in Russian, French and many languages without anything needed to handle these languages except only the TWO added grouping headers that are already in Kirghiz. verdy_p (talk) 18:54, 16 February 2018 (UTC)
@Verdy p: the mess introduced and rectified (I hope, but don’t know grammar of the language). Incnis Mrsi (talk) 22:16, 16 February 2018 (UTC)


@Verdy p: Thanks for expanding the submodules but in Module:Countries/Caribbean and Module:Countries/Oceania the change to the following comment is not correct:

XYZ must consist of en alphabetic characters [A-Za-z]

While anything is possible, the code in Module:Countries requires the above. That's because of (pattern:gsub('{(%a+)}', var)) which should not be changed without good reason and extreme care. Johnuniq (talk) 00:53, 18 October 2018 (UTC)

An original comment (not written by me) specified that "[A-Za-z]" was OK so it allowed capitals there. In fact I don't see why it could not be any alphanumeric (i.e. suitble for identifiers); but "%a" already means any alphabetic (I'm not sure it's really restricted to ASCII only), and so it includes "[A-za-z]" which is perfectly correct!
This is just a comment anyway, and for now there's nothing else than ASCII there. verdy_p (talk) 08:22, 18 October 2018 (UTC)
In case my post above is hard to interpret, the comment quoted above was written by me when I wrote the family of countries modules. The comment is correct, it is the change to the comment that is wrong. FYI the current module requires XYZ to match [A-Za-z] using only the English alphabet (ASCII letters). I posted in the hope you wouldn't change that comment in other modules. Johnuniq (talk) 04:44, 19 October 2018 (UTC)

Conditional patterns[edit]

Until now, all patterns were unconditional strings containing placeholders like '{sectiontitle}...{sectionlist}'. The problem is that we would like to remove the section title and other formatting when the {sectionlist} placeholder is empty.

I just implemented the conditional patterns (represented as ordered arrays of conditional subpatterns, instead of simple strings) to allow "smarter" usages of patterns:

  • The conditional patterns are distinguished because they are "tables" and not "strings". So this does not change any existing "country list" using simple "strings" in their patterns.
  • The tables are processed in order (the only keys are integers from 1 to N, other keys are not used, they are implicitly generated by the Lua syntax of tables) to generate their result in the same order. Each entry is a conditional subpattern, represented as an ordered array (here also, the only keys are integers 1 to N, other keys are ignored), whose first element is the subpattern, and other elements are conditions (all conditions must be true to use the subpattern in the result).
  • Each condition is for now limited to a string (this could be extended by allowing an array to implement 'or' conditions).
  • This condition string is for now limited to the name of a variable name (the same used in '{variablename}' placeholders in patterns or subpatterns).
  • This could also be extended to perform other tests on variables (given that variable names must start by a letter, and for now are limited to '{section}title' and '{section}list').

So you can now extract a '{sectiontitle}...{sectionlist}' subpattern to make it conditional (the condition being that 'sectionlist' is non empty) like this:

{'...{sectiontitle}...{sectionlist}...', 'sectionlist'}

and put this array in the ordered result containing other conditional or unconditional patterns like this:

'leading text...', --[=[ unconditional subpattern, possibly containing placeholders]=]
{'...{sectiontitle}...{sectionlist}...', 'sectionlist'}, --[=[conditional subpattern, possibly containing placeholders]=]
'...other text', --[=[ unconditional subpattern, possibly containing placeholders]=]

This ordered array may contain unconditional subpatterns (strings), or conditional subpatterns (arrays). The number of conditions after each subpattern is not limited (even if, in this example, only one is used after each subpattern).

Not all subpaterns need conditions. I also allowed conditions to be unions (OR) of optional conjonctions (AND) of simple conditions (variable names).

So now:

  • the "{section-title}{colon}" can be conditionally dropped if "{section-list}" is empty, or if the section has no title (in order to remove also the "{colon}" that does not need to be tested)
  • the break (or other punctuation) between sections can be conditionnaly dropped between two sections one section or the other is empty; this requires using a disjonction (AND) by writing the two conditions between braces;
  • the surrounding presentation box can be conditionnaly dropped (with the same condition in the leading suppattern and in the trailing subpattern) when there's no content at all to display in sections (all sections are empty).
  • Every "{section-list}" can be unconditionally returned as their own suppattern (if they are empty, this causes no harm, only what surrounds them needs to be tested.

As an example see {{Countries of Africa}}, where such conditional subpatterns are used now in Module:Countries/Africa, instead of a single unconditional pattern.

verdy_p (talk) 13:23, 30 October 2018 (UTC)

Interesting. I'll try to digest that in a week or so. Is there a page where {{Countries of Africa}} is used that shows this makes a difference? Johnuniq (talk) 06:54, 31 October 2018 (UTC)
You have to look for subcategories that are partially fed: then look at the suppression of the section for territories because no territories have this category.
You may also try to use Countries of Africa in a category where there's no African country at all: the whole banner should be automatically removed (except of course if you set all=1). verdy_p (talk) 07:03, 31 October 2018 (UTC)
The principe is there, it can be used in other similar modules.
Initially I just wrote the possibility of using conditions linked by "OR" as: {subpattern, condition1, condition2, ...}
I've finally added the AND where a single level of {} after the subpattern as: {subpattern, {condition1, condition2, ...}, ...} (this assumes conditions are written in normalized form as unions of conjonctions).
And probably I'll add the possibility of OR/AND/OR by allowing two levels of {} such as {subpattern, {{condition1, condition2, ...}, {condition3, condition4, ...}...}...} (this allows denormalization, notably as conjonctions of unions, notably a small conjonction of long unions, because the normalized form would require longer expressions).
There's also possibility for simple conditions to be something else than just simple variable names (the variables supported are: sectiontitle, sectionlist, lang, dir, colon), such as '!variablename' (given that variables names must start by a letter) for subpatterns when a variable is empty (for now all simple conditions are testing if a variable is non-empty). However I won't allow arbitrary expressions in conditions. The code remains extremely fast as is, without needing to use any "operator" and any complex parser and priorities of operators like in "#if:" expressions). verdy_p (talk) 07:24, 31 October 2018 (UTC)
Look for example in Category:Musical groups from Senegal (no "territories" are listed, so its section title is removed, as well as the break separator between sections for countries and secontion for territories). verdy_p (talk) 07:34, 31 October 2018 (UTC)
Another possibility of extension is to allow the first element of unions to be not just a single string for a subpattern, but also an array being itself conditional, to allow interesting groupings (this syntax would then become recursive and it could avoid denormalized conditions).
verdy_p (talk) 08:31, 31 October 2018 (UTC)
I'll add the feature for Module:Countries/Europe (you should see then that when there are no territories in Europe with limited recognition, or no territories with special status, the label will no longer display with no countries listed after it). Here also the "Category:Musical groups from *" will then display correctly without superfluous labels. verdy_p (talk) 16:01, 31 October 2018 (UTC)
I added that description to the documentation page, in the section speaking about "Pattern", renamed "Pattern or list of subpatterns"
Now the negative conditions work (there was an initial bug, that I detected only when trying to use them first for Countries of Europe, and now it is used as well for Countries of Africa). You can see its effect already in the examples shown in the Normal section at top of this page (for example in Europe), where superfluous header text or separators is no longer displayed in the navbox for sections with nothing to list. verdy_p (talk) 20:25, 31 October 2018 (UTC)

Default (automatic) sort order[edit]

See also Module:MakeSortKey for the implementation of computed sort keys. Note that sortkyes are composite and use a NUL character as separator, which is not suitable for output in HTML: if you need to use it to create usable sort keys in HTML, globally replace these NUL bytes ('\0') in Lua strings by ' !' (i.e. SPACE + exclamation mark), which is the smallest string that will be preserved by HTML whitespace compression, and that still allows distinction of sort keys that are prefixes from other longer sort keys.

For now the automatic sort order is still very crude. It is almost conforming to the UCA algorithm (of Unicode) except that it does not necessarily sorts identically all canonically equivalent strings (even for the set of Latin characters). The code assumes that the text is encoded in normalized order, and in NFC form (this is true at least when base letters don't have more than one combining diacritic, and only if the diacritic is precombined with the base letter; when the diacritic is not precombined, the base letter should sort correctly at primary level with the associated lowercase base letter without any combining diacritic), and should sort also OK at secondary level if they have the same letter case, but the ordering at tertiary level (taking diffrences of diacritics, or differences of ligature forms) may not sort as expected (because of lack of normalization).

I project to separate the sort function into a separate module (which cmputes the sortkey in the function "makeSortKey") which will be maintained separately. For now it works almost OK, at least for sorting page titles, provided that their translated labels (taken from Wikidata) do not use too many "exotic" letters, combining marks, special letter form variants, ligatures, or variants for punctuation marks and symbols.

For this reason, the "automatic" sort order may fail to return the expected order, but each language can fix the order by providing a modified custom order (which will override the automatic sort order, used only for English labels and for languages that have no translations but fallback to the English title in Wikidata, or to the default label indicated in the data module itself if there's no label in Wikidata, this default label which is based only on page titles that exist in Commons).

However we can do things slightly better: it's posible to implement the normalization form NFD (normally required by UCA) in pure Lua code and efficiently.

The first need will be to implement the algorithmic decomposition of Hangul syllables with type LVT and LV into Hangul jamos of type L, V and T. This can be easily done, using a simple "string:gsub()" call to detect LVT and LV syllables and replace them by 2 or 3 jamos. The pattern to use is effectively simple, because it is a single range of UTF-8 byte sequences with a static length of 3 encoded bytes between <EA,B0,80> and <ED,9E,A3>, and which encode the single range of code pointsbetween U+AC00 and U+D7A3: this 3-bytes sequence can be algorithmically decoded as a single code point to decompose):

  • the first parameter of gsub() is a pattern matching the simple valid UTF-8 range <EA,B0,80> to <ED,9E,A3> to detect a single LV or LVT syllable.
  • the second parameter of gsub() is the function computing the 3 returned jamos from the matched string given in its parameter and that contains a single Hangul TL or TLV, and which returns the replacement string made of the 3 computed jamos.
  • The function first computes the 3 UTF-8 bytes to get the scalar value of the single Hangul syllable code point.
  • substract OxAC00 to give a LVT index, and perform modular arithmetic:
  • divide the LVT index by 28, the quotient gives a LV index, the rest gives 0 if it's a LV syllable or gives a T index between 1 and 27 for the relative order of the trailing jamo;
  • divide the LV index by 21, the quotient gives the L index (between 0 and 18), the rest gives the T number (between 0 and 20);
  • the L index, V index and the T index (if it is not zero), are converted to 2 (or 3) simple jamo codepoints by adding the constants for respectively the first Hangul L jamo U+1100, the first Hangul V jamo U+1161, and the first T jamo U+11A8 minus 1;

Note that the UCD also contains some other non-algorithmic decomposition pairs for some other precomposed jamos (in the block starting at U+3131, and in the block starting at U+1100, excluding the three ranges of "simple" L, V and T jamos used by the algorithmic decomposition (needed because the canonical decompositions are not listed in the main UCD datafile).

CLDR also adds custom (non canonical) decompositions needed for collation:

  • for some Hangul symbols composed between parentheses or within circles (in the block starting at U+3200);
  • for legacy "half-width" LV hangul clustersin a "compatibility block" from U+FFA0 to U+FFBE), which may be unambiguously mappable to standard LV jamos, except they are presented in a half-square instead of being composed in a full square, so they do not combine with any following V or T jamo in the same Hangul square;
  • for legacy "half-width" L jamos (consonnants which can ambiguously also be halfwidth T jamos, in a "compatibility block" from U+FFC2 to U+FFDC). Unfortunately these cannot be disambiguated to standard clusters to infer the boundary between standard clusters representing a full syllable (so here also they cannot safely combine in the same Hangul square).

These last two sets of compatiblity jamos require special preprocessing to perform some "guess" with an heuristic. They were used in legacy terminal or printers when it was not possible for them to support the hundreds thousands possible combinations (L*, V*, T*) that can make a single Hangul syllables (in fact there's an infinite number of combionations). So these terminals and printers only presented the "syllables" by splitting them in two parts but instead of representing (L*, V*, T*) syllables, they just presented them as (L*, V*, L*) and it's impossible to determine where (L*, V*, L*) syllables are breaking (before, after, or in the middle of a L* sequence). These two sets of "half-wdith jamos" are then not recommended for use in Korean (not even modern Korean), they are only intended to be used on terminals or old printers (which render them by only aligning them horizontally without trying to present them in complex Hangul square layouts) and working with "monospace" font design and not needing a very large repertoire of glyphs for full syllables. These compatibility jamos are now only used as "fallbacks" if you cannot render all Hangul squares encoded normally.

The standard Hangul jamos and precomposed syllables on the opposite allow unambiguous separation of syllable breaks: this is extremely useful for collation (sorting) and plain text search, because these syllable breaks are inherent to the way Korean is spoken (there's a clear separation between trailing consonnants and leading consonnants of the next syllable), and Korean requires explicit and umabiguous separation of syllables (something not possible automatically with the two sets of compatibility jamos) to make semantic distinctions. These syllable breaks also allow linewrapping to be done accurately only on syllable break boundaries (only before L*, or after T*, or after V* if not followed by any T), i.e. at the same boundaries as standard Hangul squares.

This feature of the Korean alphabet has no equivalent in other "simple" alphabets (Latin, Greek, Cyrillic, the 3 Georgian alphabets, Armenian, Runic alphabet...) or abjads (Hebrew, Arabic, Aramaic, Phoenician, and Antique Greek, before the adoption of distinctive vowels as plain letters and the abandon of matres lectionis in modern alphabets...), where syllable breaks are extremely complicate to determine accurately and automatically within words (because they are almost never encoded explicitly in plain text!).

On the opposite, the Indic abugidas (like Devanagari), including Thai (which uses "prepended letters"), allow automatic syllable breaking (Indic scripts are calling these "syllables" under a specific name which is not strictly equivalent to what we think are single "syllables" in alphabets). The same is true for syllabaries (Ethiopic, Canadian Syllabics, Katakana, Hiragana, Bopomofo), sinograms (used in modern and traditional Chinese, Japanese, Korean, and old Vietnamese), or pictographic scripts (including hierogrlyphs, and emojis).

Korean syllables are true syllables, very accurately determined, but is still a real alphabet (even if it makes two distinctive encodings for its consonnants, as if there was a letter case distinction: L jamos are like "capital consonnants", and T jamos are like "lowercase consonnants", even if this is not visible on their shape but determined by their position in the Hangul square (where T jamos are always rendered below (L*,V*) jamos, so the difference is always visible; only the V jamos for vowels are unicameral). This Korean alphabet is in fact much simpler than Latin, very deterministic, and with much less exceptions.

Other alphabets (especially Latin which is even more complex than Greek or Cyrillic!) are full of exceptions and are in fact much larger with many more letters; Latin is only simple if it's restricted to the 26 bicameral letters of US-ASCII! And this is visible by the number of mappings we need to use for Latin in this module to just offer a very "crude" sort order (and it is still incomplete).

So my goal in this module will be to:

  • terminate the accurate sorting of Korean (it can be completed fast without needing lot of Lua code or large data tables);
  • separate the "makeSortKey" function in a distinct new module, maintained separately, and easier to reuse in other modules (possibly split the data tables in submodules, e.g. one per character type or script, so each part can be tested more completely and more easily);
  • keep the implementation fast for most cases (e.g. maximize the use of "string:gsub()", without necessarily depending on "mw.ustring.gsub()" which is a bit slower, even if both are now implemented in native C (by Scribunto) using "PCRE regexps" generated from the "Lua patterns" syntax; note: the internal precompilation of "Lua patterns" to "PCRE regexps" is still not cached, because Lua offers no way to "precompile" a table of substitutions given to "string:gsub()" and then use this function repeatedly with the same table of substitutions; the Lua table of substitutions should remain simple for the most frequent cases where "string:gsub()" is used: the Lua patterns must be simple and fast to compile and during matches; as much as possible, we must avoid "backtracking" for repeated characters (use '-' instead of '*' if the repeated character is not "anchored" on both sides), and complex "captures" in parentheses... For some cases, it may be faster to process short strings using simple scan loops in Lua without using gsub() with patterns.
  • implement sorting for syllabaries with modern usage, like Ethiopic, Canadian Syllabics, Bopomofo, Kanas, ... (it can be made completely and accurately, possibly using some lookup in data tables with modest size); implement a way to compare the results with what is expected with CLDR "root" locale or with the DUCET table (which have defined their test sets).
  • improve the encoding of sortkeys to make them more compact (to save resources when sorting large volumes); the best compact binary encoding (not necessarily valid UTF-8 but raw sequence of bytes with arbitrary binary values) would be useful (for internal sorting only), but an alternative encoding using "readable" strings (which are valid UTF-8) would be useful as well (e.g. to generate sort keys in sortable wikitables, without depending on complex implementations in Javascript in the browser: sort keys can be generated by the server itself and sent as part of the HTML output, e.g. in data attributes of HTML elements); this alternative could also use an even more restricted set of characters (for example as sort keys suitable in MediaWiki categories where some UTF-8 encoded characters are restricted, including ASCII whitespaces, newlines, ']' and other ASCII punctuations with special role in the MediaWiki or HTML syntax, when escaping them with numeric or symbolic character references is not the best option);
  • implement a "fast compare" function, which does not require computing and encoding sort keys, and does not require buffering weights, but can compare weights "on the fly" as they are detected.
  • implement collation/sorting in Indic abugidas (this can also be completed accurately, except for exceptions in Thai or Tibetan, which require a lookup in a not too large dictionary, that will not be implemented soon);
  • fix some missing items for most current usages of Latin, Greek, Cyrillic alphabet (it will remain a crude sort because actual sorting per language requires language specific rules for their real "clusters": this module can only determine default clusters in a language-neutral way);
  • support all combining characters (used extensively in alphabets, but as well in Semitic abjads and Indic abugidas). Their tertiary weights however will be language neutral (no intent to support language-specific collation rules);
  • possibly better collate/sort punctuation and symbols (those that are part of small sets: this requires modest data tables): at least we can detect their general category in the UCD, and tweak some of them, but they should then sort in binary order, always after controls, combining characters, and whitespaces, but before digits, letter-like symbols, and "letters". "ignorable" characters in punctuation and symbols will not be implemented (this ignorable status is language-dependant), unless this is standard in CLDR data for its "root" locale. By default, all whitespaces, punctuations and symbols are NOT "ignorable" and they should sort at least using the canonical decompositions, and then the compatibility decompositions.

Non-goals for now:

  • Sinograms (Han) and other ideographic (Yi) or pictographic scripts (Emojis) are too complex to be implemented (very large character set, and no strict rules, existing rules do not cover everything, there are many exceptions, even if syllable breaks are easy to determine accurately; there are also several competing conventions to sort them according to user preferences, all of them require lookup in a large dictionary to build for each collation style). To sort Chinese labels for example, or some Japanese labels that include sinograms, you still need to use explicit order (in lists of codes). Chinese also depends on the implemented "transliteration" rules between Traditional and Simplified sinograms (and this transliteration is in constant evolution and requires large data tables).
  • Historic scripts (like Runic, Egyptian or Mayan Hieroglyphs, ...) or invented artistic scripts possibly not even encoded in Unicode (like Klingon), or pictographic scripts (like VisibleSpeech or Duploye shorthand, or Musical symbols, or various symbols like arrows, box-drawing characters and mathematical operators), notably those that have no part encoded in the BMP, will not be supported, unless there's a demonstrated usage in Commons, and support in Wikidata for translated labels in these languages and there's a reasonably agreed convention for sorting them. They will sort only in the binary order of scalar values for each encoded code point.
  • Per language collation (i.e. "tailorings" in CLDR) will not be implemented (it requires lot of data tables, one per language and there are thousands languages!). Basically it will be mostly a reasonnable subset of what the DUCET includes and that CLDR supports in its "root" locale. If you want per-language accurate sort, you still need to use explicit order (in lists of codes).
  • May be later this module could use later CLDR-based collators (at least with its "root" locale), but only if ICU is made available in Lua (and its inclusion in Mediawiki includes the support of collators), in order to be (possibly) faster and a bit more accurate.

verdy_p (talk) 17:29, 2 November 2018 (UTC)


Category:Pages with script errors has several newly-added pages which use this module. Any idea how to fix them? --Jarekt (talk) 19:01, 6 November 2018 (UTC)

To trim some /docs, moving the bulk of examples to separate pages? Incnis Mrsi (talk) 19:14, 6 November 2018 (UTC)
@Verdy p: Please remove some of the recent additions at Template:Countries of Africa/doc which currently shows "Expensive parser function count: 493/500" in the NewPP report. That is causing "Lua error: too many expensive function calls." in the template pages listed at Category:Pages with script errors. Johnuniq (talk) 22:53, 6 November 2018 (UTC)
Even if you had not alerted me, I was already aware of this temporary problem only on a single doc (and template pages showing the shared doc page). I could not continue today becaus I had to be absent (so I did not post other changes that would have impacted all other pages without deeper tests). It's challenging to do tests and a single (shared) doc page for all does not help.
Well these are additional tests. Typical pages don't use so many boxes. This just affects the doc page because it tests several other country lists, and not just one (always the same).
Ideally the doc page should only test the template being described. The doc page needs to become more local instead of sharing everything (and citing always Africa as an example for all). This is work in progress but does not affect any usability of the templates themselves in any other page.
Ideally I' like to extend this template to support more other similar templates (not just countries), and the "Module:Countries" should be named something like "Module:NavListOf" and its associated template named "Template:NavListof" (several possible redirecting aliases like "Template:List of" or "Template:Listof").
Supporting more will allow converting former template-only implementations of similar navboxes (e.g. {{NavListOf|states of the United States}}, {{NavListOf|ISO 3166-1 countries and territories}}, {{NavListOf|official languages of the European Union}}, {{NavListOf|chemical elements}}. The data modules should probably no longer be named "Module:Countries/area name" but "Module:NavListOf/countries of area name" or "Module:area name/list of countries".
There are still some missing options to generalize it for navbox using compactable enumeration lists. Some of the generalization already exist with the new "conditional subpatterns" which extend the former patterns.
  • I also intend to have the two kinds of "bullets" turned later into parametrable subpatterns, or
  • to integrate left or right images (this can already be done in the existing pattern) but with translatable labels as well (the translated labels can be "{variablename}" in subpatterns, where the variable is associated to a list of translations or to a QiD, like for lists of country codes, or list of translatable "section titles" using "{sectiontitle}" in subpatterns.
So the layout is already a bit more adaptative than what it was first and can still be generalized for more uses.
I'm just about to separate the "sorter" code to a separate module maintained separately (this also needs independant work and can be improved a lot as well), because it has other interesting uses as well for other unrelated modules (not generating navboxes) to present their results (e.g. to show a list of countries, regionss, cities in the infobox module, where they shuld be displayed and ordered by their translated name).
This module should then become much simpler to maintain, while keeping the concept of data modules (which themselves could be extended later to make more use of Wikidata). verdy_p (talk) 00:02, 7 November 2018 (UTC)
To trim these errors (which categorize non-sandbox pages when actually the errors may be caused by the sandbox only, I've separated the testcases; also test cases may break if we make too many of them in the doc page itself, causing the base template or module to be reported as well, when such testcases should then be splitted in multiple pages).
So I'll split testcases for the normal template/module (which should all run without errors) from testcases that compare results with the normal template/module: this should then only categorize these sandbox testcases.
  • there are no pages with "script error" caused by this module (or any data module and associated template, and not even its sandbox), as reported above.
  • there are for now temporary "autocategorization errors" in the sandbox only, which I use to test the categorization of errors like this. Don't worry about that, I'll remove the error in the sandbox version, but this is the role of a sandbox to capture errors. verdy_p (talk) 07:56, 7 November 2018 (UTC)
Face-smile.svg Thank you. --Jarekt (talk) 13:24, 7 November 2018 (UTC)
Note that Module:MakeSortKey module was separated (this would facilitate its reuse in other moduyles to generate sorted lists of translated items, including for infoboxes) and is used now by Module:Country for sorting the labels. It will be improved to support more languages. I'll document it later.
There are other things to do in Module:Countries to generalize it (see my comments above).
The existing data modules now include their translated labels (for section titles) at top. May be we can derive some of these titles from Wikidata as well.
There are some new options for handling each item for use with something else than countries (there are interesting cases for country subdivisions, for cultural topics, etc. (consider all existing categories names like "X by Y" containing only subcategories where a navbox could be used to link them each other using {{#invoke|NavListOf|Module:Y's of X|prefix=:Category:}}, and whose main gallery page would use the same data module {{#invoke|NavListOf|Module:Y's of X}} (here "NavListOf" is used as to refer to "Module:NavListOf", to which I'd like to rename Module:Countries).
To do this coherently, the
countries = { AD = {...}, ..., FR = {...}, ..., ZW = {...}, },
property in data modules should become
codes = { AD = {...}, ..., FR = {...}, ... , ZW = {...}, },
verdy_p (talk) 19:35, 7 November 2018 (UTC)

Passing a custom data module[edit]

I implemented a way to pass a data module which is not necessarily named Module:Countries/id but can be any other module. The id will use the default prefix if it does not contain any colon. Otherwise you must specify the full page name of the data module.

There's an example working now for "Module:Departments of France" (which also uses subpatterns to allow several conditional sections, plus show examples of additional marks and notes added after each item link of the generated list).

Passing default options to build a template[edit]

Now it is possible to pass options directly in the {{#invoke:Countries|id or full module page name|...}} These default options can be overriden by a template using it.

There's an example working now for "Template:Departments of France", which defines the default option "showcode=yes", but the template can be called with the same parameter to override the default value used in the template's code.

Comparing performances, the template using now this module (and its associated data module) is much faster, uses less memory on the server, more powerful than the previous version based only on complex wiki syntax of ParserFunctions (so it is also much simpler to maintain the data module). verdy_p (talk) 05:52, 8 November 2018 (UTC)


Please note that the modules share a lot of functionality and it may be worth thinking about implementing missing functionality here or there, to let Countries depend on Catnav module after the latter has maybe learned a sorting stage, some more on the bidi (i.e. lang related) features and how to pull its input from countries' data modules (if a wrapper to prepare arguments is unacceptable). It may avoid double work in the long run and provide a common option interface. Written just to raise awareness of the similarities, not to put pressure on anyone to actually do migration efforts. -- 00:41, 25 February 2019 (UTC)

Interesting, thanks. I found one example although I have not looked at the module or what it does. Perhaps you could outline differences? The example is {{States of Germany}} which gives:
Johnuniq (talk) 23:08, 25 February 2019 (UTC)
I suggest reading the doc at Module:Catnav/doc, or looking at the old Template:Catnav which it is a replacement of. Greetings. -- 23:00, 27 February 2019 (UTC)
Other examples currently are {{Districts of Brandenburg}}, {{Districts of Saarbrücken}}, {{Districts of Saarland}}, {{Districts of Saxony}}, {{Districts of Saxony-Anhalt}}, {{Districts of Schleswig-Holstein}}, {{Districts of Thuringia}} to name a few, all of which are multilingual using LangSwitch first and wikidata label thereafter. -- 23:04, 27 February 2019 (UTC)
"Depending": well not really, what is meant is to merge the functionalities (which are documented). Notably this module makes use of conditional sections (that Catnav still does not have), uses translation, locale-sensitive sorting, punctuation, and behaves correctly with bidirectional text (correct layout); it also handles aliases (but for now aliases cannot handle the "the=" parameter for them (this will be fixed so that it can be set on the non-first given name). The only thing still not present is the addition of icons to the right side, but they can already be implemented without changing code of this module (only patterns in the data module to integrate the icon in the layout, I already have a working demo for that).
Note that the documentation already says that this can be generalized to other things than list of countries (and it is already used, e.g. for departments and regions of France, CRTs, olympic teams (nations or special), and chemical elements... one Lua table element is "countries" and should just be named "codes" but this does not change the functionalities (and the doc already acknowledges that).
I have already suggested that it should be renamed "ListOf" or "NavOf" or "NavListOf". Also it is not restricted to navigation in categories (it can navigate also across pages) so the suggested name "Catnav" for the merge is inappropriate.
Your additions are non essential (such as icons per item, or icon on the left/right, mostly tested for Germany and in German or English), but can be integrated (in a lighter version: I don't think that the options with only icons and not any text is useful, and it takes a lot of space, not suitable for cat navigation and stacking multiple navboxes, recognize that flags are not meaningful for many except for local inhabitants that recognize it; coats of arms are worse, not even official for most of them; note that regional flags are also often not official, so these can jsut be visual "hints" that we are on the proper category; however the need to translate texts is much more preferable).
And you're still not ready with Bidi (for Hebrew/Arabic).
verdy_p (talk) 18:37, 23 May 2019 (UTC)