Unicode collation

Unicode has support for specific collation rules, used when sorting data - see the Unicode documentation and CPAN:Unicode::Collate. This may be important as part of UnicodeSupport, as it enables Perl locales (which provide pre-Unicode collation rules based on the locale, which are used when sorting topic names etc) to be finally dropped. However, supporting locale-like sort orders using Unicode::Collate is more work than simply using locales.

See UseUTF8 for current discussions.

Using CPAN's Unicode collation package

CPAN:Unicode::Collate seems to make this quite easy - I don't believe we need any specific options to get default collation order working, just some code like the attached script (which includes UTF-8 data so it can't be embedded within this page on TWiki.org, which uses ISO-8859-1). The default sort order, whether with Unicode Collation or without, is not very useful for many situations, and will need customization typically.

Those sites that need a language-specific order would need to do some customization of the collation order in a plugin, as detailed on CPAN:Unicode::Collate - this could be put into a "language pack" that is re-usable, though multi-language sites might need to merge multiple languages' sort orders, which may well conflict with each other. So the "language packs" might end up being customized for some sites, but many could simply use the standard pack for their language.

Example script and output

See the attachment for a simple test script you can run from the shell. The output is below - the non-Roman symbols are Hebrew Alef, U+05D0, and Bet, U+05D1. This is not intended to show a particularly good sort order, just how the Unicode::Collate package works in a very simple case. Most languages/cultures will require some customization of the collation order.

Output:
>>> Sorted with Unicode Collation <<<
Aarhus
Æsop
alpha
Århus
beta
gamma
Øresund
א
ב
>>> Default sorting without Unicode Collation <<<
Aarhus
alpha
beta
gamma
Århus
Æsop
Øresund
א
ב

The sort order with Unicode Collation is not ideal for some languages (e.g. in Danish the Århus would sort just after the Aarhus as Aa is an equivalent to Å, and these are two spellings for the same town), but it works for many languages without changes and is a lot better than the default order without Unicode Collation.

See Wikipedia:Collating_sequence for examples of language variations in collation orders.

Normalisation of data before sorting

The Unicode collation specifies that http://unicode.org/reports/tr10/#Step_1 normalisation is done as the first step in collation, by default. The Unicode::Collate package can do UnicodeNormalisation if needed, and makes this quite easy.

Even if TWiki assumes all data is in Normalisation Form C (NFC) as per W3C standards, and as planned in UnicodeNormalisation (apart from MacOS which uses NFD), the Unicode collation standard says that all data must be converted to NFD form before it is sorted. However, since this is done once per data item, it should not have a big performance impact.

Locale information for Unicode

http://www.unicode.org/cldr/ CLDR is a Unicode Consortium repository of locale information for a huge range of languages/cultures. It may be a good starting point for customizing collation orders for specific languages.

Searching

Ideally we would use Unicode collation rules and configuration to http://unicode.org/reports/tr10/#Searching control searching for Unicode data in TWiki. However, this is quite complex, with benefits only in very specific cases. Searching is also performance-critical for TWiki. Hence this is probably best left to a later phase.

Unicode does provide features outside Unicode collation for case-folding etc.

-- Contributors: RichardDonkin - 30 Jun 2008

Discussion

I would not worry too much about Aa and Å in Danish. I think we have learned to live with that.

But if the sorting order above is done with Unicode Collation then the library is worthless because it sorts all wrong. The order of the real words should be (ignoring the Aa detail)
Aarhus
alpha
beta
gamma
Æsop
Øresund
Århus

-- KennethLavrsen - 19 Jul 2008

It's quite feasible to get the precise collation order you want using code that makes use of Unicode::Collate to change the collation order - presumably you are talking about the Danish sort order here. See CPAN:Unicode::Collate for some examples of how this is done.

My example above only uses the default Unicode collation order (aka DUCET), and perhaps using Danish words was misleading as the default order doesn't work well for this. Some languages treat accented characters as sorting near their unaccented versions, while others treat the accented characters as sorting after 'Z', such as Danish.

It's worth reading http://unicode.org/reports/tr10/ Unicode Technical Report 10 on Unicode collation, which gives more background and examples on this. The collation order should really depend on the user not the site - e.g. if a German user is looking at some data including ä, they'll expect that to sort after 'a', while a Swedish user will expect it to sort after 'z'.

-- RichardDonkin - 21 Jul 2008
 
I Attachment Action Size Date Who Comment
sort-unicode.pl.txttxt sort-unicode.pl.txt manage 1 K 11 Jul 2008 - 06:22 RichardDonkin Test script for Unicode collation (includes UTF-8 data)
Topic revision: r5 - 17 May 2015, CrawfordCurrie
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy