Unicode collation
Unicode has support for specific collation rules, used when sorting data - see the Unicode documentation and
CPAN:Unicode::Collate. This may be important as part of
UnicodeSupport, as it enables Perl locales (which provide pre-Unicode collation rules based on the locale, which are used when sorting topic names etc) to be finally dropped. However, supporting locale-like sort orders using
Unicode::Collate
is more work than simply using locales.
See
UseUTF8 for current discussions.
Using CPAN's Unicode collation package
CPAN:Unicode::Collate seems to make this quite easy - I don't believe we need any specific options to get default collation order working, just some code like the attached script (which includes UTF-8 data so it can't be embedded within this page on TWiki.org, which uses ISO-8859-1). The default sort order, whether with Unicode Collation or without, is not very useful for many situations, and will need customization typically.
Those sites that need a language-specific order would need to do some customization of the collation order in a plugin, as detailed on
CPAN:Unicode::Collate - this could be put into a "language pack" that is re-usable, though multi-language sites might need to merge multiple languages' sort orders, which may well conflict with each other. So the "language packs" might end up being customized for some sites, but many could simply use the standard pack for their language.
Example script and output
See the attachment for a simple test script you can run from the shell. The output is below - the non-Roman symbols are Hebrew Alef, U+05D0, and Bet, U+05D1. This is
not intended to show a particularly good sort order, just how the Unicode::Collate package works in a very simple case. Most languages/cultures will require some customization of the collation order.
Output:
>>> Sorted with Unicode Collation <<<
Aarhus
Æsop
alpha
Århus
beta
gamma
Øresund
א
ב
>>> Default sorting without Unicode Collation <<<
Aarhus
alpha
beta
gamma
Århus
Æsop
Øresund
א
ב
The sort order with Unicode Collation is not ideal for some languages (e.g. in Danish the Århus would sort just after the Aarhus as Aa is an equivalent to Å, and these are two spellings for the same town), but it works for many languages without changes and is a lot better than the default order without Unicode Collation.
See
Wikipedia:Collating_sequence for examples of language variations in collation orders.
Normalisation of data before sorting
The Unicode collation specifies that
http://unicode.org/reports/tr10/#Step_1 normalisation is done as the first step in collation, by default. The
Unicode::Collate
package can do
UnicodeNormalisation if needed, and makes this quite easy.
Even if TWiki assumes all data is in Normalisation Form C (NFC) as per
W3C standards, and as planned in
UnicodeNormalisation (apart from
MacOS which uses NFD), the Unicode collation standard says that all data must be converted to NFD form before it is sorted. However, since this is done once per data item, it should not have a big performance impact.
http://www.unicode.org/cldr/ CLDR is a Unicode Consortium repository of locale information for a huge range of languages/cultures. It may be a good starting point for customizing collation orders for specific languages.
Searching
Ideally we would use Unicode collation rules and configuration to
http://unicode.org/reports/tr10/#Searching control searching for Unicode data in TWiki. However, this is quite complex, with benefits only in very specific cases. Searching is also performance-critical for TWiki. Hence this is probably best left to a later phase.
Unicode does provide features outside Unicode collation for case-folding etc.
--
Contributors: RichardDonkin - 30 Jun 2008
Discussion
I would not worry too much about Aa and Å in Danish. I think we have learned to live with that.
But if the sorting order above is done with Unicode Collation then the library is worthless because it sorts all wrong. The order of the real words should be (ignoring the Aa detail)
Aarhus
alpha
beta
gamma
Æsop
Øresund
Århus
--
KennethLavrsen - 19 Jul 2008
It's quite feasible to get the precise collation order you want using code that makes use of
Unicode::Collate
to change the collation order - presumably you are talking about the Danish sort order here. See
CPAN:Unicode::Collate for some examples of how this is done.
My example above only uses the default Unicode collation order (aka DUCET), and perhaps using Danish words was misleading as the default order doesn't work well for this. Some languages treat accented characters as sorting near their unaccented versions, while others treat the accented characters as sorting after 'Z', such as Danish.
It's worth reading
http://unicode.org/reports/tr10/ Unicode Technical Report 10 on Unicode collation, which gives more background and examples on this. The collation order should really depend on the user not the site - e.g. if a German user is looking at some data including
ä
, they'll expect that to sort after 'a', while a Swedish user will expect it to sort after 'z'.
--
RichardDonkin - 21 Jul 2008