Feature Proposal: Provide relevant locale support in pure Foswiki code and eliminate perl/OS locale
Motivation
We only use LC_CTYPE and LC_COLLATE they can easily be replaced by alternatives that are less buggy and that we can write tests for.
(see rev=9 and earlier for original proposal)
Description and Documentation
LC_COLLATE should be replaced with Unicode::Collate::Locale, which as the name suggests would allow sorts that match the chosen locale. In addition it allows that locale to be tailored to each user or even each topic/search/table column etc. However, as this is the first step I only propose to use a config var to denote the collate locale to use.
As the core is now Unicode LC_CTYPE offers nothing. All Unicode characters are categorised into particular character types, and these are constant regardless of any locales in effect. This is logical in that a
byte (or bytes in various multi-byte character sets) with value 'X' will be understood to be a different
character in various locales and hence it may be classified differently.
Examples
Impact
Implementation
Remove locale code throughout core and update docs relating to i18n & l10n.
Add Unicode::Collate::Locale and replace existing sorts that use NFKD. Use of Unicode::Collate::Locale will be done conditionally falling back to NFKD if it's not installed.
--
Contributors: JulianLevens - 05 Mar 2016
Discussion
In
MooFoswikiPm VadimBelman suggests that locale usage should be removed anyway, but I am not so sure that's true. We currently consider LC_TYPE and perl does support this with different utf-8 locales. For LC_COLLATE is a separate issue as it is recommended to use the Unicode::Collate module (see
http://perldoc.perl.org/perllocale.html#Unicode-and-UTF-8 for details). However, at the very least the above eliminates the key structural issues we have with locales.
I wondered about not using environment variables, e.g.
package Foswiki::Locale;
use parent Import::Base;
our $Use = 0;
our $Site = 'C';
{ # Eliminate absolute dependency on Foswiki::cfg
package Foswiki;
our %cfg;
}
our @IMPORT_MODULES = $Foswiki::cfg{UseLocale} // $Use ? ( 'locale' ) : ();
1;
And amend
Foswiki::Locale::Load
to use
$Foswiki::Locale::Use
and
$Foswiki::Locale::Site
accordingly.
In normal Foswiki operation that works fine. However, if working from within a tool script and you need to have more specific control of Foswiki::Locale parameters then you would need to:
BEGIN {
$Foswiki::Locale::Use = 1;
$Foswiki::Locale::Site = 'en_GB.utf8'; # Or from command-line or whatever
}
use Foswiki::Locale; # Or require and import to eliminate above BEGIN block, if that can work with Import::Base
The above has the following dependencies: Import::Base -> Import::Into -> Module::Runtime.
They are all pure perl and Module::Runtime is a dependency of Moo anyway.
As it's pure perl it may be possible to eliminate Import::Base and with more effort Import::Into but I'm not sure that's worth it.
Note that Moo enforces 'stricture' on all modules my importing wherever
use Moo;
is called. Therefore, I wonder if the Foswiki::Object within the Foswiki::Moo-niverse could use Import::Base (or similar) to automatically import locales as appropriate wherever Foswiki::Object or sub-classes are used. This would even eliminate the need for
use Foswiki::Locale;
in every module.
--
JulianLevens - 05 Mar 2016 - 09:37
After more testing I discover that because we introduced NFKD sorting LC_COLLATE becomes effectively void, or inconsistent sorting if this has not been covered in all code.
If we wish to re-introduce locale sensitive collation back into Foswiki then the above all sorts will need to be changed to:
use Unicode::Collate::Locale;
my $collation = Unicode::Collate::Locale->new( locale => 'de', normalisation => 'NFKD' );
...
my @sorted = $collator->sort( @unsorted );
...
my $compare = $collator->cmp( $a, $b );
The above also allows different users to have their own collation rules. Indeed this could even be offered as a UI option; topic by topic or table-column by table-column.
As the only other use of locale is LC_TYPE and I am not sure how that impacts Foswiki in general and Foswiki with
{Site}{Charset}='utf-8'
in particular.
Can we indeed just remove old-style locale from Foswiki?
--
JulianLevens - 05 Mar 2016 - 14:16
I'm 100% for the
use Unicode::Collate::Locale;
As it is CLDR based, so the module it is "as good" as it's maintainer - and (AFAIK) it now uses Unicode 7.0 DUCET. The latest is 8.0 - but the changes aren't very relevant for the "common" locales - so the module's condition is
good .
--
JozefMojzis - 07 Mar 2016
Actually, locale issue is wider than what is discussed here. On today's release meeting I said that LC_ALL would be great to have. Briefly it means things like:
- Localized dates (something like
%DATETIME{epoch}%
would be needed for this)
- Time zones are not really related to localization but nice to have on installations with geographically disperse userbase.
- SpreadSheetPlugin is resembling of desktop spreadsheets so much that one would expect it supporting localized currency, for example.
- Ability to use local translations in addition to the system wide. I.e. if I'm able to choose between two-three-... interface languages then I'd like to be able to translate a table headers, button text or labels too.
What I'm trying to say here is that even though utf8+gettext support is a huge breakthrough in internationalization of Foswiki but in general it still remains pretty much US/English-centric and bound to a single location.
--
VadimBelman - 07 Mar 2016
Localized "things" ==
CLDR. Perl:
CPAN:Locale::CLDR .
--
JozefMojzis - 07 Mar 2016
After reading thru the
Development.ReleaseMeeting02X02_20160307 IRCLOG, especially about the:
(08:52:28 AM) JulianLevens: Does LC_TYPE still need to be supported?
(08:52:37 AM) jast: LC_CTYPE, I'm guessing?
(08:52:57 AM) jast: well, isn't it kind of the point of locales to get language-specific treatment of strings?
(08:53:30 AM) jast: in addition to CTYPE, what about case folding? what about date and number formats? etc.
(08:53:32 AM) JulianLevens: true, but could LC_TYPE be handled another way?
(08:53:57 AM) JulianLevens: It's global in nature at the moment
(08:54:22 AM) JulianLevens: sorry, LC_CTYPE
(08:54:23 AM) jast: maybe. I don't know. all I'm saying is that simply dropping locales with no replacement for most of its properties is maybe not the best way to go about it
(08:54:52 AM) JulianLevens: We have already dropped LC_COLLATE effectively
(08:54:52 AM) jast: if we do find ways to get all things locale without actually using POSIX locale support, great, I'm all for it
(08:55:21 AM) JulianLevens: Do we know of sites using locales?
(08:55:40 AM) vrurg: JulianLevens: I would use it if it's properly implemented.
and also as a reply to the above Vadim's comment is possible to say:
- LC_CTYPE
- used for the classification of the characters
- The idea behing the LC_CTYPE is totally wrong. The character's properties aren't locale dependent, e.g. the SMALL LETTER S WITH CARON (š) is remains letter even if it isn't used in the particular language.
- Commonly happens in the US English too, when they need use some international names like "Mojžiš" - the "ž" and "š" are letters even if the US_EN locale doesen't "knows" them.
- So, the character propersties are should be bounded to the character itself, regardless of the used locale. This is done in the UNICODE - more precisely in the UCD - where every character got assigned many properties.
- Is possible to test the characters for ANY unicode properties with:
- using regexes like:
$c =~ /\p{Upper}/
- and so on. More informations in the man perluniprops.
- case folding - perl function
fc
(needs perl 5.16 feature scope)
- access to many UCD-related "things" by the CORE module CPAN:Unicode::UCD
- getting the "list of commonly used characters" in the given language - I asked this on the StackOverflow a year ago, and also found the answer in the Unicode Common Locale Data Repository - CLDR.
- Currently the Unicode's CLDR is simply accessible by the perl module: CPAN:Locale::CLDR
- LC_MONETARY
- LC_NUMERIC
- LC_TIME
- LC_MESSAGES
So, the unicode properties, the CPAN:Unicode::UCD and the CPAN:Locale::CLDR could fully substitute the LC_ANYTHING and also the Locale::Maketext and also plus much more...
--
JozefMojzis - 09 Mar 2016
Jozef thanks for the feedback.
First divide this into two groups:
- Change existing locale support
- Add new locale support
It is not the intention of this proposal to add support for those, although updating the i18n/l10n documentation to explicitly state what we do and do not currently support would be something worth doing. In would also be worth including rough ideas of what needs to be done.
- LC_MONETARY
- This would need a field type adding to form fields and this can be a plugin
- How aware would the core would need to be aware of these values?
- Foswiki::Form::FieldDefinition does not appear to support specific sort/compare, i.e how can tables sort monetary values properly
- LC_NUMERIC
- Similar to above, indeed the about should inherit this field so monetary get users numeric prefs
- LC_TIME
- LC_MESSAGES
Now the existing supported locale stuff:
- LC_COLLATE
- Replace by Unicode::Collate::Locale or just use NFKD on older perls
- LC_CTYPE
- Unicode active in the core so character locales already there for free
- But boundaries (e.g. Stores) can encode/decode to another character set
- OTOH if you want locale support then convert your Store to Unicode - so this is a moot point
- Key question: what will break if we remove existing locale code?
- Nothing in tests for locale AFAICS
If we need to keep
LC_CTYPE support by a logical 'use locale;' in all modules (and I hope we don't) then my code changes above at least removes dependencies from all code. You've referenced Import::Into elsewhere - that's effectively the magic I'm using.
--
JulianLevens - 10 Mar 2016
Julian, YES, (unfortunately) you're probably right. I'm not an locale expert, (me using only unicode) - so, need more testing and discussion.
Reading thru the
https://metacpan.org/pod/distribution/perl/pod/perllocale.pod (note v5.22) - we could have problems when the underlaying system isn't unicode.
The
locale
pragma is problematic, because it depends on the perl versions. From the 5.16 here could be optional arguments to the pragma. Starting in v5.20, Perl fully supports UTF-8 locales, except for sorting and string comparisons. And other version based differences.
The main problem lies probably in the following citation:
The current locale is used when going outside of Perl with operations like system() or qx//, if those operations are locale-sensitive.
We need test how Foswiki fill fork the external
rcs
and
grep
and like commands (with accented characters as arguments) - when the underlaying locale is for example ISO-8859-1. (Internally the Foswiki is ALWAYS unicode). Here is no problems when the the underlaying system is unicode = e.g.
IMHO, we don't need the locale pragma for the full unicode (OS + PERL + Foswiki) .
IMHO, important reading is this part of the perllocale:
https://metacpan.org/pod/distribution/perl/pod/perllocale.pod#Category-LC_CTYPE:-Character-Types1
/hard reading :(/
My view: by
not dropping the locale, we
will have many problems - because of the perl versions, OS support, broken OS locales... and so on. Therefore, (by me) Foswiki should support the C-locale (aka ASCII) - and the UNICODE. Supporting the ISO based locales means many problems.
(but maybe not - i don't know)
--
JozefMojzis - 11 Mar 2016
Having read many a fine manual I have updated the proposal and reset the clock.
Basically let's remove from Foswiki the usage of locale as provided by perl and the underlying OS.
Looking thru existing code my concerns are not with the removal of locales breaking anything but the non-Unicode assumptions that have been made. I do not mean that
[0-9]
should be replaced by
\d
(or vice versa) in all cases. The Unicode character '৪' is a
\d
but it is actually a BENGALI DIGIT FOUR (U+09EA). For a wiki word that could be fine but treating that as part of a number is unlikely to be what you want and it has security concerns.
Going forward I suggest that we use the
\p{Digit}
syntax to make it clear that we do want to match anything considered a digit by Unicode and
\P{Digit}
for anything non-digit. That is to say, if you see
\d
in code should that be changed to
\p{PosixDigit}
when we really mean
[0-9]
or
\p{Digit}
when we want the full Unicode range of digits?
This also has the advantage that we could actually write some tests to check that 'locales' will work as we expect. The docs make it clear that locales have had many bugs in perl's implementation as well as the underlying OS, so testing is very difficult.
--
JulianLevens - 14 Mar 2016 - 15:23
Maybe the perl version
5.42 will answer "the ultimate question of locales, the unicode, and everything with their compatibility" - but we have "problems" upgrade the minimum perl version even to 5.16...
So, we should avoid bugs as much as possible. When
- the perl has capabilities to comply with any locale-specific things today (using it's basic built-in features and/or using CPAN modules
- and those capabilities are more precise, more clean and more bugfree as the OS based locales support (for the unicode)
my view is: yes, yes and again
YES.
I'm fully agree with the:
Basically let's remove from Foswiki the usage of locale as provided by perl and the underlying OS. as is stated above.
Any other questions, like:
- how to compose and use some regexes ( e.g.
\p{Digit}
vs \d
vs \p{PosixDigit}
or even user-defined properties like \p{Foswiki::Digit}
)
- or even introduce something as
Foswiki::Regex
with precompiled regexes and using it as if( $name =~ $fwregex->get('TopicNameRegex') )
- when and now to use unicode normalization
- etc etc etc
we could decide per-partes - step-by-step - when the decision time arrives (aka Moofication and such).
Like the usage of the
CPAN:Unicode::Collate::Locale for the sorting. /'jast' probably will complain as in the
http://irclogs.foswiki.org/bin/irclogger_log/foswiki?date=2014-05-17,Sat&sel=224#l220 (just kidding)/
--
JozefMojzis - 15 Mar 2016
I got an email from
GuidoBrugnara reporting an issue with the locale and accept-language header. It looks like we have some issues in how we generate the language tags. It's hard-coded to the default locale when locales are enabled. It seems as though we should be returning the correct locale for the requested language.
I realized that the search engines scans the web sites using Accept-Language="en"
In this manner our pages/topics translated in italien are not serchable
You know a work-around to this problem?
Perhaps the problem will solved with time!
In this document
https://support.google.com/webmasters/answer/6144055?hl=en
I understand that at least Google is concerned to interrogate the site using different Accept-Language.
I'm worried about a problem that does not exist?
I thought about using
RewriteCond (Apache 2) rule to force Accept-Language header to "it" when the url is
http://mysite/it/ but perhaps there is a better way to do.
You would have some suggestions?
best regards
GuidoBrugnara
[1]
http://tech-blog.borychowski.com/index.php/2009/03/htaccess/redirect-according-to-browser-language-mod-rewrite-and-http_accept_language/
I verified that when the browser has selected the English language, the downloaded web page contains text in English;: Ok!
But in html "lang" attribute in "html" tag, the set is italian:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="it-IT" lang="it-IT">
It is normal?
--
GeorgeClark - 21 May 2016
This has moved to 'do it' in my book, i.e. eliminate 'use locale' throughout Foswiki - there is always a better way. I'll raise this in the next ReleaseMeeting02x02_20160919.
My earlier discussions toyed with keeping locale via a Foswiki::Locale - I now reject that idea in favour of complete removal.
--
JulianLevens - 17 Sep 2016
Here's one other interesting aspect: for now Foswiki operates based on the browser's language property. However this does not allow us to send mail notifications in the user's language as there basically is no standard config setting per user. Instead all mail notification is send in the same default language to all recipients. Can we change that?
--
MichaelDaum - 17 Sep 2016
In my investigation of
EnableLowerCaseTopicNames it becomes really obvious that our case sensitive sorting is not very friendly. Topics with lower case first letters sort all the way at the bottom rather than being sorted together with the upper case names. This problem exists today but is less obvious for case differences deeper in the topic name.
A poor hack is to sort using something like
NFKD( lc( $a )) cmp NFKD( lc( $b))
, but I would hope that this proposal would address the issue in a better way.
--
GeorgeClark - 17 Jan 2018