WikiWords such as
VirksomhedskulturÆrø does not work in UTF-8
The UTF-8 improvements made recently reveals some problems and here is another.
In UTF-8 Perl is supposed to know which characters are upper case and which are lower case.
But both in the Wysiwyg editor and in the rendering of topics setting the Locate to da-DK.UTF8 and charset to utf-8 make TWiki see non A-Z as non-letters when it comes to wikiwords.
I have tested and experimented quite a lot and I am convinced that in perl 5.8 our regexes in TWiki.pm for wikiwords are correct in TWiki.
The problem must be that TWiki does not see the string as utf-8 somewhere.
In ISO8859 the non A-Z wikiwords work fine except in the Javascripts for creating topics where there are still regexes with A-Z.
Here on Bugs
VirksomhedskulturÆrø points to a not yet created topic
--
TWiki:Main/KennethLavrsen - 24 Apr 2008
The problem must be that TWiki does not see the string as utf-8 somewhere yes, probably. There are several functions provided for converting from the encoded UTF8 that might come from a CGI query to perl's internal unicode representation. Unfortunately they don't always work, and are called in a rather hit-and-miss fashion. On the WYSIWYG side, as far as I know there is no code anywhere that recognises wikiwords that does not use the TWiki regexes.
A general "it doesn't work" is a truism, but isn't a useful report. Anyone who tries to fix this needs to know exactly how to reproduce a problem involving UTF8, and the above description isn't detailed enough, so I'm kicking this back for more detail (note: this does not mean I intend to fix it, I'm just triaging the issue)
--
CrawfordCurrie - 24 Apr 2008
Yes my description is detailed enough.
But let me write it again differently
Setup TWiki for UTF8 - this step should be obvious.
Write the word
VirksomhedskulturÆrø
Save
Look
--
KennethLavrsen - 30 Apr 2008
OK, thanks, that was exactly the sort of simple recipe I wanted. I did that, and I see that the resulting word is not a wikiword in view. When I look at the topic saved to disc I see that the string is correctly encoded using UTF8 characters, so we can eliminate WYSIWYG as a source of error. The problem has to be with the regexes that recognise wikiwords.
Confirmed as an
I18N issue.
Note: I just had a lot of grief saving this topic, which suggests there are still issues with charsets in formfieds
CC - 01 May 2008
I did a lot of reading on the topic and I am convinced our
$regex{upperAlpha} = '[:upper:]';
$regex{lowerAlpha} = '[:lower:]';
$regex{numeric} = '[:digit:]';
$regex{mixedAlpha} = '[:alpha:]';
will work in UTF8 also seeing ÆØÅÉÖ as uppercase and æøåéö as lower case
But it requires that perl at any given time sees the variable on which we use the regex as UTF8 and not as plain ASCII.
I would like to try and analyse more using some poor mans debugging. How do I easily identify if a variable holds what perl sees as UTF8 vs ASCII? I need a one liner I can print out to error_log or debug file.
--
KJL - 01 May 2008
I have tried to analyze more. But I am nowhere near being able to resolve it. The unicode/utf-8 encoding/decoding is still a bit of a mystery to me.
But I have learned something.
The rendering of wikiwords happens in lib/Render.pm in the sub getRenderedVersion.
The actual lines are
unless( TWiki::isTrue( $prefs->getPreferencesValue('NOAUTOLINK')) ) {
# Handle WikiWords
$text = $this->takeOutBlocks( $text, 'noautolink', $removed );
$text =~ s/$STARTWW(?:($TWiki::regex{webNameRegex})\.)?($TWiki::regex{wikiWordRegex}|$TWiki::regex{abbrevRegex})($TWiki::regex{anchorRegex})?/_handleWikiWord( $this,$theWeb,$1,$2,$3)/geom;
$this->putBackBlocks( \$text, $removed, 'noautolink' );
}
So my first trial was to see if the problem is the regexes or the _handleWikiWord. The conclusion is that it is the regexes that do not work on $text because of the encoding used.
I tried this as an experiment.
unless( TWiki::isTrue( $prefs->getPreferencesValue('NOAUTOLINK')) ) {
# Handle WikiWords
$text = $this->takeOutBlocks( $text, 'noautolink', $removed );
$text = Encode::decode($TWiki::cfg{Site}{CharSet}, $text) if $TWiki::cfg{Site}{CharSet};
$text =~ s/$STARTWW(?:($TWiki::regex{webNameRegex})\.)?($TWiki::regex{wikiWordRegex}|$TWiki::regex{abbrevRegex})($TWiki::regex{anchorRegex})?/_handleWikiWord( $this,$theWeb,$1,$2,$3)/geom;
$text = Encode::encode($TWiki::cfg{Site}{CharSet}, $text) if $TWiki::cfg{Site}{CharSet};
$this->putBackBlocks( \$text, $removed, 'noautolink' );
}
This makes the links appear correct for a not yet created topic.
But the minute I create the topic and view the original topic with the wikiword I get errors "Malformed UTF-8 character".
Some other observation. A wikiword SomeTopicÆØÅWithDanish in a topic. If I print the $text to STDERR I see the wikiword as
SomeTopic\xc3\x86\xc3\x98\xc3\x85WithDanish
After Encode::decode it becomes
SomeTopic\xc6\xd8\xc5WithDanish
and then the regexes work again. Shouldn't the regex engine also work on utf-8 strings in Perl 5.8?
So the issue is again the coding of strings used inside TWiki. How do we fix this? I am stuck.
--
KennethLavrsen - 02 Jun 2008
I had similar problems with unicode and Perl. I described the steps that helped me in
UnicodeProblemsAndSolutionCandidates.
--
ChristianLudwig - 09 Jun 2008
For 4.2.1 I am still a bit stuck and need a hand.
--
KennethLavrsen - 18 Jun 2008
This seems to be trying to do
UseUTF8 aka
UnicodeSupport, which is a rather large piece of work that affects many different parts of TWiki. Current versions of TWiki don't support Unicode at all - while you can set .utf8 in the locale etc, it's not recommended for European languages, only for those languages such as Chinese that don't care about
I18N characters in
WikiWords. In other words, this is not a bug, it's the missing UnicodeSupport feature that's needed here.
However, perhaps this is part of feature work on Unicode support. In which case it's a matter of ensuring that all strings processed by TWiki are not just UTF-8 bytes but are turned into Perl utf8 characters (i.e. Perl's utf8 mode as in
perldoc perlunicode
). This is presumably not happening somewhere, as you mention. Note that one side effect of Encode::decode is that UTF-8 byte strings turn into Perl utf8 character strings. Clearly, having the sequence of bytes in UTF-8 is not enough - each string of (say) 3-6 UTF-8 bytes is represented (when Perl is using utf8 mode for a string) as a single 'character' i.e. it's a unit of matching in regexes, and a unit for other string operations. There are some Perl functions that will do this for you, and it also happens automatically in some cases but not all - however, moving TWiki to Perl utf8 mode is a
large piece of work...
I do think this should be treated as feature work and done on a branch - getting utf8 right is quite disruptive potentially, though it could be mitigated if we have a simple 'utf8 mode on/off' flag as I suggested on UseUTF8.
See my comment on
UseUTF8 - have added a new Key Concepts section there that is relevant to this distinction.
--
RichardDonkin - 26 Jun 2008
OK so I will downgrade this to normal then and I will do a small update on the help text in configure and probably also in some of the installation docs about the current support of UTF8.
--
KennethLavrsen - 26 Jun 2008
Duplicate of
Item5230 but has more analysis so closed 5230 instead.
--
CrawfordCurrie - 04 Jan 2009
Found out that there is a difference among utf8 and UTF-8 in perl. See
http://jeremy.zawodny.com/blog/archives/010546.html . I think this might help for "malformed utf-8" type errors..
--
StefanosKouzof - 10 Feb 2010
Fixed in utf8 branch. Awaiting merge.
--
Main.CrawfordCurrie - 17 May 2015 - 10:11