We have TWiki-4.1.0, build 12567 with UTF-8 codepage (for Russian). Something wrong happens with Russian text (UTF-8) during generation of HTML-anchor. If we use Russian text in headers (h1-h6). I have trying to reproduce this bug on your Twiki installation in my sandbox. But this TWiki is not configured for international characters.

I save original test-case (HTML-source code) from our Twiki installation. In Firefox 2 is everything all right (img). But in MSIE we have too strange picture (img): second header and text of second section are invisible, fonts increased for all text on the page.

I found reason (img) of such behavior (char �).

Can anybody explain me what's wrong and how to fix this issue? Thank you,

Discussion

-- TWiki:Main/AlexeyShakin - 06 Feb 2007

The problem of char � appears is in all TWiki 4* versions and seems to be too complicated to solve. I have spent some monthes and made some patches for Search.pm and Render.pm to turn multy-bite UTF-8 encoding to internal Perl 5.8 unicode data. I have got the proper page rendering for most the core TWiki pages, but faled to get the right encoding everywhere. I do not know if it is problem of Debian Perl 5.8 or not, but it happens to be impossible to control saving/reading files in utf8 if internal Perl 5.8 unicode support is on. While read utf-8 encoded russian word from file it does not convert it, but convert back when save and I have a larger size wrongly encoded file on write. I faled to find in Perl documentation working workarround of this problem. Unfortunately, I did not have time to continue and did not try submit bug to Perl.

Therefore I have had to return site to koi8.

-- TWiki:Main.SergejZnamenskij - 09 Feb 2007

This Bug described in UtfAnchorError but not fixed in TWiki 4.1.1

-- TWiki:Main.AndreyTkachenko - 11 Feb 2007

This is extremely valuable feedback, guys, please keep at it. This sort of problem can only really be debugged by people running UTF8 on their sites.

-- TWiki:Main.CrawfordCurrie - 12 Feb 2007

This is clearly a bug, not an enhancement (I changed the priority accordingly.)

-- TWiki:Main.PeterThoeny - 12 Feb 2007

Sergej - it's great that you had a go at doing Unicode support already - I did do a lot on this outside the main TWiki tree a few years back and could perhaps help you out ... Let me know if you want to re-start this as I'd be happy to provide pointers and my (rather old) code. TWiki:Codev.UnicodeSupport is the place to start, and TWiki:Codev.ProposedUTF8SupportForI18N has some other thinking. Generally, the rule is to turn off Perl locales completely when doing UTF-8 using Perl Unicode, otherwise everything is quite broken. I don't believe this problem is too complicated to solve - I did have some working code, but making it more general was a fair bit of work.

There are several ways we could solve this one (aka TWiki:Codev:UtfAnchorError):
  • Implement full TWiki:Codev.UnicodeSupport - the 'right thing to do', bringing many benefits including MediaWiki competitiveness for I18N, but quite a lot of work
  • Implement a quick hack - already available on that page, but breaks non-UTF-8 sites
  • Wrap this hack somewhat, and use a better UTF-8 regex (which already exists within TWiki.pm as part of the code for TWiki:Codev.EncodeURLsWithUTF8 - see $regex{validUtf8CharRegex}) - does not require Perl Unicode support

Third option is the best one - it's not that much work as it's just a small tweak of 2nd option for which code is available, and enables precise truncation of UTF-8 strings on Unicode character boundaries, as it should. The key is to only apply it if the 'site charset' is utf-8, as set by the startup routine in TWiki.pm - then it can be ignored by other sites. When we get full UnicodeSupport this will become much simpler of course, so a TODO note should be attached.

At present, KOI8-R works much better than UTF-8 for Cyrillic sites, as Sergej has found - it lets you use WikiWords in Cyrillic automatically, but real Unicode support would be much better. See TWiki:TWiki.InstallationWithI18N for details.

-- TWiki:Main.RichardDonkin - 20 Feb 2007

There's some discussion of TWiki:Codev.UnicodeSupport on TWiki:Codev.InternationalisationGuidelines at the moment.

-- TWiki:Main.RichardDonkin - 22 Feb 2007

It's a pity, Richard, that theProposedUTF8SupportForI18N stage 2 was not finished. UTF-8 became more and more popular, and lack of Basic UTF-8 support as in ProposedUTF8SupportForI18Nt appears to be dangerous for TWiki.

What about to restart, I think. I have to understand what does it mean (what is done outside of main branch, what should be done now and how to contribute). The first difficult question for me is were data shoud be in utf-8 and where in interrnal perl 5.8 unicode to make it possible to reuse most of extentions (probably after patching)? I appreciate Your hints very much.

-- TWiki:Main.SergejZnamenskij - 22 Feb 2007

Sergej, I agree completely. It's a shame Basic UTF-8 was never done and nobody else picked this up - I did do quite a lot in my own code, but it was complex to get this working well enough, and fast enough, to release into the main codebase. In the meantime, other software has provided good UTF-8 support and UTF-8 is really the default choice for I18N these days, so TWiki is looking quite out of date, particularly compared to TWiki:Codev.MediaWiki and much non-Wiki software. The good news is that Perl 5.8 has had quite a few Unicode bugs fixed by now, and so have modules such as CPAN:CGI, so it may be less painful!

It would be great if you want to restart this work, and I'd be happy to help. I could bring my old public TWiki site back up, which included part that ran in UTF-8 mode, and provide you with that code, so you can get up to speed with what I already did. I'm on a trip for the next week including travel next two weekends, so might be a while before I get this going - however, I could simply zip the files including both test data and code and email to you. Let me know what you'd like to do.

It would also be useful to see your Unicode patches.

Let's continue discussion over on TWiki:Codev.InternationalisationGuidelines, or maybe consolidate everything under the TWiki:Codev.UnicodeSupport topic. In the meantime, here's a useful http://ahinea.com/en/tech/perl-unicode-struggle.html article on getting Perl code to work with Unicode.

-- TWiki:Main.RichardDonkin - 23 Feb 2007

I hope to provide intensive testing in Cyrillic and will try to understand how to do Phase 3 of TWiki:ProposedUTF8SupportForI18N. It will take a time thought. What about patches, I have tried different changes and I was so unhappy to see I could not force perl to read an write to file the same contents in utf-8 mode that I have lost the prevously somehow worked versions, sorry. I found from Your topics later that I should to turn locale support off but I did not.

-- TWiki:Main.SergejZnamenskij - 23 Feb 2007

Item4074 describes the same problem as this item, but does it better, so I am setting this one to "no action".

-- MichaelTempest - 27 Jun 2010
 

ItemTemplate edit

Summary problem with HTML <a> generation for headers with Russian (UTF-8)
ReportedBy TWiki:Main.AlexeyShakin
Codebase
SVN Range TWiki-4.1.0, Tue, 16 Jan 2007, build 12567
AppliesTo Engine
Component I18N
Priority Normal
CurrentState No Action Required
WaitingFor
Checkins
TargetRelease n/a
ReleasedIn
Topic revision: r15 - 27 Jun 2010, MichaelTempest
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy