You are here: Foswiki>Tasks Web>Item758 (04 Oct 2010, KennethLavrsen)Edit Attach

Item758: Raw view break Chinese characters in UTF-8

Priority: Urgent
Current State: Closed
Released In: 1.1.0
Target Release: minor

Applies To: Engine
Component: I18N
Branches:

Reported By: ChYang
Waiting For:
Last Change By: KennethLavrsen

Raw View will break some of the Chinese characters FF/FD.

You can test the Raw View on my FW 1.0 at http://ebm.twbbs.org/bin/view/Sandbox/ChineseTest

Chinese character 盧 (codepoint 30439) is represented by the byte-sequence e7 9b a7.

Something in the "raw view" processing is converting "Windows-1252" codes to character entities. For example, 0x9B in Windows-1252 corresponds to Unicode codepoint 8250. That conversion breaks the UTF-8 encoding.

There is a similar problem with ě which is represented by the byte-sequence C4 9B, as reported in Item8950.

This might be tricky to solve comprehensively, as the problem is caused by CGI::textarea. On trunk, it is this call.

Confirmed. This particular manifestation of what I consider to be a bug in CGI.pm will not cause data-loss, but the same bug might cause data loss in forms. Setting it to urgent because I don't know what else might be affected. My CGI.pm is version 3.29.

-- MichaelTempest - 27 Jun 2010

When I:

Set up {UseLocale}=1; {Site}{CharSet}='UTF-8'; {Site{Locale}='en-GB.UTF8'; (CGI version is 3.49)
Edit Wiki Text
Write 皍 in the text, save
Looks good, I see the chinese character
Copy the character shown to the cut buffer
Edit again, paste the character, and save
Looks good (I see two instances of the character)
View Wiki Text (still looks good)

So what am I doing wrong? How did you reproduce this? Do I need to view on Windows? Or has this simply been fixed in CGI?

-- CrawfordCurrie - 13 Jul 2010

I can reproduce the problem with the process you describe, using 蓋 or 盧, but not with 皍

-- MichaelTempest - 14 Jul 2010

Dyslexics of the world unite! Confirmed.

-- CrawfordCurrie - 14 Jul 2010

OK, fixed. We have to ensure the auto-encoding in CGI uses the correct character set. CGI defaults to iso-8859-1, and has a special exception for iso-8859-1 and windows1252 in CGI::escapeHTML which breaks UTF-8 content. Get this wrong, and CGI will fail to encode certain UTF-8 characters correctly.

Note that this problem exists in earlier releases as well, but I have only committed to trunk. The same fix should be portable.

-- CrawfordCurrie - 14 Jul 2010

ItemTemplate edit

Summary	Raw view break Chinese characters in UTF-8
ReportedBy	ChYang
Codebase	1.0.9, 1.0.8, 1.0.7, 1.0.6, 1.0.5, 1.0.5 beta1, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, 1.0.0 beta3, 1.0.0 beta2, 1.0.0 beta1, trunk
SVN Range	SVN 1972: Foswiki-1.0.0, Fri, 09 Jan 2009, build 1899
AppliesTo	Engine
Component	I18N
Priority	Urgent
CurrentState	Closed
WaitingFor
Checkins	distro:94cbfbd0a9fd distro:7a09a9169186
TargetRelease	minor
ReleasedIn	1.1.0