Item758: Raw view break Chinese characters in UTF-8
Priority: Urgent
Current State: Closed
Released In: 1.1.0
Target Release: minor
Applies To: Engine
Component: I18N
Branches:
Raw View will break some of the Chinese characters FF/FD.
You can test the Raw View on my FW 1.0 at
http://ebm.twbbs.org/bin/view/Sandbox/ChineseTest
Chinese character 盧 (codepoint 30439) is represented by the byte-sequence e7 9b a7.
Something in the "raw view" processing is converting "Windows-1252" codes to character entities. For example, 0x9B in Windows-1252 corresponds to Unicode codepoint 8250. That conversion breaks the UTF-8 encoding.
There is a similar problem with ě which is represented by the byte-sequence C4 9B, as reported in
Item8950.
This might be tricky to solve comprehensively, as the problem is caused by CGI::textarea. On trunk, it is
this call.
Confirmed. This particular manifestation of what I consider to be a bug in CGI.pm will not cause data-loss, but the same bug might cause data loss in forms. Setting it to urgent because I don't know what else might be affected. My CGI.pm is version 3.29.
--
MichaelTempest - 27 Jun 2010
When I:
- Set up {UseLocale}=1; {Site}{CharSet}='UTF-8'; {Site{Locale}='en-GB.UTF8'; (CGI version is 3.49)
- Edit Wiki Text
- Write
皍
in the text, save
- Looks good, I see the chinese character
- Copy the character shown to the cut buffer
- Edit again, paste the character, and save
- Looks good (I see two instances of the character)
- View Wiki Text (still looks good)
So what am I doing wrong? How did you reproduce this? Do I need to view on Windows? Or has this simply been fixed in CGI?
--
CrawfordCurrie - 13 Jul 2010
I can reproduce the problem with the process you describe, using
蓋
or
盧
, but not with
皍
--
MichaelTempest - 14 Jul 2010
Dyslexics of the world unite! Confirmed.
--
CrawfordCurrie - 14 Jul 2010
OK, fixed. We have to ensure the auto-encoding in CGI uses the correct character set. CGI defaults to iso-8859-1, and has a special exception for
iso-8859-1 and windows1252 in CGI::escapeHTML which breaks UTF-8 content. Get this wrong, and CGI will fail to encode certain UTF-8 characters correctly.
Note that this problem exists in earlier releases as well, but I have only committed to trunk. The same fix should be portable.
--
CrawfordCurrie - 14 Jul 2010