Unicode Characters in non-UTF-8 Encodings
Background
It has been suggested that Foswiki should
use only UTF-8 for site locales (i.e. {Site}{Locale} should only be
something.utf8
). This would simplify things, but it is not achievable overnight. In the meantime, people use Unicode characters not representable in the site locale, encoded as entities. That seems to work okay until someone modifies the topics via the TinyMCE WYSIWYG editor, at which point breakage occurs (see for example
Tasks.Item5990).
UnderstandingEncodings provides an excellent introduction to this problem.
Needed: A Specification
There was (originally) no specification for how Foswiki handled entities in terms of WYSIWYG editing. We needed that specification to be able to resolve a number of bugs (
Tasks.Item5990,
Tasks.Item2311 and some others).
Sample characters that expose encoding problems
The characters in the table below together show encoding issues. Some are related to the use of (X)HTML (e.g. the need to encode & and <),
some to Foswiki internals (e.g. the use of non-breaking space in the
WysiwygPlugin's
TML2HTML/HTML2TML conversions)
and some are related to variations between character encodings.
Character |
Entity name |
Unicode code-point |
UTF-8 encoding |
ISO 8859-1 |
ISO 8859-15 |
Windows 1252 |
& |
amp |
U+0026 |
0x26 |
0x26 |
0x26 |
0x26 |
|
nbsp |
U+00A0 |
0xC2 0xA0 |
0xA0 |
0xA0 |
0xA0 |
½ |
frac12 |
U+00BD |
0xC2 0xBD |
0xBD |
- |
0xBD |
€ |
euro |
U+20AC |
0xE2 0x82 0xAC |
- |
0xA4 |
0x80 |
¤ |
curren |
U+20A4 |
0xE2 0x82 0xA4 |
0xA4 |
- |
0xA4 |
‘ |
lsquo |
U+2018 |
0xE2 0x82 0x18 |
- |
- |
0x91 |
ë |
euml |
U+00EB |
0xC3 0xAB |
0xEB |
0xEB |
0xEB |
δ |
delta |
U+03B4 |
0xCE 0xB4 |
- |
- |
- |
♀ |
n/a |
U+2640 |
|
|
|
|
♂ |
n/a |
U+2640 |
|
|
|
|
⚥ |
n/a |
U+26A5 |
|
|
|
|
☿ |
n/a |
U+263F |
|
|
|
|
⚧ |
n/a |
U+26A7 |
|
|
|
|
Additional constraints
- It is sometimes necessary to encode ordinary 7-bit characters like
\
(backslash) as entities in %MACRO{ ... }%
parameters because of issues related to Foswiki internals - see Tasks.Item8408
- Foswiki converts spaces in TML to non-breaking spaces when the TML must be protected (e.g. for
%MACRO{ ... }%
parameters), so that the spaces survive editing WYSIWYG editors such as TinyMCE.
Specification
Definition:
Plain characters are characters represented in canonical form in the site's character encoding. HTML/XML named or numeric entities are not plain characters. For example,
A
is a plain character, but
A
is not. In UTF-8, a plain character may be encoded in more than one byte.
Definition:
Protected blocks are sequences of characters where the precise sequencing of characters, including whitespace and newlines, is important. The contents of protected blocks may not be modified simply by loading and saving a topic in a WYSIWYG editor (although the user may edit them in a WYSIWYG editor). Protected blocks include the specifically-marked-up
<sticky>
blocks,
<verbatim>
blocks and automatically-protected text like
%MACRO{parameters}%
.
- Regardless of the site's character encoding, named and numeric entities not in protected blocks may be converted to plain characters (provided that the site's character encoding is able to represent those characters).
- Regardless of the site's character encoding, named and numeric entities in protected blocks shall not be converted into a different kind of entity or into plain characters
- Comment: In a protected block, each entity's
&
must be converted into an entity itself (i.e. &
) when converting TML to HTML, so that the rest of the original entity is "plain text" as far as the browser is concerned, so that the original entity may be reconstructed when converting the HTML back to TML.
- Regardless of the site's character encoding, characters that cannot be represented in the site's character encoding shall be converted to entities (by preference, named entities).
- Comment: This applies particularly to characters in an HTML2TML REST request, since those requests are encoded as UTF-8, but it could also apply to characters in a conventional form POST.
- If a character that is converted to an entity is in a protected block, then the user shall be warned of the conversion-to-entity inside a protected block.
- The warning, and how it is displayed, is TBD
- Foswiki shall treat the ISO-5589-1 encoding as if it really is Windows-1252 (since this is what most browsers do anyway)
- This specification is not complete.
Unresolved issues
- Numerical entities (e.g.
A & ë
which is A & ë) in ordinary topic text:
- Is it acceptable for TinyMCE/WysiwygPlugin to convert them to plain characters or to named entities?
- MT: Yes
- CC: No to plain chars, yes to named entities
- MT 01 Jun 2010: Update: Yes, unless the character is one of
&
, <
or >
, in which case it must be encoded as an entity to produce valid HTML.
- PH: Agree with MT 01 Jun 2010 above.
- MT 26 Jun 2010: Update: This is now implemented as per "MT 01 Jun 2010" above, and there have not been any complaints. I take that as tacit acceptance
- Numerical entities (e.g.
A & ë
which is A & ë) in <sticky>
blocks, and in automatically-protected text like %MACRO{parameters}%
:
- Is it acceptable for TinyMCE/WysiwygPlugin to convert them to plain characters or to named entities?
- MT: No to plain characters, but yes to named entities
- PH: Wouldn't it simplify the code if
<sticky>
sections were completely protected+left alone somehow?
- CC: No to plain chars, yes to named entities
- MT: I am leaning more towards Paul's way here - that
<sticky>
sections should be left untouched
- I made a unilateral decision to go with Paul's way. -- MichaelTempest - 01 Jun 2010
- Named entities (e.g.
ë ½
which is ë ½ or, with (say) de_AT.ISO-8859-15 as the {Site}{Locale}
, €
which is €) in <sticky>
blocks, and in automatically-protected text like %MACRO{parameters}%
:
- Is it acceptable for TinyMCE/WysiwygPlugin to convert them to plain characters or to numeric entities?
- Numerical or named entities in
<verbatim>
blocks:
- Is it acceptable for TinyMCE/WysiwygPlugin to convert them to plain characters or to named entities?
- When the {Site}{Locale} is not based on UTF-8:
- How should UTF-8 characters be encoded if they cannot be encoded directly in the site locale?
- MT: As entities
- PH: As entities
- CC: As entities
-
- What if those characters are in "protected" spans that correspond to
<sticky>
blocks, <verbatim>
blocks, or automatically-protected text?
- MT: As entities, as I assume they have been added via the WYSIWYG editor
- PH: Entities is probably acceptable, but to compare - how are these characters saved via raw editor? Should this behaviour be consistent between both?
- MT: Yes, it should be the consistent. However, it is far more likely to occur when clicking WikiText in TMCE (because the REST request is encoded as UTF-8) compared to saving from the raw editor (which is most likely to be encoded in the site's encoding).
- CC: As entities, with a warning.
- MT: Agreed - a warning is needed.
- When the {Site}{Locale} is based on UTF-8:
- Is it acceptable for TinyMCE/WysiwygPlugin to convert entities into plain characters?
- What if those entities were in a
<sticky>
block or <verbatim>
block?
-
- What should happen to illegal or undefined characters, where there is no character defined for the code and no workaround (this would include invalid UTF-8 codes)?
- MT: Either ignore them, remove them, replace them with ? or convert them to numeric entities. The latter is the least destructive. If the content is changed, then there should be a warning.
- PH: numeric entities, issue a warning.
- MT: Update: There is a problem with conversion to entities. By definition, the numbers in numeric character entities are Unicode codepoints. If there is no character defined for the code, and the data is not encoded in UTF-8, then by definition, there is no corresponding Unicode codepoint (if there were, then there would be a character defined for the code), and so it is not possible to convert the character to an entity. - 12 Jun 2010
--
Contributors: MichaelTempest,
PaulHarvey,
CrawfordCurrie
Discussion
I would like to tackle some encoding-related bugs, but I am not sure what the correct behaviour is. Hence this topic. I have given my interpretation, but I would like there to be consensus as to where we should aim before I start changing things
--
MichaelTempest - 29 May 2010
Excellent start! I have added some characters my users have been using, that cause horrible death of
HTML2TML.
I'm not sure where these characters are in the other locales.
Regarding saving of these characters that are beyond the site locale; I am wondering if Wysiwyg really has to be the place where this magic happens - should it be unified so that this work benefits "raw" saves too?
--
PaulHarvey - 30 May 2010
When you refer to "plain characters" in the Unresolved Issues, what do you mean? Characters already encoded in the {SiteCharset}?
- Yes - e.g. characters like "A" and "ë", and when the site locale is based on UTF-8, then multi-byte UTF-8-encoded characters are also plain characters. -- MichaelTempest - 30 May 2010
--
CrawfordCurrie - 30 May 2010
I see now that {Site}{CharSet}
overrides {Site}{Locale}, and is only used when perl's Encode does not recognise the {Site}{Locale}. So normally, {Site}{CharSet} is left blank and the tail-end of {Site}{Locale} determines the character encoding. Therefore, I have changed the topic to use {Site}{Locale} instead of {Site}{CharSet}.
--
MichaelTempest - 30 May 2010
I started writing a spec from the parts we agree on.
--
MichaelTempest - 31 May 2010
I folded the issues we have resolved into a twisty, so that the remaining issues (including some newer issues) stand out.
There are some other issues that came out of the discussion, that may have wider impact than just the WYSIWYG editor:
Converting to entities in protected blocks
Paul said that saving from the "raw text" editor" and converting
HTML2TML should exhibit the same behaviour for (typically UTF-8) characters in protected blocks that cannot be represented in the site's character encoding.
I agree. However, is there a reliable way to detect the encoding actually used for a form POST, and if so, does Foswiki try and detect the encoding used, or does it blithely assume the site encoding?
Emit a warning when converting to entities in protected blocks
Crawford said that there should be a warning when converting a character in a protected block to an entity.
Sure. How should that warning be presented? The warning could be presented after clicking WikiText, and also after clicking Save. I think the Skins would have to be modified to allow for this. What is the current UI wisdom on the best way to present this? Would
RelayAlertsToTopicTop be the way to go?
At present, TMCE POSTs HTML when you click the save button, and so it does the conversion to the site's character encoding itself. Perhaps it should first convert to TML and then save, so that all
HTML2TML conversions are done the same way, allowing for the necessary error-detection to be implemented once.
--
MichaelTempest - 01 Jun 2010
I would be quite concerned about forcing TMCE to call
HTML2TML before save. The UI latency in Foswiki is already "average" and I would love to keep working towards "fast".
Probably as a compromise, we could implement a
HTML2SAVE handler.
--
PaulHarvey - 01 Jun 2010
I like the
HTML2SAVE handler idea. I'll leave that for another proposal
--
MichaelTempest - 12 Jun 2010
I modified the
HTML2TML and
TML2HTML conversions to be as per this spec. See the checkins for
Tasks.Item5990
--
MichaelTempest - 26 Jun 2010