CharsetConverterContrib: some characters were deleted

CharsetConverterContrib did a good job after an upgrade from TWiki 6.0.1 to Foswiki 2.1.6, except for some characters.

These characters weren't correctly converted, but gave the impression to be deleted (although the arrow left/right gave the impression that there still was a character...).

JavaScript Escape	character example	result	remark
\xEF	geïnstalleerd
\xEB	kopiëren
\xB2	²superscript two
\xB3	³superscript three
\xB7	·middot
\xE9	één
\xE0	à la bonheur
\x85	next line
\x80	€		$ and £ OK!
\xBB	»
\x91	‘	not applicable
\x92	’
\x93	“
\x94	”

-- Main.StijnBousard - 04 Apr 2018

The Charset Converter or bulk copy utilities can only convert based upon what it is told is being used as the source character set. Foswiki and TWiki has always defaulted to ISO-8859-1, however they did nothing to enforce that the characters in the document are actually from that set. Most users probably use Windows to access Foswiki, and it by default uses CP-1252, the "Windows" code page / character set. CP-1252 is a "superset" of ISO-8859-1 and fills in some of the gaps of the ISO character set.

See: Wikipedia:ISO/IEC_8859-1 which does not define the 0x79-0x9F range. Wikipedia:Windows-1252 shows the additional characters.

I'm guessing that you used the default ISO-8859-1 character set when you ran the converter. It should have flagged warnings when it encountered the unknown characters, but due to the amount of output these can be easy to miss. If you have a backup of the old installation, you could run the converter again against the original data and specify the options that override the charset.

Never run the converter on data that's already been converted! Running the converter a second time will corrupt utf-8 characters. Unfortunately we don't have any tool that can easily fix a topic containing a mixture of utf-8 and non-utf-8 characters. If you cannot get back to the original pre-conversion data, then the only solution I'm aware of is to manually edit the topics to replace the incorrect text.

-- Main.GeorgeClark - 04 Apr 2018 - 14:56