This question about Issue in browser: Asked
utf8 (char) does not map to Unicode
I am upgrading from 1.1.9 to 2.0.1. I moved my data files over to the new directory tree. (Note: I did not use
block_copy.pl
because it's currently broken)
I opened
Main/WebHome
and realized that I had missed installing FlexWebListPlugin. After I did so, when I went back to
Main/WebHome
, I saw this error:
Foswiki detected an internal error - please check your Foswiki logs and webserver logs for more information.
utf8 "\x92" does not map to Unicode
I disabled FlexWebListPlugin and can view
Main/WebHome
again.
- A web search tells me that Code point 0x92 (146 decimal) is the right single quotation mark (a so-called smart quote).
- Given what I know about Flex Web List plugin, I am guessing that the character is in the description text of one of my webs
- Trying to open
System/SiteMap
(after disabling FlexWebList Plugin) throws the same error
- This would seem to confirm my guess about where the problem is
- However, a brute-force attempt to uncover the problem file will be tedious (and should be something that can be automated)
Is there a script that I can run that will locate all topic files that contain unacceptable characters that do not map to Unicode?
Essentially, I want to run just the "find bad encodings" portion of bulk_copy and identify problems. I don't even need to have it automatically fix these, only identify them.
I can imagine that such a script could be useful for other people as well...
--
VickiBrown - 17 Sep 2015
The
CharsetConverterContrib has an inspect mode and will report issues. It also has a repair option that will detect alternate encodings and will convert the topic. So in your case, it will see the "smart-quotes" that are part of the Windows cp-1252 codepage, and will attempt to convert the topic with that codepage.
We still have some challenges in the conversion tools, but it's getting closer. Remaining issues:
- Topics containing more than one encoding. (Someone pastes in smart-quotes, and also some utf-8 characters).
- Links to attachments with high characters in the attachment name. They are entity-encoded in the topic, detect as plain ASCII, and don't get converted.
--
GeorgeClark - 17 Sep 2015
Actually some sites with install base of windows users are reporting better luck converting by just setting the {Site}{CharSet} of the 1.1.9 system to 'cp-1252', so that the default source encoding includes the windows characters.
--
GeorgeClark - 17 Sep 2015
FYI, this very simple grep command should work on Unix-based servers to hunt down files with issues:
find $* -name '*.txt' | xargs grep -lnP "[\x80-\xFF]"
--
VickiBrown - 17 Sep 2015