utf8 (char) does not map to Unicode

I am upgrading from 1.1.9 to 2.0.1. I moved my data files over to the new directory tree. (Note: I did not use block_copy.pl because it's currently broken)

I opened Main/WebHome and realized that I had missed installing FlexWebListPlugin. After I did so, when I went back to Main/WebHome, I saw this error:

Foswiki detected an internal error - please check your Foswiki logs and webserver logs for more information.

utf8 "\x92" does not map to Unicode

I disabled FlexWebListPlugin and can view Main/WebHome again.

A web search tells me that Code point 0x92 (146 decimal) is the right single quotation mark (a so-called smart quote).
Given what I know about Flex Web List plugin, I am guessing that the character is in the description text of one of my webs
Trying to open System/SiteMap (after disabling FlexWebList Plugin) throws the same error
- This would seem to confirm my guess about where the problem is
However, a brute-force attempt to uncover the problem file will be tedious (and should be something that can be automated)

Is there a script that I can run that will locate all topic files that contain unacceptable characters that do not map to Unicode?

Essentially, I want to run just the "find bad encodings" portion of bulk_copy and identify problems. I don't even need to have it automatically fix these, only identify them.

I can imagine that such a script could be useful for other people as well...

-- VickiBrown - 17 Sep 2015

The CharsetConverterContrib has an inspect mode and will report issues. It also has a repair option that will detect alternate encodings and will convert the topic. So in your case, it will see the "smart-quotes" that are part of the Windows cp-1252 codepage, and will attempt to convert the topic with that codepage.

We still have some challenges in the conversion tools, but it's getting closer. Remaining issues:

Topics containing more than one encoding. (Someone pastes in smart-quotes, and also some utf-8 characters).
Links to attachments with high characters in the attachment name. They are entity-encoded in the topic, detect as plain ASCII, and don't get converted.

-- GeorgeClark - 17 Sep 2015

Actually some sites with install base of windows users are reporting better luck converting by just setting the {Site}{CharSet} of the 1.1.9 system to 'cp-1252', so that the default source encoding includes the windows characters.

-- GeorgeClark - 17 Sep 2015

FYI, this very simple grep command should work on Unix-based servers to hunt down files with issues:

find $* -name '*.txt' | xargs grep -lnP "[\x80-\xFF]"

-- VickiBrown - 17 Sep 2015

QuestionForm edit

Subject	Issue in browser
Extension
Version
Status	Asked
Related Topics

Topic revision: r6 - 19 Sep 2015, VickiBrown

Support

Quick Links

Tools

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement.

Legal Imprint Privacy Policy