Item14091: Fowiki page cache breaks UTF-8 characters, causing non-ASCII characters to become gibberish after caching.
Priority: Normal
Current State: Confirmed
Released In: 2.1.7
Target Release: patch
Applies To: Engine
Component: Cache
Branches:
Description
By enabling the Foswiki page cache (
{Cache}{Enabled}
) under Tuning in Configure, UTF-8 characters become gibberish - in both pages that use the
%TRANSLATE%
macro as well as in the default interface localisations themselves. This is most apparent with languages such as Russian, Bulgarian, or Chinese - but is also apparent in languages such as French.
I found out that this was due to the Foswiki page cache because only cached pages were affected - by loading pages using
?cache=refresh
or visiting uncached pages, the user interface and wiki content loads correctly. However, when attempting to visit the page normally, broken content is presented once more.
Example of the broken utf-8 that is presented when a cached page is retrieved
System Setup
I am using:
- Ubuntu 16.04
- Apache/2.4.18
- MySQL 5.7.12-0ubuntu1
- Foswiki 2.12
Debugging Process
At first, despite my strong assumptions that it's due to the Foswiki page cache, I took steps to migrate all other possible variables. I made sure to:
- Disable CDN-level caching by setting Cloudflare to Development mode.
- Disable apache2 mod_pagespeed in case page rewriting was the problem.
- Tested pages in Google Chrome incognito mode with browser-level cache disabled
Even after doing all of these, the above symptoms were still present. I narrowed the problem down to either how
MySQL is storing the cached pages, or how Foswiki is entering data into the cache
I am using the Foswiki page cache with a locally run
MySQL server. First, I checked the character_set collation encoding of the cache database. They were not in utf8, but in latin1. I assumed this was the cause of the problem. Therefore, I stopped apache2, deleted the
foswiki_cache_deps
and
foswiki_cache_pages
tables in the database, and changed the character encoding and collation of the
MySQL database to utf8 and utf8_general_ci.
I made sure to also add the following lines in the my.cnf configuration file to
MySQL, to make sure all future created tables will be in utf-8:
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
init_connect='SET collation_connection = utf8_unicode_ci'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_unicode_ci
skip-character-set-client-handshake
After doing all of this, I restarted the mysql client, and checked once more that the database is now in utf8. Then, I restarted apache2 and checked out the website.
The issue persists
However, despite explicitly configuring
MySQL to use utf8, the issue persists! On first visit, all non-ASCII characters render correctly, but any subsequent cached pages present broken gibberish rather than cryllic or chinese characters. It is essential for my web to have non-latin multilingual support. I've tried looking at the configuration section again, but there are no obvious options that will solve this problem. I think this is caused by the way foswiki enters information into the cache databases - although my installation is utf-8 by default (as it says in configure), it appears the cache is still in some other character encoding scheme. How can I solve this issue?
Workaround
A workaround to this problem is by disabling the caching entirely. This is not a good solution, as it has negative performance implications. I hope we can work together and find a way to solve this problem.
--
ShenZhouHong - 11 Jun 2016
Oh, please tell me if you need any additional information to debug this problem. I'm extremely new to Perl and Foswiki, and this is my first time configuring something like this. I'll leave this page open and I'll be standing by if anything further is requested of me. It's my first time making a real bug report, so if I missed anything please let me know.
--
ShenZhouHong - 11 Jun 2016
For testing purposes (taken from the Gutenberg Project EBook of
"Journey to the West"):
第一回 靈根育孕源流出 心性修持大道生
詩曰:
混沌未分天地亂,茫茫渺渺無人見。
自從盤古破鴻濛,開闢從茲清濁辨。
覆載群生仰至仁,發明萬物皆成善。
欲知造化會元功,須看西遊釋厄傳。
--
MarkusUeberall - 11 Jun 2016
I've added the text here. I've currently turned off caching so it displays perfectly fine - but when caching is turned on it becomes gibberish.
https://csc.uwc.wiki/Sandbox/UTF8Test
This is what the text becomes once caching is turned on:
第ä¸å éæ ¹è²åæºæµåºãå¿æ§ä¿®æ大éç
è©©æ°ï¼ æ··æ²æªå天å°äºï¼è«è«æ¸ºæ¸ºç¡äººè¦ã èªå¾ç¤å¤ç ´é´»æ¿ï¼éé¢å¾è²æ¸
æ¿è¾¨ã è¦è¼ç¾¤çä»°è³ä»ï¼ç¼æè¬ç©çæåã 欲ç¥é åæå
åï¼é ç西ééåå³ã
I've tried caching using the SQLLite cache implementation as well, and this issue also persists. It appears to be a problem with how Foswiki inputs data to the cache itself. Internationalization support is one of Foswiki's priorities, and this bug should be fixed in order to allow proper internationalization.
--
ShenZhouHong - 11 Jun 2016
I have seen the same with default (sqlite I think) caching store and UTF-8 setting with western non-ASCII characters. (öäüèčš etc.). I hasn't been annoying enough for me to dig into it yet.
--
PhilippeKehl - 11 Jun 2016
Glad to see that this bug can be replicated. I hope a solution can be found for it soon.
--
ShenZhouHong - 11 Jun 2016
Two short notes:
- On f.o, foswiki_cache tables are still latin1 based, and as you can see below, caching this page still works.
- From the above, it's not clear whether the DB cache tables were recreated after changing the encoding; AFAIK, existing tables are not converted automatically when changing the DB system defaults.
--
MarkusUeberall - 11 Jun 2016
I've deleted the foswiki_cache_pages and foswiki_cache_deps tables after changing the encoding, and before starting apache. The tables were automatically created again, but I haven't deleted the whole database.
--
ShenZhouHong - 11 Jun 2016
Michael pointed out on IRC, that the cache databases are
only the indices to the cache, and are used to find and invalidated cache entries when topics are updated. The actual cached pages are written to the directory configured in
$Foswiki::cfg{Cache}{RootDir}
, typically the
working/cache
directory. So encoding issues in the database won't have anything at all to do with the cached data.
The cache files are named using a hashed filename, for ex,
d538d946e4202519b18dfeb2342b97ae
. If your cache encoding is corrupted it's something related to writing/reading these files.
--
GeorgeClark - 12 Jun 2016
The cache file is written in
lib/Foswiki/PageCache/DBI.pm
#writeDebug("saving data of $webTopic into $fileName");
open( $FILE, '>:encoding(utf-8)', $fileName )
or die "Can't create file $fileName - $!\n";
print $FILE $variation->{data};
close($FILE);
--
GeorgeClark - 12 Jun 2016
Shen, I was trying to register your site in order to repro the error. However the registration code did not make it through to me. The email server says
<www-data@dilijan>: Sender address rejected: need fully-qualified address
This as just a sidenote.
I then tried to reproduce the error on my installation with above test text but was unable to get any encoding errors. Could you add some more info on which perl version you are using and what your settings in your
LocalSite.cfg
are. Best would be to attach it here - any privacy information removed before, of course.
Please also make sure that no cloudflare cdn or mod_pagespeed is activated. Whatever module might get in the midle: disable it please so that the raw results are delivered by Foswiki.
--
MichaelDaum - 13 Jun 2016
I have the problem on my production Foswiki but not in my dev-Foswiki. The only difference is that the first runs in mod_perl and the other as CGI. Otherwise it's the same host, httpd, Perl etc. In both instances the string "äöüč" appears as "äöü" in the cache file. Why's that? "äöü" is also what I see in the page mod_perl-served from the cache (Content-type headers etc. look okay). The topic.txt file has the correct "äöüč" in both installations.
I have {HttpCompress} disabled because I cannot get that to work in the mod_perl instance (I get weird Firefox "content" or "decoding" error pages, or something like that). It does work on the CGI installation. Maybe that's related? Some encoding weirdness messing up the gzipped data?
The problem does not occur on pages that have a <dirtyarea>.
Any ideas where to look?
--
PhilippeKehl - 17 Jun 2016
I'm wondering if this is somehow related to API differences in mod_perl vs. plain old CGI /
FastCGI. We have many sites running with fcgi /
FastCGI without issues including foswiki.org.
--
GeorgeClark - 18 Jun 2016
Michael, site registration is limited right now only to holders of an @uwcchina.org address. You are right - the email doesn't seem to work as well. I'm planning to rebuild the entire foswiki site in light of the trouble I am facing, in hopes I can reac a solution.
Philippe, I have the exact same problem with content encoding as well, when I turn on {HttpCompress}. The content encoding problem disappears when pages are loaded with the ?cache=refresh header as well.
George - I'll try to perform a clean reinstall of the site with mod_perl rather than the CGI engine - since I am using
FastCGI right now. Perhaps that is the root of the issues?
--
ShenZhouHong - 22 Jun 2016
Actually we are aware of several sites including foswiki.org successfully using
FastCGI without any character set issues, so I doubt that switching to mod_perl will help. I'm not sure where to go now with this.
--
GeorgeClark - 23 Jun 2016
For those who encounter this problem--have you verified that all locale specific settings are correct? (see
How to set up a clean UTF-8 environment in Linux,
Unicode-processing issues in Perl and how to cope with it)
--
MarkusUeberall - 27 Jun 2016
I have the problem with mod_perl, but not with CGI.
My system locales should be alright. The system default is en_GB.UTF-8. "locale -a" says that that is available. Also Perl should be alright (
LC_ALL=en_GB.UTF-8 perl -e 'print "hello\n";'
doesn't complain about a missing locale or so, which it would if the locale wasn't available).
I've now tried various combinations of LANG and LC_ALL settings ins /etc/apache/envvars and for the {Site}{Locale}, {UseLocale} and {Store}{Encoding} settings without any success.
I'll try getting mod_perl with cache compression working. Somehow I'm unable to prevent Apache's mod_deflate to compress the output again (which is why I have {HttpCompress} disabled.
Or I'll try fcgi.
--
PhilippeKehl - 01 Jul 2016
mod_perl is really not recommended when you also want performance. mod_deflate is a problem as well: foswiki already caches compressed pages so no need to compress the page again on every new request. I'd highly recommend fcgi. Sure, even when that fixes your encoding issues, would we still don't really know what caused your encoding problems...meh
--
MichaelDaum - 01 Jul 2016
I was able to fix it for me (mod_perl, {HttpCompress} off) by removing the ":encoding(utf-8)" from the open() call in Foswiki::PageCache::DBI::setPageVariation(). I.e. I changed
open( $FILE, '>:encoding(utf-8)', $fileName )
to
open( $FILE, '>', $fileName )
And now it works.
Now also the working/cache/..... file shows the original äöü content instead of the garbled version. That somehow makes more sense to me. However, my CGI installation of foswiki has the garbled version in the cache file but all is fine and öäü displays correctly. I'm confused.
I haven't yet traced where $variation->{data} is filled in or what it looks like.
--
PhilippeKehl - 01 Jul 2016
I'll try the fcgi sometimes.
--
PhilippeKehl - 01 Jul 2016
It works fine with fcgi (and {HttpCompress} off, as I still cannot get that to work with Apache -- it insists on mod_deflate-ing the content).
I'm still puzzled by the garbled content in the cache file (e.g. "äöü" instead of the original "äöüč"), but the cached pages now display correctly. Why would the cache not store the "raw" contents?
I'm confused by all the ":encoding(utf-8)" vs ":raw" stuff. I never had to use any of those in the perl/CGI apps I wrote (which would store strings in files and databases).
--
PhilippeKehl - 25 Sep 2016
I can confirm that removing the encoding layer fixes the issue.
--
MichaelDaum - 01 Oct 2018