Item5485: Attachments containing special characters (e.g. german umlauts: ä ö ü) can not be opened any more
Priority: Normal
Current State: No Action Required
Released In:
Target Release: n/a
Applies To: Engine
Component: I18N
Branches:
I migrated a TWiki installation from version 4.0.4 to version 4.2 running in a windows/cygwin environment.
Any attachment with special characters in the name (e.g. ä, ö, Ì) can not be opened any more. The response is "Access denied", "error 403".
All attachments not containing those characters can be opened without problems as before.
Version 4.0.4 had no problems in storing and opening attachments containing these special characters.
--
TWiki:Main/MichaelSchmidt - 31 Mar 2008
It took a bit of hacking to create an attachment with those characters in the name, because TWiki filters those characters from attachment names by default. But once I did, I found it opened just fine, with no 403.
You need to provide more detail; a step-by-step description of how you arrived at the error.
--
CrawfordCurrie - 31 Mar 2008
- Sample attachment without umlaut:
- Sample attachment with an umlaut:
Error: (3) can't find FehlerGlblRemöte3.jpg in Tasks
On my Twiki 4.2.0 installation the error can easily be reproduced, simply by attaching a file containing a special character in its name like in the attached Jpeg Files. When I try to open the attachment with the umlaut, the error appears.
The version of this Bugs web handles this file without problems (but it runs version 5.0.0).
Ok, you can say "don't store attachments with umlauts". However my german users have already stored quite a lot of word documents containing these umlauts as attachments when we used version 4.0.4 and they want to open some of them again.
--
TWiki:Main.MichaelSchmidt - 31 Mar 2008
I would never say that. But AFAIK there have been no changes to the handling code since 4.2 was released (you may know better). What's more, opening an attachment (by clicking on it) normally has nothing to do with TWiki; it's opening a URL. The "Access denied" is coming from your Apache server, I suspect.
--
TWiki:Main.CrawfordCurrie - 31 Mar 2008
Michael, do you use the same configuration with both versions (4.2 and 404)? {UseLocale}, {Site}{Locale}, {Site}{CharSet}, {Upper/LowerNational}. Maybe not all of them affect the problem, but its worth a look.
--
TWiki:Main.OliverKrueger - 31 Mar 2008
Yes, I have thoroughly checked this and it is the same configuration.
However, I observed the following:
In version 404, when the mouse is over an attachment link, the string in the status line looks as follows: siteurl/bin/viewfile/web/topic?rev=n;filename=file.ext
In version 4.2 the string in the status line looks as follows: siteurl/pub/web/topic/file.ext
In other words: in the new version the attachment is directly accessed via the url. This is quite a difference compared with the previous version where the url apparently has been constructed in the background.
Does this mean, we have to ban the special characters in file names and replace all occurences of these characters in previously stored attachements by standard ASCII characters? Or is there still another solution for this problem?
--
TWiki:Main.MichaelSchmidt - 01 Apr 2008
The use of a direct link instead of viewfile helps on the performance and many requested it because the viewfile URLs were difficult to use with tools like wget.
You need to compare the links that are created in the shown page (look at the page source) and compare this with the actual links. Maybe we have some encoding issue.
It is impossible to guess your configuration. Please attach the
LocalSite.cfg (remove passwords and email addresses you want to keep secret and attach it to this bug report.
Also attach an actual topic (the raw file) so we can see what is in the META data of the topic.
--
TWiki:Main.KennethLavrsen - 01 Apr 2008
It also seems to be an encoding issue to me.
I have attached the current config file, a sample test page and three screenshots to demonstrate the problem.
As you can see in the source of the page, the file name in the "meta" statement contains the special character (umlaut ö) while the link statement generated by TWiki contains the encoding "%f6" for this character. The file cannot be displayed by TWiki. If I duplicate the link statement and replace the encoding by the character, the file
can be displayed.
When I click on the file name displayed in the attachment area, TWiki generates a URL containing the encoding for the character, but the webserver is not able to access the file using this URL.
If I modify the URL replacing the encoding with the character, the webserver can access the file.
--
TWiki:Main.MichaelSchmidt - 10 Apr 2008
Assigned to
I18N
CC
Michael: You said you "migrated a TWiki installation"--by this, did you mean you copied the contents/configuration into a new environment (new computer and/or new web server)?
I 'played' with this issue on a Linux box which also hosts a Windows VM where I installed the provided Windows Installer (after 'deactivating' TWiki::Sandbox::sanitizeAttachmentName and attaching a file called
plügin.gif
w/ u-umlaut in both cases).
After that, I had a look at the Apache logs and tried to access the attachment/icon directly, using all three possibilities to encode the umlaut:
plügin.gif
,
pl%FCgin.gif
, and
pl%C3%BCgin.gif
. NB: If you use umlauts directly, they will automatically be encoded first--in all cases ({IE7,Firefox2.0.0.14}/Win->TWiki/{Win,Linux}, {Firefox/Konqueror}/Linux->TWiki/Linux), the resulting GET statement used the third encoding. (By "access directly", I mean that I copied the URLs into the browser's address bar; if you use the WYSIWYG editor, links within a topic will be converted (differently) by the plugin itself, which 'strangely' will result in attachments being displayed correctly there but not in "view" mode.)
Interestingly, the installed Apache/2.2.8 (Linux Mandriva) instance was unable to 'map' that third form onto the filename, but was ok with the second, while the Apache/2.2.4 (Win32) instance contained in the forementioned Windows installer would only accept the third encoding (exactly as described above). Therefore, I guess that during the migration, that very behaviour changed. And while I clearly would call this a (mapping) bug, it's not TWiki's fault but the web server's... since the requests technically 'bypass' TWiki, checking/unifying the form of the escape codes used within TWiki topics can only be regarded as a work-around...
--
TWiki:Main.MarkusUeberall - 12 May 2008
Addendum: If you cannot change the web server's behaviour, you may get away with a rather small set of rewriting rules for "wrong encodings" of german umlauts, but I didn't test this...
--
TWiki:Main.MarkusUeberall - 12 May 2008
For Apache backends, this might be useful to read:
https://issues.apache.org/bugzilla/show_bug.cgi?id=24333
--
TWiki:Main.OliverKrueger - 12 May 2008
I just found an easier solution to the problem: use
mod_encoding
(
http://webdav.todo.gr.jp/)
After I added the following lines to my apache configuration (cf.
TWiki.ApacheConfigGenerator or
Foswiki:Support.ApacheConfigGenerator), all URI encodings mentioned above worked (you may want to check that the given server encoding matches your setting, though):
<IfModule mod_encoding.c> EncodingEngine on SetServerEncoding iso-8859-15 </IfModule>
Example:
---++ Attachments with german umlauts * Attachment w/ german umlaut in name: <img src="/pub/Tasks/Item5485/pl%C3%BCgin.gif" alt="pl%C3%BCgin.gif"> * Attachment w/ german umlaut in name: <img src="/pub/Tasks/Item5485/pl%FCgin.gif" alt="pl%FCgin.gif"> * Attachment w/ german umlaut in name: <img src="/pub/Tasks/Item5485/plügin.gif" alt="plügin.gif">
should be displayed as follows:
--
TWiki:Main.MarkusUeberall - 16 May 2008
Re-opening this (with Normal priority) to remind Markus to document this (in Codev internationalisation docs) for the benefit of other TWiki users.
--
TWiki:Main.CrawfordCurrie - 17 May 2008
Attachments with
I18N were working fine in Cairo but something broke when things were refactored.
Some things that may be relevant:
- TWiki:Codev.EncodeURLsWithUTF8 was written specifically to handle this case - for URLs that are served by TWiki, it converts from (URL-encoded) UTF-8 in the URL to the {Site}{CharSet} e.g. ISO-8859-15. For URLs served by Apache, it tries to pre-encode the generated URL in the attachment links that it provides, such that the URL is already in the site charset (because there won't be any chance for TWiki to fix the encoding when the attachment is served directly by Apache.) So this is something of a regression. I did work on this a while back but can't find the bug right now.
- TWiki:Codev.ApacheTwoBreaksNonUTF8EncodedURLsOnWindows - this is an Apache on Windows bug with non-UTF-8 URLs (e.g. pl%FCgin.gif not pl%C3%BCgin.gif) - I managed to get a patch into Apache 2.0.54 that mostly fixed this, but a TWiki patch may still be necessary on Windows (see topic for patch). That was to do with URL components such as PATH_INFO but it's worth a try with 2.0.54.
--
TWiki:Main.RichardDonkin - 28 Jun 2008
Suggestion: When uploading, there's already a conversion of spaces to underscores -- this check should be expanded to also convert umlauts using the same logic that's used by
TWiki:TWiki.TWikiRegistration.
--
TorbenGB - 08 Jan 2009
I am not sure I understand Torben's suggestion right. But if the proposal is to translate all non english characters in attachments to _ then I find that to be a very poor work-around for most of the populations on this planet. We need to provide a proper fix.
I cannot confirm where we are with this on Foswiki 1.0.0 because my test server is 6 time zones away and had decided to lock up. But my feeling is that with with the right settings in configure simple things like uploading attachments with umlauts and getting them back should be possible to get to work.
If some old installation uploaded them and stored the name as UTF8 while the Foswiki thinks all is iso - then we probably cannot fix our way out of it, but at least new up and downloads should work. And the _ workaround would not solve that anyway
--
KennethLavrsen - 09 Jan 2009
I have the same Problem after switching from TWiki to foswiki last week. I have Topics with german umlauts and attachments with german umlauts. They cant' be opened any more because of different encoding.
But when I create an new topic with a umlaut-attachment, they work fine. And I cannot see any diffrence on file system/encoding, html source in browser or topic source. I have to check again the tipps above (i donts have access to my media files through internet).
--
ChristianSchmidt - 09 Feb 2009
Christian, did you look at the
%META:FILEATTACHMENT{name="..." ...} entries as well (i.e., the TML source, which can be accessed in the browser by appending
?raw=debug in the address bar)? If you didn't change anything else (e.g., Apache, Perl, ...) apart from the application, this would be the next place to look at.
Also note that the WYSIWYG editor may change the encoding as can be seen by looking at the summary of this entry ("
ä ö ü" instead of "
ÄÖÜ") -- however, if you didn't edit the page with the attachments in question following the switch, this can be ruled out (if you
did, try to compare it with a copy of a topic taken from a backup of your TWiki installation).
In either case, I'd suspect that
mod_encoding
should solve your problems.
--
MarkusUeberall - 10 Feb 2009
Markus, mod_encoding doesn't change anything. I tried
SetServerEncoding utf-8
and
iso88591
and
iso-8859-1
I'll attach a PDF with all information I have. I can't see any difference between old topics/attachments (not working) and new ones (working fine).
Because of the amount of data (attachments are movies) I have no real backup of the old server, I had to move the data to the new one and delete it on the old one. The files are ok, I scp one movie to my Mac, renamed it and it plays.
There are some differences between the server:
- old: locale: de_DE.utf8
- new: locale: en_US.utf8
- old: files in root fs (ext3)
- new: files in mounted ext3-partition
But don't I think that's the problem.
I can give you a read-only access to the wiki, write me an email.
--
ChristianSchmidt - 13 Feb 2009
Additional Info: kind of fixed my problem. I had wrong folder names, file names were correct. So I renamed only folders and it works. I still have problems with Anchor-links (like in TOC) with umlauts, but we can live with this restriction.
--
ChristianSchmidt - 27 Feb 2009
I still have problems with Anchor-links (like in TOC) with umlauts: see
Item817.
--
ChristianLudwig - 27 Feb 2009
Kenneth, I didn't mean to suggest that all umlauts should be converted to underscores; I agree that would be a poor solution.
I meant that the user registration will intelligently create a username without umlauts even when there are umlauts in the person's name, e.g. Müller becomes Mueller. That transliteration logic could be implemented for file names as well, and that would be a rather elegant solution.
--
TorbenGB - 14 Jan 2010
The longer term solution we are all looking for is using UTF8 everywhere and make that work so no matter where you are from in the world, you can name topics and attachments as your language is designed for. And this is also why you have not seen much effort on the work arounds.
--
KennethLavrsen - 14 Jan 2010
Closed in favour of
Item9462 which is a duplicate of this, but contains a solution.
--
CrawfordCurrie - 29 Aug 2010