Item13696: Some attachments can be unreachable with non-UTF8 store encoding.
Priority: Urgent
Current State: Closed
Released In: 2.0.2
Target Release: patch
When an attachment is uploaded, the filename is saved into the Store with the configured
{Store}{Encoding}
However browser links are always generated in UTF-8, So on systems not using
viewfile
for attachments, attachments with non-ascii characters in the filenames will become unreachable.
This can be addressed in 3 ways:
- Convert store to utf-8. This is the recommended solution.
- use
viewfile
for all attachment access
- Use a plugin completePageHandler to rewrite any pub urls from UTF8 to the configured Store encoding.
--
GeorgeClark - 10 Sep 2015
There is a bit more to that quote. It starts out with a caveat, that the
URI scheme ... represents textual data consists of characters from the UCS. If we start from the native URI that Apache understands
http://somesite/pub/Sandbox/iso-8859-1-location
, it does
not contain characters from the
UCS, It is consistently encoded as and needs to be requested as an ISO-8859-1 URI.
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
So the initial assumption that the data consists of UCS characters is not met. The problem is that Foswiki is generating a URI as if the server location did contain UCS characters. So the generated URI is incorrect. Trying to modify Apache, (and lighttpd and nginx and IIS ...) to convert from the utf-8 location back to the real ISO-8859-1 location doesn't make sense. We need to generate the correct URLs in the first place.
Ignoring Foswiki:
- The URI of the file is
pub/Sandbox/TestUtf8Attach/_Andr%e9.jpg
- It does not contain any UCS characters.
But Foswiki generates:
- URI
pub/Sandbox/TestUtf8Attach/_Andr%c3%a9.jpg
--
GeorgeClark - 10 Sep 2015
See also
Item13697.
--
GeorgeClark - 11 Sep 2015
ok, i removed all my stupid comments. sorry for bothering.
--
JozefMojzis - 11 Sep 2015
On a 2.0 system, iso-8859-1 encoding, an IMG link is generated
<img src="/pub/Sandbox/TestUtf8Attach/_Andr%c3%a9.jpg" alt="_André.jpg" width='113' height='85' />
On a 1.1.9 system, same encoding, same file, same topic, the link in the html source is:
<img src="/pub/Litterbox/TestUTF8Attach/_Andr%e9.jpg" alt="_André.jpg" width='113' height='85' />
--
GeorgeClark - 11 Sep 2015
Attached a plugin module which appears to fix the URLs up on a 2.0 system with iso-8859-1 store.
--
GeorgeClark - 11 Sep 2015
Crawford, I've checked this in as part of Foswik.pm, rather than using a plugin completePageHandler. Please take a moment to review it if you could. I think I've covered the tag types that will refer to pub url locations. Unit tests are all still passing.
--
GeorgeClark - 12 Sep 2015
I think that this still needs some work. It would be better to reverse back to the store filename, and determine if it exists with either encoding. Only replace the path if one of them is actually reachable. Not sure how easy it is to reliably recover the filename from the url.
--
GeorgeClark - 12 Sep 2015
Implemented as
PubLinkFixupPlugin. Shipped as a default extension.
--
GeorgeClark - 19 Sep 2015
Still an issue. On a new site running with a non-utf-8 store, we should not need to be "fixing" up URLs. Attachment pub URLs should be inserted with the correct encoding.
in
Foswiki::Attach::getAttachmentLink()
, we should use the Store encoding, and not utf-8 when generating a link. Since Foswiki::encodeUrl is hardcoded to utf-8, we either need to extend that to optionally request the Store Encoding, or consider a encodePubUrl which uses store encoding.
--
GeorgeClark - 20 Sep 2015
Releasing this as is. My rationale is that any Store encoding other than utf-8 is probably transitional. Writing links in utf-8 avoids future fixup once conversion happens.
--
GeorgeClark - 29 Sep 2015