Bug: TWiki on Mac OS X server with I18N generates odd looking file names
InternationalisationEnhancements tested with Mac OS X generate odd-looking file names, due to HFS+ and UFS filesystem
UnicodeNormalisation issues. TWiki does work OK, but the filenames are not very easy to use for administrators using the command line.
It's also possible that attachments using some I18N characters, uploaded from Mac clients and downloaded by Windows/Unix clients, could cause problems - not tested.
(See
MozillaURLEncodingWithI18N for original bug report from
InternationalisationEnhancements - turned out to be mainly Mozilla UTF8 URL encoding issues.)
Test case
See comment by
StefanLindmark in
InternationalisationEnhancements. Browser has not been configured in any way, but there are no configuration notes for Mozilla. I'm attaching TWiki.cfg and testenv output.
Environment
TWiki version: |
Alpha20021202 |
TWiki plugins: |
- |
Server OS: |
Mac OS X 10.2.1 |
Web server: |
Apache 2.0.40 |
Perl version: |
5.6.0 |
Client OS: |
Mac OS X 10.2.1 |
Web Browser: |
Mozilla 1.2 |
--
StefanLindmark - 03 Dec 2002
Follow up
Fix record
(From emails)
MacOS is creating quite weird looking filenames, but TWiki is working fine, so I'm setting this to
BugResolved. If people using TWiki
I18N find the filenames annoying on
MacOS X, please open a new bug.
StefanLindmark is now testing on Perl 5.6.1 on Linux, which works fine.
--
RichardDonkin - 10 Dec 2002
I've done some more testing to shed some light on how file names are treated in OS X. What I did was:
- Created the topic AnAufHinterInNebenÜberUnterVorZwischen in TWiki running on Linux stored on reiserfs filesystem
- Created the topic AnAufHinterInNebenÜberUnterVorZwischen in TWiki running on OS X stored on HFS+ filesystem
- Created the folder AnAufHinterInNebenÜberUnterVorZwischen in Finder running on OS X stored on HFS+ filesystem
Then I ran
ls > filename
on each of those files using
ls
in the same environment as they were created in. The resulting output files from this have been attached to this topic. Hopefully these files can be of use for the people that put their skills into further development of i18n.
--
StefanLindmark - 11 Dec 2002
One implication that needs to be investigated is portability. If I run TWiki with i18n enhancements on a server running OS X, what happens if I want to move the site to a box with a different OS/filesystem? Would it be possible to transfer the files straight over to the new environment or would there be a need to recode the filenames?
--
StefanLindmark - 14 Dec 2002
Only one way to find out, so I tried it by doing this:
-
tar cvf an.tar AnAuf*
-
scp an.tar mysite.net:upload
-
ssh mysite.net
-
cd upload
-
tar xvf an.tar
-
ls AnAuf*
-
ls ../twiki/data/Sandbox/AnAuf*
The result can be seen below:
So I guess this
is something to worry about if you want to be able to move files around between different systems as your server platform may shift over time.
--
StefanLindmark - 17 Dec 2002
Interesting - however, I think the best longer term solution is to find out why
MacOS X is UTF8-encoding filenames and see if it can be configured to avoid this, or to show the names to the user in ISO8859-1 (or perhaps to just support UTF8 filenames and topic names). Transforming UTF8 filenames into ISO8859-1 when moving server platforms would be another option.
--
RichardDonkin - 19 Dec 2002
Apple technote #1150 documents the Unicode filename encoding of the HFS+ filesystem. With _trace enabled on the RCS operations, the
debug.txt
file shows that RCS commands are using 8-bit single character ISO-8859 encoding of filenames (i.e. "å" encoded as E5 hex). But files are still written with Unicode filenames. One idea could be to use
RcsLite and see if
ci
and friends in their Apple-distributed form are the cause of this.
Transforming filenames on server platform moves makes data portable, but I'm sceptic about having TWiki running on OS X generating a lot of files on the backend that are difficult to browse, backup, restore, etc. I haven't even started thinking of how useful the available backup tools will be when filenames turn up with mixed charsets and script styles (e.g. starting with western chars, and then reversing script direction to right-to-left and using non-western characters). I guess it would be more difficult to handle that situation than having stray "?" replacing 8-bit chars, still leading to recognizable filenames. So my ambition is still to try to find out how to move TWiki away from this Unicode stuff on OS X and behave like other common Unix systems like Linux and Solaris.
--
StefanLindmark - 21 Dec 2002
I've been doing a lot more research into Unicode (see
InternationalisationUTF8) and it's a bit clearer what was happening here from reading the HFS+ doc's
http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties Unicode section - basically, HFS+ appears to prefer to work in Unicode 2.1, storing characters internally in 16-bit values, and also normalises all filename characters into a decomposed form (i.e. "å" is encoded as "a" followed by the accent as a separate Unicode character). This can be seen in the Finder generated attachment below, which presumably is correct.
The TWiki-generated attachment looks like UTF-8 encoding of the precomposed character (i.e. "å" as a single Unicode codepoint, encoded as two bytes in UTF-8).
UPDATE: HFS+ actually uses an Apple-modified version of Unicode's Normalisation Form D (NFD, i.e. decomposed), whereas Unix/Linux and
http://www.w3c.org W3C standards use Normalisation Form C (NFC, i.e. precomposed).
MacOS X 10.2 seems to have recognised this issue and at least provides an API to normalise into NFC, but in any case TWiki would need to normalise filenames read out from the filesystem into NFC - without this, it appears that the conversion back to ISO-8859-1 doesn't work. This is really a
MacOS X implementation issue but can be worked around. Possible solutions include:
- TWiki code to do the normalisation to NFC - should be configurable as something like
$normaliseToUnicodeNFC
in TWiki.cfg
- enabled on HFS+ filesystems but not on the UFS (Unix style) filesystem. There are some Apple developer docs that describe this in more detail. Main option, enables non-NFD-capable browsers (e.g. Konqueror 3.1.1) to work with MacOS X and I18N.
- Try using a UTF-8 or other locale setting when administering TWiki files so that the conversion from Unicode NFD format to ISO-8859-1 is avoided or works properly. RCS may not work well with Unicode NFD format, though this should be largely transparent to RCS. This will also be necessary, since first option doesn't change use of NFD for filenames.
- Research/test using Perl 5.8.x in case this has addressed this issue. Not covered by Perl 5.8, may be covered by Perl 6.
Some useful links on Apple's NFD-based normalisation in HFS+:
On testing the Finder-generated file below, using IE5.5 in UTF-8 encoding mode, it was displayed correctly - so IE at least is able to display UTF-8 NFD filenames.
The TWiki-generated file has been corrupted somehow, since the capital ü was transformed into 0xDBA2, which is an Asian character.
--
RichardDonkin - 11 Sep 2003
I now have a plan for how to solve this issue as part of
ProposedUTF8SupportForI18N.
If you do need to convert a whole set of filenames from one character encoding to another, have a look at Bjoern Jacke's
http://j3e.de/linux/convmv/man/ convmv
(
http://j3e.de/linux/convmv/ download) - suggested by the author in email.
--
RichardDonkin - 14 Oct 2003
It seems that UFS filesystems have the
http://lists.apple.com/archives/unix-porting/2002/Mar/msg00147.html same NFD behaviour on Darwin (the
FreeBSD based Unix underlying
MacOS), so it's not just HFS+.
There's a related issue mentioned in
http://lists.w3.org/Archives/Public/www-international/2003OctDec/0079.html this W3C list thread - if a
MacOS X user attaches a file with a Unicode NFD filename to a TWiki page, by default TWiki would store the filename in UTF-8 without changing the normalisation. This would then mean that users on some other platforms (e.g. Konqueror on Linux) would probaby have the NFD filename rendered incorrectly even if the server is not
MacOS based!
Also, when TWiki is in UTF-8 mode,
MacOS X's builtin conversion of Unicode NFD to ISO-8859-1 etc does not apply - the unconverted Unicode NFD characters from the filesystem will remain in NFD mode, resulting in a similar problem.
So it seems that normalisation will be important if there are any
MacOS clients or servers involved in a TWiki deployment, and hence for all public TWiki sites.
Are there any Mac users out there who could test this?
--
RichardDonkin - 14 Feb 2004
Back in 2004, Mozilla suite and Thunderbird fixed the problem of
MacOS exposing Unicode NFD normalisation of filenames to the outside world (caused a problem with
MacOS clients attaching files) - solution was to convert data from
MacOS clients from NFD into NFC (which is what rest of world uses), see
MozillaBug:227547.
--
RichardDonkin - 01 Oct 2006
http://www.nntp.perl.org/group/perl.macosx/2005/04/msg8847.html Interesting thread about I18n Filenames and CGI upload - may cause some problems on
MacOS X at some point, due to use of NFD normalisation by HFS+. See also Bugs:Item3652 re other attachment issues.
--
RichardDonkin - 18 Mar 2007