SEARCHes of type word do not work if word is non-English and the wiki is setup for UTF8
This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmussen er ikke så indlysende en kandidat til posten som for blot et par uger siden, skriver Morgenavisen Jyllands-Posten.
followed by these searches
First Regex which works
Item5529
SEARCHes of type word do not work if word is non English and the wiki is setup for UTF8 This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmuss...
Number of topics: 1
Then query
Item5529
SEARCHes of type word do not work if word is non English and the wiki is setup for UTF8 This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmuss...
Number of topics: 1
And finally word
Item5529
… kommende statsminister Lars Løkke Rasmussen er ikke så …
Number of topics: 1
note that here on Bugs we do not run UTF8. You have to copy the examples to a UTF8 Foswiki
It also seems that the query does not really work at all with text ~ here on Bugs which runs iso-8859. It does work on my T42 with utf-8
--
KennethLavrsen - 13 Apr 2008
Problem is still there also after the SVN16656.
Regex works. But both word and query search does not work if the word you search for contains non-English characters and Foswiki runs UTF8.
--
KennethLavrsen - 13 Apr 2008
Would this be a simply (workaround) fix? : Scan for punctuation and whitespace instead of perl word boundaries.
--
PeterThoeny - 12 May 2008
I do not understand what the idea is of this work around.
We are talking about searching for plain simple words.
If you cannot search for plain words in languages that do not use only A-Z (that is the majority of this world) then the search is in practical totally worthless. This needs to be fixed if people are to be able to use UTF8.
--
KennethLavrsen - 13 May 2008
Search does also not work when the searched word contains a single quote, like
Foswiki's
.
--
ArthurClemens - 24 May 2008
Query using text ~ "something" does not work with English words either and not in ISO-8859-1. It seems Query is simply just broken now.
--
KennethLavrsen - 26 May 2008
After having fixed 5529 (Sven still need to check in the fix on SVN) I have been able to debug this one further and I know the exact root cause.
It is in
lib\Foswiki\Store\SearchAlgorithms\Forking.pm
we have the problem.
The problem only occurs in a search where we are looking for work boundaries but it is not the \b that is the problem.
There are the code lines
if ($options->{wordboundaries} ) {
$searchString = '\b'.quotemeta( $searchString ).'\b';
}
and the problem is the quotemeta( $searchString ) which screws up the string when it contains unicode characters.
Crawford, you added this code originally. What is the quitemeta supposed to do? We obviously need to do the similar operation in a different way but before I just remove the function I need to understand what it is doing and what to watch out for.
--
KennethLavrsen - 30 May 2008
I'm really surprised to hear that
quotemeta
fails with UTF-8 encoding.
quotemeta
is a standard perl function used to escape regular expression meta-characters in the search string. However, when you read the doc in detail, you can see that it is absolute shit. I quote
all characters not matching "/[A-Za-z_0-9]/" will be preceded by a backslash in the returned string, regardless of any locale settings. Note the "regardless of any locale settings" bit, which ensures it won't work for any multibyte character encoding.
The simlpest solution I can think of is to replace
quotemeta
with a method that actually recognises valid meta
grep
characters.
if ($options->{wordboundaries} ) {
$searchString =~ s#([][|/\\$^*()+{};@?.{}])#\\$1#g; # Can't use quotemeta because $searchString may be UTF8 encoded
$searchString = '\b'.$searchString.'\b';
}
If the above code doesn't work, try converting the string to unicode first:
$searchString = Encode::decode($Foswiki::cfg{Site}{CharSet}, $searchString) if $Foswiki::cfg{Site}{CharSet};
as the
first line in the condition block. If this causes a
Wide character in print
error, then add
$searchString = Encode::encode($Foswiki::cfg{Site}{CharSet}, $searchString) if $Foswiki::cfg{Site}{CharSet};
as the
last line in the condition block.
Note that all uses of
quotemeta
in the code that operate on data that is potentially UTF8-encoded will be similarly affected. I
think this problem would "just go away" if Foswiki used unicde strings internally - this is a problem specific to multibyte encodings such as UTF8.
--
CrawfordCurrie - 30 May 2008
Working on this.
Tried the first solution and it works.
Tried the 2nd solution with Encode::decode. It also works. I did not need to use the Encode::encode.
I only tried with my test topic and only in utf-8.
I will try different other searches and combinations before I check in a fix.
For the moment I am mostly keen on the 2nd solution because it seems less a hack.
I cannot help thinking that the
$searchString = Encode::decode($Foswiki::cfg{Site}{CharSet}, $searchString) if $Foswiki::cfg{Site}{CharSet};
operation should happen a lot earlier in the code to prevent other bugs that we have not seen reveal themselves yet.
Something for me to investigate a little further this weekend.
Thanks for following up on my questions Crawford.
--
KennethLavrsen - 31 May 2008
Note that you will have to test with at least one multibyte encoding (e.g. UTF-8) with a multibyte search string, at least one high bit encoding such as iso-8859-1 checking high-bit characters, normal 7-bit ascii, and you should also really test all legal meta-characters in regex searches.
--
CrawfordCurrie - 31 May 2008
I tried to Encode the $searchString much earlier. It seems to have a negative effect on the non-word type of searches resulting in searches results containing garbage. So it is a bit of a can of worms.
I continue learning all I can but we probably have to settle for the fix that targets this particular problem for 4.2.1
--
KennethLavrsen - 31 May 2008
I decided to go for the solution that does not use Encode because when we later want
to change Foswiki to general utf-8 additional hidden Encode conversions can harm so the
regex substitute is a better short term solution which will still work when we go utf-8
--
KennethLavrsen - 02 Jun 2008
Imported from TWikibugs:Item5529 by
Babar because the regex is faulty. Created
Item8657 to fix this fix.
--
OlivierRaginel - 03 Mar 2010