Strategy for Regular Expression Support
Foswiki strives to support the rich Perl regular expression syntax for end users, for example in searching. However, because Foswiki has to interface with third party tools and libraries, it is not always possible to support all the features of Perl regular expressions in all places.
Any developer who implements an interface to such a third-party tool must make every effort to map all the functionality of Perl regular expressions to the tool. The following table lists the features of Perl regular expressions that are understood to be supported by a number of common third-party tools. The features are chosen from those described in
http://www.regular-expressions.info/refflavors.html.
The
Required by Foswiki column documents regex features
used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Loss of one or more of the features in this column will affect Foswiki (or one or more important extensions) functionality in some way. This won't necessarily make Foswiki unusable, but should be borne in mind. Note also that Foswiki internal regexes may use meta-syntax that might need to be escaped/modified for different regex flavours e.g.
(
and
\(
.
Developers working with regular expressions must take great care, when exposing features of non-Perl regular expressions to end users, that they don't use features which are sparsely supported.
In the event that an external tool supports regular expression syntax that is
not compatible with Perl, the calling code
must defuse the regex feature that is not perl compatible. This may result in some loss of functionality, but is necessary to avoid confusing users.
† The PCRE library can compiled with Unicode support, but is not always. Check.
‡ MySQL/MariaDB do have options to incorporate PCRE rather than POSIX ERE. Indeed for
MariaDB 10 (when finally GA) will include PCRE with Unicode as standard.
As can be seen there is great variability in regular expressions support. This is especially true of SQL interfaces to databases, where the ANSI standard for pattern matching is so pathetic that most databases support some extension. Even where standards (such as POSIX) have been implemented, they are at times arbitrarily constrained or extended. The following table provides a guideline as to what is supported in SQL by a number of common database implementations.
TODO: add
Tcl ARE to the top table
Foswiki internally uses \Q...\E to disable metacharacters within regular expressions. A quick search of a foswiki install finds a number of places. Is the above list only meant to list features that are pushed down into regexes used by the Store and Search engines?
Foswiki/Configure/Checker.pm
Foswiki/Plugins/SmiliesPlugin.pm
Foswiki/Plugins/EditTablePlugin/Core.pm
Foswiki/Plugins/WysiwygPlugin/TML2HTML.pm
Foswiki/Plugins/WysiwygPlugin/HTML2TML/Node.pm
Foswiki/Plugins/TwistyPlugin.pm
Foswiki/Prefs.pm
Foswiki/Contrib/BuildContrib/Targets/manifest.pm
Foswiki/Macros/WEBLIST.pm
Foswiki/Macros/TOPICLIST.pm
Foswiki/Macros/LANGUAGES.pm
Unit/TestRunner.pm
CPAN/lib/Text/Patch.pm
CPAN/lib/Crypt/PasswdMD5.pm
CPAN/lib/Locale/Maketext/Extract/Plugin/Base.pm
Foswiki.pm
--
GeorgeClark - 24 Dec 2013
I can only imagine that this is about Store and Search engines, but
CrawfordCurrie was the original author so he'll need to confirm.
To summarise the usual SQL suspects we have:
- MySQL: Posix ERE, PCRE via library
- MariaDB: Posix ERE, PCRE via library (Standard from version 10)
- Oracle: Posix ERE
- PostgreSQL: Tcl ARE
The following databases support regex only after installing an extra library.
I would suggest the possibility of pruning some of the columns above. It appears to me we only want columns for actual known targets.
I suppose we may end up with similar issues with any NoSQL type stores, but the active development is currently around SQL stores.
I've amended the table to highlight a few Foswiki required rows. I find the
for Foswiki required gets lost when scrolling up and down. Do you agree this approach highlights these rows better? In which case I'll complete the job and remove the Foswiki required column. Alternatively, I could add an explicit
when Foswiki support is not required and another flag when we're not quite sure ;).
It seems to me that there are other rows that should be marked Foswiki required, e.g '.', but maybe
CrawfordCurrie had good reason not to include it.
--
JulianLevens - 25 Dec 2013
Correct; as described in the intro, the intention is to document constraints imposed on regexes
by external third party tools. The "Foswiki required" column documents regex features
used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Since '.' is not used in composing this regex, then Foswiki can continue to operate without it.
I'm not a fan of the highlighted rows; it puts undue focus on the Foswiki requires column. What we
should do, though, is identify a subset of PCRE that is (1) well supported across the range of search engines and (2) adequate for end-user searching. At the moment the doc is weasely about what regex features are available (
RegularExpression) to an end user; it really ought to say something definitive like "POSIX ERE".
JulianLevens can you please update the "Databases" table with what you know about the different SQL implementations? Thanks.
Really we need a Task to
reduce the number of regex features that core depends on. For example,
[0-9]
is more widely supported than
\d
.
I just noticed that the column for GNU BRE is very wrong. However I have to go and cook now.
--
CrawfordCurrie - 25 Dec 2013
That could be me. I fixed up the table formatting by replacing
|
with
%VBAR%
. I have since tried to make the POSIX ERE, POSIX BRE, GNU ERE and GNU BRE columns consistent with
http://www.regular-expressions.info but that site appears to be inconsistent with
http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html which does not specify the use of
\}
with GNU BRE. I also saw that POSIX ERE and BRE both support
[[:<:]]
instead of
\b
.
--
MichaelTempest - 25 Dec 2013
I've updated the database table.
Crawford, I was indeed confused about the meaning of the 'Foswiki required' column in that I thought it related to a minimum requirement for user search. As you say we need to actually define what a FW user can expect to work.
BTW: I've found that the standard backlinks regex used in a regexp SQL clause e.g.
select * from metaText where value regexp 'DefaultPreferences([^A-Za-z0-9]|$)|Default([^A-Za-z0-9]*)Preferences([^A-Za-z0-9]|$)|System.DefaultPreferences([^A-Za-z0-9]|$)' \G
Took 43s to complete, and because of
MySQL caching data, subsequent calls are a rapid 0s. However, the 3 separate parts of the alternation are much quicker (< 1s each). So, why the significant time required? They are all table scans so it's shouldn't be an io bottleneck, which suggests that this regex via a POSIX ERE has some surprisingly poor performers. This will need more investigation, note that a simple search for '.' with original perl regex as a post filter is much faster.
MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default|Preferences'\G -- AutoDocket([^A-Za-z0-9]|$)|Auto([^A-Za-z0-9]*)Docket([^A-Za-z0-9]|$)|System.AutoDocket([^A-Za-z0-9]|$)' \G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (9.13 sec)
MariaDB [foswiki]> select count(fobid) from metaText where value regexp '.'\G
*************************** 1. row ***************************
count(fobid): 418800
1 row in set (0.53 sec)
MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default' or value regexp 'Preferences'\G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (0.76 sec)
I suspect it's related to this:
http://www.regular-expressions.info/posix.html see the
POSIX ERE Alternation Returns The Longest Match
section.
Note that these timings are not the complete process with perl and DBI where all the data is transferred to perl for processing. I.e. reading 418800 string into perl for further filtering may still be the overall worst performer, but that's not how it feels.
Why is it rarely simple
?
--
JulianLevens - 25 Dec 2013