You are here: Foswiki>Development Web>DevelopersBible>RegularExpressions (21 Feb 2014, CrawfordCurrie)Edit Attach

Strategy for Regular Expression Support

Foswiki strives to support the rich Perl regular expression syntax for end users, for example in searching. However, because Foswiki has to interface with third party tools and libraries, it is not always possible to support all the features of Perl regular expressions in all places.

Any developer who implements an interface to such a third-party tool must make every effort to map all the functionality of Perl regular expressions to the tool. The following table lists the features of Perl regular expressions that are understood to be supported by a number of common third-party tools. The features are chosen from those described in http://www.regular-expressions.info/refflavors.html.

The Required by Foswiki column documents regex features used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Loss of one or more of the features in this column will affect Foswiki (or one or more important extensions) functionality in some way. This won't necessarily make Foswiki unusable, but should be borne in mind. Note also that Foswiki internal regexes may use meta-syntax that might need to be escaped/modified for different regex flavours e.g. ( and \(.

Developers working with regular expressions must take great care, when exposing features of non-Perl regular expressions to end users, that they don't use features which are sparsely supported.

Perl Regex Feature	PCRE †	Java	XPath	GNU ERE	XML	POSIX ERE ‡	GNU BRE	POSIX BRE	.NET
Single-character-generating escapes										edit
Backslash escapes one metacharacter										edit
\d shorthand for digits	ascii	ascii								edit
\w shorthand for word characters	ascii	ascii								edit
\s shorthand for whitespace	ascii	ascii	ascii		ascii					edit
\D, \W and \S shorthand negated character classes										edit
\x00 through \xFF (ASCII character)										edit
\n (LF), \r (CR) and \t (tab)										edit
. (dot; any character except line break)										edit
\Q...\E escapes a string of metacharacters										edit
\f (form feed) and \v (vtab)										edit
\a (bell) and \e (escape)										edit
\cA through \cZ (control character)										edit
\ca through \cz (control character)										edit
Character classes features										edit
[abc] character class										edit
[^abc] negated character class										edit
[a-z] character class range										edit
Hyphen in [\d-z] is a literal									?	edit
Backslash escapes one character class metacharacter										edit
\Q...\E escapes a string of character class metacharacters		Java 6								edit
[\b] backspace										edit
[:alpha:] POSIX character class	ascii									edit
\p{IsAlpha} POSIX character class										edit
Anchors										edit
\b (at the beginning or end of a word)	ascii			ascii			ascii			edit
\B (NOT at the beginning or end of a word)	ascii			ascii			ascii			edit
^ (start of string/line)										edit
$ (end of string/line)										edit
\A (start of string)										edit
\Z (end of string, before final line break)										edit
\z (end of string)										edit
Grouping, references and quantifiers										edit
(regex) (numbered capturing group)							``	``		edit
\1 through \9 (backreferences)										edit
\| (alternation)							`\\|`			edit
? (0 or 1)							`\?`			edit
* (0 or more)										edit
+ (1 or more)							`\+`			edit
{n} (exactly n)							`\{n\}`	`\{n\}`		edit
{n,m} (between n and m)							`\{n,m\}`	`\{n,m\}`		edit
{n,} (n or more)							`\{n,\}`	`\{n,\}`		edit
? after any of the above quantifiers to make it "lazy"										edit
(?:regex) (non-capturing group)										edit
\10 through \99 (backreferences)					n/a	n/a				edit
Forward references \1 through \9					n/a	n/a				edit
Nested references \1 through \9					n/a	n/a			?	edit
Backreferences non-existent groups are an error					n/a	n/a			?	edit
Backreferences to failed groups also fail					n/a	n/a			?	edit
(?>regex) (atomic group)										edit
(?=regex) (positive lookahead)										edit
(?!regex) (negative lookahead)										edit
(?<=text) (fixed length positive lookbehind)		finite length								edit
(?<!text) (fixed length negative lookbehind)		finite length								edit
\G (start of match attempt)										edit
(?(?=regex)then\|else) (using any lookaround)										edit
(?(1)then\|else)										edit
Flags, Spacing and Comments										edit
(?i) (case insensitive)			flag							edit
(?s) (dot matches newlines)			flag							edit
(?m) (^ and $ match at line breaks)			flag							edit
(?x) (free-spacing mode)			flag							edit
(?-ismxn) (turn off mode modifiers)										edit
(?ismxn:group) (mode modifiers local to group)										edit
(?#comment)										edit
Free-spacing syntax supported										edit
Character class is a single token				n/a	n/a	n/a	n/a	n/a	?	edit
# starts a comment				n/a	n/a	n/a	n/a	n/a		edit
Unicode support										edit
\X (Unicode grapheme)										edit
\x{0} through \x{FFFF} (Unicode character)										edit
\pL through \pC (Unicode properties)										edit
\p{L} through \p{C} (Unicode properties)										edit
\p{Lu} through \p{Cn} (Unicode property)										edit
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties)										edit
\p{IsL} through \p{IsC} (Unicode properties)										edit
\p{IsLu} through \p{IsCn} (Unicode property)										edit
\p{Letter} through \p{Other} (Unicode properties)										edit
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property)										edit
\p{IsLetter} through \p{IsOther} (Unicode properties)										edit
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property)										edit
\p{Arabic} through \p{Yi} (Unicode script)										edit
\p{IsArabic} through \p{IsYi} (Unicode script)										edit
\p{BasicLatin} through \p{Specials} (Unicode block)										edit
\p{InBasicLatin} through \p{InSpecials} (Unicode block)										edit
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block)										edit
Part between {} in all of the above is case insensitive										edit
Spaces, hyphens and underscores allowed in all long names listed above (e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)		Java 5								edit
\P (negated variants of all \p as listed above)										edit
\p{^...} (negated variants of all \p{...} as listed above)										edit

In the event that an external tool supports regular expression syntax that is not compatible with Perl, the calling code must defuse the regex feature that is not perl compatible. This may result in some loss of functionality, but is necessary to avoid confusing users.

† The PCRE library can compiled with Unicode support, but is not always. Check.

‡ MySQL/MariaDB do have options to incorporate PCRE rather than POSIX ERE. Indeed for MariaDB 10 (when finally GA) will include PCRE with Unicode as standard.

As can be seen there is great variability in regular expressions support. This is especially true of SQL interfaces to databases, where the ANSI standard for pattern matching is so pathetic that most databases support some extension. Even where standards (such as POSIX) have been implemented, they are at times arbitrarily constrained or extended. The following table provides a guideline as to what is supported in SQL by a number of common database implementations.

Database	Native	With extensions
MySQL	ANSI	PCRE
MariaDB	ANSI	As this is a MySQL fork the above library should also work here. See also MariaDB 10 PCRE
Oracle	Posix ERE with variations	PCRE may be possible see this reference. However I have not found any docs as to how this can be installed or how to use from within Oracle
Postgresql	ANSI, plus Tcl ARE, POSIX ERE, POSIX BRE depending on the SQL function used.
SQLite	ANSI	PCRE is usually added as an extension
Microsoft SQL Server	ANSI	.NET, see http://www.codeproject.com/Articles/19502/A-T-SQL-Regular-Expression-Library-for-SQL-Server and http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
DB2 LUW	Ansi	PCRE Via User Defined Functions UDFs: http://www.ibm.com/developerworks/data/library/techarticle/0301stolze/0301stolze.html

TODO: add Tcl ARE to the top table

Foswiki internally uses \Q...\E to disable metacharacters within regular expressions. A quick search of a foswiki install finds a number of places. Is the above list only meant to list features that are pushed down into regexes used by the Store and Search engines?

Foswiki/Configure/Checker.pm
Foswiki/Plugins/SmiliesPlugin.pm
Foswiki/Plugins/EditTablePlugin/Core.pm
Foswiki/Plugins/WysiwygPlugin/TML2HTML.pm
Foswiki/Plugins/WysiwygPlugin/HTML2TML/Node.pm
Foswiki/Plugins/TwistyPlugin.pm
Foswiki/Prefs.pm
Foswiki/Contrib/BuildContrib/Targets/manifest.pm
Foswiki/Macros/WEBLIST.pm
Foswiki/Macros/TOPICLIST.pm
Foswiki/Macros/LANGUAGES.pm
Unit/TestRunner.pm
CPAN/lib/Text/Patch.pm
CPAN/lib/Crypt/PasswdMD5.pm
CPAN/lib/Locale/Maketext/Extract/Plugin/Base.pm
Foswiki.pm

-- GeorgeClark - 24 Dec 2013

I can only imagine that this is about Store and Search engines, but CrawfordCurrie was the original author so he'll need to confirm.

To summarise the usual SQL suspects we have:

MySQL: Posix ERE, PCRE via library
MariaDB: Posix ERE, PCRE via library (Standard from version 10)
Oracle: Posix ERE
PostgreSQL: Tcl ARE

The following databases support regex only after installing an extra library.

DB2: Install PCRE support via http://www.ibm.com/developerworks/data/library/techarticle/0301stolze/0301stolze.html
SQL Server: Install .NET support via http://blogs.msdn.com/b/sqlclr/archive/2005/06/29/regex.aspx

I would suggest the possibility of pruning some of the columns above. It appears to me we only want columns for actual known targets.

I suppose we may end up with similar issues with any NoSQL type stores, but the active development is currently around SQL stores.

I've amended the table to highlight a few Foswiki required rows. I find the

for Foswiki required gets lost when scrolling up and down. Do you agree this approach highlights these rows better? In which case I'll complete the job and remove the Foswiki required column. Alternatively, I could add an explicit

when Foswiki support is not required and another flag when we're not quite sure ;).

It seems to me that there are other rows that should be marked Foswiki required, e.g '.', but maybe CrawfordCurrie had good reason not to include it.

-- JulianLevens - 25 Dec 2013

Correct; as described in the intro, the intention is to document constraints imposed on regexes by external third party tools. The "Foswiki required" column documents regex features used by the core code when using the search engine. For example, when searching for topic references, the core code assembles a regex and then uses the search engine to look for it. Since '.' is not used in composing this regex, then Foswiki can continue to operate without it.

I'm not a fan of the highlighted rows; it puts undue focus on the Foswiki requires column. What we should do, though, is identify a subset of PCRE that is (1) well supported across the range of search engines and (2) adequate for end-user searching. At the moment the doc is weasely about what regex features are available (RegularExpression) to an end user; it really ought to say something definitive like "POSIX ERE".

JulianLevens can you please update the "Databases" table with what you know about the different SQL implementations? Thanks.

Really we need a Task to reduce the number of regex features that core depends on. For example, [0-9] is more widely supported than \d.

I just noticed that the column for GNU BRE is very wrong. However I have to go and cook now.

-- CrawfordCurrie - 25 Dec 2013

That could be me. I fixed up the table formatting by replacing | with %VBAR%. I have since tried to make the POSIX ERE, POSIX BRE, GNU ERE and GNU BRE columns consistent with http://www.regular-expressions.info but that site appears to be inconsistent with http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html which does not specify the use of \} with GNU BRE. I also saw that POSIX ERE and BRE both support [[:<:]] instead of \b.

-- MichaelTempest - 25 Dec 2013

I've updated the database table.

Crawford, I was indeed confused about the meaning of the 'Foswiki required' column in that I thought it related to a minimum requirement for user search. As you say we need to actually define what a FW user can expect to work.

BTW: I've found that the standard backlinks regex used in a regexp SQL clause e.g.

select * from metaText where value regexp 'DefaultPreferences([^A-Za-z0-9]|$)|Default([^A-Za-z0-9]*)Preferences([^A-Za-z0-9]|$)|System.DefaultPreferences([^A-Za-z0-9]|$)' \G

Took 43s to complete, and because of MySQL caching data, subsequent calls are a rapid 0s. However, the 3 separate parts of the alternation are much quicker (< 1s each). So, why the significant time required? They are all table scans so it's shouldn't be an io bottleneck, which suggests that this regex via a POSIX ERE has some surprisingly poor performers. This will need more investigation, note that a simple search for '.' with original perl regex as a post filter is much faster.

MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default|Preferences'\G -- AutoDocket([^A-Za-z0-9]|$)|Auto([^A-Za-z0-9]*)Docket([^A-Za-z0-9]|$)|System.AutoDocket([^A-Za-z0-9]|$)' \G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (9.13 sec)

MariaDB [foswiki]> select count(fobid) from metaText where value regexp '.'\G
*************************** 1. row ***************************
count(fobid): 418800
1 row in set (0.53 sec)

MariaDB [foswiki]> select count(fobid) from metaText where value regexp 'Default' or value regexp 'Preferences'\G
*************************** 1. row ***************************
count(fobid): 2551
1 row in set (0.76 sec)

I suspect it's related to this: http://www.regular-expressions.info/posix.html see the POSIX ERE Alternation Returns The Longest Match section.

Note that these timings are not the complete process with perl and DBI where all the data is transferred to perl for processing. I.e. reading 418800 string into perl for further filtering may still be the overall worst performer, but that's not how it feels.

Why is it rarely simple