A smaller, more compatible regular expression engine for XS
The new regular expression engine in XS achieves several important goals:
- Support for three regular expression proposals under consideration for inclusion in a future edition of the standard
- Significantly smaller ROM footprint, about 88% smaller.
- Easier integration with XS
The PCRE engine has worked well for XS for nearly 15 years. It has been necessary to modify PCRE to integrate cleanly with XS, for example to use the XS memory heap instead of system memory. More recently, we have enhanced the functionality of PCRE to achieve compatibility with the ECMAScript standard.
There are several proposals currently in the late stages of consideration by TC39, the ECMAScript language committee. We found it impractical to enhance PCRE further to support these features, in particular look behind assertions. We explored using PCRE2, but it is an even bigger engine and achieving a clean integration proved too difficult.
I am no specialist of regular expressions. I studied the theory a bit, before implementing a parser in C, following the specified grammar and using the hand coded parsing technique that is everywhere in XS.
The character ranges were kind of fun because the parser has to union or inverse character sets. I saw character sets like one dimension graphical regions and used the same algorithms as Piu! Graphical regions are small enough and fast enough for hit testing, which is the most frequent operation when executing regular expressions.
The new regular expression engine achieves excellent compatibility as verified by test262 tests.
- 100% (178 of 178 tests pass) of test262
- 99% (1684 of 1698 tests pass) of test262
- 100% of the dotAll Stage 4 proposal
- 100% of the look behind assertions Stage 3 proposal
- 100% of the named capture groups Stage 3 proposal
The new engine does not implement the unicode property escapes Stage 3 proposal. There is no technical barrier to implementing this proposal. However, the required Unicode tables are larger than practical on the devices XS targets.
The PCRE engine, as used by XS, compiles to 219,888 bytes. The embedded devices XS is designed to target typically have between 512KB and 2MB of ROM. PCRE requires such a large fraction of the total ROM that it is usually stripped from embedded projects.
The new regular expression engine compiles to 25,644 bytes, or just 11.6% of the size of PCRE. This reduction in code size, nearly an order of magnitude, makes regular expressions practical for use by many more projects.
Note: Many factors influence the size of compiled code. The numbers provided here are based on a GCC build for ESP8266 using a release build of the Moddable SDK.
The question naturally arises as to how the new regular expression engine can be so much smaller. There is no single answer, but a variety of reasons:
- For portability, PCRE duplicates capabilities already present in XS, for example for memory management and managing character encoding.
- The new regular expression engine takes a different approach to operations on character sets.
Performance and RAM
We have not yet performed a detailed analysis of the performance and RAM use of the new regular expression engine. Some details are worth noting:
- The engine uses XS memory heaps exclusively, reducing the free space needed in system memory, speeding allocations, and reducing fragmentation.
- The engine runs with a smaller native stack size, eliminating the need to increase the native stack to use regular expressions reliably in all circumstances.
- Performance appears to be similar to PCRE. As performance has not been a focus of development of new regular expression engine, there are likely optimization opportunities.