Sunday, August 1, 2010

Beware a slow death by regular expressions!

Overview & Motivation

The Java programming language provides great support for regular expressions. In general, the performance of regular expressions in Java is very good. The trade-off for that excellent performance is a (somewhat) protracted API. To see what I mean, have a a look at a canonical regular expression example from Sun's javadoc:

Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();

Kind of verbose, considering all we've accomplished is checking a single CharacterSequence for a match against one regular expression! Such complaints about decreased simplicity are likely what motivated Sun to provide some regular expression "convenience" methods inside the API for the String class. Specifically, as of version 6, the String API provides five such convenience methods:

  • public boolean matches(String regex);
  • public String replaceFirst(String regex, String replacement);
  • public String replaceAll(String regex, String replacement);
  • public String[] split(String regex);
  • public String[] split(String regex, int limit);

Each of the above convenience methods constructs and delegates the method call to Pattern and Matcher objects. What's not immediately obvious from the documentation for those convenience methods is that they become extremely inefficient as you execute them many thousands of times.

Reg. Exp. API Timing Data

Consider the following bit of code that illuminates the inefficiency of the String.matches(regex) method:

public class RegExpExerciser
{
 public static void main(String[] args)
 {

  int numTimes = 10000000;

  String regexp = "a*b";
  String toMatch = "aaaaab";

  long begin1 = System.currentTimeMillis();
  for (int i = 0; i < numTimes; i++)
  {
   toMatch.matches(regexp);
  }
  System.out.println(
   "Milliseconds using convenience method from String: " 
   + (System.currentTimeMillis() - begin1)
  );

  long begin2 = System.currentTimeMillis();
  Pattern p = Pattern.compile(regexp);
  for (int i = 0; i < numTimes; i++)
  {
   p.matcher(toMatch).matches();
  }
  System.out.println(
   "Milliseconds when re-using compiled reg. exp. pattern: " 
   + (System.currentTimeMillis() - begin2)
  );

 }
}

Executing the above code, my box reveals these times:

Milliseconds using convenience method from String: 5074
Milliseconds when re-using compiled reg. exp. pattern: 1720

RegExpHelper.java to the Rescue!

The code sample above clearly demonstrates that the time to compile a regular expression Pattern dominates the time required to actually check whether a given string matches the pattern. Utilizing that key insight, I offer a helper class in the Big-Oh Software Common library that provides simple APIs for working with regular expressions without sacrificing performance: RegExpHelper.java (javadoc). The key feature to note about that class is that it caches compiled regular expressions, making its methods both simple and efficient.

No comments:

Post a Comment