-
Notifications
You must be signed in to change notification settings - Fork 65
Just Say No to Regex
String manipulation (matching, extraction, splitting, removals, replacements etc.) in Java traditionally resorts to two approaches.
For the simplest cases (such as taking the part before, after a delimiter, or between two delimiters), it takes a input.indexOf(myChar)
and then a input.substring(startIndex, endIndex)
call. Along the way some remember to check the index being -1
and some just feel lucky and not bother.
For anything more complex, there's regex.
But regex in Java is in a sad state:
- Its recursive backtracking implementation suffers worst-case exponential time complexity. See the StackOverflow outage.
- Regex patterns tend to be cryptic to read, especially in Java where you need to scape
\
, which then requires\\
for any regex escape.
Luckily, you don't really need regex as you may have thought!
In this page I'll try to give a few examples so hopefully you can see where I'm going.
Imagine you need to find the ChromeOS version from the device model number that looks like "Linux,CrOS,eve|x86_64,EVE D6B-A6B-C4C-F8N-P8A-A36|10863.0.0"
. In summary, the device model string is in the format of {OS}|{hardware}|{OS-version}
.
Being a regex wizard, you may come up with the regex pattern like "^\\w+,CrOS,[^|]+\\|[^|]+\\|([0-9\\.]+)"
. But it's not quite easy to read is it (at least to the regex muggles)?
Let's just say no to regex. Try the following:
int version = new StringFormat("{...},CrOS,{...}|{hardward}|{version}")
.parseOrThrow(deviceModel, (hardware, v) -> Integer.parseInt(v));
- The
{hardware}
,{version}
syntax are placeholders captured by the lambda. -
{...}
is a wildcard placeholder not captured by the lambda. - All other characters (
,
|
) are literal.
The code is intuitive to read. And StringFormat
does no backtracking.
Need to split around a pattern?
Substring.consecutive(Character::isSpace)
.repeatedly()
.split(...);
Need to replace some patterns?
Substring.between("<password>", "</password>")
.repeatedly()
.replaceAllFrom(input, pwd -> "***");
Want string substitution?
String template = "{who} is going to {where}";
Map<String, String> substitutions = Map.of(
"who", "Arya",
"where", "Braavos"
);
// Matches all {placeholder} syntaxes
Substring.RepeatingPattern placeholders =
Substring.word()
.immediatelyBetween("{", INCLUSIVE, "}", INCLUSIVE)
.repeatedly();
// Returns "Arya is going to Braavos"
String result = placeholders.replaceAllFrom(
template,
// Skip the braces to turn {who} to "who",
// then look up the map to get "Arya".
placeholder -> substitutions.get(placeholder.skip(1, 1).toString()));
Did we also talk about the simple cases where you may be used to using indexOf()
? Fiddling with indexes can be prone to off-by-one errors and unreadable code. Instead, consider using either StringForamt
like:
new StringFormat("'{quoted}'").scan(input, quoted -> quoted);
Or Substring
like:
Substring.between('\'', '\'').repeatedly().from(input);
Life will be easier without regexes, my friend.