In today’s Diary of a Dangineer1 I learned the difference between replace and replaceAll on java.lang.String the hard way.
My task involved sanitising large amount of data, by cleaning up some strings. So I thought this should be easy, and reached out to Spark. I wrote a small program that will merrily prance through bazillion lines of JSON stored on HDFS and will get rid of those nasty unwanted characters. I thought I should use replaceAll not
replace because, y’know, I wanted to replace all.
So this is what it looked like (Spark artifacts elided to prevent distraction):
1 2 3 4 |
val strToBeReplaced = "(+1" val strToCleanup = ""2kd.(+1X" strToCleanup.replaceAll("(+1", "") |
Feeling happy with the code, I unleashed it on a cluster and left it running. Several hours and hundreds of Gigabytes later, I checked out the logs to see this stack trace popping up everywhere:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 1 (+1 ^ at java.util.regex.Pattern.error(Pattern.java:1955) at java.util.regex.Pattern.sequence(Pattern.java:2123) at java.util.regex.Pattern.expr(Pattern.java:1996) at java.util.regex.Pattern.group0(Pattern.java:2905) at java.util.regex.Pattern.sequence(Pattern.java:2051) at java.util.regex.Pattern.expr(Pattern.java:1996) at java.util.regex.Pattern.compile(Pattern.java:1696) at java.util.regex.Pattern.(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1028) at java.lang.String.replaceAll(String.java:2223) |
After a couple of WTFs, angrily punching the air, and going through the five phases, I read the documentation of replaceAll which stated:
Replaces each substring of this string that matches the given regular expression with the given replacement.
You see,String.replaceAll takes a regex as the first parameter, and it was compiling “(+1” to be a regex. I was supposed to use String.replace to replace to replace all when the the first parameter shouldn’t be treated as regex as evident in the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
val aString = "2kd.(+1XoXo-OMGPoines(+1" //aString: String = 2kd.(+1XoXo-OMGPoines(+1 aString.replaceAll("(+1", "") java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 1 (+1 ^ at java.util.regex.Pattern.error(Pattern.java:1955) at java.util.regex.Pattern.sequence(Pattern.java:2123) at java.util.regex.Pattern.expr(Pattern.java:1996) at java.util.regex.Pattern.group0(Pattern.java:2905) at java.util.regex.Pattern.sequence(Pattern.java:2051) at java.util.regex.Pattern.expr(Pattern.java:1996) at java.util.regex.Pattern.compile(Pattern.java:1696) at java.util.regex.Pattern.<init>(Pattern.java:1351) at java.util.regex.Pattern.compile(Pattern.java:1028) at java.lang.String.replaceAll(String.java:2223) ... 32 elided aString.replace("(+1", "") //res4: String = 2kd.XoXo-OMGPoines |
Dangineer (n): An Engineer who works with Big Data, which is dangerously addictive.
Hey, Vijay is a hoopy Dangineer↩︎
Leave a Reply