Diary of a Dangineer: replaceAll vs. replace

In today’s Diary of a Dangineer1 I learned the difference between replace and replaceAll on java.lang.String the hard way.

My task involved sanitising large amount of data, by cleaning up some strings. So I thought this should be easy, and reached out to Spark. I wrote a small program that will merrily prance through bazillion lines of JSON stored on HDFS and will get rid of those nasty unwanted characters. I thought I should use replaceAll not
replace because, y’know, I wanted to replace all.

So this is what it looked like (Spark artifacts elided to prevent distraction):

Feeling happy with the code, I unleashed it on a cluster and left it running. Several hours and hundreds of Gigabytes later, I checked out the logs to see this stack trace popping up everywhere:

After a couple of WTFs, angrily punching the air, and going through the five phases, I read the documentation of replaceAll which stated:

Replaces each substring of this string that matches the given regular expression with the given replacement.

You see,String.replaceAll takes a regex as the first parameter, and it was compiling “(+1” to be a regex. I was supposed to use String.replace to replace to replace all when the the first parameter shouldn’t be treated as regex as evident in the code below:

Dangineer (n): An Engineer who works with Big Data, which is dangerously addictive.
Hey, Vijay is a hoopy Dangineer