Why You Should Try To Avoid Regex in Your Python

A friend of mine wants to write code for a living. I'm one of the greatest software engineers of all time (...to be), so, naturally, my friend asked me for help. He took a swing at building his first useful application: some shit that tells you the expiration date of a domain name.

After some research, he learned about web scraping. He tried to scrape the whois.com page for info, but he was having trouble. The data of interest looks like this:

Domain Name: COOL.COM  
Sponsoring Registrar IANA ID: 1659  
Whois Server: whois.uniregistrar.com  
Referral URL: http://www.uniregistrar.com  
Status: clientTransferProhibited https://www.icann.org/epp#clientTransferProhibited  
Updated Date: 06-oct-2014  
Creation Date: 12-jul-1995  
Expiration Date: 11-jul-2016  

He managed to grab the page, but he was stuck with raw HTML. After some more googling, apparently, he concuded (somehow) that he should use Python's re library to parse the HTML strings. I had to explain to him that there are limitations to use of regular expressions.

I generally like to stay far away from regular expressions. There are a few reasons, especially if I'm writing Python:

1) You don't need regular expressions. Most of the time, you can use Python's built-in string methods. A combination of list comprehensions with string methods can be super powerful. Writing re.search('^cool', my_text) isn't necessary when you can easily do my_text.startswith('cool'). If you know the string you want to search for inside of a text, why not use Python's in like so: if "this" in "i love this shit":?

2) Regular expressions can become super complicated unnecessarily. Perfect example: extracting a YouTube video ID from a valid YouTube URL. Holy shit. Say you have a link: https://www.youtube.com/watch?v=2HQaBWziYvY. You want 2HQaBWziYvY.

Why use

import re

pattern = re.compile(ur'(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=)|youtu\.be\/)([^"&?\/ ]{11})', re.IGNORECASE)  
url = u"https://www.youtube.com/watch?v=2HQaBWziYvY"

match = re.search(pattern, url)  


Chill, fam. WTF is that? Since you know the format of the string you're handling (the id is found after v= and before any ampersand), why not just use some combinations of url.split() and url.substring()? Why not use urllib to atomize the url string and extract the query strings? Your regex can complicate the shit out of things unnecessarily. And what happens when YouTube introduces a new ID format? You have to add more shit to that pattern?

3) The freedom regexes provide are easy to abuse. I try to stay away from patterns that allow me to make up my own rules entirely. Writing a bunch of extremely strict/loose regular expressions for extracting readable text from, say, an SEC Form 13D document, can pigeonhole the behavior of my application. I write one rule to catch an obvious pattern, and the next thing I know, I found that I'm missing a lot of data as a result of false negatives purported by my regex processes. I'll be forced to write more regexes, modify/extend existing ones, and/or rebuild a major portion of my application because I've built something I can't iterate quickly with or find too logically flawed.

Use of regular expressions can be extremely reasonable. Just don't abuse them. By attempting to use better tools (more readable, shorter, easier to write), like Python's built-in string methods and keywords, you'll use regular expressions strictly when there aren't reasonable alternatives. I think the concision of your regexes, or lack thereof, can only be valuable after in-depth planning. Otherwise, you might succumb to the aforementioned freedom of building your own lexical analyzer.

When you have complex text to parse and you decide to use regexes, you might find yourself using lots and lots of patterns. If you must and you're writing Python, overuse the Python methods, not the regexes.



Software Engineer

Subscribe to GregBlogs