Python Regex Complete Guide: re Module Usage
Master Python's re module from basics to advanced with practical examples and best practices.
Python's re module is a powerful tool for working with regular expressions. Whether you're doing data cleaning, text analysis, or form validation, mastering the re module will significantly boost your productivity. This guide will take you from zero to hero, covering everything you need to know about Python regular expressions.
Why Choose Python's re Module?
Python's re module offers complete regular expression support with several advantages:
- Built-in module, no installation required
- Clean, intuitive syntax
- Excellent performance for large text processing
- Rich function library for various needs
- Seamless integration with other Python modules
re Module Basics
Importing the Module
Import the module before use:
import re
Core Functions Overview
The re module provides multiple functions, each with specific purposes:
re.match()- Match from the beginning of the stringre.search()- Search for the first match anywhere in the stringre.findall()- Find all matching occurrencesre.finditer()- Return an iterator of all matchesre.sub()- Replace matching textre.split()- Split string by patternre.compile()- Compile regex pattern (improves performance)
Detailed Function Usage
1. re.match() - Match from Beginning
match() only checks if the pattern matches the beginning of the string:
import re
text = "Hello, World!"
pattern = r"Hello"
result = re.match(pattern, text)
if result:
print("Match found:", result.group()) # Output: Hello
else:
print("Match failed")
# No match case
result = re.match(r"World", text) # Returns None
Use case: Validate user input against specific formats, such as email or phone numbers.
2. re.search() - Search First Match
search() looks for the first match anywhere in the string:
text = "Python is awesome! Python is powerful!"
pattern = r"Python"
result = re.search(pattern, text)
if result:
print("Found:", result.group()) # Output: Python
print("Position:", result.start()) # Output: 0
Use case: Extract key information from log files, find specific errors or warnings.
3. re.findall() - Find All Matches
findall() returns a list of all matching occurrences:
text = "My phone: 138-1234-5678, yours: 139-8765-4321"
pattern = r"\d{3}-\d{4}-\d{4}"
phone_numbers = re.findall(pattern, text)
print(phone_numbers) # Output: ['138-1234-5678', '139-8765-4321']
Use case: Batch extract data, such as all emails, links, or image URLs from a webpage.
4. re.finditer() - Get Detailed Match Info
finditer() returns an iterator of match objects with more information:
text = "Email: [email protected], [email protected], [email protected]"
pattern = r"[\w.+-]+@[\w-]+\.[\w.-]+"
for match in re.finditer(pattern, text):
print(f"Email: {match.group()}, Start: {match.start()}, End: {match.end()}")
Use case: When you need to know the exact position of each match.
5. re.sub() - Powerful Replacement
sub() can replace all matching occurrences:
# Simple replacement
text = "Hello, Hello, Hello"
result = re.sub(r"Hello", "Hi", text)
print(result) # Output: Hi, Hi, Hi
# Using callback function
text = "Price: 100, 200, 300"
def discount(match):
price = int(match.group())
return f"{price * 0.9}元"
result = re.sub(r"\d+", discount, text)
print(result) # Output: Price: 90.0元, 180.0元, 270.0元
# Limit replacement count
text = "a-a-a-a"
result = re.sub(r"a", "b", text, count=2)
print(result) # Output: b-b-a-a
Use case: Batch modify text format, such as unifying date formats or cleaning special characters.
6. re.split() - Flexible String Splitting
split() splits a string based on regex pattern:
# Split by multiple delimiters
text = "apple,banana;orange|grape"
result = re.split(r"[,;|]", text)
print(result) # Output: ['apple', 'banana', 'orange', 'grape']
# Keep delimiters
text = "apple banana orange"
result = re.split(r"(\s+)", text)
print(result) # Output: ['apple', ' ', 'banana', ' ', 'orange']
# Limit split count
text = "one,two,three,four"
result = re.split(r",", text, maxsplit=2)
print(result) # Output: ['one', 'two', 'three,four']
Use case: Parse complex text formats, like log files or configuration files.
7. re.compile() - Improve Performance
If using the same pattern multiple times, compile it first for better performance:
# Not compiled (re-parses pattern every time)
pattern = r"\b\w+\b"
text = "This is a test"
for _ in range(1000):
words = re.findall(pattern, text)
# Compiled (only parses once)
compiled_pattern = re.compile(r"\b\w+\b")
for _ in range(1000):
words = compiled_pattern.findall(text) # Faster!
Use case: When using the same regex pattern in loops or frequent calls.
Regex Pattern Details
Basic Patterns
# Character classes
pattern = r"[a-z]" # Match any lowercase letter
pattern = r"[A-Z0-9]" # Match uppercase letter or digit
pattern = r"[^0-9]" # Match non-digit
# Predefined character classes
pattern = r"\d" # Digit: [0-9]
pattern = r"\D" # Non-digit: [^0-9]
pattern = r"\w" # Alphanumeric: [a-zA-Z0-9_]
pattern = r"\W" # Non-alphanumeric: [^a-zA-Z0-9_]
pattern = r"\s" # Whitespace character
pattern = r"\S" # Non-whitespace character
# Quantifiers
pattern = r"a*" # 0 or more times
pattern = r"a+" # 1 or more times
pattern = r"a?" # 0 or 1 time
pattern = r"a{3}" # Exactly 3 times
pattern = r"a{2,5}" # 2 to 5 times
pattern = r"a{2,}" # At least 2 times
Boundary Matching
text = "hello world hello"
# ^ Match string start
re.search(r"^hello", text) # Matches first hello
# $ Match string end
re.search(r"hello$", text) # Matches last hello
# \b Match word boundary
re.findall(r"\bhello\b", text) # Only matches standalone hello
Grouping and Capturing
# Capture groups
text = "My birthday: 1990-05-15"
pattern = r"(\d{4})-(\d{2})-(\d{2})"
match = re.search(pattern, text)
if match:
year = match.group(1) # 1990
month = match.group(2) # 05
day = match.group(3) # 15
# Named groups
text = "Name: 张三, Age: 25"
pattern = r"Name: (?P<name>\w+), Age: (?P<age>\d+)"
match = re.search(pattern, text)
if match:
print(match.group('name')) # 张三
print(match.group('age')) # 25
# Non-capturing groups
pattern = r"(?:apple|banana|orange)" # Group but don't capture
Practical Examples
Example 1: Validate Email Address
import re
def validate_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
# Test
print(validate_email("[email protected]")) # True
print(validate_email("invalid.email")) # False
print(validate_email("test@domain")) # False
Example 2: Extract Web Links
import re
html = """
<a href="https://example.com">Link 1</a>
<a href="http://site.org/page">Link 2</a>
<a href="/relative/path">Link 3</a>
"""
pattern = r'href=["\']([^"\']+)["\']'
links = re.findall(pattern, html)
print(links)
# Output: ['https://example.com', 'http://site.org/page', '/relative/path']
Example 3: Clean Text
import re
text = "This is a very messy text!!!!!"
# Remove extra spaces and punctuation
cleaned = re.sub(r'\s+', ' ', text)
cleaned = re.sub(r'!+', '!', cleaned)
print(cleaned) # Output: This is a very messy text!
Example 4: Log Analysis
import re
log = """
2024-01-25 10:30:45 [INFO] User login successful
2024-01-25 10:31:20 [ERROR] Database connection failed
2024-01-25 10:32:10 [INFO] Data saved
2024-01-25 10:33:05 [WARNING] Memory usage at 85%
"""
# Extract error logs
error_pattern = r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[ERROR\] (.+)'
errors = re.findall(error_pattern, log)
print(errors) # Output: ['Database connection failed']
Example 5: Data Extraction
import re
text = """
Order #1001: Apple x 2 = ¥10.00
Order #1002: Banana x 3 = ¥15.00
Order #1003: Orange x 1 = ¥8.00
"""
pattern = r'Order #(\d+): (\w+) x (\d+) = ¥(\d+\.\d+)'
orders = re.findall(pattern, text)
for order in orders:
order_id, product, quantity, price = order
print(f"Order ID: {order_id}, Product: {product}, Quantity: {quantity}, Price: {price}")
Best Practices
1. Use Raw Strings
# Good practice
pattern = r"\d{3}-\d{4}"
# Bad practice
pattern = "\\d{3}-\\d{4}"
2. Compile Common Patterns
# Compile if used multiple times
EMAIL_PATTERN = re.compile(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
def is_valid_email(email):
return bool(EMAIL_PATTERN.match(email))
3. Use Named Groups
# Good practice
pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
# Bad practice
pattern = r"(\d{4})-(\d{2})-(\d{2})" # Need to remember indices
4. Handle Match Failures
match = re.search(pattern, text)
if match:
# Process match result
result = match.group(1)
else:
# Handle no match case
print("No match found")
5. Use Appropriate Functions
- Only need to check if it matches:
re.search()orre.match() - Need all matches:
re.findall() - Need position info:
re.finditer() - Need to replace:
re.sub() - Need to split:
re.split()
Common Pitfalls
1. Greedy vs Non-greedy
text = "<div>content1</div><div>content2</div>"
# Greedy match (default)
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group()) # Matches entire string
# Non-greedy match
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group()) # Only matches first <div>
2. Escape Special Characters
# Characters to escape: . ^ $ * + ? { } [ ] \ | ( )
pattern = r"\.com" # Match literal .com
pattern = r"\$" # Match literal $
3. Chinese Character Handling
# Match Chinese characters
text = "你好世界123"
pattern = r"[\u4e00-\u9fa5]+" # Match Chinese characters
chinese = re.findall(pattern, text)
print(chinese) # Output: ['你好世界']
# Note: In Python 3, strings support Unicode by default
Performance Tips
- Compile frequently used patterns: Use
re.compile()for pre-compilation - Avoid overusing
.*: Use more specific patterns - Use non-greedy matching:
.*?instead of.* - Avoid unnecessary grouping: Use
(?:...)for non-capturing groups - Use character classes:
[abc]is faster than multiple|operators
Summary
Python's re module is powerful and easy to use. After mastering these techniques, you can:
- Efficiently process text data
- Validate user input
- Extract key information
- Batch modify text
- Analyze log files
Remember: Practice is the best teacher. Write more code, try different patterns, and you'll soon become a regex expert!
Use our online Regex Tester to practice and test your regex patterns with immediate results!
About the Author
The Regex Master Team consists of experienced developers and technical writers dedicated to simplifying regular expressions for everyone. We ensure all patterns are rigorously tested and verified to provide accurate, production-ready solutions.
Try It: Regex Tester
Use our interactive regex tester to experiment with the patterns you learned in this article. Test your regular expressions in real-time and see immediate results.