How to Extract All URLs from Text Using Regex
Master URL extraction from text using regular expressions with comprehensive patterns for HTTP, HTTPS, FTP, and more.
How to Extract All URLs from Text Using Regex
Extracting URLs from text is a common task for web scraping, log analysis, and content processing. Regular expressions provide an efficient way to identify and extract URLs from unstructured text. In this comprehensive guide, we'll explore various regex patterns for URL extraction, from simple to advanced.
Understanding URL Structure
A URL (Uniform Resource Locator) consists of several components:
https://example.com/path/to/page?query=value#section
│ │ │ │ │ │
│ │ │ │ │ └─ Fragment
│ │ │ │ └──────────────── Query string
│ │ │ └────────────────────────── Path
│ │ └──────────────────────────────── Domain
│ └──────────────────────────────────────────── Protocol
Basic URL Extraction Patterns
Simple HTTP/HTTPS Pattern
https?://[^\s]+
Extracts: https://example.com from "Visit https://example.com for more info"
Breakdown:
https?://- Matches http:// or https://[^\s]+- One or more non-whitespace characters
Pros: Simple and fast
Cons: Includes trailing punctuation, doesn't validate URL format
Including FTP and Other Protocols
(https?|ftp)://[^\s]+
Extracts: http://site.com, https://secure.com, ftp://files.com
Comprehensive Protocol Support
(https?|ftp|file|mailto|tel):[^\s]+
Extracts: Various URL schemes including mailto: and tel:
Improved URL Extraction
Match Complete URLs
https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?
Breakdown:
https?:\/\/- http:// or https://(?:www\.)?- Optional www. prefix[a-zA-Z0-9-]+- Domain name (letters, digits, hyphens)\.- Literal dot[a-zA-Z]{2,}- Top-level domain (2+ letters)(?:\/[^\s]*)?- Optional path and query string
Valid URLs:
https://example.comhttp://www.example.comhttps://example.com/path/to/pagehttp://example.com?query=value
Advanced URL Pattern with Query Parameters
https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s?#]*)?(?:\?[^\s#]*)?(?:#[^\s]*)?
This pattern handles:
- Path components
- Query strings (?key=value)
- Fragments (#section)
Extracts: https://example.com/path?query=value#section from full text
URL Patterns with Character Validation
Strict URL Validation
https?:\/\/(?:www\.)?[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*\.[a-zA-Z]{2,}(?:\/[^\s]*)?
This pattern:
- Validates domain name rules (max 63 characters per label)
- Ensures domains don't start or end with hyphens
- Supports subdomains (e.g., sub.example.com)
URL with Port Number
https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?::\d{1,5})?(?:\/[^\s]*)?
Extracts: https://example.com:8080/path
Breakdown:
(?::\d{1,5})?- Optional port number (1-5 digits)
URL with IP Address
https?:\/\/(?:www\.)?(?:\d{1,3}\.){3}\d{1,3}(?::\d{1,5})?(?:\/[^\s]*)?
Extracts: https://192.168.1.1:8080/path
Special URL Patterns
URLs with Authentication
https?:\/\/[^:\s]+:[^@\s]+@[^\s]+
Extracts: https://user:[email protected]
Relative URLs
\/[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
Extracts: /path/to/page and /path?query=value (without domain)
Data URLs
data:[^,\s]+,[^,\s]+
Extracts: data:text/plain;base64,SGVsbG8=
Extracting URLs from Complex Text
Extract All URLs (Multiple Types)
(?:https?|ftp):\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?|(?:mailto|tel):[^\s]+
This pattern extracts:
- HTTP/HTTPS URLs
- FTP URLs
- mailto: links
- tel: links
URL Extraction with Punctuation Handling
https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s\)\]\}>"]*)?
Stops extraction at common closing punctuation: ), ], }, >, "
Example: "Visit https://example.com) now!" → Extracts https://example.com
Code Examples
JavaScript URL Extraction
function extractUrls(text) {
const urlRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
const urls = text.match(urlRegex);
return urls || [];
}
const text = "Visit https://example.com and http://www.test.com for more info";
const urls = extractUrls(text);
console.log(urls);
// Output: ["https://example.com", "http://www.test.com"]
Python URL Extraction
import re
def extract_urls(text):
url_pattern = r'https?://(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:/[^\s]*)?'
return re.findall(url_pattern, text)
text = "Visit https://example.com and http://www.test.com"
urls = extract_urls(text)
print(urls)
# Output: ['https://example.com', 'http://www.test.com']
URL Extraction with Validation
function extractAndValidateUrls(text) {
const urlRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
const urls = text.match(urlRegex) || [];
// Validate each URL
return urls.filter(url => {
try {
new URL(url);
return true;
} catch {
return false;
}
});
}
Advanced Use Cases
Extract URLs from HTML
href=["'](https?:[^"']+)["']
Extracts: https://example.com from <a href="https://example.com">Link</a>
Extract URLs from Markdown
\[([^\]]+)\]\((https?:[^)]+)\)
Captures:
- Group 1: Link text
- Group 2: URL
Extracts from: [Link](https://example.com) → URL: https://example.com, Text: Link
Extract URLs from Social Media Posts
https?:\/\/(?:www\.)?(?:twitter|facebook|instagram|linkedin)\.com\/[^\s]+
Extracts: https://twitter.com/user/status/123456789
Best Practices
1. Use the Global Flag
// WRONG: Only finds first URL
const urls = text.match(/https?:\/\/[^\s]+/);
// RIGHT: Finds all URLs
const urls = text.match(/https?:\/\/[^\s]+/g);
2. Validate After Extraction
// Extract with regex
const urls = text.match(urlRegex);
// Validate with URL constructor
urls.forEach(url => {
try {
const parsed = new URL(url);
console.log('Valid:', parsed.hostname);
} catch (e) {
console.log('Invalid:', url);
}
});
3. Handle Trailing Punctuation
// Clean up trailing punctuation
function cleanUrl(url) {
return url.replace(/[.,;!?]+$/, '');
}
const urls = text.match(urlRegex).map(cleanUrl);
4. Consider Performance
// Simple patterns are faster for large texts
const simpleRegex = /https?:\/\/[^\s]+/g;
// Complex patterns are more accurate but slower
const complexRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
Common Pitfalls
Pitfall 1: Including Trailing Punctuation
// BAD: Includes trailing period
const url = "Visit https://example.com.";
// Extracts: "https://example.com."
// GOOD: Stops at punctuation
const url = "Visit https://example.com.";
// Extracts: "https://example.com"
Pitfall 2: Not Handling Subdomains
// BAD: Only matches example.com
https?:\/\/[a-zA-Z0-9-]+\.[a-zA-Z]{2,}
// GOOD: Matches sub.example.com
https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}
Pitfall 3: Missing Protocol
// BAD: Requires protocol
https?:\/\/[^\s]+
// GOOD: Matches URLs with or without protocol
(?:https?:\/\/)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?
Testing Your URL Extraction
Use our interactive Match Finder with these test cases:
Text with URLs:
Visit https://example.com and http://www.test.com/path
for more info. Also check ftp://files.com
Expected Extracted URLs:
https://example.comhttp://www.test.com/pathftp://files.com
Edge Cases:
- URL with port:
https://example.com:8080 - URL with query:
https://example.com?query=value - URL with fragment:
https://example.com#section - URL with auth:
https://user:[email protected]
Conclusion
URL extraction with regex is about finding the right balance between simplicity and accuracy. The pattern https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)? provides a good balance for most applications.
Remember to:
- Use the global flag to find all URLs
- Validate URLs after extraction
- Handle trailing punctuation
- Consider your specific use case (web scraping, log analysis, etc.)
For complex URL validation, consider using a dedicated URL parsing library in combination with regex for extraction.
Experiment with different patterns using our Regex Tester to find the perfect fit for your URL extraction needs!
About the Author
The Regex Master Team consists of experienced developers and technical writers dedicated to simplifying regular expressions for everyone. We ensure all patterns are rigorously tested and verified to provide accurate, production-ready solutions.
Try It: Regex Tester
Use our interactive regex tester to experiment with the patterns you learned in this article. Test your regular expressions in real-time and see immediate results.