How to Extract All URLs from Text Using Regex

Extracting URLs from text is a common task for web scraping, log analysis, and content processing. Regular expressions provide an efficient way to identify and extract URLs from unstructured text. In this comprehensive guide, we'll explore various regex patterns for URL extraction, from simple to advanced.

Understanding URL Structure

A URL (Uniform Resource Locator) consists of several components:

https://example.com/path/to/page?query=value#section
│     │          │     │         │              │
│     │          │     │         │              └─ Fragment
│     │          │     │         └──────────────── Query string
│     │          │     └────────────────────────── Path
│     │          └──────────────────────────────── Domain
│     └──────────────────────────────────────────── Protocol

Basic URL Extraction Patterns

Simple HTTP/HTTPS Pattern

https?://[^\s]+

Extracts: https://example.com from "Visit https://example.com for more info"

Breakdown:

https?:// - Matches http:// or https://
[^\s]+ - One or more non-whitespace characters

Pros: Simple and fast
Cons: Includes trailing punctuation, doesn't validate URL format

Including FTP and Other Protocols

(https?|ftp)://[^\s]+

Extracts: http://site.com, https://secure.com, ftp://files.com

Comprehensive Protocol Support

(https?|ftp|file|mailto|tel):[^\s]+

Extracts: Various URL schemes including mailto: and tel:

Improved URL Extraction

Match Complete URLs

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?

Breakdown:

https?:\/\/ - http:// or https://
(?:www\.)? - Optional www. prefix
[a-zA-Z0-9-]+ - Domain name (letters, digits, hyphens)
\. - Literal dot
[a-zA-Z]{2,} - Top-level domain (2+ letters)
(?:\/[^\s]*)? - Optional path and query string

Valid URLs:

https://example.com
http://www.example.com
https://example.com/path/to/page
http://example.com?query=value

Advanced URL Pattern with Query Parameters

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s?#]*)?(?:\?[^\s#]*)?(?:#[^\s]*)?

This pattern handles:

Path components
Query strings (?key=value)
Fragments (#section)

Extracts: https://example.com/path?query=value#section from full text

URL Patterns with Character Validation

Strict URL Validation

https?:\/\/(?:www\.)?[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*\.[a-zA-Z]{2,}(?:\/[^\s]*)?

This pattern:

Validates domain name rules (max 63 characters per label)
Ensures domains don't start or end with hyphens
Supports subdomains (e.g., sub.example.com)

URL with Port Number

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?::\d{1,5})?(?:\/[^\s]*)?

Extracts: https://example.com:8080/path

Breakdown:

(?::\d{1,5})? - Optional port number (1-5 digits)

URL with IP Address

https?:\/\/(?:www\.)?(?:\d{1,3}\.){3}\d{1,3}(?::\d{1,5})?(?:\/[^\s]*)?

Extracts: https://192.168.1.1:8080/path

Special URL Patterns

URLs with Authentication

https?:\/\/[^:\s]+:[^@\s]+@[^\s]+

Extracts: https://user:[email protected]

Relative URLs

\/[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?

Extracts: /path/to/page and /path?query=value (without domain)

Data URLs

data:[^,\s]+,[^,\s]+

Extracts: data:text/plain;base64,SGVsbG8=

Extracting URLs from Complex Text

Extract All URLs (Multiple Types)

(?:https?|ftp):\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?|(?:mailto|tel):[^\s]+

This pattern extracts:

HTTP/HTTPS URLs
FTP URLs
mailto: links
tel: links

URL Extraction with Punctuation Handling

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s\)\]\}>"]*)?

Stops extraction at common closing punctuation: ), ], }, >, "

Example: "Visit https://example.com) now!" → Extracts https://example.com

Code Examples

JavaScript URL Extraction

function extractUrls(text) {
  const urlRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
  const urls = text.match(urlRegex);
  return urls || [];
}

const text = "Visit https://example.com and http://www.test.com for more info";
const urls = extractUrls(text);
console.log(urls);
// Output: ["https://example.com", "http://www.test.com"]

Python URL Extraction

import re

def extract_urls(text):
    url_pattern = r'https?://(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:/[^\s]*)?'
    return re.findall(url_pattern, text)

text = "Visit https://example.com and http://www.test.com"
urls = extract_urls(text)
print(urls)
# Output: ['https://example.com', 'http://www.test.com']

URL Extraction with Validation

function extractAndValidateUrls(text) {
  const urlRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
  const urls = text.match(urlRegex) || [];
  
  // Validate each URL
  return urls.filter(url => {
    try {
      new URL(url);
      return true;
    } catch {
      return false;
    }
  });
}

Advanced Use Cases

Extract URLs from HTML

href=["'](https?:[^"']+)["']

Extracts: https://example.com from <a href="https://example.com">Link</a>

Extract URLs from Markdown

\[([^\]]+)\]\((https?:[^)]+)\)

Captures:

Group 1: Link text
Group 2: URL

Extracts from: [Link](https://example.com) → URL: https://example.com, Text: Link

Extract URLs from Social Media Posts

https?:\/\/(?:www\.)?(?:twitter|facebook|instagram|linkedin)\.com\/[^\s]+

Extracts: https://twitter.com/user/status/123456789

Best Practices

1. Use the Global Flag

// WRONG: Only finds first URL
const urls = text.match(/https?:\/\/[^\s]+/);

// RIGHT: Finds all URLs
const urls = text.match(/https?:\/\/[^\s]+/g);

2. Validate After Extraction

// Extract with regex
const urls = text.match(urlRegex);

// Validate with URL constructor
urls.forEach(url => {
  try {
    const parsed = new URL(url);
    console.log('Valid:', parsed.hostname);
  } catch (e) {
    console.log('Invalid:', url);
  }
});

3. Handle Trailing Punctuation

// Clean up trailing punctuation
function cleanUrl(url) {
  return url.replace(/[.,;!?]+$/, '');
}

const urls = text.match(urlRegex).map(cleanUrl);

4. Consider Performance

// Simple patterns are faster for large texts
const simpleRegex = /https?:\/\/[^\s]+/g;

// Complex patterns are more accurate but slower
const complexRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;

Common Pitfalls

Pitfall 1: Including Trailing Punctuation

// BAD: Includes trailing period
const url = "Visit https://example.com.";
// Extracts: "https://example.com."

// GOOD: Stops at punctuation
const url = "Visit https://example.com.";
// Extracts: "https://example.com"

Pitfall 2: Not Handling Subdomains

// BAD: Only matches example.com
https?:\/\/[a-zA-Z0-9-]+\.[a-zA-Z]{2,}

// GOOD: Matches sub.example.com
https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}

Pitfall 3: Missing Protocol

// BAD: Requires protocol
https?:\/\/[^\s]+

// GOOD: Matches URLs with or without protocol
(?:https?:\/\/)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?

Testing Your URL Extraction

Use our interactive Match Finder with these test cases:

Text with URLs:

Visit https://example.com and http://www.test.com/path
for more info. Also check ftp://files.com

Expected Extracted URLs:

https://example.com
http://www.test.com/path
ftp://files.com

Edge Cases:

URL with port: https://example.com:8080
URL with query: https://example.com?query=value
URL with fragment: https://example.com#section
URL with auth: https://user:[email protected]

Conclusion

URL extraction with regex is about finding the right balance between simplicity and accuracy. The pattern https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)? provides a good balance for most applications.

Remember to:

Use the global flag to find all URLs
Validate URLs after extraction
Handle trailing punctuation
Consider your specific use case (web scraping, log analysis, etc.)

For complex URL validation, consider using a dedicated URL parsing library in combination with regex for extraction.

Experiment with different patterns using our Regex Tester to find the perfect fit for your URL extraction needs!

How to Extract All URLs from Text Using Regex

Understanding URL Structure

A URL (Uniform Resource Locator) consists of several components:

https://example.com/path/to/page?query=value#section
│     │          │     │         │              │
│     │          │     │         │              └─ Fragment
│     │          │     │         └──────────────── Query string
│     │          │     └────────────────────────── Path
│     │          └──────────────────────────────── Domain
│     └──────────────────────────────────────────── Protocol

Basic URL Extraction Patterns

Simple HTTP/HTTPS Pattern

https?://[^\s]+

Extracts: https://example.com from "Visit https://example.com for more info"

Breakdown:

https?:// - Matches http:// or https://
[^\s]+ - One or more non-whitespace characters

Pros: Simple and fast
Cons: Includes trailing punctuation, doesn't validate URL format

Including FTP and Other Protocols

(https?|ftp)://[^\s]+

Extracts: http://site.com, https://secure.com, ftp://files.com

Comprehensive Protocol Support

(https?|ftp|file|mailto|tel):[^\s]+

Extracts: Various URL schemes including mailto: and tel:

Improved URL Extraction

Match Complete URLs

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?

Breakdown:

https?:\/\/ - http:// or https://
(?:www\.)? - Optional www. prefix
[a-zA-Z0-9-]+ - Domain name (letters, digits, hyphens)
\. - Literal dot
[a-zA-Z]{2,} - Top-level domain (2+ letters)
(?:\/[^\s]*)? - Optional path and query string

Valid URLs:

https://example.com
http://www.example.com
https://example.com/path/to/page
http://example.com?query=value

Advanced URL Pattern with Query Parameters

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s?#]*)?(?:\?[^\s#]*)?(?:#[^\s]*)?

This pattern handles:

Path components
Query strings (?key=value)
Fragments (#section)

Extracts: https://example.com/path?query=value#section from full text

URL Patterns with Character Validation

Strict URL Validation

https?:\/\/(?:www\.)?[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*\.[a-zA-Z]{2,}(?:\/[^\s]*)?

This pattern:

Validates domain name rules (max 63 characters per label)
Ensures domains don't start or end with hyphens
Supports subdomains (e.g., sub.example.com)

URL with Port Number

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?::\d{1,5})?(?:\/[^\s]*)?

Extracts: https://example.com:8080/path

Breakdown:

(?::\d{1,5})? - Optional port number (1-5 digits)

URL with IP Address

https?:\/\/(?:www\.)?(?:\d{1,3}\.){3}\d{1,3}(?::\d{1,5})?(?:\/[^\s]*)?

Extracts: https://192.168.1.1:8080/path

Special URL Patterns

URLs with Authentication

https?:\/\/[^:\s]+:[^@\s]+@[^\s]+

Extracts: https://user:[email protected]

Relative URLs

\/[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?

Extracts: /path/to/page and /path?query=value (without domain)

Data URLs

data:[^,\s]+,[^,\s]+

Extracts: data:text/plain;base64,SGVsbG8=

Extracting URLs from Complex Text

Extract All URLs (Multiple Types)

(?:https?|ftp):\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?|(?:mailto|tel):[^\s]+

This pattern extracts:

HTTP/HTTPS URLs
FTP URLs
mailto: links
tel: links

URL Extraction with Punctuation Handling

https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s\)\]\}>"]*)?

Stops extraction at common closing punctuation: ), ], }, >, "

Example: "Visit https://example.com) now!" → Extracts https://example.com

Code Examples

JavaScript URL Extraction

function extractUrls(text) {
  const urlRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
  const urls = text.match(urlRegex);
  return urls || [];
}

const text = "Visit https://example.com and http://www.test.com for more info";
const urls = extractUrls(text);
console.log(urls);
// Output: ["https://example.com", "http://www.test.com"]

Python URL Extraction

import re

def extract_urls(text):
    url_pattern = r'https?://(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:/[^\s]*)?'
    return re.findall(url_pattern, text)

text = "Visit https://example.com and http://www.test.com"
urls = extract_urls(text)
print(urls)
# Output: ['https://example.com', 'http://www.test.com']

URL Extraction with Validation

function extractAndValidateUrls(text) {
  const urlRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;
  const urls = text.match(urlRegex) || [];
  
  // Validate each URL
  return urls.filter(url => {
    try {
      new URL(url);
      return true;
    } catch {
      return false;
    }
  });
}

Advanced Use Cases

Extract URLs from HTML

href=["'](https?:[^"']+)["']

Extracts: https://example.com from <a href="https://example.com">Link</a>

Extract URLs from Markdown

\[([^\]]+)\]\((https?:[^)]+)\)

Captures:

Group 1: Link text
Group 2: URL

Extracts from: [Link](https://example.com) → URL: https://example.com, Text: Link

Extract URLs from Social Media Posts

https?:\/\/(?:www\.)?(?:twitter|facebook|instagram|linkedin)\.com\/[^\s]+

Extracts: https://twitter.com/user/status/123456789

Best Practices

1. Use the Global Flag

// WRONG: Only finds first URL
const urls = text.match(/https?:\/\/[^\s]+/);

// RIGHT: Finds all URLs
const urls = text.match(/https?:\/\/[^\s]+/g);

2. Validate After Extraction

// Extract with regex
const urls = text.match(urlRegex);

// Validate with URL constructor
urls.forEach(url => {
  try {
    const parsed = new URL(url);
    console.log('Valid:', parsed.hostname);
  } catch (e) {
    console.log('Invalid:', url);
  }
});

3. Handle Trailing Punctuation

// Clean up trailing punctuation
function cleanUrl(url) {
  return url.replace(/[.,;!?]+$/, '');
}

const urls = text.match(urlRegex).map(cleanUrl);

4. Consider Performance

// Simple patterns are faster for large texts
const simpleRegex = /https?:\/\/[^\s]+/g;

// Complex patterns are more accurate but slower
const complexRegex = /https?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?/g;

Common Pitfalls

Pitfall 1: Including Trailing Punctuation

// BAD: Includes trailing period
const url = "Visit https://example.com.";
// Extracts: "https://example.com."

// GOOD: Stops at punctuation
const url = "Visit https://example.com.";
// Extracts: "https://example.com"

Pitfall 2: Not Handling Subdomains

// BAD: Only matches example.com
https?:\/\/[a-zA-Z0-9-]+\.[a-zA-Z]{2,}

// GOOD: Matches sub.example.com
https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}

Pitfall 3: Missing Protocol

// BAD: Requires protocol
https?:\/\/[^\s]+

// GOOD: Matches URLs with or without protocol
(?:https?:\/\/)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?

Testing Your URL Extraction

Use our interactive Match Finder with these test cases:

Text with URLs:

Visit https://example.com and http://www.test.com/path
for more info. Also check ftp://files.com

Expected Extracted URLs:

https://example.com
http://www.test.com/path
ftp://files.com

Edge Cases:

URL with port: https://example.com:8080
URL with query: https://example.com?query=value
URL with fragment: https://example.com#section
URL with auth: https://user:[email protected]

Conclusion

Remember to:

Use the global flag to find all URLs
Validate URLs after extraction
Handle trailing punctuation
Consider your specific use case (web scraping, log analysis, etc.)

For complex URL validation, consider using a dedicated URL parsing library in combination with regex for extraction.

Experiment with different patterns using our Regex Tester to find the perfect fit for your URL extraction needs!

How to Extract All URLs from Text Using Regex

Understanding URL Structure

Basic URL Extraction Patterns

Simple HTTP/HTTPS Pattern

Including FTP and Other Protocols

Comprehensive Protocol Support

Improved URL Extraction

Match Complete URLs

Advanced URL Pattern with Query Parameters

URL Patterns with Character Validation

Strict URL Validation

URL with Port Number

URL with IP Address

Special URL Patterns

URLs with Authentication

Relative URLs

Data URLs

Extracting URLs from Complex Text

Extract All URLs (Multiple Types)

URL Extraction with Punctuation Handling

Code Examples

JavaScript URL Extraction

Python URL Extraction

URL Extraction with Validation

Advanced Use Cases

Extract URLs from HTML

Extract URLs from Markdown

Extract URLs from Social Media Posts

Best Practices

1. Use the Global Flag

2. Validate After Extraction

3. Handle Trailing Punctuation

4. Consider Performance

Common Pitfalls

Pitfall 1: Including Trailing Punctuation

Pitfall 2: Not Handling Subdomains

Pitfall 3: Missing Protocol

Testing Your URL Extraction

Conclusion

About the Author

Try It: Regex Tester

How to Extract All URLs from Text Using Regex

Understanding URL Structure

Basic URL Extraction Patterns

Simple HTTP/HTTPS Pattern

Including FTP and Other Protocols

Comprehensive Protocol Support

Improved URL Extraction

Match Complete URLs

Advanced URL Pattern with Query Parameters

URL Patterns with Character Validation

Strict URL Validation

URL with Port Number

URL with IP Address

Special URL Patterns

URLs with Authentication

Relative URLs

Data URLs

Extracting URLs from Complex Text

Extract All URLs (Multiple Types)

URL Extraction with Punctuation Handling

Code Examples

JavaScript URL Extraction

Python URL Extraction

URL Extraction with Validation

Advanced Use Cases

Extract URLs from HTML

Extract URLs from Markdown

Extract URLs from Social Media Posts

Best Practices

1. Use the Global Flag

2. Validate After Extraction

3. Handle Trailing Punctuation

4. Consider Performance

Common Pitfalls

Pitfall 1: Including Trailing Punctuation

Pitfall 2: Not Handling Subdomains

Pitfall 3: Missing Protocol

Testing Your URL Extraction

Conclusion