Extracting Multiple URLs from HTML and JS Source Code Made Easy

Показать описание

Learn how to extract URLs from your HTML and JS source code, even when multiple URLs are present on the same line, using simple shell tools.
---

Visit these links for original content and any more details, such as alternate solutions, latest updates/developments on topic, comments, revision history etc. For example, the original title of the Question was: Extract URLs from HTML and JS source, with multiple on a line

If anything seems off to you, please feel free to write me at vlogize [AT] gmail [DOT] com.
---
Extracting Multiple URLs from HTML and JS Source Code Made Easy

If you've ever found yourself needing to list all the domains referenced in your HTML and JavaScript source code, you’re not alone! Many developers face the challenge of extracting URLs, especially when they appear multiple times on a single line. The good news is, with a few simple commands, you can get the job done effectively.

The Problem

When attempting to extract URLs that begin with https?://, a common method using sed may fail to return all domains present on a line containing multiple URLs. For instance, the initial command often results in only one URL being captured due to the greedy nature of the regular expression used. Here’s a sample command that many developers might use:

[[See Video to Reveal this Text or Code Snippet]]

The Issue

The problem arises because the greedy regular expression .* at both the beginning and end of the regex essentially discards any subsequent URLs on the same line. This leads to incomplete results, which can be frustrating when trying to audit or document external links.

The Solution

To tackle this challenge, we can utilize a more efficient approach using Perl instead of sed. Perl allows us to extract multiple matches from a single line without losing any data.

Step-by-Step Explanation

Use a New Regex Pattern:

Change the previous regex to https?://([a-z0-9-._]+) which matches URLs without needing to be greedy.

Iterate Through Matches:

Use the while m{...}g construct to find all matching URLs on a line.

Printing Results:

print $1 will output the matched URLs one by one.

Revised Command

Here's how the revised command looks using Perl:

[[See Video to Reveal this Text or Code Snippet]]

Breakdown of the Perl Command

**find -s [^.]* -print0**: This part remains the same, searching through files.

xargs -0 perl -nle: This runs Perl on each input from find.

print $1 while m{https?://([a-z0-9-._]+)}g: For each line, it prints all the matches it finds.

Advantages of This Approach

Efficiency: Grabs all URLs on a line without missing any.

Simplicity: Uses familiar commands, making it easier to implement.

Cleaner Results: Outputs unique domains sorted for easier reference.

Conclusion

In summary, extracting multiple URLs from HTML and JavaScript can be effectively achieved with a simple change from sed to perl. By utilizing a less greedy regex and leveraging Perl's capabilities to iterate through all matches, you can ensure that no URLs fall through the cracks.

Now, you can confidently extract all the domains your source code refers to, streamlining your workflow and ensuring you get the complete list you need!