I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
|
You could also do something like this (provided you have lynx installed):
|
|||||||||||||||||
|
You asked for it:
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply. |
|||||
|
An example, since you didn't provide any sample
|
|||||||||||||
|
|
|||||
|
You can do it quite easily with the following regex, which is quite good at finding URLs:
I took it from John Gruber's article on how to find URLs in text. That lets you find all URLs in a file f.html as follows:
|
|||||||||||||||||
|
With the Xidel - HTML/XML data extraction tool, this can be done via:
With conversion to absolute URLs:
|
|||
|
I made a few changes to Greg Bacon Solution
This fixes two problems:
|
|||||
|
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this. OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed! |
||||
|
You can try:
|
|||||
|