Easiest way to extract the urls from an html page using sed or awk only

Question

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?

Read this and be enlightened: stackoverflow.com/questions/1732348/… — Dennis Williamson, Dec 10 '09 at 14:44
If you don't mind that: There is no guarantee that you find all urls. or There is no guarantee that all urls you find are valid. use one of the examples below. If you do mind use an appropriate tool for the job (perl, python, ruby) — Nifle, Dec 10 '09 at 14:59
My previous comment is of course for any easy solution you might try. awk is powerful enough to do the job, heck you could theoretically implement perl in awk... — Nifle, Dec 10 '09 at 15:02
Is this like one of those survivor challenges, where you have to live for three days eating only termites? If not, seriously, why the restriction? Every modern system can install at least Perl, and from there, you have the whole web — Randal Schwartz, Dec 21 '09 at 2:33

Hardy · Answer 1 · 2010-01-04 13:06:42Z

up vote 32 down vote

You could also do something like this (provided you have lynx installed):

lynx -dump -listonly my.html

answered Jan 4 '10 at 13:06

Hardy
7,14911538

Thanks for this, very useful! – Alberto Zaccagni Feb 19 '11 at 16:55

Very useful! Thanks! – SiddharthaRT Dec 9 '13 at 8:50

Always fun to fire up lynx! – wprl May 1 at 16:50

In Lynx 2.8.8 this has become lynx -dump -hiddenlinks=listonly my.html – condit May 7 at 22:17

add a comment |

Greg Bacon · Answer 2 · 2009-12-17 23:09:22Z

up vote 17 down vote

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

answered Dec 17 '09 at 23:09

Greg Bacon
54k11112177

Almost perfect, but what about this two cases: 1. You are matching only the ones that start with <a href <a title="Title" href="sample">Match me</a> 2. What if there's two anchors in the same line I made this modifications to the original solution: code cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a/\n<a/g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' code – Crisboot Aug 6 '12 at 10:23

add a comment |

ghostdog74 · Answer 3 · 2009-12-10 14:49:38Z

up vote 6 down vote

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html

edited Dec 10 '09 at 14:49

answered Dec 10 '09 at 14:26

ghostdog74
92.9k17107170

Does this work for '<a href="aktuell.de.selfhtml.org/"; target="_blank">SELFHTML aktuell</a>' – malach Dec 10 '09 at 14:40

if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see. – ghostdog74 Dec 10 '09 at 14:54

this really did the work, many great thanx for this great awk bundle! – SomniusX Jul 1 at 8:38

add a comment |

kerkael · Answer 4 · 2012-09-19 12:48:00Z

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html

Nice break down of what each step should do. – Jeremy J Starcher Sep 20 '12 at 6:52 — Jeremy J Starcher, Sep 20 '12 at 6:52

nes1983 · Answer 5 · 2009-12-10 14:28:41Z

up vote 5 down vote

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

answered Dec 10 '09 at 14:28

nes1983
7,24312645

1

complicated, and fails when href is like this: ... HREF="somewhere.com/"; ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ... – ghostdog74 Dec 10 '09 at 14:35

I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar. – nes1983 Dec 10 '09 at 14:45

2

You don't need to have a cat before the grep. Just put f.html at the end of grep – monksy Apr 13 '12 at 5:10

And grep -o can fail due to a bug in some versions of grep. – kisp Aug 23 '13 at 21:45

add a comment |

Ingo Karkat · Answer 6 · 2013-03-13 13:51:25Z

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/concat(resolve-uri(@href, base-uri()))" http://example.com/

Crisboot · Answer 7 · 2012-08-09 12:55:20Z

up vote 3 down vote

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line

edited Aug 9 '12 at 12:55

answered Aug 6 '12 at 10:28

Crisboot
381413

But at least it solves the problem, none of the other solutions does – Crisboot Aug 6 '12 at 12:30

add a comment |

Alok Singhal · Answer 8 · 2009-12-15 08:18:06Z

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

Anthon · Answer 9 · 2013-03-11 18:40:37Z

up vote -1 down vote

You can try:

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a>&nbsp;<\/td>//g'| column -c2 -t|awk '{print $1}'

edited Mar 11 '13 at 18:40

Anthon
4,42591946

answered Mar 11 '13 at 18:00

dpathak
191

Please format your code! – poplitea Mar 11 '13 at 18:37

add a comment |

asked	4 years ago
viewed	31425 times
active	1 year ago

current community

your communities

more stack exchange communities

Easiest way to extract the urls from an html page using sed or awk only

9 Answers 9

Your Answer

Not the answer you're looking for? Browse other questions tagged html regex bash sed awk or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Easiest way to extract the urls from an html page using sed or awk only

9 Answers 9

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged html regex bash sed awk or ask your own question.

Linked

Related

Hot Network Questions