Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?

share|improve this question
8  
Read this and be enlightened: stackoverflow.com/questions/1732348/… –  Dennis Williamson Dec 10 '09 at 14:44
    
If you don't mind that: There is no guarantee that you find all urls. or There is no guarantee that all urls you find are valid. use one of the examples below. If you do mind use an appropriate tool for the job (perl, python, ruby) –  Nifle Dec 10 '09 at 14:59
    
My previous comment is of course for any easy solution you might try. awk is powerful enough to do the job, heck you could theoretically implement perl in awk... –  Nifle Dec 10 '09 at 15:02
7  
Is this like one of those survivor challenges, where you have to live for three days eating only termites? If not, seriously, why the restriction? Every modern system can install at least Perl, and from there, you have the whole web –  Randal Schwartz Dec 21 '09 at 2:33

9 Answers 9

You could also do something like this (provided you have lynx installed):

lynx -dump -listonly my.html
share|improve this answer
    
Thanks for this, very useful! –  Alberto Zaccagni Feb 19 '11 at 16:55
    
Very useful! Thanks! –  SiddharthaRT Dec 9 '13 at 8:50
    
Always fun to fire up lynx! –  wprl May 1 at 16:50
    
In Lynx 2.8.8 this has become lynx -dump -hiddenlinks=listonly my.html –  condit May 7 at 22:17

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

share|improve this answer
    
Almost perfect, but what about this two cases: 1. You are matching only the ones that start with <a href <a title="Title" href="sample">Match me</a> 2. What if there's two anchors in the same line I made this modifications to the original solution: code cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a/\n<a/g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' code –  Crisboot Aug 6 '12 at 10:23

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html
share|improve this answer
    
Does this work for '<a href="aktuell.de.selfhtml.org/"; target="_blank">SELFHTML aktuell</a>' –  malach Dec 10 '09 at 14:40
    
if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see. –  ghostdog74 Dec 10 '09 at 14:54
    
this really did the work, many great thanx for this great awk bundle! –  SomniusX Jul 1 at 8:38
grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
  1. The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
  2. The first sed will add a newline in front of each a href url tag with the \n
  3. The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
  4. The 2nd grep href cleans the mess up
  5. The sort and uniq will give you one instance of each existing url present in the sourcepage.html
share|improve this answer
    
Nice break down of what each step should do. –  Jeremy J Starcher Sep 20 '12 at 6:52

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
share|improve this answer
1  
complicated, and fails when href is like this: ... HREF="somewhere.com/"; ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ... –  ghostdog74 Dec 10 '09 at 14:35
    
I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar. –  nes1983 Dec 10 '09 at 14:45
2  
You don't need to have a cat before the grep. Just put f.html at the end of grep –  monksy Apr 13 '12 at 5:10
    
And grep -o can fail due to a bug in some versions of grep. –  kisp Aug 23 '13 at 21:45

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/concat(resolve-uri(@href, base-uri()))" http://example.com/
share|improve this answer

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

  1. We are matching cases where the anchor doesn't start with href as first attribute
  2. We are covering the possibility of having several anchors in the same line
share|improve this answer
    
But at least it solves the problem, none of the other solutions does –  Crisboot Aug 6 '12 at 12:30

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

share|improve this answer

You can try:

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a>&nbsp;<\/td>//g'| column -c2 -t|awk '{print $1}'
share|improve this answer
    
Please format your code! –  poplitea Mar 11 '13 at 18:37

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.