RegEx parse XML file and return only url with file extension

Question

I have the following xml file that is converted to string I want to parse the xml file an return only the url with file extension

<?xml version="1.0" encoding="utf-8"?>
<channel>
  <generator>GOA Celebrations</generator>
  <generator>GOA Celebrations</generator>
  <pubDate>13 Jan 2016</pubDate>
  <title>GOA Celebrations</title>
  <link>http://goframe.com.au/rss/herston-inbound</link>
  <language>en</language>
  <image>
    <url>http://goafame.com.au/site/templates/img/logo.png</url>
  </image>
  <item>
    <title>Herston Inbound - Wednesday, 13 Jan - Alex Peisker - 2015-12-21 09:26am - Live</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="http://goafame.com.au/site/assets/files/2991/final_cropped_shutterstock_114908098.jpg" length="" type="" />
    <guid isPermaLink="false">c5c1cb0bebd56ae38817b251ad72bedb</guid>
  </item>
  <item>
    <title>Dog 1</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="http://animaliaz-life.com/data_images/dog/dog4.jpg" length="" type="" />
  </item>
  <item>
    <title>Dog 2</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="http://cdn1.theodysseyonline.com/files/2015/12/21/6358631429926013411708851658_Dog-Pictures.jpg" length="" type="" />
  </item>
  <item>
    <title>Dog 3</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="https://i.ytimg.com/vi/AkcfB3z0_-0/maxresdefault.jpg" length="" type="" />
  </item>
</channel>

My question is how to do this using

Regular Exression

The desired output would be

http://goafame.com.au/site/templates/img/logo.png http://goafame.com.au/site/assets/files/2991/final_cropped_shutterstock_114908098.jpg http://animaliaz-life.com/data_images/dog/dog4.jpg http://cdn1.theodysseyonline.com/files/2015/12/21/6358631429926013411708851658_Dog-Pictures.jpg

https://i.ytimg.com/vi/AkcfB3z0_-0/maxresdefault.jpg

Here's what I've try so far

Regex linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);

                string rawString = doc.ToString();
                int posCounter = 0;
                foreach (Match m in linkParser.Matches(rawString))
                {
                    posCounter++;
                    links.Add(new LinkModel
                    {
                        IsSelected = false,
                        XmlLink = m.Value,
                        NodePosition = posCounter
                    });
                }

Note: XML file can come from any sources and some other url are not located in link element.Some are even nested. That's why I think of using RegEx rather than XDocument.

Why You needd regex? Using Xpath it will by one line of code : ) — blogprogramisty.net, Mar 4 '16 at 7:37
Don't use RegEx for this. Just use an XML parser. XDocument for example. — Hein Andre Grønnestad, Mar 4 '16 at 7:38
IMHO parsing XML using regular expressions are not a good idea. I'll go for DOM or SAX. — Milan Tomeš, Mar 4 '16 at 7:39
is it possible to get url using XDocumet (note that XML file are dynamic some of url's may not come from enclosure tag)? — Prince Jea, Mar 4 '16 at 7:43

tdat00 · Accepted Answer · 2016-03-04 08:23:51Z

My pattern can match your sample. Tested here http://regexstorm.net/tester

https?://[^\s<"]+/[^\s<"]+(?:\.\w{3,4})

The idea is that finding all links that have a splash character (/) followed by a file name pattern (end with 3,4 characters extension).

Raja Nadar · Accepted Answer · 2016-03-04 07:45:52Z

var allLinkValues = XDocument.Parse(doc.ToString())
                             .Root
                             .Elements("item")
                             .Select(itemElement => itemElement.Element("link").Value)
                             .ToList();

here

XDocument.Parse(doc.ToString()) loads the document. Root points to the root element then we select all the "item" elements and select the value of the "link" element.

power of Linq2Xml!

XPath, XmlDocument are your other options.

in general if the string is well schemed, (XML, JSON, RDF etc.) do not opt for RegEx as the first option. There are well defined parsers for these type of documents.

And the above query should get you started on Xml navigation.

Thank you for your answer but my problem here is the xml is dynamic it may come from any rss feeds and some links are not located in link element — Prince Jea, Mar 4 '16 at 7:52

RegEx parse XML file and return only url with file extension

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged c# regex xml or ask your own question.

Our tech stack

Meet our team

Hot Network Questions

RegEx parse XML file and return only url with file extension

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c# regex xml or ask your own question.

Our tech stack

Meet our team

Related

Hot Network Questions