1

I have the following xml file that is converted to string I want to parse the xml file an return only the url with file extension

<?xml version="1.0" encoding="utf-8"?>
<channel>
  <generator>GOA Celebrations</generator>
  <generator>GOA Celebrations</generator>
  <pubDate>13 Jan 2016</pubDate>
  <title>GOA Celebrations</title>
  <link>http://goframe.com.au/rss/herston-inbound</link>
  <language>en</language>
  <image>
    <url>http://goafame.com.au/site/templates/img/logo.png</url>
  </image>
  <item>
    <title>Herston Inbound - Wednesday, 13 Jan - Alex Peisker - 2015-12-21 09:26am - Live</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="http://goafame.com.au/site/assets/files/2991/final_cropped_shutterstock_114908098.jpg" length="" type="" />
    <guid isPermaLink="false">c5c1cb0bebd56ae38817b251ad72bedb</guid>
  </item>
  <item>
    <title>Dog 1</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="http://animaliaz-life.com/data_images/dog/dog4.jpg" length="" type="" />
  </item>
  <item>
    <title>Dog 2</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="http://cdn1.theodysseyonline.com/files/2015/12/21/6358631429926013411708851658_Dog-Pictures.jpg" length="" type="" />
  </item>
  <item>
    <title>Dog 3</title>
    <description></description>
    <pubDate>13 Jan 2016 10:32:01 AEST</pubDate>
    <link>http://goafame.com.au/gallery/entry/2991/</link>
    <enclosure url="https://i.ytimg.com/vi/AkcfB3z0_-0/maxresdefault.jpg" length="" type="" />
  </item>
</channel>

My question is how to do this using

Regular Exression

The desired output would be

http://goafame.com.au/site/templates/img/logo.png http://goafame.com.au/site/assets/files/2991/final_cropped_shutterstock_114908098.jpg http://animaliaz-life.com/data_images/dog/dog4.jpg http://cdn1.theodysseyonline.com/files/2015/12/21/6358631429926013411708851658_Dog-Pictures.jpg

https://i.ytimg.com/vi/AkcfB3z0_-0/maxresdefault.jpg

Here's what I've try so far

Regex linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);

                string rawString = doc.ToString();
                int posCounter = 0;
                foreach (Match m in linkParser.Matches(rawString))
                {
                    posCounter++;
                    links.Add(new LinkModel
                    {
                        IsSelected = false,
                        XmlLink = m.Value,
                        NodePosition = posCounter
                    });
                }

Note: XML file can come from any sources and some other url are not located in link element.Some are even nested. That's why I think of using RegEx rather than XDocument.

2

My pattern can match your sample. Tested here http://regexstorm.net/tester

https?://[^\s<"]+/[^\s<"]+(?:\.\w{3,4})

The idea is that finding all links that have a splash character (/) followed by a file name pattern (end with 3,4 characters extension).

2
var allLinkValues = XDocument.Parse(doc.ToString())
                             .Root
                             .Elements("item")
                             .Select(itemElement => itemElement.Element("link").Value)
                             .ToList();

here

XDocument.Parse(doc.ToString()) loads the document. Root points to the root element then we select all the "item" elements and select the value of the "link" element.

power of Linq2Xml!

XPath, XmlDocument are your other options.

in general if the string is well schemed, (XML, JSON, RDF etc.) do not opt for RegEx as the first option. There are well defined parsers for these type of documents.

And the above query should get you started on Xml navigation.

  • Thank you for your answer but my problem here is the xml is dynamic it may come from any rss feeds and some links are not located in link element – Prince Jea Mar 4 '16 at 7:52

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.