Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I'm trying to work out if there is a better way to achieve the following:

from lxml import html
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("<p>&pound;682m</p>")
text = soup.find("p").string

print text
>>> &pound;682m

print html.fromstring(text).text
>>> £682m

So I'm trying to produce the same string that lxml returns when I do the second print. I'd rather not have to resort to lxml in order to interpret these escaped characters: can anyone provide a way of doing this with something in the standard library?

[edit: I've accepted luc's answer but both are valid: I just thought that the answer that made use of the standard library was probably more useful in a generic sense]

share|improve this question

5 Answers 5

up vote 132 down vote accepted

You can also use the Html parser from the standard lib see http://docs.python.org/library/htmlparser.html

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print h.unescape('&pound;682m')
£682m

EDIT for Python 3: the HTMLParser module has been renamed to html.parser.

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

EDIT for Python 3.4+: the The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.

import html
print(html.unescape('&pound;682m'))

see https://docs.python.org/3/library/html.html

share|improve this answer
11  
Note that this method of is not officially documented… (but has been quite stable so far). –  EOL Jan 19 '10 at 10:46
    
this method doesn't seem to escape characters like "&#8217;" on google app engine, though it works locally on python2.6. It does still decode entities (like &quot;) at least –  gfxmonk Jul 10 '10 at 14:40
3  
User 'armo' would like to advise: "The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead." –  AndrewS May 19 at 22:18
    
Do you mean that in python 3.5, it will be from html import unescape print(unescape(''&pound;682m''))? –  luc May 20 at 4:46
1  
@luc: html.unescape() in Python 3.4+ –  J.F. Sebastian Oct 7 at 19:21

BeautifulSoup handles entity conversion:

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>
share|improve this answer
    
BeautifulStoneSoup is for XML parsing. Use BeautifulSoup for HTML. –  interjay Jan 18 '10 at 16:23
    
+1. No idea how I missed this in the docs: thanks for the info. I'm going to accept luc's answer tho because his uses the standard lib which I specified in the question (not important to me) and its probably of more general use to other people. –  jkp Jan 18 '10 at 16:23
    
interjay: fixed, the same applies to BeautifulSoup also. –  Ben James Jan 18 '10 at 16:27
    
jkp: Actually I think you are not helping people, who may continue to believe BeautifulSoup can't handle entities properly by seeing that accepted answer. If an assumption you made in your question (i.e. that BeautifulSoup couldn't do it) was incorrect, you can always edit and point that out. –  Ben James Jan 18 '10 at 16:31
7  
convertEntities is removed from BeautifulSoup4. :( –  Gagandeep Singh Sep 22 '13 at 7:33

I would have liked to simply add a comment to Neil Aggarwal's answer, but lacked the reputation.

Based on his informative answer, but hoping to reduce the number of copies of potentially large strings in the for loop, I came up with the following alternative. It runs about 40 times faster on my platform:

import re
import HTMLParser

# 1500000 character long string
long_test_string = "abcdefg &pound;" * 10**5
parser = HTMLParser.HTMLParser()
# supply a lambda function to re.sub() to handle unescaping
# of any matched string segments
re.sub("(&.+?;)", lambda m: parser.unescape(m.group()), long_test_string)
share|improve this answer

This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value
share|improve this answer

Beautiful Soup 4 allows you to set a formatter to your output

If you pass in formatter=None, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.