Decode HTML entities in Python string?

Question

I'm trying to work out if there is a better way to achieve the following:

from lxml import html
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("<p>&pound;682m</p>")
text = soup.find("p").string

print text
>>> &pound;682m

print html.fromstring(text).text
>>> £682m

So I'm trying to produce the same string that lxml returns when I do the second print. I'd rather not have to resort to lxml in order to interpret these escaped characters: can anyone provide a way of doing this with something in the standard library?

[edit: I've accepted luc's answer but both are valid: I just thought that the answer that made use of the standard library was probably more useful in a generic sense]

related: Convert XML/HTML Entities into Unicode String in Python — J.F. Sebastian, Dec 18 '12 at 19:01

luc · Accepted Answer · 2014-10-08 05:36:33Z

up vote 132 down vote accepted

You can also use the Html parser from the standard lib see http://docs.python.org/library/htmlparser.html

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print h.unescape('&pound;682m')
£682m

EDIT for Python 3: the HTMLParser module has been renamed to html.parser.

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

EDIT for Python 3.4+: the The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.

import html
print(html.unescape('&pound;682m'))

see https://docs.python.org/3/library/html.html

edited Oct 8 at 5:36

answered Jan 18 '10 at 16:17

luc
14.2k553105

11

Note that this method of is not officially documented… (but has been quite stable so far). – EOL Jan 19 '10 at 10:46

this method doesn't seem to escape characters like "’" on google app engine, though it works locally on python2.6. It does still decode entities (like ") at least – gfxmonk Jul 10 '10 at 14:40

3

User 'armo' would like to advise: "The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead." – AndrewS May 19 at 22:18

Do you mean that in python 3.5, it will be from html import unescape print(unescape(''£682m''))? – luc May 20 at 4:46

1

@luc: html.unescape() in Python 3.4+ – J.F. Sebastian Oct 7 at 19:21

add a comment |

Ben James · Answer 2 · 2010-01-18 16:26:38Z

up vote 41 down vote

BeautifulSoup handles entity conversion:

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

edited Jan 18 '10 at 16:26

answered Jan 18 '10 at 16:19

Ben James
43k7115131

BeautifulStoneSoup is for XML parsing. Use BeautifulSoup for HTML. – interjay Jan 18 '10 at 16:23

+1. No idea how I missed this in the docs: thanks for the info. I'm going to accept luc's answer tho because his uses the standard lib which I specified in the question (not important to me) and its probably of more general use to other people. – jkp Jan 18 '10 at 16:23

interjay: fixed, the same applies to BeautifulSoup also. – Ben James Jan 18 '10 at 16:27

jkp: Actually I think you are not helping people, who may continue to believe BeautifulSoup can't handle entities properly by seeing that accepted answer. If an assumption you made in your question (i.e. that BeautifulSoup couldn't do it) was incorrect, you can always edit and point that out. – Ben James Jan 18 '10 at 16:31

7

convertEntities is removed from BeautifulSoup4. :( – Gagandeep Singh Sep 22 '13 at 7:33

| show 1 more comment

Rob · Answer 3 · 2013-12-18 20:24:30Z

I would have liked to simply add a comment to Neil Aggarwal's answer, but lacked the reputation.

Based on his informative answer, but hoping to reduce the number of copies of potentially large strings in the for loop, I came up with the following alternative. It runs about 40 times faster on my platform:

import re
import HTMLParser

# 1500000 character long string
long_test_string = "abcdefg &pound;" * 10**5
parser = HTMLParser.HTMLParser()
# supply a lambda function to re.sub() to handle unescaping
# of any matched string segments
re.sub("(&.+?;)", lambda m: parser.unescape(m.group()), long_test_string)

Ashwini Chaudhary · Answer 4 · 2013-10-31 11:07:16Z

This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears - Im new to this).

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

Loïc · Answer 5 · 2014-01-14 10:03:44Z

Beautiful Soup 4 allows you to set a formatter to your output

If you pass in formatter=None, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

asked	4 years ago
viewed	39010 times
active	23 days ago

current community

your communities

more stack exchange communities

Decode HTML entities in Python string?

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged python html xml escaping or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Decode HTML entities in Python string?

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python html xml escaping or ask your own question.

Linked

Related

Hot Network Questions