2

Suppose I have the following HTML table:

<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>

I'd like to convert this table to JSON, potentially in the following format:

data= [
  { 
    Name: 'John',         
    Age: 28,
    License: 'Y',
    Amount: 12.30
  },
  { 
    Name: 'Kevin',         
    Age: 25,
    License: 'Y',
    Amount: 22.30
  },
  { 
    Name: 'Smith',         
    Age: 38,
    License: 'Y',
    Amount: 52.20
  },
  { 
    Name: 'Stewart',         
    Age: 21,
    License: 'N',
    Amount: 3.80
  }
];

I've seen another example that sort of does the above, which I found here. However, there are a couple of things that I can't get working given that answer. Those are:

  • It is limited to two rows on the table. If I add an additional row, I get an error:

print(json.dumps(OrderedDict(table_data))) ValueError: too many values to unpack (expected 2)

  • The header rows of the table are not taken into account.

This is my code so far:

html_data = """
<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>
"""

from bs4 import BeautifulSoup
from collections import OrderedDict
import json

table_data = [[cell.text for cell in row("td")]
                         for row in BeautifulSoup(html_data, features="lxml")("tr")]

print(json.dumps(OrderedDict(table_data)))

But I'm getting the following error:

print(json.dumps(OrderedDict(table_data))) ValueError: need more than 0 values to unpack

EDIT The answer below works perfectly if there is only one table in the HTML. What if there are two tables? For example:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>Rich</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>

If I plug this in the below code, only the first table is shown as the JSON output.

Share a link to this question
CC BY-SA 4.0
3

You can use soup.find_all:

from bs4 import BeautifulSoup as soup
s = soup(html, 'html.parser').table
h, [_, *d] = [i.text for i in s.tr.find_all('th')], [[i.text for i in b.find_all('td')] for b in s.find_all('tr')]
result = [dict(zip(h, i)) for i in d]

Output:

[{'Name': 'John', 'Age': '28', 'License': 'Y', 'Amount': '12.30'}, {'Name': 'Kevin', 'Age': '25', 'License': 'Y', 'Amount': '22.30'}, {'Name': 'Smith', 'Age': '38', 'License': 'Y', 'Amount': '52.20'}, {'Name': 'Stewart', 'Age': '21', 'License': 'N', 'Amount': '3.80'}]
Share a link to this answer
CC BY-SA 4.0
1
  • Thank you so much! This also does exactly what I wanted. I have only "approved" the other answer as the output was shown in a friendlier way.
    – Adam
    Jan 29 '20 at 14:27
2

This code does exactly what you want

from bs4 import BeautifulSoup
import json

xml_data = """
[[your xml data]]"""


if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for tr in model.table.find_all('tr', recursive=False):
        for th in tr.find_all('th', recursive=False):
            fields.append(th.text)
    for tr in model.table.find_all('tr', recursive=False):
        datum = {}
        for i, td in enumerate(tr.find_all('td', recursive=False)):
            datum[fields[i]] = td.text
        if datum:
            table_data.append(datum)

    print(json.dumps(table_data, indent=4))
Share a link to this answer
CC BY-SA 4.0
4
  • How would you modify this if you had two different tables? If I put the two tables one after the other in the "xml_data" variable, it only returns the JSON output for the first table.
    – Adam
    Feb 19 '20 at 14:58
  • 1
    At the beginngin you do a loop for table in model.find_all("table"): and then replace every following occurene of model with table
    – Rugnar
    Feb 20 '20 at 18:40
  • Where do you mean "at the beginning"? Should this be directly before the first loop? Should it encompass the both for loops? I have edited my original post for more detail. Perhaps you could edit your answer directly?
    – Adam
    Feb 24 '20 at 13:25
  • I have created a new question here: stackoverflow.com/q/60379358/3480297
    – Adam
    Feb 24 '20 at 15:41

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.