Parsing an HTML file using BeautifulSoup in a DataFrame
There is a file of the form:
<TD class="c1">111-1111</TD>
<TD class="c2">AA1111-1111</TD>
<TD class="c3">NAME1</TD>
<TD class="c4"><INPUT type="text" id="F1" readonly="readonly" value=" .368"></TD>
<TD class="c5"><INPUT type="text" id="Q1" readonly="readonly" value=""></TD>
</TR>
<TR class="r1">
<TD class="c1">222-2222</TD>
<TD class="c2">BB2222-2222</TD>
<TD class="c3">NAME2</TD>
<TD class="c4"><INPUT type="text" id="F2" readonly="readonly" value=" 1.28"></TD>
<TD class="c5"><INPUT type="text" id="Q2" readonly="readonly" value=""></TD>
</TR>
From it, I need information in the form of pandas. DataFreme, which lies in the blocks TD class= "c1", TD class= "c2", TD class= " c3 "and the value value= in TD class= "c4".
To get it, I do the following:
soup = BeautifulSoup(html,'lxml')
description = [element.text for element in soup.find_all(class_="c3")]
component = [element.text for element in soup.find_all(class_="c1")]
code = [element.text for element in soup.find_all(class_="c2")]
val = re.findall(r'value="(.*?)"', html)
value = [value for value in val if value != '']
value.insert(0, 'Value')
data = []
for a, b, c, in zip(component ,description,value):
data.append([a, b, c,])
df = pd.DataFrame(data, columns=['cod','desc','val'])
The code works, if there are suggestions for improving it (and I am sure that this code can be improved :) ), I will be happy to listen!!!
Actually, the question is, how do I get the value= value that I have now .368 to lead to a numeric value? a value of the form 0.368 ?
I will be grateful for any information !
1
1 answers
Try this way:
import pandas as pd
from bs4 import BeautifulSoup
from pathlib import Path
def get_vals(soup, filt="[class='c4']"):
ret = [x.input.attrs["value"].strip()
for x in soup.select(f"td{filt}")[1:]]
return pd.to_numeric(ret, errors="coerce")
url = r"C:\download\CONCTEXT_NCS_S0907R50B.htm"
soup = BeautifulSoup(Path(url).read_text(encoding="utf-8"), 'lxml')
df = pd.read_html(url, header=0)[0]
df["Recipe Qty"] = get_vals(soup, filt="[class='c4']")
Result:
In [123]: df
Out[123]:
Component S-W Code Description Recipe Qty Required Quantity
0 241-2905 TZ4103-3905 BLUE FTALO 0.368 NaN
1 241-6909 TZ4103-2909 OXYDE RED 1.280 NaN
2 241-7906 TZ4103-3406 RED BORDEAUX 1.120 NaN
3 X80LC-G NaN WHITE TEXTURED TOP COAT (*) 997.232 NaN
In [124]: df.dtypes
Out[124]:
Component object
S-W Code object
Description object
Recipe Qty float64
Required Quantity float64
dtype: object
1
Author: MaxU, 2019-07-17 09:20:23