Displaying the tags of a web page with indentation proportional to the depth of the element in the document tree structure
Problem: develop the myhtmlparser class as a subclass of HTMLParser which, when fed with an HTML file, shows the names of the start and end tags in the order in which they appear in the document, and with an indentation proportional to the depth of the element in the document tree structure. Ignore HTML elements that do not require an end tag, such as p and br.
The HTML file used: https://easyupload.io/d45c52
The output should ser:
html start
head start
title start
title end
head end
body start
h1 start
h1 end
h2 start
h2 end
ul start
li start
...
a end
body end
html end
What I did:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
print (tag, "start")
def handle_endtag(self, tag):
print(tag, "end")
infile = open("w3c.html", "r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)
My output was:
html start
head start
title start
title end
head end
body start
h1 start
h1 end
p start
br start
p end
h2 start
h2 end
...
a start
a end
body end
html end
How to fix the code so as to achieve indentation on the output?
1 answers
The handle_starttag()
and handle_endtag()
methods need to be reset. Each should display the name of the element corresponding to the tag , indented appropriately.
The indentation is an integer value incremented with each token of tag beginning and decremented with each token of tag ending. (I ignored the elements p and br.) The indentation value should be stored as an instance variable of the parser object and initialized in the builder.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
'analisador de doc. HTML que mostra tags indentadas '
def __init__(self):
'inicializa o analisador e a indentação inicial'
HTMLParser.__init__(self)
self.indent = 0 # valor da indentação inicial
def handle_starttag(self, tag, attrs):
'''mostra tag de início com indentação proporcional à
profundidade do elemento da tag no documento'''
if tag not in {'br','p'}:
print('{}{} start'.format(self.indent*' ', tag))
self.indent += 4
def handle_endtag(self, tag):
'''mostra tag de fim com indentação proporcional à
profundidade do elemento da tag no documento'''
if tag not in {'br','p'}:
self.indent -= 4
print('{}{} end'.format(self.indent*' ', tag))