Displaying the tags of a web page with indentation proportional to the depth of the element in the document tree structure

Problem: develop the myhtmlparser class as a subclass of HTMLParser which, when fed with an HTML file, shows the names of the start and end tags in the order in which they appear in the document, and with an indentation proportional to the depth of the element in the document tree structure. Ignore HTML elements that do not require an end tag, such as p and br.

The HTML file used: https://easyupload.io/d45c52

The output should ser:

html start
    head start
        title start
        title end
    head end
    body start
        h1 start
        h1 end
        h2 start
        h2 end
        ul start
            li start
...
        a end
    body end
html end   

What I did:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
        print (tag, "start")

    def handle_endtag(self, tag):
        print(tag, "end")

infile = open("w3c.html", "r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

My output was:

html start
head start
title start
title end
head end
body start
h1 start
h1 end
p start
br start
p end
h2 start
h2 end
...
a start
a end
body end
html end

How to fix the code so as to achieve indentation on the output?

Author: Ed S, 2020-01-21

1 answers

The handle_starttag() and handle_endtag() methods need to be reset. Each should display the name of the element corresponding to the tag , indented appropriately.

The indentation is an integer value incremented with each token of tag beginning and decremented with each token of tag ending. (I ignored the elements p and br.) The indentation value should be stored as an instance variable of the parser object and initialized in the builder.

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    'analisador de doc. HTML que mostra tags indentadas '

    def __init__(self):
        'inicializa o analisador e a indentação inicial'
        HTMLParser.__init__(self)
        self.indent = 0            # valor da indentação inicial

    def handle_starttag(self, tag, attrs):
        '''mostra tag de início com indentação proporcional à
           profundidade do elemento da tag no documento'''
        if tag not in {'br','p'}:
            print('{}{} start'.format(self.indent*' ', tag))
            self.indent += 4

    def handle_endtag(self, tag):
        '''mostra tag de fim com indentação proporcional à
           profundidade do elemento da tag no documento'''
        if tag not in {'br','p'}:
            self.indent -= 4
            print('{}{} end'.format(self.indent*' ', tag))
 1
Author: Ed S, 2020-03-21 11:25:03