XPath with Python-pick up text after tag in a div

Question

XPath with Python-pick up text after tag in a div

I'm trying to grab a text after a tag that's inside a div, in an html. The problem I'm having is that I'm not getting the text, just an empty string. I searched elsewhere and didn't see anyone with a similar problem :/

Here is the html Code:

<div class="list-view-item-title-wrapper">
    <div class="list-view-item-title-top">
        <div class="list-view-item-type">
            "Webcast"
        </div>
    </div>
    <a href="/resources/actionable-awareness-unlock-your-influence" class="list-view-item-title">
        <h2>
            "Actionable Awareness: Unlock Your Influence"
        </h2>
    </a>
    <div class="list-view-item-date">
        <i class="fa fa-calendar"></i>
        "September 24, 2020"
    </div>
    ...
</div>

And python:

def get_posts_elements(self, html):
    posts = self.get_posts(html)

    # - get_posts -> retorna html.xpath("//div[@class='list-view-item-title-wrapper']")
    # - html -> lxml.html.fromstring(requests.get('https://www.scrum.org/resources'))
    
    for post in posts:

            # --- Recebendo com sucesso:
        try:
            self.data['Type'].append(post.xpath(".//div[@class='list-view-item-type']")[0].text.strip())
        except:
            self.data['Type'].append('')

        try:
            self.data['Title'].append(post.xpath(".//a[@class='list-view-item-title']/h2")[0].text.strip())
        except:
            self.data['Type'].append('')
        
        try:
            self.data['Link'].append(urljoin(self.base_url, post.xpath(".//a[@class='list-view-item-title']/@href")[0]))
        except:
            self.data['Link'].append('')


            # --- Recebendo com falha:
        data = post.xpath(".//div[@class='list-view-item-date']")[0].text
        print(data)

In the case, I want to take the texts referring to the dates of each post, as I do with the title and type. In the example above it would be "September 24, 2020" but only I get an empty string.

My imports:

import lxml.html as parser
import requests
from urllib.parse import urlsplit, urljoin

1

python web-scraping xpath

Author: Wiliane Souza, 2020-06-27

Source

1 answers

score 0 · Answer 1

I believe I managed to solve using the inheritance concepts in XPath. Usei

post.xpath(".//div[@class='list-view-item-date']/descendant-or-self::*/text()")[1])

Instead of

post.xpath(".//div[@class='list-view-item-date']")[0].text

/descendant-gold-self::* it is, in short, being used to catch all daughters/granddaughters of the node, more comprehensively. So I was finally able to identify the text. I also needed to change the index, since the element I want is always the second in the list.