XPath with Python-pick up text after tag in a div
I'm trying to grab a text after a tag that's inside a div, in an html. The problem I'm having is that I'm not getting the text, just an empty string. I searched elsewhere and didn't see anyone with a similar problem :/
Here is the html Code:
<div class="list-view-item-title-wrapper">
<div class="list-view-item-title-top">
<div class="list-view-item-type">
"Webcast"
</div>
</div>
<a href="/resources/actionable-awareness-unlock-your-influence" class="list-view-item-title">
<h2>
"Actionable Awareness: Unlock Your Influence"
</h2>
</a>
<div class="list-view-item-date">
<i class="fa fa-calendar"></i>
"September 24, 2020"
</div>
...
</div>
And python:
def get_posts_elements(self, html):
posts = self.get_posts(html)
# - get_posts -> retorna html.xpath("//div[@class='list-view-item-title-wrapper']")
# - html -> lxml.html.fromstring(requests.get('https://www.scrum.org/resources'))
for post in posts:
# --- Recebendo com sucesso:
try:
self.data['Type'].append(post.xpath(".//div[@class='list-view-item-type']")[0].text.strip())
except:
self.data['Type'].append('')
try:
self.data['Title'].append(post.xpath(".//a[@class='list-view-item-title']/h2")[0].text.strip())
except:
self.data['Type'].append('')
try:
self.data['Link'].append(urljoin(self.base_url, post.xpath(".//a[@class='list-view-item-title']/@href")[0]))
except:
self.data['Link'].append('')
# --- Recebendo com falha:
data = post.xpath(".//div[@class='list-view-item-date']")[0].text
print(data)
In the case, I want to take the texts referring to the dates of each post, as I do with the title and type. In the example above it would be "September 24, 2020" but only I get an empty string.
My imports:
import lxml.html as parser
import requests
from urllib.parse import urlsplit, urljoin
1 answers
I believe I managed to solve using the inheritance concepts in XPath. Usei
post.xpath(".//div[@class='list-view-item-date']/descendant-or-self::*/text()")[1])
Instead of
post.xpath(".//div[@class='list-view-item-date']")[0].text
/descendant-gold-self::* it is, in short, being used to catch all daughters/granddaughters of the node, more comprehensively. So I was finally able to identify the text. I also needed to change the index, since the element I want is always the second in the list.