Web scraping with BeautifulSoup-find next does not return text

Question

Web scraping with BeautifulSoup-find next does not return text

I want to extract the text from the excerpt below:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000">Mon 9 Mar 2020</div>

The text would be "Mon 9 Mar 2020". But when I do:

date = match_bar[0].find_next('div', {'class': 'matchDate renderMatchDateContainer'})

I return the following, without the text itself:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000"></div>

How much I add '.text ' return is empty. I don't have much experience with HTML.

Update :

I realized that when I run the code:

my_url = 'https://www.premierleague.com/match/{}'.format(i)
client = urlopen(my_url)
page_html = client.read()

The excerpt in question already appreciates this, without the text:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000"></div>

While in the browser I I can see the text:

Could anyone help? Thanks.

0

python html web-scraping beautifulsoup

Author: Otávio Simões Silveira, 2020-04-14

Source

2 answers

Hello, All right?

If your intention is to extract the snippet from a specific div the way you are trying to extract the information is wrong, just use the Find function of bs4 specifying what the class of that div is.

I will provide an example of how the code would look:

from bs4 import BeautifulSoup

html = """<div class='matchDate renderMatchDateContainer' 
          data-kickoff='1583784000000'>Mon 9 Mar 2020</div>"""

soup = BeautifulSoup(html, 'html.parser')

getValueFromDiv = soup.find('div', class_='matchDate renderMatchDateContainer').text

print(getValueFromDiv)

The result of this input was:

Mon 9 Mar 2020

0

Author: Jefferson Matheus Duarte, 2020-04-14 18:28:32

score 0 · Accepted Answer

Now that I have access to the link I understand better what your problem is and let's talk about it.

The reason why you are not succeeding is because the site is rendered when loading the page, making a request to get the HTML code it comes back with only the uncomposed HTML body as it is only populated with page loading.

Let's go to the solution, one of the possible solutions and the best and that I recommend is the use of the automation library selenium and to consume as little processing as possible we will add an argument so that it loads the page in a hidden way, so it will not display what is being opened by the automated browser, with it it will be possible to load the page and then get the HTML body already filled with the values.

I strongly recommend if you have never worked with selenium read the documentation, as you will need to download the driver and edit the path described in the "executable_path". I'll leave a code below with the solution of the problem:

from bs4 import BeautifulSoup
from selenium import webdriver

def obterCodigoFonte(url):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(executable_path=r'.\chromedriver.exe', chrome_options=chrome_options)
    driver.get(url)
    return driver.page_source

def processarCodigoFonte(cf):
    soup = BeautifulSoup(cf, 'html.parser')
    getValueFromDiv = soup.find('div', class_='matchDate renderMatchDateContainer')
    return getValueFromDiv.text


url = 'https://www.premierleague.com/match/46889'
codigoFonte = obterCodigoFonte(url)
print(processarCodigoFonte(codigoFonte))