Parsing JavaScript rendered pages in Python with pyppeteer
Parsing JavaScript rendered pages in Python with pyppeteer¶
Where is my table?¶
I already wrote a blog post about Parsing HTML Tables in Python with pandas. Using requests or even directly pandas was working nicely.
I wanted to play with some data from a race I recently run: Lundaloppet. The results are available here: http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25
Let's try to get that table!
import pandas as pd
dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
No tables found... So what is going on? Let's look at what is returned by requests.
import requests
from IPython.display import display_html
r = requests.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.text
display_html(r.text, raw=True)
There is no table in the HTML sent by the server. The table is rendered on the client side by AngularJS. We can check that by looking at the page source in Chrome:
How do you parse JavaScript rendered page in Python? Don't we need a browser to run the JavaScript code? By googling, I found Requests-HTML that has JavaScript support!
Requests-HTML¶
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
display_html(table.html, raw=True)
Wow! Isn't that magic? We'll explore a bit later how this works.
What I want to get is all the results, not just the first 25. I tried increasing the pageSize passed in the URL, but that didn't help. Even passing a lower value always returns 25 rows. Not sure how the API is implemented...
An issue I had with requests-html is that sometimes r.html.find('table', first=True)
returned None
or an empty table...
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=63&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
pd.read_html(table.html)[0]
That's probably a timing issue (the rendering might take longer sometimes). I tried playing with the wait
and sleep
arguments of r.html.render(wait=1, sleep=1)
but couldn't make the problem completetly go away. This is an issue because I don't need just one page but 135.
I started to look at requests-html code to see how this was implemented. That's how I discovered pyppeteer.
Pyppeteer¶
Pyppeteer is an unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.
Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
Pyppeteer allows you to do the same from Python. So there is no magic. You just let Chromium load and render the page with the latest JavaScript and browser features. This is super powerful.
The first time you run pyppeteer, it even downloads a recent version of Chromium. So no initial setup is required.
Pyppeteer is based on asyncio. This is hidden by requests-html that gives you a simple interface but of course less flexibility.
So let's explore pyppeteer. The first example from the documentation is how to take a screenshot of a page.
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('http://example.com')
await page.screenshot({'path': 'example.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Let's try that with our page. Note that I pass the fullPage
option otherwise the page is cut.
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
await page.screenshot({'path': 'pyppeteer_screenshot.png', 'fullPage': True})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Here is the screenshot taken:
Nice, no? This example showed us how to load a page:
- create a browser
- create a new page
- goto a page
There are several functions that can be used to retrieve elements from the page, like querySelector or querySelectorEval. This is the function we gonna use to retrieve the table. We use the table
selector and apply the outerHTML
function to get the HTML representation of the table:
table = await page.querySelectorEval('table', '(element) => element.outerHTML')
We can then pass that to pandas.
One thing we wanted is to wait for the table to be rendered before trying to retrieve it. We can use the waitForSelector function for that.
I initially tried to use the table
selector but that sometimes returned an empty table. So I chose a class of one row element td.res-startNo
to be sure that the table was rendered.
import asyncio
import pandas as pd
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
await page.waitForSelector('td.res-startNo')
table = await page.querySelectorEval('table', '(element) => element.outerHTML')
await browser.close()
return pd.read_html(table)[0]
df = asyncio.get_event_loop().run_until_complete(main())
df
That's a bit more code than with requests-HTML but we have finer control. Let's refactor that code to retrieve all the results of the race.
import asyncio
import pandas as pd
from pyppeteer import launch
URL = 'http://results.neptron.se/#/lundaloppet2018/results?sortOrder=Place&raceId=99&pageSize=25&page={}'
async def get_page(browser, url, selector):
"""Return a page after waiting for the given selector"""
page = await browser.newPage()
await page.goto(url)
await page.waitForSelector(selector)
return page
async def get_num_pages(browser):
"""Return the total number of pages available"""
page = await get_page(browser, URL.format(0), 'div.ng-isolate-scope')
num_pages = await page.querySelectorEval(
'div.ng-isolate-scope',
'(element) => element.getAttribute("data-num-pages")')
return int(num_pages)
async def get_table(browser, page_nb):
"""Return the table from the given page number as a pandas dataframe"""
print(f'Get table from page {page_nb}')
page = await get_page(browser, URL.format(page_nb), 'td.res-startNo')
table = await page.querySelectorEval('table', '(element) => element.outerHTML')
return pd.read_html(table)[0]
async def get_results():
"""Return all the results as a pandas dataframe"""
browser = await launch()
num_pages = await get_num_pages(browser)
print(f'Number of pages: {num_pages}')
# Python 3.6 asynchronous comprehensions! Nice!
dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
await browser.close()
df = pd.concat(dfs, ignore_index=True)
return df
This code could be made a bit more generic but that's good enough for what I want. I think it's quite straightforward. We first get the total number of pages and then load each page to retrieve the table. Once done, we just have to concatenate all those tables in one.
One thing to note is the use of Python asynchronous comprehensions. This is a Python 3.6 feature and makes it really Pythonic. It just works as it would with synchronous functions:
dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
Let's run that code!
df = asyncio.get_event_loop().run_until_complete(get_results())
That's it! We got all the results from the Lundaloppet 2018 in a nice pandas DataFrame.
len(df)
df.head()
df.tail()
Let's save the result to a csv file
df.to_csv('lundaloppet2018.csv', index=False)
Summary¶
With frameworks like AngularJS, React, Vue.js... more and more websites use client-side rendering. To parse those websites, you can't just request HTML from the server. Parsing requires to run some JavaScript.
Pyppeteer makes that possible. Thanks to Headless Chomium, it gives you access to the full power of a browser from Python. I find that really impressive!
I tried to use Selenium in the past but didn't find it very easy to start with. That wasn't the case with Pyppeteer. To be fair, it was a while ago and both projects are quite different. It's not just about browser automation. Selenium allows you to perform cross browser testing. Pyppeteer is limited to Chrome/Chromium. Anyway, I'll probably look more at Pyppeteer for web application testing.
For simple tasks, Requests-HTML is a nice wrapper and gives you a simple API. If you want more control, use directly Pyppeteer.
One last note. To run this code in a Jupyter notebook, you should use tornado 4. asyncio code doesn't play well with ipython and tornado 5. See this Github issue: asyncio will be running by default with tornado 5. There is some work in progress for a nice integration.
What about the Lundaloppet results you might ask? I'll explore them in another post!
Comments
Comments powered by Disqus