# Parsing JavaScript rendered pages in Python with pyppeteer

## Where is my table?

I already wrote a blog post about [Parsing HTML Tables in Python with pandas](https://beenje.github.io/blog/posts/parsing-html-tables-in-python-with-pandas/). Using [requests](http://docs.python-requests.org/en/master/) or even directly [pandas](https://pandas.pydata.org) was working nicely.

I wanted to play with some data from a race I recently run: [Lundaloppet](http://www.lundaloppet.se/info/resultat/).
The results are available here: http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25
![Results Lundaloppet 2018](/images/pyppeteer/results_lundaloppet_2018.png)

Let's try to get that table!

In [1]:
import pandas as pd

In [2]:
dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')

ValueError: No tables found

No tables found... So what is going on? Let's look at what is returned by requests.

In [3]:
import requests
from IPython.display import display_html

In [4]:
r = requests.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.text

'ï»¿<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" ng-app="app">\r\n<head>\r\n    <title ng-bind="event.name || \'Neptron Timing\'">Neptron Timing</title>\r\n\r\n    <meta charset="utf-8">\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n    <meta name="description" content="Neptron Timing event results">\r\n\r\n    <link rel="shortcut icon" href="favicon.ico">\r\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/css/bootstrap.min.css">\r\n    <link rel="stylesheet" href="content/app.min.css">\r\n    <script src="scripts/iframeResizer.contentWindow.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/es6-shim/0.35.0/es6-shim.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap

In [5]:
display_html(r.text, raw=True)

There is no table in the HTML sent by the server.
The table is rendered on the client side by AngularJS. We can check that by looking at the page source in Chrome:
![Results Lundaloppet 2018 source](/images/pyppeteer/results_lundaloppet_2018_source.png)

How do you parse JavaScript rendered page in Python? Don't we need a browser to run the JavaScript code?
By googling, I found [Requests-HTML](https://github.com/kennethreitz/requests-html) that has JavaScript support!

## Requests-HTML

In [6]:
from requests_html import HTMLSession

In [7]:
session = HTMLSession()
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)

In [8]:
display_html(table.html, raw=True)

Unnamed: 0,Place (race),Place (cat),Bib no,Category,Name,Unnamed: 6,Association,Progress,Time,Status
,1,1,6922,P10,Hans Larsson,,MAI,Finish,33:22,Finished
,2,2,6514,P10,Filip Helmroth,,IK Lerum Friidrott,Finish,33:37,Finished
,3,3,3920,P10,David Hartman,,Björnstorps IF,Finish,33:39,Finished
,4,4,3926,P10,Henrik Orre,,Björnstorps IF,Finish,34:24,Finished
,5,5,2666,P10,Jesper Bokefors,,Malmö AI,Finish,34:51,Finished
,6,6,5729,P10,Juan Negreira,,Lunds universitet,Finish,35:19,Finished
,7,7,3649,P10,Jim Webb,,,Finish,35:23,Finished
,8,8,3675,P10,Nils Wetterberg,,Ekmans Löpare i Lund,Finish,35:39,Finished
,9,9,4880,P10,Hannes Hjalmarsson,,Lunds kommun,Finish,35:41,Finished
,10,10,6929,P10,Freyi Karlsson,,Ekmans löpare i lund,Finish,35:42,Finished


Wow! Isn't that magic? We'll explore a bit later how this works.

What I want to get is all the results, not just the first 25. I tried increasing the *pageSize* passed in the URL, but that didn't help. Even passing a lower value always returns 25 rows. Not sure how the API is implemented...

An issue I had with requests-html is that sometimes `r.html.find('table', first=True)` returned `None` or an empty table...

In [9]:
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=63&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
pd.read_html(table.html)[0]

IndexError: list index out of range

That's probably a timing issue (the rendering might take longer sometimes). I tried playing with the `wait` and `sleep` arguments of `r.html.render(wait=1, sleep=1)` but couldn't make the problem completetly go away. This is an issue because I don't need just one page but 135.

I started to look at [requests-html](https://github.com/kennethreitz/requests-html/blob/master/requests_html.py) code to see how this was implemented. That's how I discovered [pyppeteer](https://github.com/miyakogi/pyppeteer).

## Pyppeteer

[Pyppeteer](https://miyakogi.github.io/pyppeteer/) is an unofficial Python port of [puppeteer](https://github.com/GoogleChrome/puppeteer) JavaScript (headless) chrome/chromium browser automation library.

> Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

Pyppeteer allows you to do the same from Python.
So there is no *magic*. You just let Chromium load and render the page with the latest JavaScript and browser features. This is super powerful.

The first time you run pyppeteer, it even downloads a recent version of Chromium. So no initial setup is required.

Pyppeteer is based on [asyncio](https://docs.python.org/3/library/asyncio.html). This is hidden by requests-html that gives you a simple interface but of course less flexibility.

So let's explore pyppeteer. The first example from the documentation is how to take a screenshot of a page.

```python
import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())
```

Let's try that with our page. Note that I pass the `fullPage` option otherwise the page is cut.

In [10]:
import asyncio
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.screenshot({'path': 'pyppeteer_screenshot.png', 'fullPage': True})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Here is the screenshot taken:
![Pyppeteer screenshot](/images/pyppeteer/pyppeteer_screenshot.png)

Nice, no?
This example showed us how to load a page:

- create a browser
- create a new page
- goto a page

There are several functions that can be used to retrieve elements from the page, like [querySelector](https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.querySelector) or [querySelectorEval](https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.querySelectorEval). This is the function we gonna use to retrieve the table. We use the `table` selector and apply the `outerHTML` function to get the HTML representation of the table:

```python
table = await page.querySelectorEval('table', '(element) => element.outerHTML')
```

We can then pass that to pandas.

One thing we wanted is to wait for the table to be rendered before trying to retrieve it. We can use the [waitForSelector](https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.waitForSelector) function for that.
I initially tried to use the `table` selector but that sometimes returned an empty table. So I chose a class of one row element `td.res-startNo` to be sure that the table was rendered.

In [11]:
import asyncio
import pandas as pd
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.waitForSelector('td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    await browser.close()
    return pd.read_html(table)[0]

df = asyncio.get_event_loop().run_until_complete(main())
df

Unnamed: 0.1,Unnamed: 0,Place(race),Place(cat),Bib no,Category,Name,Unnamed: 6,Association,Progress,Time,Status
0,,1,1,6922,P10,Hans Larsson,,MAI,Finish,33:22,Finished
1,,2,2,6514,P10,Filip Helmroth,,IK Lerum Friidrott,Finish,33:37,Finished
2,,3,3,3920,P10,David Hartman,,Björnstorps IF,Finish,33:39,Finished
3,,4,4,3926,P10,Henrik Orre,,Björnstorps IF,Finish,34:24,Finished
4,,5,5,2666,P10,Jesper Bokefors,,Malmö AI,Finish,34:51,Finished
5,,6,6,5729,P10,Juan Negreira,,Lunds universitet,Finish,35:19,Finished
6,,7,7,3649,P10,Jim Webb,,,Finish,35:23,Finished
7,,8,8,3675,P10,Nils Wetterberg,,Ekmans Löpare i Lund,Finish,35:39,Finished
8,,9,9,4880,P10,Hannes Hjalmarsson,,Lunds kommun,Finish,35:41,Finished
9,,10,10,6929,P10,Freyi Karlsson,,Ekmans löpare i lund,Finish,35:42,Finished


That's a bit more code than with requests-HTML but we have finer control.
Let's refactor that code to retrieve all the results of the race.

In [12]:
import asyncio
import pandas as pd
from pyppeteer import launch

URL = 'http://results.neptron.se/#/lundaloppet2018/results?sortOrder=Place&raceId=99&pageSize=25&page={}'


async def get_page(browser, url, selector):
    """Return a page after waiting for the given selector"""
    page = await browser.newPage()
    await page.goto(url)
    await page.waitForSelector(selector)
    return page


async def get_num_pages(browser):
    """Return the total number of pages available"""
    page = await get_page(browser, URL.format(0), 'div.ng-isolate-scope')
    num_pages = await page.querySelectorEval(
        'div.ng-isolate-scope',
        '(element) => element.getAttribute("data-num-pages")')
    return int(num_pages)


async def get_table(browser, page_nb):
    """Return the table from the given page number as a pandas dataframe"""
    print(f'Get table from page {page_nb}')
    page = await get_page(browser, URL.format(page_nb), 'td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    return pd.read_html(table)[0]


async def get_results():
    """Return all the results as a pandas dataframe"""
    browser = await launch()
    num_pages = await get_num_pages(browser)
    print(f'Number of pages: {num_pages}')
    # Python 3.6 asynchronous comprehensions! Nice!
    dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
    await browser.close()
    df = pd.concat(dfs, ignore_index=True)
    return df

This code could be made a bit more generic but that's good enough for what I want.
I think it's quite straightforward. We first get the total number of pages and then load each page to retrieve the table.
Once done, we just have to concatenate all those tables in one.

One thing to note is the use of Python [asynchronous comprehensions](https://www.python.org/dev/peps/pep-0530/). This is a Python 3.6 feature and makes it really Pythonic. It just works as it would with synchronous functions:
```python
dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
```

Let's run that code!

In [13]:
df = asyncio.get_event_loop().run_until_complete(get_results())

Number of pages: 115
Get table from page 0
Get table from page 1
Get table from page 2
Get table from page 3
Get table from page 4
Get table from page 5
Get table from page 6
Get table from page 7
Get table from page 8
Get table from page 9
Get table from page 10
Get table from page 11
Get table from page 12
Get table from page 13
Get table from page 14
Get table from page 15
Get table from page 16
Get table from page 17
Get table from page 18
Get table from page 19
Get table from page 20
Get table from page 21
Get table from page 22
Get table from page 23
Get table from page 24
Get table from page 25
Get table from page 26
Get table from page 27
Get table from page 28
Get table from page 29
Get table from page 30
Get table from page 31
Get table from page 32
Get table from page 33
Get table from page 34
Get table from page 35
Get table from page 36
Get table from page 37
Get table from page 38
Get table from page 39
Get table from page 40
Get table from page 41
Get table from page 42


That's it! We got all the results from the Lundaloppet 2018 in a nice pandas DataFrame.

In [14]:
len(df)

2872

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,Place(race),Place(cat),Bib no,Category,Name,Unnamed: 6,Association,Progress,Time,Status
0,,1.0,1.0,6922,P10,Hans Larsson,,MAI,Finish,33:22,Finished
1,,2.0,2.0,6514,P10,Filip Helmroth,,IK Lerum Friidrott,Finish,33:37,Finished
2,,3.0,3.0,3920,P10,David Hartman,,Björnstorps IF,Finish,33:39,Finished
3,,4.0,4.0,3926,P10,Henrik Orre,,Björnstorps IF,Finish,34:24,Finished
4,,5.0,5.0,2666,P10,Jesper Bokefors,,Malmö AI,Finish,34:51,Finished


In [16]:
df.tail()

Unnamed: 0.1,Unnamed: 0,Place(race),Place(cat),Bib no,Category,Name,Unnamed: 6,Association,Progress,Time,Status
2867,,,,6855,T10,porntepin sooksaengprasit,,Lunds universitet,,,Not started
2868,,,,6857,P10,Gabriel Teku,,Lunds universitet,,,Not started
2869,,,,6888,P10,Viktor Karlsson,,Genarps if,,,Not started
2870,,,,6892,P10,Emil Larsson,,,,,Not started
2871,,,,6893,P10,Göran Larsson,,,,,Not started


Let's save the result to a csv file

In [17]:
df.to_csv('lundaloppet2018.csv', index=False)

# Summary

With frameworks like [AngularJS](https://angularjs.org), [React](https://reactjs.org), [Vue.js](https://vuejs.org)... more and more websites use client-side rendering. To parse those websites, you can't just request HTML from the server. Parsing requires to run some JavaScript.

[Pyppeteer](https://miyakogi.github.io/pyppeteer/) makes that possible. Thanks to [Headless Chomium](https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md), it gives you access to the full power of a browser from Python. I find that really impressive!

I tried to use [Selenium](https://www.seleniumhq.org) in the past but didn't find it very easy to start with. That wasn't the case with Pyppeteer. To be fair, it was a while ago and both projects are quite different. It's not just about browser automation. Selenium allows you to perform cross browser testing. Pyppeteer is limited to Chrome/Chromium.
Anyway, I'll probably look more at Pyppeteer for web application testing.

For simple tasks, [Requests-HTML](https://github.com/kennethreitz/requests-html) is a nice wrapper and gives you a simple API. If you want more control, use directly Pyppeteer.

One last note. To run this code in a Jupyter notebook, you should use tornado 4. asyncio code doesn't play well with ipython and tornado 5. See this Github issue: [asyncio will be running by default with tornado 5](https://github.com/ipython/ipython/issues/11030). There is some [work in progress](https://github.com/ipython/ipython/pull/11155) for a nice integration.

What about the Lundaloppet results you might ask? I'll explore them in another post!