Parsing JavaScript rendered pages in Python with pyppeteer¶

Where is my table?¶

I already wrote a blog post about Parsing HTML Tables in Python with pandas. Using requests or even directly pandas was working nicely.

I wanted to play with some data from a race I recently run: Lundaloppet. The results are available here: http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25 Results Lundaloppet 2018

Let's try to get that table!

In [1]:

import pandas as pd

In [2]:

dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-881110a1fe3d> in <module>()
----> 1 dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
    985                   decimal=decimal, converters=converters, na_values=na_values,
    986                   keep_default_na=keep_default_na,
--> 987                   displayed_only=displayed_only)

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    813             break
    814     else:
--> 815         raise_with_traceback(retained)
    816 
    817     ret = []

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    401         if traceback == Ellipsis:
    402             _, _, traceback = sys.exc_info()
--> 403         raise exc.with_traceback(traceback)
    404 else:
    405     # this version of raise is a syntax error in Python 3

ValueError: No tables found

No tables found... So what is going on? Let's look at what is returned by requests.

In [3]:

import requests
from IPython.display import display_html

In [4]:

r = requests.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.text

Out[4]:

'ï»¿<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" ng-app="app">\r\n<head>\r\n    <title ng-bind="event.name || \'Neptron Timing\'">Neptron Timing</title>\r\n\r\n    <meta charset="utf-8">\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n    <meta name="description" content="Neptron Timing event results">\r\n\r\n    <link rel="shortcut icon" href="favicon.ico">\r\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/css/bootstrap.min.css">\r\n    <link rel="stylesheet" href="content/app.min.css">\r\n    <script src="scripts/iframeResizer.contentWindow.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/es6-shim/0.35.0/es6-shim.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/js/bootstrap.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.4.8/angular.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.4.8/angular-route.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.13.0/moment.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.0.2/Chart.min.js"></script>\r\n    <script src="https://maps.googleapis.com/maps/api/js?key=AIzaSyD7OPJoYN6W9qUHU1L_fEr_5ut8tQN8r2A"></script>\r\n</head>\r\n<body>\r\n    <div class="navbar navbar-inverse navbar-static-top" role="navigation">\r\n        <div class="container">\r\n            <div class="navbar-header">\r\n                <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">\r\n                    <span class="sr-only">Toggle navigation</span>\r\n                    <span class="icon-bar"></span>\r\n                    <span class="icon-bar"></span>\r\n                    <span class="icon-bar"></span>\r\n                </button>\r\n                <a class="navbar-brand" href="#">Neptron Timing</a>\r\n            </div>\r\n            <div class="collapse navbar-collapse">\r\n                <ul class="nav navbar-nav">\r\n                    <li><a href="#/">Events</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/event">Info</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/results">Results</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/leaderboard">Leaderboard</a></li>\r\n                    <li ng-show="event.id && event.tracking"><a href="#/{{event.id}}/tracking">Tracking</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/favorites">Favorites</a></li>\r\n                    <li ng-show="event.id && event.sprints.length > 0"><a href="#/{{event.id}}/sprint">Sprint</a></li>\r\n                    <li ng-show="event.id && event.teamCompetitions.length > 0"><a href="#/{{event.id}}/teams">Teams</a></li>\r\n                </ul>\r\n            </div><!--/.nav-collapse -->\r\n        </div>\r\n    </div>\r\n  <script type="text/javascript">\r\n\r\nvar fixLidingloppetMessage = function() {\r\n\tvar str = window.location.href || \'\';\r\n\tvar cssStyle = (str.match(\'lidingolor2017\') ? \'\' : \'none\');\r\n\t//console.log(\'changed: \'+str, cssStyle);\r\n\t$(\'#nytamin-fix\').css(\'display\', cssStyle);\r\n}\r\n$(window).bind(\'hashchange\', function() {\r\n\tfixLidingloppetMessage();\r\n});\r\nwindow.setInterval(fixLidingloppetMessage, 1000);\r\n\r\n</script>\r\n\r\n<div class="container-fluid">\r\n\t<div id="nytamin-fix" class="panel panel-primary" style="display: none; margin: 2em;">\r\n\t  <div class="panel-heading">Liding&ouml;loppet.se</div>\r\n\t  <div class="panel-body">\r\n\t\t\r\n\t\t<strong><a href="http://213.39.39.152">Click here to get back to Liding&ouml;loppet\'s homepage!</a></strong>\r\n\r\n\t  </div>\r\n\t</div>\r\n</div>\r\n    <div class="container-fluid" ng-view></div>\r\n  <div class="nt-app-links" style="margin:10px 20px">\r\n    <a href="https://www.raceone.com/redirect" id="download-icon-footer-ios" target="_blank">\r\n      <img class="download-icon" src="https://www.raceone.com/wp-content/uploads/2016/09/app-store-e1475238488598.png" alt="">\r\n    </a>\r\n    <a href="https://www.raceone.com/redirect" id="download-icon-footer-android" target="_blank">\r\n      <img class="download-icon" src="https://www.raceone.com/wp-content/uploads/2016/09/google-play-e1475238513871.png" alt="">\r\n    </a>\r\n  </div>\r\n\r\n    <script type="text/javascript" src="scripts/app.js"></script>\r\n\r\n    <!-- AddThis Button BEGIN -->\r\n    <div class="addthis_toolbox addthis_default_style addthis_32x32_style">\r\n        <a class="addthis_button_facebook"></a>\r\n        <a class="addthis_button_twitter"></a>\r\n        <a class="addthis_button_linkedin"></a>\r\n        <a class="addthis_button_email"></a>\r\n        <a class="addthis_button_print"></a>\r\n        <a class="addthis_button_textme"></a>\r\n        <a class="addthis_button_compact"></a>\r\n    </div>\r\n    <script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-5364e093794f9d2f"></script>\r\n    <!-- AddThis Button END -->\r\n\r\n    <!--<div class="applinks">\r\n        <a href="https://itunes.apple.com/se/app/neptron-timing/id709776903" target="_blank"><img class="appstore" alt="Get it on iTunes" src="content/appstore.svg" /></a>\r\n        <a href="https://play.google.com/store/apps/details?id=se.neptron.timing" target="_blank"><img class="playstore" alt="Get it on Google Play" src="content/playstore.png" /></a>\r\n    </div>-->\r\n\r\n</body>\r\n</html>\r\n'

In [5]:

display_html(r.text, raw=True)

ï»¿ Neptron Timing

There is no table in the HTML sent by the server. The table is rendered on the client side by AngularJS. We can check that by looking at the page source in Chrome: Results Lundaloppet 2018 source

How do you parse JavaScript rendered page in Python? Don't we need a browser to run the JavaScript code? By googling, I found Requests-HTML that has JavaScript support!

Requests-HTML¶

In [6]:

from requests_html import HTMLSession

In [7]:

session = HTMLSession()
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)

In [8]:

display_html(table.html, raw=True)

Place (race)	Place (cat)	Bib no	Category	Name	Association	Progress	Time	Status
1	1	6922	P10	Hans Larsson	MAI	Finish	33:22	Finished
2	2	6514	P10	Filip Helmroth	IK Lerum Friidrott	Finish	33:37	Finished
3	3	3920	P10	David Hartman	Björnstorps IF	Finish	33:39	Finished
4	4	3926	P10	Henrik Orre	Björnstorps IF	Finish	34:24	Finished
5	5	2666	P10	Jesper Bokefors	Malmö AI	Finish	34:51	Finished
6	6	5729	P10	Juan Negreira	Lunds universitet	Finish	35:19	Finished
7	7	3649	P10	Jim Webb		Finish	35:23	Finished
8	8	3675	P10	Nils Wetterberg	Ekmans Löpare i Lund	Finish	35:39	Finished
9	9	4880	P10	Hannes Hjalmarsson	Lunds kommun	Finish	35:41	Finished
10	10	6929	P10	Freyi Karlsson	Ekmans löpare i lund	Finish	35:42	Finished
11	11	5995	P10	Shijie Xu	Lunds universitet	Finish	35:43	Finished
12	12	5276	P10	Stuart Ansell	Lunds universitet	Finish	36:02	Finished
13	13	3917	P10	Christer Friberg	Björnstorps IF	Finish	36:15	Finished
14	14	5647	P10	Roger Lindskog	Lunds universitet	Finish	36:15	Finished
15	15	3616	P10	Andreas Thell	Ystads IF Friidrott	Finish	36:20	Finished
16	16	6382	P10	Tommy Olofsson	Tetra Pak IF	Finish	36:20	Finished
17	17	3183	P10	Kristoffer Loo		Finish	36:36	Finished
18	18	2664	P10	Alfred Bodenäs	Triathlon Syd	Finish	36:44	Finished
19	19	6979	P10	Daniel Jonsson		Finish	36:54	Finished
20	20	4977	P10	Johan Lindgren	Lunds kommun	Finish	36:58	Finished
21	21	3495	P10	Erik Schultz-Eklund	Agape Lund	Finish	37:20	Finished
22	22	3571	P10	Daniel Strandberg	Malmö AI	Finish	37:28	Finished
23	23	3121	P10	Martin Larsson	inQore-part of Qgroup	Finish	37:32	Finished
24	24	5955	P10	Johan Vallon-Christersson	Lunds universitet	Finish	37:33	Finished
25	25	6675	P10	Kristian Haggärde	Björnstorps IF	Finish	37:34	Finished

Wow! Isn't that magic? We'll explore a bit later how this works.

What I want to get is all the results, not just the first 25. I tried increasing the pageSize passed in the URL, but that didn't help. Even passing a lower value always returns 25 rows. Not sure how the API is implemented...

An issue I had with requests-html is that sometimes r.html.find('table', first=True) returned None or an empty table...

In [9]:

r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=63&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
pd.read_html(table.html)[0]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-e9d6c036862c> in <module>()
      2 r.html.render()
      3 table = r.html.find('table', first=True)
----> 4 pd.read_html(table.html)[0]

IndexError: list index out of range

That's probably a timing issue (the rendering might take longer sometimes). I tried playing with the wait and sleep arguments of r.html.render(wait=1, sleep=1) but couldn't make the problem completetly go away. This is an issue because I don't need just one page but 135.

I started to look at requests-html code to see how this was implemented. That's how I discovered pyppeteer.

Pyppeteer¶

Pyppeteer is an unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

Pyppeteer allows you to do the same from Python. So there is no magic. You just let Chromium load and render the page with the latest JavaScript and browser features. This is super powerful.

The first time you run pyppeteer, it even downloads a recent version of Chromium. So no initial setup is required.

Pyppeteer is based on asyncio. This is hidden by requests-html that gives you a simple interface but of course less flexibility.

So let's explore pyppeteer. The first example from the documentation is how to take a screenshot of a page.

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Let's try that with our page. Note that I pass the fullPage option otherwise the page is cut.

In [10]:

import asyncio
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.screenshot({'path': 'pyppeteer_screenshot.png', 'fullPage': True})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Here is the screenshot taken: Pyppeteer screenshot

Nice, no? This example showed us how to load a page:

create a browser
create a new page
goto a page

There are several functions that can be used to retrieve elements from the page, like querySelector or querySelectorEval. This is the function we gonna use to retrieve the table. We use the table selector and apply the outerHTML function to get the HTML representation of the table:

table = await page.querySelectorEval('table', '(element) => element.outerHTML')

We can then pass that to pandas.

One thing we wanted is to wait for the table to be rendered before trying to retrieve it. We can use the waitForSelector function for that. I initially tried to use the table selector but that sometimes returned an empty table. So I chose a class of one row element td.res-startNo to be sure that the table was rendered.

In [11]:

import asyncio
import pandas as pd
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.waitForSelector('td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    await browser.close()
    return pd.read_html(table)[0]

df = asyncio.get_event_loop().run_until_complete(main())
df

Out[11]:

	Unnamed: 0	Place(race)	Place(cat)	Bib no	Category	Name	Unnamed: 6	Association	Progress	Time	Status
0	NaN	1	1	6922	P10	Hans Larsson	NaN	MAI	Finish	33:22	Finished
1	NaN	2	2	6514	P10	Filip Helmroth	NaN	IK Lerum Friidrott	Finish	33:37	Finished
2	NaN	3	3	3920	P10	David Hartman	NaN	Björnstorps IF	Finish	33:39	Finished
3	NaN	4	4	3926	P10	Henrik Orre	NaN	Björnstorps IF	Finish	34:24	Finished
4	NaN	5	5	2666	P10	Jesper Bokefors	NaN	Malmö AI	Finish	34:51	Finished
5	NaN	6	6	5729	P10	Juan Negreira	NaN	Lunds universitet	Finish	35:19	Finished
6	NaN	7	7	3649	P10	Jim Webb	NaN	NaN	Finish	35:23	Finished
7	NaN	8	8	3675	P10	Nils Wetterberg	NaN	Ekmans Löpare i Lund	Finish	35:39	Finished
8	NaN	9	9	4880	P10	Hannes Hjalmarsson	NaN	Lunds kommun	Finish	35:41	Finished
9	NaN	10	10	6929	P10	Freyi Karlsson	NaN	Ekmans löpare i lund	Finish	35:42	Finished
10	NaN	11	11	5995	P10	Shijie Xu	NaN	Lunds universitet	Finish	35:43	Finished
11	NaN	12	12	5276	P10	Stuart Ansell	NaN	Lunds universitet	Finish	36:02	Finished
12	NaN	13	13	3917	P10	Christer Friberg	NaN	Björnstorps IF	Finish	36:15	Finished
13	NaN	14	14	5647	P10	Roger Lindskog	NaN	Lunds universitet	Finish	36:15	Finished
14	NaN	15	15	3616	P10	Andreas Thell	NaN	Ystads IF Friidrott	Finish	36:20	Finished
15	NaN	16	16	6382	P10	Tommy Olofsson	NaN	Tetra Pak IF	Finish	36:20	Finished
16	NaN	17	17	3183	P10	Kristoffer Loo	NaN	NaN	Finish	36:36	Finished
17	NaN	18	18	2664	P10	Alfred Bodenäs	NaN	Triathlon Syd	Finish	36:44	Finished
18	NaN	19	19	6979	P10	Daniel Jonsson	NaN	NaN	Finish	36:54	Finished
19	NaN	20	20	4977	P10	Johan Lindgren	NaN	Lunds kommun	Finish	36:58	Finished
20	NaN	21	21	3495	P10	Erik Schultz-Eklund	NaN	Agape Lund	Finish	37:20	Finished
21	NaN	22	22	3571	P10	Daniel Strandberg	NaN	Malmö AI	Finish	37:28	Finished
22	NaN	23	23	3121	P10	Martin Larsson	NaN	inQore-part of Qgroup	Finish	37:32	Finished
23	NaN	24	24	5955	P10	Johan Vallon-Christersson	NaN	Lunds universitet	Finish	37:33	Finished
24	NaN	25	25	6675	P10	Kristian Haggärde	NaN	Björnstorps IF	Finish	37:34	Finished

That's a bit more code than with requests-HTML but we have finer control. Let's refactor that code to retrieve all the results of the race.

In [12]:

import asyncio
import pandas as pd
from pyppeteer import launch

URL = 'http://results.neptron.se/#/lundaloppet2018/results?sortOrder=Place&raceId=99&pageSize=25&page={}'


async def get_page(browser, url, selector):
    """Return a page after waiting for the given selector"""
    page = await browser.newPage()
    await page.goto(url)
    await page.waitForSelector(selector)
    return page


async def get_num_pages(browser):
    """Return the total number of pages available"""
    page = await get_page(browser, URL.format(0), 'div.ng-isolate-scope')
    num_pages = await page.querySelectorEval(
        'div.ng-isolate-scope',
        '(element) => element.getAttribute("data-num-pages")')
    return int(num_pages)


async def get_table(browser, page_nb):
    """Return the table from the given page number as a pandas dataframe"""
    print(f'Get table from page {page_nb}')
    page = await get_page(browser, URL.format(page_nb), 'td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    return pd.read_html(table)[0]


async def get_results():
    """Return all the results as a pandas dataframe"""
    browser = await launch()
    num_pages = await get_num_pages(browser)
    print(f'Number of pages: {num_pages}')
    # Python 3.6 asynchronous comprehensions! Nice!
    dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
    await browser.close()
    df = pd.concat(dfs, ignore_index=True)
    return df

This code could be made a bit more generic but that's good enough for what I want. I think it's quite straightforward. We first get the total number of pages and then load each page to retrieve the table. Once done, we just have to concatenate all those tables in one.

One thing to note is the use of Python asynchronous comprehensions. This is a Python 3.6 feature and makes it really Pythonic. It just works as it would with synchronous functions:

dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]

Let's run that code!

In [13]:

df = asyncio.get_event_loop().run_until_complete(get_results())

Number of pages: 115
Get table from page 0
Get table from page 1
Get table from page 2
Get table from page 3
Get table from page 4
Get table from page 5
Get table from page 6
Get table from page 7
Get table from page 8
Get table from page 9
Get table from page 10
Get table from page 11
Get table from page 12
Get table from page 13
Get table from page 14
Get table from page 15
Get table from page 16
Get table from page 17
Get table from page 18
Get table from page 19
Get table from page 20
Get table from page 21
Get table from page 22
Get table from page 23
Get table from page 24
Get table from page 25
Get table from page 26
Get table from page 27
Get table from page 28
Get table from page 29
Get table from page 30
Get table from page 31
Get table from page 32
Get table from page 33
Get table from page 34
Get table from page 35
Get table from page 36
Get table from page 37
Get table from page 38
Get table from page 39
Get table from page 40
Get table from page 41
Get table from page 42
Get table from page 43
Get table from page 44
Get table from page 45
Get table from page 46
Get table from page 47
Get table from page 48
Get table from page 49
Get table from page 50
Get table from page 51
Get table from page 52
Get table from page 53
Get table from page 54
Get table from page 55
Get table from page 56
Get table from page 57
Get table from page 58
Get table from page 59
Get table from page 60
Get table from page 61
Get table from page 62
Get table from page 63
Get table from page 64
Get table from page 65
Get table from page 66
Get table from page 67
Get table from page 68
Get table from page 69
Get table from page 70
Get table from page 71
Get table from page 72
Get table from page 73
Get table from page 74
Get table from page 75
Get table from page 76
Get table from page 77
Get table from page 78
Get table from page 79
Get table from page 80
Get table from page 81
Get table from page 82
Get table from page 83
Get table from page 84
Get table from page 85
Get table from page 86
Get table from page 87
Get table from page 88
Get table from page 89
Get table from page 90
Get table from page 91
Get table from page 92
Get table from page 93
Get table from page 94
Get table from page 95
Get table from page 96
Get table from page 97
Get table from page 98
Get table from page 99
Get table from page 100
Get table from page 101
Get table from page 102
Get table from page 103
Get table from page 104
Get table from page 105
Get table from page 106
Get table from page 107
Get table from page 108
Get table from page 109
Get table from page 110
Get table from page 111
Get table from page 112
Get table from page 113
Get table from page 114

That's it! We got all the results from the Lundaloppet 2018 in a nice pandas DataFrame.

In [14]:

len(df)

Out[14]:

In [15]:

df.head()

Out[15]:

	Unnamed: 0	Place(race)	Place(cat)	Bib no	Category	Name	Unnamed: 6	Association	Progress	Time	Status
0	NaN	1.0	1.0	6922	P10	Hans Larsson	NaN	MAI	Finish	33:22	Finished
1	NaN	2.0	2.0	6514	P10	Filip Helmroth	NaN	IK Lerum Friidrott	Finish	33:37	Finished
2	NaN	3.0	3.0	3920	P10	David Hartman	NaN	Björnstorps IF	Finish	33:39	Finished
3	NaN	4.0	4.0	3926	P10	Henrik Orre	NaN	Björnstorps IF	Finish	34:24	Finished
4	NaN	5.0	5.0	2666	P10	Jesper Bokefors	NaN	Malmö AI	Finish	34:51	Finished

In [16]:

df.tail()

Out[16]:

	Unnamed: 0	Place(race)	Place(cat)	Bib no	Category	Name	Unnamed: 6	Association	Progress	Time	Status
2867	NaN	NaN	NaN	6855	T10	porntepin sooksaengprasit	NaN	Lunds universitet	NaN	NaN	Not started
2868	NaN	NaN	NaN	6857	P10	Gabriel Teku	NaN	Lunds universitet	NaN	NaN	Not started
2869	NaN	NaN	NaN	6888	P10	Viktor Karlsson	NaN	Genarps if	NaN	NaN	Not started
2870	NaN	NaN	NaN	6892	P10	Emil Larsson	NaN	NaN	NaN	NaN	Not started
2871	NaN	NaN	NaN	6893	P10	Göran Larsson	NaN	NaN	NaN	NaN	Not started

Let's save the result to a csv file

In [17]:

df.to_csv('lundaloppet2018.csv', index=False)

Summary¶

With frameworks like AngularJS, React, Vue.js... more and more websites use client-side rendering. To parse those websites, you can't just request HTML from the server. Parsing requires to run some JavaScript.

Pyppeteer makes that possible. Thanks to Headless Chomium, it gives you access to the full power of a browser from Python. I find that really impressive!

I tried to use Selenium in the past but didn't find it very easy to start with. That wasn't the case with Pyppeteer. To be fair, it was a while ago and both projects are quite different. It's not just about browser automation. Selenium allows you to perform cross browser testing. Pyppeteer is limited to Chrome/Chromium. Anyway, I'll probably look more at Pyppeteer for web application testing.

For simple tasks, Requests-HTML is a nice wrapper and gives you a simple API. If you want more control, use directly Pyppeteer.

One last note. To run this code in a Jupyter notebook, you should use tornado 4. asyncio code doesn't play well with ipython and tornado 5. See this Github issue: asyncio will be running by default with tornado 5. There is some work in progress for a nice integration.

What about the Lundaloppet results you might ask? I'll explore them in another post!

Parsing JavaScript rendered pages in Python with pyppeteer¶

Where is my table?¶

Requests-HTML¶

Pyppeteer¶

Summary¶

Comments