Parsing JavaScript rendered pages in Python with pyppeteer

Parsing JavaScript rendered pages in Python with pyppeteer

Where is my table?

I already wrote a blog post about Parsing HTML Tables in Python with pandas. Using requests or even directly pandas was working nicely.

I wanted to play with some data from a race I recently run: Lundaloppet. The results are available here: http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25 Results Lundaloppet 2018

Let's try to get that table!

In [1]:
import pandas as pd
In [2]:
dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-881110a1fe3d> in <module>()
----> 1 dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
    985                   decimal=decimal, converters=converters, na_values=na_values,
    986                   keep_default_na=keep_default_na,
--> 987                   displayed_only=displayed_only)

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    813             break
    814     else:
--> 815         raise_with_traceback(retained)
    816 
    817     ret = []

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    401         if traceback == Ellipsis:
    402             _, _, traceback = sys.exc_info()
--> 403         raise exc.with_traceback(traceback)
    404 else:
    405     # this version of raise is a syntax error in Python 3

ValueError: No tables found

No tables found... So what is going on? Let's look at what is returned by requests.

In [3]:
import requests
from IPython.display import display_html
In [4]:
r = requests.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.text
Out[4]:
'<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" ng-app="app">\r\n<head>\r\n    <title ng-bind="event.name || \'Neptron Timing\'">Neptron Timing</title>\r\n\r\n    <meta charset="utf-8">\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n    <meta name="description" content="Neptron Timing event results">\r\n\r\n    <link rel="shortcut icon" href="favicon.ico">\r\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/css/bootstrap.min.css">\r\n    <link rel="stylesheet" href="content/app.min.css">\r\n    <script src="scripts/iframeResizer.contentWindow.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/es6-shim/0.35.0/es6-shim.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/js/bootstrap.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.4.8/angular.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.4.8/angular-route.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.13.0/moment.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.0.2/Chart.min.js"></script>\r\n    <script src="https://maps.googleapis.com/maps/api/js?key=AIzaSyD7OPJoYN6W9qUHU1L_fEr_5ut8tQN8r2A"></script>\r\n</head>\r\n<body>\r\n    <div class="navbar navbar-inverse navbar-static-top" role="navigation">\r\n        <div class="container">\r\n            <div class="navbar-header">\r\n                <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">\r\n                    <span class="sr-only">Toggle navigation</span>\r\n                    <span class="icon-bar"></span>\r\n                    <span class="icon-bar"></span>\r\n                    <span class="icon-bar"></span>\r\n                </button>\r\n                <a class="navbar-brand" href="#">Neptron Timing</a>\r\n            </div>\r\n            <div class="collapse navbar-collapse">\r\n                <ul class="nav navbar-nav">\r\n                    <li><a href="#/">Events</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/event">Info</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/results">Results</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/leaderboard">Leaderboard</a></li>\r\n                    <li ng-show="event.id && event.tracking"><a href="#/{{event.id}}/tracking">Tracking</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/favorites">Favorites</a></li>\r\n                    <li ng-show="event.id && event.sprints.length > 0"><a href="#/{{event.id}}/sprint">Sprint</a></li>\r\n                    <li ng-show="event.id && event.teamCompetitions.length > 0"><a href="#/{{event.id}}/teams">Teams</a></li>\r\n                </ul>\r\n            </div><!--/.nav-collapse -->\r\n        </div>\r\n    </div>\r\n  <script type="text/javascript">\r\n\r\nvar fixLidingloppetMessage = function() {\r\n\tvar str = window.location.href || \'\';\r\n\tvar cssStyle = (str.match(\'lidingolor2017\') ? \'\' : \'none\');\r\n\t//console.log(\'changed: \'+str, cssStyle);\r\n\t$(\'#nytamin-fix\').css(\'display\', cssStyle);\r\n}\r\n$(window).bind(\'hashchange\', function() {\r\n\tfixLidingloppetMessage();\r\n});\r\nwindow.setInterval(fixLidingloppetMessage, 1000);\r\n\r\n</script>\r\n\r\n<div class="container-fluid">\r\n\t<div id="nytamin-fix" class="panel panel-primary" style="display: none; margin: 2em;">\r\n\t  <div class="panel-heading">Liding&ouml;loppet.se</div>\r\n\t  <div class="panel-body">\r\n\t\t\r\n\t\t<strong><a href="http://213.39.39.152">Click here to get back to Liding&ouml;loppet\'s homepage!</a></strong>\r\n\r\n\t  </div>\r\n\t</div>\r\n</div>\r\n    <div class="container-fluid" ng-view></div>\r\n  <div class="nt-app-links" style="margin:10px 20px">\r\n    <a href="https://www.raceone.com/redirect" id="download-icon-footer-ios" target="_blank">\r\n      <img class="download-icon" src="https://www.raceone.com/wp-content/uploads/2016/09/app-store-e1475238488598.png" alt="">\r\n    </a>\r\n    <a href="https://www.raceone.com/redirect" id="download-icon-footer-android" target="_blank">\r\n      <img class="download-icon" src="https://www.raceone.com/wp-content/uploads/2016/09/google-play-e1475238513871.png" alt="">\r\n    </a>\r\n  </div>\r\n\r\n    <script type="text/javascript" src="scripts/app.js"></script>\r\n\r\n    <!-- AddThis Button BEGIN -->\r\n    <div class="addthis_toolbox addthis_default_style addthis_32x32_style">\r\n        <a class="addthis_button_facebook"></a>\r\n        <a class="addthis_button_twitter"></a>\r\n        <a class="addthis_button_linkedin"></a>\r\n        <a class="addthis_button_email"></a>\r\n        <a class="addthis_button_print"></a>\r\n        <a class="addthis_button_textme"></a>\r\n        <a class="addthis_button_compact"></a>\r\n    </div>\r\n    <script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-5364e093794f9d2f"></script>\r\n    <!-- AddThis Button END -->\r\n\r\n    <!--<div class="applinks">\r\n        <a href="https://itunes.apple.com/se/app/neptron-timing/id709776903" target="_blank"><img class="appstore" alt="Get it on iTunes" src="content/appstore.svg" /></a>\r\n        <a href="https://play.google.com/store/apps/details?id=se.neptron.timing" target="_blank"><img class="playstore" alt="Get it on Google Play" src="content/playstore.png" /></a>\r\n    </div>-->\r\n\r\n</body>\r\n</html>\r\n'
In [5]:
display_html(r.text, raw=True)
 Neptron Timing

There is no table in the HTML sent by the server. The table is rendered on the client side by AngularJS. We can check that by looking at the page source in Chrome: Results Lundaloppet 2018 source

How do you parse JavaScript rendered page in Python? Don't we need a browser to run the JavaScript code? By googling, I found Requests-HTML that has JavaScript support!

Requests-HTML

In [6]:
from requests_html import HTMLSession
In [7]:
session = HTMLSession()
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
In [8]:
display_html(table.html, raw=True)
  Place
(race)
Place
(cat)
Bib no Category Name Association Progress Time Status
1
1
6922
P10
Hans Larsson
MAI
Finish
33:22
Finished
2
2
6514
P10
Filip Helmroth
IK Lerum Friidrott
Finish
33:37
Finished
3
3
3920
P10
David Hartman
Björnstorps IF
Finish
33:39
Finished
4
4
3926
P10
Henrik Orre
Björnstorps IF
Finish
34:24
Finished
5
5
2666
P10
Jesper Bokefors
Malmö AI
Finish
34:51
Finished
6
6
5729
P10
Juan Negreira
Lunds universitet
Finish
35:19
Finished
7
7
3649
P10
Jim Webb
Finish
35:23
Finished
8
8
3675
P10
Nils Wetterberg
Ekmans Löpare i Lund
Finish
35:39
Finished
9
9
4880
P10
Hannes Hjalmarsson
Lunds kommun
Finish
35:41
Finished
10
10
6929
P10
Freyi Karlsson
Ekmans löpare i lund
Finish
35:42
Finished
11
11
5995
P10
Shijie Xu
Lunds universitet
Finish
35:43
Finished
12
12
5276
P10
Stuart Ansell
Lunds universitet
Finish
36:02
Finished
13
13
3917
P10
Christer Friberg
Björnstorps IF
Finish
36:15
Finished
14
14
5647
P10
Roger Lindskog
Lunds universitet
Finish
36:15
Finished
15
15
3616
P10
Andreas Thell
Ystads IF Friidrott
Finish
36:20
Finished
16
16
6382
P10
Tommy Olofsson
Tetra Pak IF
Finish
36:20
Finished
17
17
3183
P10
Kristoffer Loo
Finish
36:36
Finished
18
18
2664
P10
Alfred Bodenäs
Triathlon Syd
Finish
36:44
Finished
19
19
6979
P10
Daniel Jonsson
Finish
36:54
Finished
20
20
4977
P10
Johan Lindgren
Lunds kommun
Finish
36:58
Finished
21
21
3495
P10
Erik Schultz-Eklund
Agape Lund
Finish
37:20
Finished
22
22
3571
P10
Daniel Strandberg
Malmö AI
Finish
37:28
Finished
23
23
3121
P10
Martin Larsson
inQore-part of Qgroup
Finish
37:32
Finished
24
24
5955
P10
Johan Vallon-Christersson
Lunds universitet
Finish
37:33
Finished
25
25
6675
P10
Kristian Haggärde
Björnstorps IF
Finish
37:34
Finished

Wow! Isn't that magic? We'll explore a bit later how this works.

What I want to get is all the results, not just the first 25. I tried increasing the pageSize passed in the URL, but that didn't help. Even passing a lower value always returns 25 rows. Not sure how the API is implemented...

An issue I had with requests-html is that sometimes r.html.find('table', first=True) returned None or an empty table...

In [9]:
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=63&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
pd.read_html(table.html)[0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-e9d6c036862c> in <module>()
      2 r.html.render()
      3 table = r.html.find('table', first=True)
----> 4 pd.read_html(table.html)[0]

IndexError: list index out of range

That's probably a timing issue (the rendering might take longer sometimes). I tried playing with the wait and sleep arguments of r.html.render(wait=1, sleep=1) but couldn't make the problem completetly go away. This is an issue because I don't need just one page but 135.

I started to look at requests-html code to see how this was implemented. That's how I discovered pyppeteer.

Pyppeteer

Pyppeteer is an unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

Pyppeteer allows you to do the same from Python. So there is no magic. You just let Chromium load and render the page with the latest JavaScript and browser features. This is super powerful.

The first time you run pyppeteer, it even downloads a recent version of Chromium. So no initial setup is required.

Pyppeteer is based on asyncio. This is hidden by requests-html that gives you a simple interface but of course less flexibility.

So let's explore pyppeteer. The first example from the documentation is how to take a screenshot of a page.

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Let's try that with our page. Note that I pass the fullPage option otherwise the page is cut.

In [10]:
import asyncio
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.screenshot({'path': 'pyppeteer_screenshot.png', 'fullPage': True})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Here is the screenshot taken: Pyppeteer screenshot

Nice, no? This example showed us how to load a page:

  • create a browser
  • create a new page
  • goto a page

There are several functions that can be used to retrieve elements from the page, like querySelector or querySelectorEval. This is the function we gonna use to retrieve the table. We use the table selector and apply the outerHTML function to get the HTML representation of the table:

table = await page.querySelectorEval('table', '(element) => element.outerHTML')

We can then pass that to pandas.

One thing we wanted is to wait for the table to be rendered before trying to retrieve it. We can use the waitForSelector function for that. I initially tried to use the table selector but that sometimes returned an empty table. So I chose a class of one row element td.res-startNo to be sure that the table was rendered.

In [11]:
import asyncio
import pandas as pd
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.waitForSelector('td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    await browser.close()
    return pd.read_html(table)[0]

df = asyncio.get_event_loop().run_until_complete(main())
df
Out[11]:
Unnamed: 0 Place(race) Place(cat) Bib no Category Name Unnamed: 6 Association Progress Time Status
0 NaN 1 1 6922 P10 Hans Larsson NaN MAI Finish 33:22 Finished
1 NaN 2 2 6514 P10 Filip Helmroth NaN IK Lerum Friidrott Finish 33:37 Finished
2 NaN 3 3 3920 P10 David Hartman NaN Björnstorps IF Finish 33:39 Finished
3 NaN 4 4 3926 P10 Henrik Orre NaN Björnstorps IF Finish 34:24 Finished
4 NaN 5 5 2666 P10 Jesper Bokefors NaN Malmö AI Finish 34:51 Finished
5 NaN 6 6 5729 P10 Juan Negreira NaN Lunds universitet Finish 35:19 Finished
6 NaN 7 7 3649 P10 Jim Webb NaN NaN Finish 35:23 Finished
7 NaN 8 8 3675 P10 Nils Wetterberg NaN Ekmans Löpare i Lund Finish 35:39 Finished
8 NaN 9 9 4880 P10 Hannes Hjalmarsson NaN Lunds kommun Finish 35:41 Finished
9 NaN 10 10 6929 P10 Freyi Karlsson NaN Ekmans löpare i lund Finish 35:42 Finished
10 NaN 11 11 5995 P10 Shijie Xu NaN Lunds universitet Finish 35:43 Finished
11 NaN 12 12 5276 P10 Stuart Ansell NaN Lunds universitet Finish 36:02 Finished
12 NaN 13 13 3917 P10 Christer Friberg NaN Björnstorps IF Finish 36:15 Finished
13 NaN 14 14 5647 P10 Roger Lindskog NaN Lunds universitet Finish 36:15 Finished
14 NaN 15 15 3616 P10 Andreas Thell NaN Ystads IF Friidrott Finish 36:20 Finished
15 NaN 16 16 6382 P10 Tommy Olofsson NaN Tetra Pak IF Finish 36:20 Finished
16 NaN 17 17 3183 P10 Kristoffer Loo NaN NaN Finish 36:36 Finished
17 NaN 18 18 2664 P10 Alfred Bodenäs NaN Triathlon Syd Finish 36:44 Finished
18 NaN 19 19 6979 P10 Daniel Jonsson NaN NaN Finish 36:54 Finished
19 NaN 20 20 4977 P10 Johan Lindgren NaN Lunds kommun Finish 36:58 Finished
20 NaN 21 21 3495 P10 Erik Schultz-Eklund NaN Agape Lund Finish 37:20 Finished
21 NaN 22 22 3571 P10 Daniel Strandberg NaN Malmö AI Finish 37:28 Finished
22 NaN 23 23 3121 P10 Martin Larsson NaN inQore-part of Qgroup Finish 37:32 Finished
23 NaN 24 24 5955 P10 Johan Vallon-Christersson NaN Lunds universitet Finish 37:33 Finished
24 NaN 25 25 6675 P10 Kristian Haggärde NaN Björnstorps IF Finish 37:34 Finished

That's a bit more code than with requests-HTML but we have finer control. Let's refactor that code to retrieve all the results of the race.

In [12]:
import asyncio
import pandas as pd
from pyppeteer import launch

URL = 'http://results.neptron.se/#/lundaloppet2018/results?sortOrder=Place&raceId=99&pageSize=25&page={}'


async def get_page(browser, url, selector):
    """Return a page after waiting for the given selector"""
    page = await browser.newPage()
    await page.goto(url)
    await page.waitForSelector(selector)
    return page


async def get_num_pages(browser):
    """Return the total number of pages available"""
    page = await get_page(browser, URL.format(0), 'div.ng-isolate-scope')
    num_pages = await page.querySelectorEval(
        'div.ng-isolate-scope',
        '(element) => element.getAttribute("data-num-pages")')
    return int(num_pages)


async def get_table(browser, page_nb):
    """Return the table from the given page number as a pandas dataframe"""
    print(f'Get table from page {page_nb}')
    page = await get_page(browser, URL.format(page_nb), 'td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    return pd.read_html(table)[0]


async def get_results():
    """Return all the results as a pandas dataframe"""
    browser = await launch()
    num_pages = await get_num_pages(browser)
    print(f'Number of pages: {num_pages}')
    # Python 3.6 asynchronous comprehensions! Nice!
    dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
    await browser.close()
    df = pd.concat(dfs, ignore_index=True)
    return df

This code could be made a bit more generic but that's good enough for what I want. I think it's quite straightforward. We first get the total number of pages and then load each page to retrieve the table. Once done, we just have to concatenate all those tables in one.

One thing to note is the use of Python asynchronous comprehensions. This is a Python 3.6 feature and makes it really Pythonic. It just works as it would with synchronous functions:

dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]

Let's run that code!

In [13]:
df = asyncio.get_event_loop().run_until_complete(get_results())
Number of pages: 115
Get table from page 0
Get table from page 1
Get table from page 2
Get table from page 3
Get table from page 4
Get table from page 5
Get table from page 6
Get table from page 7
Get table from page 8
Get table from page 9
Get table from page 10
Get table from page 11
Get table from page 12
Get table from page 13
Get table from page 14
Get table from page 15
Get table from page 16
Get table from page 17
Get table from page 18
Get table from page 19
Get table from page 20
Get table from page 21
Get table from page 22
Get table from page 23
Get table from page 24
Get table from page 25
Get table from page 26
Get table from page 27
Get table from page 28
Get table from page 29
Get table from page 30
Get table from page 31
Get table from page 32
Get table from page 33
Get table from page 34
Get table from page 35
Get table from page 36
Get table from page 37
Get table from page 38
Get table from page 39
Get table from page 40
Get table from page 41
Get table from page 42
Get table from page 43
Get table from page 44
Get table from page 45
Get table from page 46
Get table from page 47
Get table from page 48
Get table from page 49
Get table from page 50
Get table from page 51
Get table from page 52
Get table from page 53
Get table from page 54
Get table from page 55
Get table from page 56
Get table from page 57
Get table from page 58
Get table from page 59
Get table from page 60
Get table from page 61
Get table from page 62
Get table from page 63
Get table from page 64
Get table from page 65
Get table from page 66
Get table from page 67
Get table from page 68
Get table from page 69
Get table from page 70
Get table from page 71
Get table from page 72
Get table from page 73
Get table from page 74
Get table from page 75
Get table from page 76
Get table from page 77
Get table from page 78
Get table from page 79
Get table from page 80
Get table from page 81
Get table from page 82
Get table from page 83
Get table from page 84
Get table from page 85
Get table from page 86
Get table from page 87
Get table from page 88
Get table from page 89
Get table from page 90
Get table from page 91
Get table from page 92
Get table from page 93
Get table from page 94
Get table from page 95
Get table from page 96
Get table from page 97
Get table from page 98
Get table from page 99
Get table from page 100
Get table from page 101
Get table from page 102
Get table from page 103
Get table from page 104
Get table from page 105
Get table from page 106
Get table from page 107
Get table from page 108
Get table from page 109
Get table from page 110
Get table from page 111
Get table from page 112
Get table from page 113
Get table from page 114

That's it! We got all the results from the Lundaloppet 2018 in a nice pandas DataFrame.

In [14]:
len(df)
Out[14]:
2872
In [15]:
df.head()
Out[15]:
Unnamed: 0 Place(race) Place(cat) Bib no Category Name Unnamed: 6 Association Progress Time Status
0 NaN 1.0 1.0 6922 P10 Hans Larsson NaN MAI Finish 33:22 Finished
1 NaN 2.0 2.0 6514 P10 Filip Helmroth NaN IK Lerum Friidrott Finish 33:37 Finished
2 NaN 3.0 3.0 3920 P10 David Hartman NaN Björnstorps IF Finish 33:39 Finished
3 NaN 4.0 4.0 3926 P10 Henrik Orre NaN Björnstorps IF Finish 34:24 Finished
4 NaN 5.0 5.0 2666 P10 Jesper Bokefors NaN Malmö AI Finish 34:51 Finished
In [16]:
df.tail()
Out[16]:
Unnamed: 0 Place(race) Place(cat) Bib no Category Name Unnamed: 6 Association Progress Time Status
2867 NaN NaN NaN 6855 T10 porntepin sooksaengprasit NaN Lunds universitet NaN NaN Not started
2868 NaN NaN NaN 6857 P10 Gabriel Teku NaN Lunds universitet NaN NaN Not started
2869 NaN NaN NaN 6888 P10 Viktor Karlsson NaN Genarps if NaN NaN Not started
2870 NaN NaN NaN 6892 P10 Emil Larsson NaN NaN NaN NaN Not started
2871 NaN NaN NaN 6893 P10 Göran Larsson NaN NaN NaN NaN Not started

Let's save the result to a csv file

In [17]:
df.to_csv('lundaloppet2018.csv', index=False)

Summary

With frameworks like AngularJS, React, Vue.js... more and more websites use client-side rendering. To parse those websites, you can't just request HTML from the server. Parsing requires to run some JavaScript.

Pyppeteer makes that possible. Thanks to Headless Chomium, it gives you access to the full power of a browser from Python. I find that really impressive!

I tried to use Selenium in the past but didn't find it very easy to start with. That wasn't the case with Pyppeteer. To be fair, it was a while ago and both projects are quite different. It's not just about browser automation. Selenium allows you to perform cross browser testing. Pyppeteer is limited to Chrome/Chromium. Anyway, I'll probably look more at Pyppeteer for web application testing.

For simple tasks, Requests-HTML is a nice wrapper and gives you a simple API. If you want more control, use directly Pyppeteer.

One last note. To run this code in a Jupyter notebook, you should use tornado 4. asyncio code doesn't play well with ipython and tornado 5. See this Github issue: asyncio will be running by default with tornado 5. There is some work in progress for a nice integration.

What about the Lundaloppet results you might ask? I'll explore them in another post!

Comments

Comments powered by Disqus