{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parsing JavaScript rendered pages in Python with pyppeteer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Where is my table?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I already wrote a blog post about [Parsing HTML Tables in Python with pandas](https://beenje.github.io/blog/posts/parsing-html-tables-in-python-with-pandas/). Using [requests](http://docs.python-requests.org/en/master/) or even directly [pandas](https://pandas.pydata.org) was working nicely.\n", "\n", "I wanted to play with some data from a race I recently run: [Lundaloppet](http://www.lundaloppet.se/info/resultat/).\n", "The results are available here: http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25\n", "![Results Lundaloppet 2018](/images/pyppeteer/results_lundaloppet_2018.png)\n", "\n", "Let's try to get that table!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "No tables found", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdfs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_html\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36mread_html\u001b[0;34m(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)\u001b[0m\n\u001b[1;32m 985\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdecimal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconverters\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconverters\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mna_values\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_values\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 986\u001b[0m \u001b[0mkeep_default_na\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkeep_default_na\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 987\u001b[0;31m displayed_only=displayed_only)\n\u001b[0m", "\u001b[0;32m~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36m_parse\u001b[0;34m(flavor, io, match, attrs, encoding, displayed_only, **kwargs)\u001b[0m\n\u001b[1;32m 813\u001b[0m \u001b[0;32mbreak\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 814\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 815\u001b[0;31m \u001b[0mraise_with_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mretained\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 816\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 817\u001b[0m \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/compat/__init__.py\u001b[0m in \u001b[0;36mraise_with_traceback\u001b[0;34m(exc, traceback)\u001b[0m\n\u001b[1;32m 401\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtraceback\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mEllipsis\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 402\u001b[0m \u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtraceback\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexc_info\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 403\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtraceback\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 404\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 405\u001b[0m \u001b[0;31m# this version of raise is a syntax error in Python 3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: No tables found" ] } ], "source": [ "dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No tables found... So what is going on? Let's look at what is returned by requests." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from IPython.display import display_html" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\r\\n\\r\\n\\r\\n Neptron Timing\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n\\r\\n\\r\\n
\\r\\n
\\r\\n
\\r\\n \\r\\n Neptron Timing\\r\\n
\\r\\n
\\r\\n \\r\\n
\\r\\n
\\r\\n
\\r\\n \\r\\n\\r\\n
\\r\\n\\t
\\r\\n\\t
Lidingöloppet.se
\\r\\n\\t
\\r\\n\\t\\t\\r\\n\\t\\tClick here to get back to Lidingöloppet\\'s homepage!\\r\\n\\r\\n\\t
\\r\\n\\t
\\r\\n
\\r\\n
\\r\\n
\\r\\n \\r\\n \"\"\\r\\n \\r\\n \\r\\n \"\"\\r\\n \\r\\n
\\r\\n\\r\\n \\r\\n\\r\\n \\r\\n
\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n
\\r\\n \\r\\n \\r\\n\\r\\n \\r\\n\\r\\n\\r\\n\\r\\n'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = requests.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')\n", "r.text" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\r\n", "\r\n", "\r\n", " Neptron Timing\r\n", "\r\n", " \r\n", " \r\n", " \r\n", " \r\n", "\r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", "\r\n", "\r\n", "
\r\n", "
\r\n", "
\r\n", " \r\n", " Neptron Timing\r\n", "
\r\n", "
\r\n", " \r\n", "
\r\n", "
\r\n", "
\r\n", " \r\n", "\r\n", "
\r\n", "\t
\r\n", "\t
Lidingöloppet.se
\r\n", "\t
\r\n", "\t\t\r\n", "\t\tClick here to get back to Lidingöloppet's homepage!\r\n", "\r\n", "\t
\r\n", "\t
\r\n", "
\r\n", "
\r\n", "
\r\n", " \r\n", " \"\"\r\n", " \r\n", " \r\n", " \"\"\r\n", " \r\n", "
\r\n", "\r\n", " \r\n", "\r\n", " \r\n", "
\r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", " \r\n", "
\r\n", " \r\n", " \r\n", "\r\n", " \r\n", "\r\n", "\r\n", "\r\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_html(r.text, raw=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no table in the HTML sent by the server.\n", "The table is rendered on the client side by AngularJS. We can check that by looking at the page source in Chrome:\n", "![Results Lundaloppet 2018 source](/images/pyppeteer/results_lundaloppet_2018_source.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do you parse JavaScript rendered page in Python? Don't we need a browser to run the JavaScript code?\n", "By googling, I found [Requests-HTML](https://github.com/kennethreitz/requests-html) that has JavaScript support!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requests-HTML" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from requests_html import HTMLSession" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "session = HTMLSession()\n", "r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')\n", "r.html.render()\n", "table = r.html.find('table', first=True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 Place
(race)
Place
(cat)
Bib noCategoryNameAssociationProgressTimeStatus
1
1
6922
P10
Hans Larsson
MAI
Finish
33:22
Finished
2
2
6514
P10
Filip Helmroth
IK Lerum Friidrott
Finish
33:37
Finished
3
3
3920
P10
David Hartman
Björnstorps IF
Finish
33:39
Finished
4
4
3926
P10
Henrik Orre
Björnstorps IF
Finish
34:24
Finished
5
5
2666
P10
Jesper Bokefors
Malmö AI
Finish
34:51
Finished
6
6
5729
P10
Juan Negreira
Lunds universitet
Finish
35:19
Finished
7
7
3649
P10
Jim Webb
Finish
35:23
Finished
8
8
3675
P10
Nils Wetterberg
Ekmans Löpare i Lund
Finish
35:39
Finished
9
9
4880
P10
Hannes Hjalmarsson
Lunds kommun
Finish
35:41
Finished
10
10
6929
P10
Freyi Karlsson
Ekmans löpare i lund
Finish
35:42
Finished
11
11
5995
P10
Shijie Xu
Lunds universitet
Finish
35:43
Finished
12
12
5276
P10
Stuart Ansell
Lunds universitet
Finish
36:02
Finished
13
13
3917
P10
Christer Friberg
Björnstorps IF
Finish
36:15
Finished
14
14
5647
P10
Roger Lindskog
Lunds universitet
Finish
36:15
Finished
15
15
3616
P10
Andreas Thell
Ystads IF Friidrott
Finish
36:20
Finished
16
16
6382
P10
Tommy Olofsson
Tetra Pak IF
Finish
36:20
Finished
17
17
3183
P10
Kristoffer Loo
Finish
36:36
Finished
18
18
2664
P10
Alfred Bodenäs
Triathlon Syd
Finish
36:44
Finished
19
19
6979
P10
Daniel Jonsson
Finish
36:54
Finished
20
20
4977
P10
Johan Lindgren
Lunds kommun
Finish
36:58
Finished
21
21
3495
P10
Erik Schultz-Eklund
Agape Lund
Finish
37:20
Finished
22
22
3571
P10
Daniel Strandberg
Malmö AI
Finish
37:28
Finished
23
23
3121
P10
Martin Larsson
inQore-part of Qgroup
Finish
37:32
Finished
24
24
5955
P10
Johan Vallon-Christersson
Lunds universitet
Finish
37:33
Finished
25
25
6675
P10
Kristian Haggärde
Björnstorps IF
Finish
37:34
Finished
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_html(table.html, raw=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow! Isn't that magic? We'll explore a bit later how this works.\n", "\n", "What I want to get is all the results, not just the first 25. I tried increasing the *pageSize* passed in the URL, but that didn't help. Even passing a lower value always returns 25 rows. Not sure how the API is implemented...\n", "\n", "An issue I had with requests-html is that sometimes `r.html.find('table', first=True)` returned `None` or an empty table..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "ename": "IndexError", "evalue": "list index out of range", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhtml\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrender\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mtable\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhtml\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'table'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfirst\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_html\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhtml\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mIndexError\u001b[0m: list index out of range" ] } ], "source": [ "r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=63&pageSize=25')\n", "r.html.render()\n", "table = r.html.find('table', first=True)\n", "pd.read_html(table.html)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's probably a timing issue (the rendering might take longer sometimes). I tried playing with the `wait` and `sleep` arguments of `r.html.render(wait=1, sleep=1)` but couldn't make the problem completetly go away. This is an issue because I don't need just one page but 135.\n", "\n", "I started to look at [requests-html](https://github.com/kennethreitz/requests-html/blob/master/requests_html.py) code to see how this was implemented. That's how I discovered [pyppeteer](https://github.com/miyakogi/pyppeteer)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pyppeteer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pyppeteer](https://miyakogi.github.io/pyppeteer/) is an unofficial Python port of [puppeteer](https://github.com/GoogleChrome/puppeteer) JavaScript (headless) chrome/chromium browser automation library.\n", "\n", "> Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.\n", "\n", "Pyppeteer allows you to do the same from Python.\n", "So there is no *magic*. You just let Chromium load and render the page with the latest JavaScript and browser features. This is super powerful.\n", "\n", "The first time you run pyppeteer, it even downloads a recent version of Chromium. So no initial setup is required.\n", "\n", "Pyppeteer is based on [asyncio](https://docs.python.org/3/library/asyncio.html). This is hidden by requests-html that gives you a simple interface but of course less flexibility.\n", "\n", "So let's explore pyppeteer. The first example from the documentation is how to take a screenshot of a page.\n", "\n", "```python\n", "import asyncio\n", "from pyppeteer import launch\n", "\n", "async def main():\n", " browser = await launch()\n", " page = await browser.newPage()\n", " await page.goto('http://example.com')\n", " await page.screenshot({'path': 'example.png'})\n", " await browser.close()\n", "\n", "asyncio.get_event_loop().run_until_complete(main())\n", "```\n", "\n", "Let's try that with our page. Note that I pass the `fullPage` option otherwise the page is cut." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "from pyppeteer import launch\n", "\n", "\n", "async def main():\n", " browser = await launch()\n", " page = await browser.newPage()\n", " await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')\n", " await page.screenshot({'path': 'pyppeteer_screenshot.png', 'fullPage': True})\n", " await browser.close()\n", "\n", "asyncio.get_event_loop().run_until_complete(main())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the screenshot taken:\n", "![Pyppeteer screenshot](/images/pyppeteer/pyppeteer_screenshot.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice, no?\n", "This example showed us how to load a page:\n", "\n", "- create a browser\n", "- create a new page\n", "- goto a page\n", "\n", "There are several functions that can be used to retrieve elements from the page, like [querySelector](https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.querySelector) or [querySelectorEval](https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.querySelectorEval). This is the function we gonna use to retrieve the table. We use the `table` selector and apply the `outerHTML` function to get the HTML representation of the table:\n", "\n", "```python\n", "table = await page.querySelectorEval('table', '(element) => element.outerHTML')\n", "```\n", "\n", "We can then pass that to pandas.\n", "\n", "One thing we wanted is to wait for the table to be rendered before trying to retrieve it. We can use the [waitForSelector](https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.waitForSelector) function for that.\n", "I initially tried to use the `table` selector but that sometimes returned an empty table. So I chose a class of one row element `td.res-startNo` to be sure that the table was rendered." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Place(race)Place(cat)Bib noCategoryNameUnnamed: 6AssociationProgressTimeStatus
0NaN116922P10Hans LarssonNaNMAIFinish33:22Finished
1NaN226514P10Filip HelmrothNaNIK Lerum FriidrottFinish33:37Finished
2NaN333920P10David HartmanNaNBjörnstorps IFFinish33:39Finished
3NaN443926P10Henrik OrreNaNBjörnstorps IFFinish34:24Finished
4NaN552666P10Jesper BokeforsNaNMalmö AIFinish34:51Finished
5NaN665729P10Juan NegreiraNaNLunds universitetFinish35:19Finished
6NaN773649P10Jim WebbNaNNaNFinish35:23Finished
7NaN883675P10Nils WetterbergNaNEkmans Löpare i LundFinish35:39Finished
8NaN994880P10Hannes HjalmarssonNaNLunds kommunFinish35:41Finished
9NaN10106929P10Freyi KarlssonNaNEkmans löpare i lundFinish35:42Finished
10NaN11115995P10Shijie XuNaNLunds universitetFinish35:43Finished
11NaN12125276P10Stuart AnsellNaNLunds universitetFinish36:02Finished
12NaN13133917P10Christer FribergNaNBjörnstorps IFFinish36:15Finished
13NaN14145647P10Roger LindskogNaNLunds universitetFinish36:15Finished
14NaN15153616P10Andreas ThellNaNYstads IF FriidrottFinish36:20Finished
15NaN16166382P10Tommy OlofssonNaNTetra Pak IFFinish36:20Finished
16NaN17173183P10Kristoffer LooNaNNaNFinish36:36Finished
17NaN18182664P10Alfred BodenäsNaNTriathlon SydFinish36:44Finished
18NaN19196979P10Daniel JonssonNaNNaNFinish36:54Finished
19NaN20204977P10Johan LindgrenNaNLunds kommunFinish36:58Finished
20NaN21213495P10Erik Schultz-EklundNaNAgape LundFinish37:20Finished
21NaN22223571P10Daniel StrandbergNaNMalmö AIFinish37:28Finished
22NaN23233121P10Martin LarssonNaNinQore-part of QgroupFinish37:32Finished
23NaN24245955P10Johan Vallon-ChristerssonNaNLunds universitetFinish37:33Finished
24NaN25256675P10Kristian HaggärdeNaNBjörnstorps IFFinish37:34Finished
\n", "
" ], "text/plain": [ " Unnamed: 0 Place(race) Place(cat) Bib no Category \\\n", "0 NaN 1 1 6922 P10 \n", "1 NaN 2 2 6514 P10 \n", "2 NaN 3 3 3920 P10 \n", "3 NaN 4 4 3926 P10 \n", "4 NaN 5 5 2666 P10 \n", "5 NaN 6 6 5729 P10 \n", "6 NaN 7 7 3649 P10 \n", "7 NaN 8 8 3675 P10 \n", "8 NaN 9 9 4880 P10 \n", "9 NaN 10 10 6929 P10 \n", "10 NaN 11 11 5995 P10 \n", "11 NaN 12 12 5276 P10 \n", "12 NaN 13 13 3917 P10 \n", "13 NaN 14 14 5647 P10 \n", "14 NaN 15 15 3616 P10 \n", "15 NaN 16 16 6382 P10 \n", "16 NaN 17 17 3183 P10 \n", "17 NaN 18 18 2664 P10 \n", "18 NaN 19 19 6979 P10 \n", "19 NaN 20 20 4977 P10 \n", "20 NaN 21 21 3495 P10 \n", "21 NaN 22 22 3571 P10 \n", "22 NaN 23 23 3121 P10 \n", "23 NaN 24 24 5955 P10 \n", "24 NaN 25 25 6675 P10 \n", "\n", " Name Unnamed: 6 Association Progress \\\n", "0 Hans Larsson NaN MAI Finish \n", "1 Filip Helmroth NaN IK Lerum Friidrott Finish \n", "2 David Hartman NaN Björnstorps IF Finish \n", "3 Henrik Orre NaN Björnstorps IF Finish \n", "4 Jesper Bokefors NaN Malmö AI Finish \n", "5 Juan Negreira NaN Lunds universitet Finish \n", "6 Jim Webb NaN NaN Finish \n", "7 Nils Wetterberg NaN Ekmans Löpare i Lund Finish \n", "8 Hannes Hjalmarsson NaN Lunds kommun Finish \n", "9 Freyi Karlsson NaN Ekmans löpare i lund Finish \n", "10 Shijie Xu NaN Lunds universitet Finish \n", "11 Stuart Ansell NaN Lunds universitet Finish \n", "12 Christer Friberg NaN Björnstorps IF Finish \n", "13 Roger Lindskog NaN Lunds universitet Finish \n", "14 Andreas Thell NaN Ystads IF Friidrott Finish \n", "15 Tommy Olofsson NaN Tetra Pak IF Finish \n", "16 Kristoffer Loo NaN NaN Finish \n", "17 Alfred Bodenäs NaN Triathlon Syd Finish \n", "18 Daniel Jonsson NaN NaN Finish \n", "19 Johan Lindgren NaN Lunds kommun Finish \n", "20 Erik Schultz-Eklund NaN Agape Lund Finish \n", "21 Daniel Strandberg NaN Malmö AI Finish \n", "22 Martin Larsson NaN inQore-part of Qgroup Finish \n", "23 Johan Vallon-Christersson NaN Lunds universitet Finish \n", "24 Kristian Haggärde NaN Björnstorps IF Finish \n", "\n", " Time Status \n", "0 33:22 Finished \n", "1 33:37 Finished \n", "2 33:39 Finished \n", "3 34:24 Finished \n", "4 34:51 Finished \n", "5 35:19 Finished \n", "6 35:23 Finished \n", "7 35:39 Finished \n", "8 35:41 Finished \n", "9 35:42 Finished \n", "10 35:43 Finished \n", "11 36:02 Finished \n", "12 36:15 Finished \n", "13 36:15 Finished \n", "14 36:20 Finished \n", "15 36:20 Finished \n", "16 36:36 Finished \n", "17 36:44 Finished \n", "18 36:54 Finished \n", "19 36:58 Finished \n", "20 37:20 Finished \n", "21 37:28 Finished \n", "22 37:32 Finished \n", "23 37:33 Finished \n", "24 37:34 Finished " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import asyncio\n", "import pandas as pd\n", "from pyppeteer import launch\n", "\n", "\n", "async def main():\n", " browser = await launch()\n", " page = await browser.newPage()\n", " await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')\n", " await page.waitForSelector('td.res-startNo')\n", " table = await page.querySelectorEval('table', '(element) => element.outerHTML')\n", " await browser.close()\n", " return pd.read_html(table)[0]\n", "\n", "df = asyncio.get_event_loop().run_until_complete(main())\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a bit more code than with requests-HTML but we have finer control.\n", "Let's refactor that code to retrieve all the results of the race." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "import pandas as pd\n", "from pyppeteer import launch\n", "\n", "URL = 'http://results.neptron.se/#/lundaloppet2018/results?sortOrder=Place&raceId=99&pageSize=25&page={}'\n", "\n", "\n", "async def get_page(browser, url, selector):\n", " \"\"\"Return a page after waiting for the given selector\"\"\"\n", " page = await browser.newPage()\n", " await page.goto(url)\n", " await page.waitForSelector(selector)\n", " return page\n", "\n", "\n", "async def get_num_pages(browser):\n", " \"\"\"Return the total number of pages available\"\"\"\n", " page = await get_page(browser, URL.format(0), 'div.ng-isolate-scope')\n", " num_pages = await page.querySelectorEval(\n", " 'div.ng-isolate-scope',\n", " '(element) => element.getAttribute(\"data-num-pages\")')\n", " return int(num_pages)\n", "\n", "\n", "async def get_table(browser, page_nb):\n", " \"\"\"Return the table from the given page number as a pandas dataframe\"\"\"\n", " print(f'Get table from page {page_nb}')\n", " page = await get_page(browser, URL.format(page_nb), 'td.res-startNo')\n", " table = await page.querySelectorEval('table', '(element) => element.outerHTML')\n", " return pd.read_html(table)[0]\n", "\n", "\n", "async def get_results():\n", " \"\"\"Return all the results as a pandas dataframe\"\"\"\n", " browser = await launch()\n", " num_pages = await get_num_pages(browser)\n", " print(f'Number of pages: {num_pages}')\n", " # Python 3.6 asynchronous comprehensions! Nice!\n", " dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]\n", " await browser.close()\n", " df = pd.concat(dfs, ignore_index=True)\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code could be made a bit more generic but that's good enough for what I want.\n", "I think it's quite straightforward. We first get the total number of pages and then load each page to retrieve the table.\n", "Once done, we just have to concatenate all those tables in one.\n", "\n", "One thing to note is the use of Python [asynchronous comprehensions](https://www.python.org/dev/peps/pep-0530/). This is a Python 3.6 feature and makes it really Pythonic. It just works as it would with synchronous functions:\n", "```python\n", "dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]\n", "```\n", "\n", "Let's run that code!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of pages: 115\n", "Get table from page 0\n", "Get table from page 1\n", "Get table from page 2\n", "Get table from page 3\n", "Get table from page 4\n", "Get table from page 5\n", "Get table from page 6\n", "Get table from page 7\n", "Get table from page 8\n", "Get table from page 9\n", "Get table from page 10\n", "Get table from page 11\n", "Get table from page 12\n", "Get table from page 13\n", "Get table from page 14\n", "Get table from page 15\n", "Get table from page 16\n", "Get table from page 17\n", "Get table from page 18\n", "Get table from page 19\n", "Get table from page 20\n", "Get table from page 21\n", "Get table from page 22\n", "Get table from page 23\n", "Get table from page 24\n", "Get table from page 25\n", "Get table from page 26\n", "Get table from page 27\n", "Get table from page 28\n", "Get table from page 29\n", "Get table from page 30\n", "Get table from page 31\n", "Get table from page 32\n", "Get table from page 33\n", "Get table from page 34\n", "Get table from page 35\n", "Get table from page 36\n", "Get table from page 37\n", "Get table from page 38\n", "Get table from page 39\n", "Get table from page 40\n", "Get table from page 41\n", "Get table from page 42\n", "Get table from page 43\n", "Get table from page 44\n", "Get table from page 45\n", "Get table from page 46\n", "Get table from page 47\n", "Get table from page 48\n", "Get table from page 49\n", "Get table from page 50\n", "Get table from page 51\n", "Get table from page 52\n", "Get table from page 53\n", "Get table from page 54\n", "Get table from page 55\n", "Get table from page 56\n", "Get table from page 57\n", "Get table from page 58\n", "Get table from page 59\n", "Get table from page 60\n", "Get table from page 61\n", "Get table from page 62\n", "Get table from page 63\n", "Get table from page 64\n", "Get table from page 65\n", "Get table from page 66\n", "Get table from page 67\n", "Get table from page 68\n", "Get table from page 69\n", "Get table from page 70\n", "Get table from page 71\n", "Get table from page 72\n", "Get table from page 73\n", "Get table from page 74\n", "Get table from page 75\n", "Get table from page 76\n", "Get table from page 77\n", "Get table from page 78\n", "Get table from page 79\n", "Get table from page 80\n", "Get table from page 81\n", "Get table from page 82\n", "Get table from page 83\n", "Get table from page 84\n", "Get table from page 85\n", "Get table from page 86\n", "Get table from page 87\n", "Get table from page 88\n", "Get table from page 89\n", "Get table from page 90\n", "Get table from page 91\n", "Get table from page 92\n", "Get table from page 93\n", "Get table from page 94\n", "Get table from page 95\n", "Get table from page 96\n", "Get table from page 97\n", "Get table from page 98\n", "Get table from page 99\n", "Get table from page 100\n", "Get table from page 101\n", "Get table from page 102\n", "Get table from page 103\n", "Get table from page 104\n", "Get table from page 105\n", "Get table from page 106\n", "Get table from page 107\n", "Get table from page 108\n", "Get table from page 109\n", "Get table from page 110\n", "Get table from page 111\n", "Get table from page 112\n", "Get table from page 113\n", "Get table from page 114\n" ] } ], "source": [ "df = asyncio.get_event_loop().run_until_complete(get_results())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! We got all the results from the Lundaloppet 2018 in a nice pandas DataFrame." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2872" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Place(race)Place(cat)Bib noCategoryNameUnnamed: 6AssociationProgressTimeStatus
0NaN1.01.06922P10Hans LarssonNaNMAIFinish33:22Finished
1NaN2.02.06514P10Filip HelmrothNaNIK Lerum FriidrottFinish33:37Finished
2NaN3.03.03920P10David HartmanNaNBjörnstorps IFFinish33:39Finished
3NaN4.04.03926P10Henrik OrreNaNBjörnstorps IFFinish34:24Finished
4NaN5.05.02666P10Jesper BokeforsNaNMalmö AIFinish34:51Finished
\n", "
" ], "text/plain": [ " Unnamed: 0 Place(race) Place(cat) Bib no Category Name \\\n", "0 NaN 1.0 1.0 6922 P10 Hans Larsson \n", "1 NaN 2.0 2.0 6514 P10 Filip Helmroth \n", "2 NaN 3.0 3.0 3920 P10 David Hartman \n", "3 NaN 4.0 4.0 3926 P10 Henrik Orre \n", "4 NaN 5.0 5.0 2666 P10 Jesper Bokefors \n", "\n", " Unnamed: 6 Association Progress Time Status \n", "0 NaN MAI Finish 33:22 Finished \n", "1 NaN IK Lerum Friidrott Finish 33:37 Finished \n", "2 NaN Björnstorps IF Finish 33:39 Finished \n", "3 NaN Björnstorps IF Finish 34:24 Finished \n", "4 NaN Malmö AI Finish 34:51 Finished " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Place(race)Place(cat)Bib noCategoryNameUnnamed: 6AssociationProgressTimeStatus
2867NaNNaNNaN6855T10porntepin sooksaengprasitNaNLunds universitetNaNNaNNot started
2868NaNNaNNaN6857P10Gabriel TekuNaNLunds universitetNaNNaNNot started
2869NaNNaNNaN6888P10Viktor KarlssonNaNGenarps ifNaNNaNNot started
2870NaNNaNNaN6892P10Emil LarssonNaNNaNNaNNaNNot started
2871NaNNaNNaN6893P10Göran LarssonNaNNaNNaNNaNNot started
\n", "
" ], "text/plain": [ " Unnamed: 0 Place(race) Place(cat) Bib no Category \\\n", "2867 NaN NaN NaN 6855 T10 \n", "2868 NaN NaN NaN 6857 P10 \n", "2869 NaN NaN NaN 6888 P10 \n", "2870 NaN NaN NaN 6892 P10 \n", "2871 NaN NaN NaN 6893 P10 \n", "\n", " Name Unnamed: 6 Association Progress Time \\\n", "2867 porntepin sooksaengprasit NaN Lunds universitet NaN NaN \n", "2868 Gabriel Teku NaN Lunds universitet NaN NaN \n", "2869 Viktor Karlsson NaN Genarps if NaN NaN \n", "2870 Emil Larsson NaN NaN NaN NaN \n", "2871 Göran Larsson NaN NaN NaN NaN \n", "\n", " Status \n", "2867 Not started \n", "2868 Not started \n", "2869 Not started \n", "2870 Not started \n", "2871 Not started " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save the result to a csv file" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df.to_csv('lundaloppet2018.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With frameworks like [AngularJS](https://angularjs.org), [React](https://reactjs.org), [Vue.js](https://vuejs.org)... more and more websites use client-side rendering. To parse those websites, you can't just request HTML from the server. Parsing requires to run some JavaScript.\n", "\n", "[Pyppeteer](https://miyakogi.github.io/pyppeteer/) makes that possible. Thanks to [Headless Chomium](https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md), it gives you access to the full power of a browser from Python. I find that really impressive!\n", "\n", "I tried to use [Selenium](https://www.seleniumhq.org) in the past but didn't find it very easy to start with. That wasn't the case with Pyppeteer. To be fair, it was a while ago and both projects are quite different. It's not just about browser automation. Selenium allows you to perform cross browser testing. Pyppeteer is limited to Chrome/Chromium.\n", "Anyway, I'll probably look more at Pyppeteer for web application testing.\n", "\n", "For simple tasks, [Requests-HTML](https://github.com/kennethreitz/requests-html) is a nice wrapper and gives you a simple API. If you want more control, use directly Pyppeteer.\n", "\n", "One last note. To run this code in a Jupyter notebook, you should use tornado 4. asyncio code doesn't play well with ipython and tornado 5. See this Github issue: [asyncio will be running by default with tornado 5](https://github.com/ipython/ipython/issues/11030). There is some [work in progress](https://github.com/ipython/ipython/pull/11155) for a nice integration.\n", "\n", "What about the Lundaloppet results you might ask? I'll explore them in another post!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "nikola": { "category": "python", "date": "2018-06-02 22:54:45 UTC+02:00", "description": "", "link": "", "slug": "parsing-javascript-rendered-pages-in-python-with-pyppeteer", "tags": "python,pandas,requests,request-html,pyppeteer,javascript", "title": "Parsing JavaScript rendered pages in Python with pyppeteer", "type": "text" } }, "nbformat": 4, "nbformat_minor": 1 }