Parsing JavaScript rendered pages in Python with pyppeteer

Parsing JavaScript rendered pages in Python with pyppeteer

Where is my table?

I already wrote a blog post about Parsing HTML Tables in Python with pandas. Using requests or even directly pandas was working nicely.

I wanted to play with some data from a race I recently run: Lundaloppet. The results are available here: http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25 Results Lundaloppet 2018

Let's try to get that table!

In [1]:
import pandas as pd
In [2]:
dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-881110a1fe3d> in <module>()
----> 1 dfs = pd.read_html('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
    985                   decimal=decimal, converters=converters, na_values=na_values,
    986                   keep_default_na=keep_default_na,
--> 987                   displayed_only=displayed_only)

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    813             break
    814     else:
--> 815         raise_with_traceback(retained)
    816 
    817     ret = []

~/miniconda3/envs/notebook/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    401         if traceback == Ellipsis:
    402             _, _, traceback = sys.exc_info()
--> 403         raise exc.with_traceback(traceback)
    404 else:
    405     # this version of raise is a syntax error in Python 3

ValueError: No tables found

No tables found... So what is going on? Let's look at what is returned by requests.

In [3]:
import requests
from IPython.display import display_html
In [4]:
r = requests.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.text
Out[4]:
'<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" ng-app="app">\r\n<head>\r\n    <title ng-bind="event.name || \'Neptron Timing\'">Neptron Timing</title>\r\n\r\n    <meta charset="utf-8">\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n    <meta name="description" content="Neptron Timing event results">\r\n\r\n    <link rel="shortcut icon" href="favicon.ico">\r\n    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/css/bootstrap.min.css">\r\n    <link rel="stylesheet" href="content/app.min.css">\r\n    <script src="scripts/iframeResizer.contentWindow.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/es6-shim/0.35.0/es6-shim.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.6/js/bootstrap.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.4.8/angular.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.4.8/angular-route.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.13.0/moment.min.js"></script>\r\n    <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.0.2/Chart.min.js"></script>\r\n    <script src="https://maps.googleapis.com/maps/api/js?key=AIzaSyD7OPJoYN6W9qUHU1L_fEr_5ut8tQN8r2A"></script>\r\n</head>\r\n<body>\r\n    <div class="navbar navbar-inverse navbar-static-top" role="navigation">\r\n        <div class="container">\r\n            <div class="navbar-header">\r\n                <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">\r\n                    <span class="sr-only">Toggle navigation</span>\r\n                    <span class="icon-bar"></span>\r\n                    <span class="icon-bar"></span>\r\n                    <span class="icon-bar"></span>\r\n                </button>\r\n                <a class="navbar-brand" href="#">Neptron Timing</a>\r\n            </div>\r\n            <div class="collapse navbar-collapse">\r\n                <ul class="nav navbar-nav">\r\n                    <li><a href="#/">Events</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/event">Info</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/results">Results</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/leaderboard">Leaderboard</a></li>\r\n                    <li ng-show="event.id && event.tracking"><a href="#/{{event.id}}/tracking">Tracking</a></li>\r\n                    <li ng-show="event.id"><a href="#/{{event.id}}/favorites">Favorites</a></li>\r\n                    <li ng-show="event.id && event.sprints.length > 0"><a href="#/{{event.id}}/sprint">Sprint</a></li>\r\n                    <li ng-show="event.id && event.teamCompetitions.length > 0"><a href="#/{{event.id}}/teams">Teams</a></li>\r\n                </ul>\r\n            </div><!--/.nav-collapse -->\r\n        </div>\r\n    </div>\r\n  <script type="text/javascript">\r\n\r\nvar fixLidingloppetMessage = function() {\r\n\tvar str = window.location.href || \'\';\r\n\tvar cssStyle = (str.match(\'lidingolor2017\') ? \'\' : \'none\');\r\n\t//console.log(\'changed: \'+str, cssStyle);\r\n\t$(\'#nytamin-fix\').css(\'display\', cssStyle);\r\n}\r\n$(window).bind(\'hashchange\', function() {\r\n\tfixLidingloppetMessage();\r\n});\r\nwindow.setInterval(fixLidingloppetMessage, 1000);\r\n\r\n</script>\r\n\r\n<div class="container-fluid">\r\n\t<div id="nytamin-fix" class="panel panel-primary" style="display: none; margin: 2em;">\r\n\t  <div class="panel-heading">Liding&ouml;loppet.se</div>\r\n\t  <div class="panel-body">\r\n\t\t\r\n\t\t<strong><a href="http://213.39.39.152">Click here to get back to Liding&ouml;loppet\'s homepage!</a></strong>\r\n\r\n\t  </div>\r\n\t</div>\r\n</div>\r\n    <div class="container-fluid" ng-view></div>\r\n  <div class="nt-app-links" style="margin:10px 20px">\r\n    <a href="https://www.raceone.com/redirect" id="download-icon-footer-ios" target="_blank">\r\n      <img class="download-icon" src="https://www.raceone.com/wp-content/uploads/2016/09/app-store-e1475238488598.png" alt="">\r\n    </a>\r\n    <a href="https://www.raceone.com/redirect" id="download-icon-footer-android" target="_blank">\r\n      <img class="download-icon" src="https://www.raceone.com/wp-content/uploads/2016/09/google-play-e1475238513871.png" alt="">\r\n    </a>\r\n  </div>\r\n\r\n    <script type="text/javascript" src="scripts/app.js"></script>\r\n\r\n    <!-- AddThis Button BEGIN -->\r\n    <div class="addthis_toolbox addthis_default_style addthis_32x32_style">\r\n        <a class="addthis_button_facebook"></a>\r\n        <a class="addthis_button_twitter"></a>\r\n        <a class="addthis_button_linkedin"></a>\r\n        <a class="addthis_button_email"></a>\r\n        <a class="addthis_button_print"></a>\r\n        <a class="addthis_button_textme"></a>\r\n        <a class="addthis_button_compact"></a>\r\n    </div>\r\n    <script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-5364e093794f9d2f"></script>\r\n    <!-- AddThis Button END -->\r\n\r\n    <!--<div class="applinks">\r\n        <a href="https://itunes.apple.com/se/app/neptron-timing/id709776903" target="_blank"><img class="appstore" alt="Get it on iTunes" src="content/appstore.svg" /></a>\r\n        <a href="https://play.google.com/store/apps/details?id=se.neptron.timing" target="_blank"><img class="playstore" alt="Get it on Google Play" src="content/playstore.png" /></a>\r\n    </div>-->\r\n\r\n</body>\r\n</html>\r\n'
In [5]:
display_html(r.text, raw=True)
 Neptron Timing

There is no table in the HTML sent by the server. The table is rendered on the client side by AngularJS. We can check that by looking at the page source in Chrome: Results Lundaloppet 2018 source

How do you parse JavaScript rendered page in Python? Don't we need a browser to run the JavaScript code? By googling, I found Requests-HTML that has JavaScript support!

Requests-HTML

In [6]:
from requests_html import HTMLSession
In [7]:
session = HTMLSession()
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
In [8]:
display_html(table.html, raw=True)
  Place
(race)
Place
(cat)
Bib no Category Name Association Progress Time Status
1
1
6922
P10
Hans Larsson
MAI
Finish
33:22
Finished
2
2
6514
P10
Filip Helmroth
IK Lerum Friidrott
Finish
33:37
Finished
3
3
3920
P10
David Hartman
Björnstorps IF
Finish
33:39
Finished
4
4
3926
P10
Henrik Orre
Björnstorps IF
Finish
34:24
Finished
5
5
2666
P10
Jesper Bokefors
Malmö AI
Finish
34:51
Finished
6
6
5729
P10
Juan Negreira
Lunds universitet
Finish
35:19
Finished
7
7
3649
P10
Jim Webb
Finish
35:23
Finished
8
8
3675
P10
Nils Wetterberg
Ekmans Löpare i Lund
Finish
35:39
Finished
9
9
4880
P10
Hannes Hjalmarsson
Lunds kommun
Finish
35:41
Finished
10
10
6929
P10
Freyi Karlsson
Ekmans löpare i lund
Finish
35:42
Finished
11
11
5995
P10
Shijie Xu
Lunds universitet
Finish
35:43
Finished
12
12
5276
P10
Stuart Ansell
Lunds universitet
Finish
36:02
Finished
13
13
3917
P10
Christer Friberg
Björnstorps IF
Finish
36:15
Finished
14
14
5647
P10
Roger Lindskog
Lunds universitet
Finish
36:15
Finished
15
15
3616
P10
Andreas Thell
Ystads IF Friidrott
Finish
36:20
Finished
16
16
6382
P10
Tommy Olofsson
Tetra Pak IF
Finish
36:20
Finished
17
17
3183
P10
Kristoffer Loo
Finish
36:36
Finished
18
18
2664
P10
Alfred Bodenäs
Triathlon Syd
Finish
36:44
Finished
19
19
6979
P10
Daniel Jonsson
Finish
36:54
Finished
20
20
4977
P10
Johan Lindgren
Lunds kommun
Finish
36:58
Finished
21
21
3495
P10
Erik Schultz-Eklund
Agape Lund
Finish
37:20
Finished
22
22
3571
P10
Daniel Strandberg
Malmö AI
Finish
37:28
Finished
23
23
3121
P10
Martin Larsson
inQore-part of Qgroup
Finish
37:32
Finished
24
24
5955
P10
Johan Vallon-Christersson
Lunds universitet
Finish
37:33
Finished
25
25
6675
P10
Kristian Haggärde
Björnstorps IF
Finish
37:34
Finished

Wow! Isn't that magic? We'll explore a bit later how this works.

What I want to get is all the results, not just the first 25. I tried increasing the pageSize passed in the URL, but that didn't help. Even passing a lower value always returns 25 rows. Not sure how the API is implemented...

An issue I had with requests-html is that sometimes r.html.find('table', first=True) returned None or an empty table...

In [9]:
r = session.get('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=63&pageSize=25')
r.html.render()
table = r.html.find('table', first=True)
pd.read_html(table.html)[0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-e9d6c036862c> in <module>()
      2 r.html.render()
      3 table = r.html.find('table', first=True)
----> 4 pd.read_html(table.html)[0]

IndexError: list index out of range

That's probably a timing issue (the rendering might take longer sometimes). I tried playing with the wait and sleep arguments of r.html.render(wait=1, sleep=1) but couldn't make the problem completetly go away. This is an issue because I don't need just one page but 135.

I started to look at requests-html code to see how this was implemented. That's how I discovered pyppeteer.

Pyppeteer

Pyppeteer is an unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

Pyppeteer allows you to do the same from Python. So there is no magic. You just let Chromium load and render the page with the latest JavaScript and browser features. This is super powerful.

The first time you run pyppeteer, it even downloads a recent version of Chromium. So no initial setup is required.

Pyppeteer is based on asyncio. This is hidden by requests-html that gives you a simple interface but of course less flexibility.

So let's explore pyppeteer. The first example from the documentation is how to take a screenshot of a page.

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Let's try that with our page. Note that I pass the fullPage option otherwise the page is cut.

In [10]:
import asyncio
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.screenshot({'path': 'pyppeteer_screenshot.png', 'fullPage': True})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Here is the screenshot taken: Pyppeteer screenshot

Nice, no? This example showed us how to load a page:

  • create a browser
  • create a new page
  • goto a page

There are several functions that can be used to retrieve elements from the page, like querySelector or querySelectorEval. This is the function we gonna use to retrieve the table. We use the table selector and apply the outerHTML function to get the HTML representation of the table:

table = await page.querySelectorEval('table', '(element) => element.outerHTML')

We can then pass that to pandas.

One thing we wanted is to wait for the table to be rendered before trying to retrieve it. We can use the waitForSelector function for that. I initially tried to use the table selector but that sometimes returned an empty table. So I chose a class of one row element td.res-startNo to be sure that the table was rendered.

In [11]:
import asyncio
import pandas as pd
from pyppeteer import launch


async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://results.neptron.se/#/lundaloppet2018/?sortOrder=Place&raceId=99&page=0&pageSize=25')
    await page.waitForSelector('td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    await browser.close()
    return pd.read_html(table)[0]

df = asyncio.get_event_loop().run_until_complete(main())
df
Out[11]:
Unnamed: 0 Place(race) Place(cat) Bib no Category Name Unnamed: 6 Association Progress Time Status
0 NaN 1 1 6922 P10 Hans Larsson NaN MAI Finish 33:22 Finished
1 NaN 2 2 6514 P10 Filip Helmroth NaN IK Lerum Friidrott Finish 33:37 Finished
2 NaN 3 3 3920 P10 David Hartman NaN Björnstorps IF Finish 33:39 Finished
3 NaN 4 4 3926 P10 Henrik Orre NaN Björnstorps IF Finish 34:24 Finished
4 NaN 5 5 2666 P10 Jesper Bokefors NaN Malmö AI Finish 34:51 Finished
5 NaN 6 6 5729 P10 Juan Negreira NaN Lunds universitet Finish 35:19 Finished
6 NaN 7 7 3649 P10 Jim Webb NaN NaN Finish 35:23 Finished
7 NaN 8 8 3675 P10 Nils Wetterberg NaN Ekmans Löpare i Lund Finish 35:39 Finished
8 NaN 9 9 4880 P10 Hannes Hjalmarsson NaN Lunds kommun Finish 35:41 Finished
9 NaN 10 10 6929 P10 Freyi Karlsson NaN Ekmans löpare i lund Finish 35:42 Finished
10 NaN 11 11 5995 P10 Shijie Xu NaN Lunds universitet Finish 35:43 Finished
11 NaN 12 12 5276 P10 Stuart Ansell NaN Lunds universitet Finish 36:02 Finished
12 NaN 13 13 3917 P10 Christer Friberg NaN Björnstorps IF Finish 36:15 Finished
13 NaN 14 14 5647 P10 Roger Lindskog NaN Lunds universitet Finish 36:15 Finished
14 NaN 15 15 3616 P10 Andreas Thell NaN Ystads IF Friidrott Finish 36:20 Finished
15 NaN 16 16 6382 P10 Tommy Olofsson NaN Tetra Pak IF Finish 36:20 Finished
16 NaN 17 17 3183 P10 Kristoffer Loo NaN NaN Finish 36:36 Finished
17 NaN 18 18 2664 P10 Alfred Bodenäs NaN Triathlon Syd Finish 36:44 Finished
18 NaN 19 19 6979 P10 Daniel Jonsson NaN NaN Finish 36:54 Finished
19 NaN 20 20 4977 P10 Johan Lindgren NaN Lunds kommun Finish 36:58 Finished
20 NaN 21 21 3495 P10 Erik Schultz-Eklund NaN Agape Lund Finish 37:20 Finished
21 NaN 22 22 3571 P10 Daniel Strandberg NaN Malmö AI Finish 37:28 Finished
22 NaN 23 23 3121 P10 Martin Larsson NaN inQore-part of Qgroup Finish 37:32 Finished
23 NaN 24 24 5955 P10 Johan Vallon-Christersson NaN Lunds universitet Finish 37:33 Finished
24 NaN 25 25 6675 P10 Kristian Haggärde NaN Björnstorps IF Finish 37:34 Finished

That's a bit more code than with requests-HTML but we have finer control. Let's refactor that code to retrieve all the results of the race.

In [12]:
import asyncio
import pandas as pd
from pyppeteer import launch

URL = 'http://results.neptron.se/#/lundaloppet2018/results?sortOrder=Place&raceId=99&pageSize=25&page={}'


async def get_page(browser, url, selector):
    """Return a page after waiting for the given selector"""
    page = await browser.newPage()
    await page.goto(url)
    await page.waitForSelector(selector)
    return page


async def get_num_pages(browser):
    """Return the total number of pages available"""
    page = await get_page(browser, URL.format(0), 'div.ng-isolate-scope')
    num_pages = await page.querySelectorEval(
        'div.ng-isolate-scope',
        '(element) => element.getAttribute("data-num-pages")')
    return int(num_pages)


async def get_table(browser, page_nb):
    """Return the table from the given page number as a pandas dataframe"""
    print(f'Get table from page {page_nb}')
    page = await get_page(browser, URL.format(page_nb), 'td.res-startNo')
    table = await page.querySelectorEval('table', '(element) => element.outerHTML')
    return pd.read_html(table)[0]


async def get_results():
    """Return all the results as a pandas dataframe"""
    browser = await launch()
    num_pages = await get_num_pages(browser)
    print(f'Number of pages: {num_pages}')
    # Python 3.6 asynchronous comprehensions! Nice!
    dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]
    await browser.close()
    df = pd.concat(dfs, ignore_index=True)
    return df

This code could be made a bit more generic but that's good enough for what I want. I think it's quite straightforward. We first get the total number of pages and then load each page to retrieve the table. Once done, we just have to concatenate all those tables in one.

One thing to note is the use of Python asynchronous comprehensions. This is a Python 3.6 feature and makes it really Pythonic. It just works as it would with synchronous functions:

dfs = [await get_table(browser, page_nb) for page_nb in range(0, num_pages)]

Let's run that code!

In [13]:
df = asyncio.get_event_loop().run_until_complete(get_results())
Number of pages: 115
Get table from page 0
Get table from page 1
Get table from page 2
Get table from page 3
Get table from page 4
Get table from page 5
Get table from page 6
Get table from page 7
Get table from page 8
Get table from page 9
Get table from page 10
Get table from page 11
Get table from page 12
Get table from page 13
Get table from page 14
Get table from page 15
Get table from page 16
Get table from page 17
Get table from page 18
Get table from page 19
Get table from page 20
Get table from page 21
Get table from page 22
Get table from page 23
Get table from page 24
Get table from page 25
Get table from page 26
Get table from page 27
Get table from page 28
Get table from page 29
Get table from page 30
Get table from page 31
Get table from page 32
Get table from page 33
Get table from page 34
Get table from page 35
Get table from page 36
Get table from page 37
Get table from page 38
Get table from page 39
Get table from page 40
Get table from page 41
Get table from page 42
Get table from page 43
Get table from page 44
Get table from page 45
Get table from page 46
Get table from page 47
Get table from page 48
Get table from page 49
Get table from page 50
Get table from page 51
Get table from page 52
Get table from page 53
Get table from page 54
Get table from page 55
Get table from page 56
Get table from page 57
Get table from page 58
Get table from page 59
Get table from page 60
Get table from page 61
Get table from page 62
Get table from page 63
Get table from page 64
Get table from page 65
Get table from page 66
Get table from page 67
Get table from page 68
Get table from page 69
Get table from page 70
Get table from page 71
Get table from page 72
Get table from page 73
Get table from page 74
Get table from page 75
Get table from page 76
Get table from page 77
Get table from page 78
Get table from page 79
Get table from page 80
Get table from page 81
Get table from page 82
Get table from page 83
Get table from page 84
Get table from page 85
Get table from page 86
Get table from page 87
Get table from page 88
Get table from page 89
Get table from page 90
Get table from page 91
Get table from page 92
Get table from page 93
Get table from page 94
Get table from page 95
Get table from page 96
Get table from page 97
Get table from page 98
Get table from page 99
Get table from page 100
Get table from page 101
Get table from page 102
Get table from page 103
Get table from page 104
Get table from page 105
Get table from page 106
Get table from page 107
Get table from page 108
Get table from page 109
Get table from page 110
Get table from page 111
Get table from page 112
Get table from page 113
Get table from page 114

That's it! We got all the results from the Lundaloppet 2018 in a nice pandas DataFrame.

In [14]:
len(df)
Out[14]:
2872
In [15]:
df.head()
Out[15]:
Unnamed: 0 Place(race) Place(cat) Bib no Category Name Unnamed: 6 Association Progress Time Status
0 NaN 1.0 1.0 6922 P10 Hans Larsson NaN MAI Finish 33:22 Finished
1 NaN 2.0 2.0 6514 P10 Filip Helmroth NaN IK Lerum Friidrott Finish 33:37 Finished
2 NaN 3.0 3.0 3920 P10 David Hartman NaN Björnstorps IF Finish 33:39 Finished
3 NaN 4.0 4.0 3926 P10 Henrik Orre NaN Björnstorps IF Finish 34:24 Finished
4 NaN 5.0 5.0 2666 P10 Jesper Bokefors NaN Malmö AI Finish 34:51 Finished
In [16]:
df.tail()
Out[16]:
Unnamed: 0 Place(race) Place(cat) Bib no Category Name Unnamed: 6 Association Progress Time Status
2867 NaN NaN NaN 6855 T10 porntepin sooksaengprasit NaN Lunds universitet NaN NaN Not started
2868 NaN NaN NaN 6857 P10 Gabriel Teku NaN Lunds universitet NaN NaN Not started
2869 NaN NaN NaN 6888 P10 Viktor Karlsson NaN Genarps if NaN NaN Not started
2870 NaN NaN NaN 6892 P10 Emil Larsson NaN NaN NaN NaN Not started
2871 NaN NaN NaN 6893 P10 Göran Larsson NaN NaN NaN NaN Not started

Summary

With frameworks like AngularJS, React, Vue.js... more and more websites use client-side rendering. To parse those websites, you can't just request HTML from the server. Parsing requires to run some JavaScript.

Pyppeteer makes that possible. Thanks to Headless Chomium, it gives you access to the full power of a browser from Python. I find that really impressive!

I tried to use Selenium in the past but didn't find it very easy to start with. That wasn't the case with Pyppeteer. To be fair, it was a while ago and both projects are quite different. It's not just about browser automation. Selenium allows you to perform cross browser testing. Pyppeteer is limited to Chrome/Chromium. Anyway, I'll probably look more at Pyppeteer for web application testing.

For simple tasks, Requests-HTML is a nice wrapper and gives you a simple API. If you want more control, use directly Pyppeteer.

One last note. To run this code in a Jupyter notebook, you should use tornado 4. asyncio code doesn't play well with ipython and tornado 5. See this Github issue: asyncio will be running by default with tornado 5. There is some work in progress for a nice integration.

What about the Lundaloppet results you might ask? I'll explore them in another post!

Parsing HTML Tables in Python with pandas

Not long ago, I needed to parse some HTML tables from our confluence website at work. I first thought: I'm gonna need requests and BeautifulSoup. As HTML tables are well defined, I did some quick googling to see if there was some recipe or lib to parse them and I found a link to pandas. What? Can pandas do that too?

I have been using pandas for quite some time and have used read_csv, read_excel, even read_sql, but I had missed read_html!

Reading excel file with pandas

Before to look at HTML tables, I want to show a quick example on how to read an excel file with pandas. The API is really nice. If I have to look at some excel data, I go directly to pandas.

So let's download a sample file file:

In [1]:
import io
import requests
import pandas as pd
from zipfile import ZipFile
In [2]:
r = requests.get('http://www.contextures.com/SampleData.zip')
ZipFile(io.BytesIO(r.content)).extractall()

This created the SampleData.xlsx file that includes four sheets: Instructions, SalesOrders, SampleNumbers and MyLinks. Only the SalesOrders sheet includes tabular data: SampleData So let's read it.

In [3]:
df = pd.read_excel('SampleData.xlsx', sheet_name='SalesOrders')
In [4]:
df.head()
Out[4]:
OrderDate Region Rep Item Units Unit Cost Total
0 2016-01-06 East Jones Pencil 95 1.99 189.05
1 2016-01-23 Central Kivell Binder 50 19.99 999.50
2 2016-02-09 Central Jardine Pencil 36 4.99 179.64
3 2016-02-26 Central Gill Pen 27 19.99 539.73
4 2016-03-15 West Sorvino Pencil 56 2.99 167.44

That's it. One line and you have your data in a DataFrame that you can easily manipulate, filter, convert and display in a jupyter notebook. Can it be easier than that?

Parsing HTML Tables

So let's go back to HTML tables and look at pandas.read_html.

The function accepts:

A URL, a file-like object, or a raw string containing HTML.

Let's start with a basic HTML table in a raw string.

Parsing raw string

In [5]:
html_string = """
<table>
  <thead>
    <tr>
      <th>Programming Language</th>
      <th>Creator</th> 
      <th>Year</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C</td>
      <td>Dennis Ritchie</td> 
      <td>1972</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>Guido Van Rossum</td> 
      <td>1989</td>
    </tr>
    <tr>
      <td>Ruby</td>
      <td>Yukihiro Matsumoto</td> 
      <td>1995</td>
    </tr>
  </tbody>
</table>
"""

We can render the table using IPython display_html function:

In [6]:
from IPython.display import display_html
display_html(html_string, raw=True)
Programming Language Creator Year
C Dennis Ritchie 1972
Python Guido Van Rossum 1989
Ruby Yukihiro Matsumoto 1995

Let's import this HTML table in a DataFrame. Note that the function read_html always returns a list of DataFrame objects:

In [7]:
dfs = pd.read_html(html_string)
dfs
Out[7]:
[  Programming Language             Creator  Year
 0                    C      Dennis Ritchie  1972
 1               Python    Guido Van Rossum  1989
 2                 Ruby  Yukihiro Matsumoto  1995]
In [8]:
df = dfs[0]
df
Out[8]:
Programming Language Creator Year
0 C Dennis Ritchie 1972
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995

This looks quite similar to the raw string we rendered above, but we are printing a pandas DataFrame object here! We can apply any operation we want.

In [9]:
df[df.Year > 1975]
Out[9]:
Programming Language Creator Year
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995

Pandas automatically found the header to use thanks to the <thead> tag. It is not mandatory to define a table and is actually often missing on the web. So what happens if it's not present?

In [10]:
html_string = """
<table>
  <tr>
    <th>Programming Language</th>
    <th>Creator</th> 
    <th>Year</th>
  </tr>
  <tr>
    <td>C</td>
    <td>Dennis Ritchie</td> 
    <td>1972</td>
  </tr>
  <tr>
    <td>Python</td>
    <td>Guido Van Rossum</td> 
    <td>1989</td>
  </tr>
  <tr>
    <td>Ruby</td>
    <td>Yukihiro Matsumoto</td> 
    <td>1995</td>
  </tr>
</table>
"""
In [11]:
pd.read_html(html_string)[0]
Out[11]:
0 1 2
0 Programming Language Creator Year
1 C Dennis Ritchie 1972
2 Python Guido Van Rossum 1989
3 Ruby Yukihiro Matsumoto 1995

In this case, we need to pass the row number to use as header.

In [12]:
pd.read_html(html_string, header=0)[0]
Out[12]:
Programming Language Creator Year
0 C Dennis Ritchie 1972
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995

Parsing a http URL

The same data we read in our excel file is available in a table at the following address: http://www.contextures.com/xlSampleData01.html

Let's pass this url to read_html:

In [13]:
dfs = pd.read_html('http://www.contextures.com/xlSampleData01.html')
In [14]:
dfs
Out[14]:
[             0        1         2        3      4         5        6
 0    OrderDate   Region       Rep     Item  Units  UnitCost    Total
 1     1/6/2016     East     Jones   Pencil     95      1.99   189.05
 2    1/23/2016  Central    Kivell   Binder     50     19.99   999.50
 3     2/9/2016  Central   Jardine   Pencil     36      4.99   179.64
 4    2/26/2016  Central      Gill      Pen     27     19.99   539.73
 5    3/15/2016     West   Sorvino   Pencil     56      2.99   167.44
 6     4/1/2016     East     Jones   Binder     60      4.99   299.40
 7    4/18/2016  Central   Andrews   Pencil     75      1.99   149.25
 8     5/5/2016  Central   Jardine   Pencil     90      4.99   449.10
 9    5/22/2016     West  Thompson   Pencil     32      1.99    63.68
 10    6/8/2016     East     Jones   Binder     60      8.99   539.40
 11   6/25/2016  Central    Morgan   Pencil     90      4.99   449.10
 12   7/12/2016     East    Howard   Binder     29      1.99    57.71
 13   7/29/2016     East    Parent   Binder     81     19.99  1619.19
 14   8/15/2016     East     Jones   Pencil     35      4.99   174.65
 15    9/1/2016  Central     Smith     Desk      2    125.00   250.00
 16   9/18/2016     East     Jones  Pen Set     16     15.99   255.84
 17   10/5/2016  Central    Morgan   Binder     28      8.99   251.72
 18  10/22/2016     East     Jones      Pen     64      8.99   575.36
 19   11/8/2016     East    Parent      Pen     15     19.99   299.85
 20  11/25/2016  Central    Kivell  Pen Set     96      4.99   479.04
 21  12/12/2016  Central     Smith   Pencil     67      1.29    86.43
 22  12/29/2016     East    Parent  Pen Set     74     15.99  1183.26
 23   1/15/2017  Central      Gill   Binder     46      8.99   413.54
 24    2/1/2017  Central     Smith   Binder     87     15.00  1305.00
 25   2/18/2017     East     Jones   Binder      4      4.99    19.96
 26    3/7/2017     West   Sorvino   Binder      7     19.99   139.93
 27   3/24/2017  Central   Jardine  Pen Set     50      4.99   249.50
 28   4/10/2017  Central   Andrews   Pencil     66      1.99   131.34
 29   4/27/2017     East    Howard      Pen     96      4.99   479.04
 30   5/14/2017  Central      Gill   Pencil     53      1.29    68.37
 31   5/31/2017  Central      Gill   Binder     80      8.99   719.20
 32   6/17/2017  Central    Kivell     Desk      5    125.00   625.00
 33    7/4/2017     East     Jones  Pen Set     62      4.99   309.38
 34   7/21/2017  Central    Morgan  Pen Set     55     12.49   686.95
 35    8/7/2017  Central    Kivell  Pen Set     42     23.95  1005.90
 36   8/24/2017     West   Sorvino     Desk      3    275.00   825.00
 37   9/10/2017  Central      Gill   Pencil      7      1.29     9.03
 38   9/27/2017     West   Sorvino      Pen     76      1.99   151.24
 39  10/14/2017     West  Thompson   Binder     57     19.99  1139.43
 40  10/31/2017  Central   Andrews   Pencil     14      1.29    18.06
 41  11/17/2017  Central   Jardine   Binder     11      4.99    54.89
 42   12/4/2017  Central   Jardine   Binder     94     19.99  1879.06
 43  12/21/2017  Central   Andrews   Binder     28      4.99   139.72]

We have one table and can see that we need to pass the row number to use as header (because <thead> is not present).

In [15]:
dfs = pd.read_html('http://www.contextures.com/xlSampleData01.html', header=0)
dfs[0].head()
Out[15]:
OrderDate Region Rep Item Units UnitCost Total
0 1/6/2016 East Jones Pencil 95 1.99 189.05
1 1/23/2016 Central Kivell Binder 50 19.99 999.50
2 2/9/2016 Central Jardine Pencil 36 4.99 179.64
3 2/26/2016 Central Gill Pen 27 19.99 539.73
4 3/15/2016 West Sorvino Pencil 56 2.99 167.44

Nice!

Parsing a https URL

The documentation states that:

Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

This is true, but bs4 + html5lib are used as a fallback when lxml fails. I guess this is why passing a https url does work. We can confirm that with a wikipedia page.

In [16]:
pd.read_html('https://en.wikipedia.org/wiki/Python_(programming_language)', header=0)[1]
Out[16]:
Type mutable Description Syntax example
0 bool immutable Boolean value True False
1 bytearray mutable Sequence of bytes bytearray(b'Some ASCII') bytearray(b"Some ASCI...
2 bytes immutable Sequence of bytes b'Some ASCII' b"Some ASCII" bytes([119, 105, 1...
3 complex immutable Complex number with real and imaginary parts 3+2.7j
4 dict mutable Associative array (or dictionary) of key and v... {'key1': 1.0, 3: False}
5 ellipsis NaN An ellipsis placeholder to be used as an index... ...
6 float immutable Floating point number, system-defined precision 3.1415927
7 frozenset immutable Unordered set, contains no duplicates; can con... frozenset([4.0, 'string', True])
8 int immutable Integer of unlimited magnitude[76] 42
9 list mutable List, can contain mixed types [4.0, 'string', True]
10 set mutable Unordered set, contains no duplicates; can con... {4.0, 'string', True}
11 str immutable A character string: sequence of Unicode codepo... 'Wikipedia' "Wikipedia" """Spanning multiple l...
12 tuple immutable Can contain mixed types (4.0, 'string', True)But we can append element...

But what if the url requires authentiation?

In that case we can use requests to get the HTML and pass the string to pandas!

To demonstrate authentication, we can use http://httpbin.org

We can first confirm that passing a url that requires authentication raises a 401

In [17]:
pd.read_html('https://httpbin.org/basic-auth/myuser/mypasswd')
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-17-7e6b50c9f1f3> in <module>()
----> 1 pd.read_html('https://httpbin.org/basic-auth/myuser/mypasswd')

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    913                   thousands=thousands, attrs=attrs, encoding=encoding,
    914                   decimal=decimal, converters=converters, na_values=na_values,
--> 915                   keep_default_na=keep_default_na)

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
    747             break
    748     else:
--> 749         raise_with_traceback(retained)
    750 
    751     ret = []

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    383         if traceback == Ellipsis:
    384             _, _, traceback = sys.exc_info()
--> 385         raise exc.with_traceback(traceback)
    386 else:
    387     # this version of raise is a syntax error in Python 3

HTTPError: HTTP Error 401: UNAUTHORIZED
In [ ]:
r = requests.get('https://httpbin.org/basic-auth/myuser/mypasswd')
r.status_code

Yes, as expected. Let's pass the username and password with requests.

In [ ]:
r = requests.get('https://httpbin.org/basic-auth/myuser/mypasswd', auth=('myuser', 'mypasswd'))
r.status_code

We could now pass r.text to pandas. http://httpbin.org was used to demonstrate authentication but it only returns JSON-encoded responses and no HTML. It's a testing service. So it doesn't make sense here.

The following example shows how to combine requests and pandas.

In [18]:
r = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
pd.read_html(r.text, header=0)[1]
Out[18]:
Type mutable Description Syntax example
0 bool immutable Boolean value True False
1 bytearray mutable Sequence of bytes bytearray(b'Some ASCII') bytearray(b"Some ASCI...
2 bytes immutable Sequence of bytes b'Some ASCII' b"Some ASCII" bytes([119, 105, 1...
3 complex immutable Complex number with real and imaginary parts 3+2.7j
4 dict mutable Associative array (or dictionary) of key and v... {'key1': 1.0, 3: False}
5 ellipsis NaN An ellipsis placeholder to be used as an index... ...
6 float immutable Floating point number, system-defined precision 3.1415927
7 frozenset immutable Unordered set, contains no duplicates; can con... frozenset([4.0, 'string', True])
8 int immutable Integer of unlimited magnitude[76] 42
9 list mutable List, can contain mixed types [4.0, 'string', True]
10 set mutable Unordered set, contains no duplicates; can con... {4.0, 'string', True}
11 str immutable A character string: sequence of Unicode codepo... 'Wikipedia' "Wikipedia" """Spanning multiple l...
12 tuple immutable Can contain mixed types (4.0, 'string', True)But we can append element...

A more complex example

We looked at some quite simple examples so far. So let's try a page with several tables: https://en.wikipedia.org/wiki/Timeline_of_programming_languages

In [19]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages')
In [20]:
len(dfs)
Out[20]:
13

If we look at the page we have 8 tables (one per decade). Looking at our dfs list, we can see that the first interesting table is the fifth one and that we need to pass the row to use as header.

In [21]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', header=0)
dfs[4]
Out[21]:
Year Name Chief developer, company Predecessor(s)
0 1943–45 Plankalkül (concept) Konrad Zuse none (unique language)
1 1943–46 ENIAC coding system John von Neumann, John Mauchly, J. Presper Eck... none (unique language)
2 1946 ENIAC Short Code Richard Clippinger, John von Neumann after Ala... ENIAC coding system
3 1946 Von Neumann and Goldstine graphing system (Not... John von Neumann and Herman Goldstine ENIAC coding system
4 1947 ARC Assembly Kathleen Booth[1][2] ENIAC coding system
5 1948 CPC Coding scheme Howard H. Aiken Analytical Engine order code
6 1948 Curry notation system Haskell Curry ENIAC coding system
7 1948 Plankalkül (concept published) Konrad Zuse none (unique language)
8 1949 Short Code John Mauchly and William F. Schmitt ENIAC Short Code
9 Year Name Chief developer, company Predecessor(s)

Notice that the header was repeated in the last row (to make the table easier to read on the HTML page). We can filter that after concatenating together the 8 tables to get one DataFrame.

In [22]:
df = pd.concat(dfs[4:12])
df
Out[22]:
Year Name Chief developer, company Predecessor(s)
0 1943–45 Plankalkül (concept) Konrad Zuse none (unique language)
1 1943–46 ENIAC coding system John von Neumann, John Mauchly, J. Presper Eck... none (unique language)
2 1946 ENIAC Short Code Richard Clippinger, John von Neumann after Ala... ENIAC coding system
3 1946 Von Neumann and Goldstine graphing system (Not... John von Neumann and Herman Goldstine ENIAC coding system
4 1947 ARC Assembly Kathleen Booth[1][2] ENIAC coding system
5 1948 CPC Coding scheme Howard H. Aiken Analytical Engine order code
6 1948 Curry notation system Haskell Curry ENIAC coding system
7 1948 Plankalkül (concept published) Konrad Zuse none (unique language)
8 1949 Short Code John Mauchly and William F. Schmitt ENIAC Short Code
9 Year Name Chief developer, company Predecessor(s)
0 1950 Short Code William F Schmidt, Albert B. Tonik,[3] J.R. Logan Brief Code
1 1950 Birkbeck Assembler Kathleen Booth ARC
2 1951 Superplan Heinz Rutishauser Plankalkül
3 1951 ALGAE Edward A Voorhees and Karl Balke none (unique language)
4 1951 Intermediate Programming Language Arthur Burks Short Code
5 1951 Regional Assembly Language Maurice Wilkes EDSAC
6 1951 Boehm unnamed coding system Corrado Böhm CPC Coding scheme
7 1951 Klammerausdrücke Konrad Zuse Plankalkül
8 1951 OMNIBAC Symbolic Assembler Charles Katz Short Code
9 1951 Stanislaus (Notation) Fritz Bauer none (unique language)
10 1951 Whirlwind assembler Charles Adams and Jack Gilmore at MIT Project ... EDSAC
11 1951 Rochester assembler Nat Rochester EDSAC
12 1951 Sort Merge Generator Betty Holberton none (unique language)
13 1952 A-0 Grace Hopper Short Code
14 1952 Glennie Autocode Alick Glennie after Alan Turing CPC Coding scheme
15 1952 Editing Generator Milly Koss SORT/MERGE
16 1952 COMPOOL RAND/SDC none (unique language)
17 1953 Speedcoding John W. Backus none (unique language)
18 1953 READ/PRINT Don Harroff, James Fishman, George Ryckman none (unique language)
19 1954 Laning and Zierler system Laning, Zierler, Adams at MIT Project Whirlwind none (unique language)
... ... ... ... ...
47 2009 Chapel Brad Chamberlain, Cray Inc. HPF, ZPL
48 2009 Go Google C, Oberon, Limbo, Smalltalk
49 2009 CoffeeScript Jeremy Ashkenas JavaScript, Ruby, Python, Haskell
50 2009 Idris Edwin Brady Haskell, Agda, Coq
51 2009 Parasail S. Tucker Taft, AdaCore Modula, Ada, Pascal, ML
52 2009 Whiley David J. Pearce Java, C, Python
53 Year Name Chief developer, company Predecessor(s)
0 2010 Rust Graydon Hoare, Mozilla Alef, C++, Camlp4, Erlang, Hermes, Limbo, Napi...
1 2011 Ceylon Gavin King, Red Hat Java
2 2011 Dart Google Java, JavaScript, CoffeeScript, Go
3 2011 C++11 C++ ISO/IEC 14882:2011 C++, Standard C, C
4 2011 Kotlin JetBrains Java, Scala, Groovy, C#, Gosu
5 2011 Red Nenad Rakocevic Rebol, Scala, Lua
6 2011 Opa MLstate OCaml, Erlang, JavaScript
7 2012 Elixir José Valim Erlang, Ruby, Clojure
8 2012 Elm Evan Czaplicki Haskell, Standard ML, OCaml, F#
9 2012 TypeScript Anders Hejlsberg, Microsoft JavaScript, CoffeeScript
10 2012 Julia Jeff Bezanson, Stefan Karpinski, Viral Shah, A... MATLAB, Lisp, C, Fortran, Mathematica[9] (stri...
11 2012 P Vivek Gupta: not the politician, Ethan Jackson... NaN
12 2012 Ada 2012 ARA and Ada Europe (ISO/IEC 8652:2012) Ada 2005, ISO/IEC 8652:1995/Amd 1:2007
13 2014 Crystal Ary Borenszweig, Manas Technology Solutions Ruby, C, Rust, Go, C#, Python
14 2014 Hack Facebook PHP
15 2014 Swift Apple Inc. Objective-C, Rust, Haskell, Ruby, Python, C#, CLU
16 2014 C++14 C++ ISO/IEC 14882:2014 C++, Standard C, C
17 2015 Atari 2600 SuperCharger BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...
18 2015 Perl 6 The Rakudo Team Perl, Haskell, Python, Ruby
19 2016 Ring Mahmoud Fayed Lua, Python, Ruby, C, C#, BASIC, QML, xBase, S...
20 2017 C++17 C++ ISO/IEC 14882:2017 C++, Standard C, C
21 2017 Atari 2600 Flashback BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...
22 Year Name Chief developer, company Predecessor(s)

388 rows × 4 columns

Remove the extra header rows.

In [23]:
prog_lang = df[df.Year != 'Year']
prog_lang
Out[23]:
Year Name Chief developer, company Predecessor(s)
0 1943–45 Plankalkül (concept) Konrad Zuse none (unique language)
1 1943–46 ENIAC coding system John von Neumann, John Mauchly, J. Presper Eck... none (unique language)
2 1946 ENIAC Short Code Richard Clippinger, John von Neumann after Ala... ENIAC coding system
3 1946 Von Neumann and Goldstine graphing system (Not... John von Neumann and Herman Goldstine ENIAC coding system
4 1947 ARC Assembly Kathleen Booth[1][2] ENIAC coding system
5 1948 CPC Coding scheme Howard H. Aiken Analytical Engine order code
6 1948 Curry notation system Haskell Curry ENIAC coding system
7 1948 Plankalkül (concept published) Konrad Zuse none (unique language)
8 1949 Short Code John Mauchly and William F. Schmitt ENIAC Short Code
0 1950 Short Code William F Schmidt, Albert B. Tonik,[3] J.R. Logan Brief Code
1 1950 Birkbeck Assembler Kathleen Booth ARC
2 1951 Superplan Heinz Rutishauser Plankalkül
3 1951 ALGAE Edward A Voorhees and Karl Balke none (unique language)
4 1951 Intermediate Programming Language Arthur Burks Short Code
5 1951 Regional Assembly Language Maurice Wilkes EDSAC
6 1951 Boehm unnamed coding system Corrado Böhm CPC Coding scheme
7 1951 Klammerausdrücke Konrad Zuse Plankalkül
8 1951 OMNIBAC Symbolic Assembler Charles Katz Short Code
9 1951 Stanislaus (Notation) Fritz Bauer none (unique language)
10 1951 Whirlwind assembler Charles Adams and Jack Gilmore at MIT Project ... EDSAC
11 1951 Rochester assembler Nat Rochester EDSAC
12 1951 Sort Merge Generator Betty Holberton none (unique language)
13 1952 A-0 Grace Hopper Short Code
14 1952 Glennie Autocode Alick Glennie after Alan Turing CPC Coding scheme
15 1952 Editing Generator Milly Koss SORT/MERGE
16 1952 COMPOOL RAND/SDC none (unique language)
17 1953 Speedcoding John W. Backus none (unique language)
18 1953 READ/PRINT Don Harroff, James Fishman, George Ryckman none (unique language)
19 1954 Laning and Zierler system Laning, Zierler, Adams at MIT Project Whirlwind none (unique language)
20 1954 Mark I Autocode Tony Brooker Glennie Autocode
... ... ... ... ...
45 2008 Genie Jamie McCracken Python, Boo, D, Object Pascal
46 2008 Pure Albert Gräf Q
47 2009 Chapel Brad Chamberlain, Cray Inc. HPF, ZPL
48 2009 Go Google C, Oberon, Limbo, Smalltalk
49 2009 CoffeeScript Jeremy Ashkenas JavaScript, Ruby, Python, Haskell
50 2009 Idris Edwin Brady Haskell, Agda, Coq
51 2009 Parasail S. Tucker Taft, AdaCore Modula, Ada, Pascal, ML
52 2009 Whiley David J. Pearce Java, C, Python
0 2010 Rust Graydon Hoare, Mozilla Alef, C++, Camlp4, Erlang, Hermes, Limbo, Napi...
1 2011 Ceylon Gavin King, Red Hat Java
2 2011 Dart Google Java, JavaScript, CoffeeScript, Go
3 2011 C++11 C++ ISO/IEC 14882:2011 C++, Standard C, C
4 2011 Kotlin JetBrains Java, Scala, Groovy, C#, Gosu
5 2011 Red Nenad Rakocevic Rebol, Scala, Lua
6 2011 Opa MLstate OCaml, Erlang, JavaScript
7 2012 Elixir José Valim Erlang, Ruby, Clojure
8 2012 Elm Evan Czaplicki Haskell, Standard ML, OCaml, F#
9 2012 TypeScript Anders Hejlsberg, Microsoft JavaScript, CoffeeScript
10 2012 Julia Jeff Bezanson, Stefan Karpinski, Viral Shah, A... MATLAB, Lisp, C, Fortran, Mathematica[9] (stri...
11 2012 P Vivek Gupta: not the politician, Ethan Jackson... NaN
12 2012 Ada 2012 ARA and Ada Europe (ISO/IEC 8652:2012) Ada 2005, ISO/IEC 8652:1995/Amd 1:2007
13 2014 Crystal Ary Borenszweig, Manas Technology Solutions Ruby, C, Rust, Go, C#, Python
14 2014 Hack Facebook PHP
15 2014 Swift Apple Inc. Objective-C, Rust, Haskell, Ruby, Python, C#, CLU
16 2014 C++14 C++ ISO/IEC 14882:2014 C++, Standard C, C
17 2015 Atari 2600 SuperCharger BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...
18 2015 Perl 6 The Rakudo Team Perl, Haskell, Python, Ruby
19 2016 Ring Mahmoud Fayed Lua, Python, Ruby, C, C#, BASIC, QML, xBase, S...
20 2017 C++17 C++ ISO/IEC 14882:2017 C++, Standard C, C
21 2017 Atari 2600 Flashback BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...

380 rows × 4 columns

In what year was Python created?

In [24]:
prog_lang[prog_lang.Name == 'Python']
Out[24]:
Year Name Chief developer, company Predecessor(s)
9 1991 Python Guido van Rossum ABC, ALGOL 68, Icon, Modula-3

Conclusion

The last example should say it all.

In [25]:
import pandas as pd

dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', header=0)
df = pd.concat(dfs[4:12])
prog_lang = df[df.Year != 'Year']

Four lines of code (including the import) and we have one DataFrame containing the data from 8 different HTML tables on one wikipedia page!

Do I need to say why I love Python and pandas? :-)

This post was written in a jupyter notebook. You can find the notebook on GitHub and download the conda environment.yml file to get all the dependencies I used.

In [ ]:
 

Logging to a Tkinter ScrolledText Widget

I've been programming in Python for almost 10 years. I did many CLI tools, some web applications (mainly using Flask), but I had never built a GUI.

PyQt seems to be one of the most popular framework. I had a look at it but I was not hooked. It looks like you really need to embrace the Qt world. You shouldn't try to use Python Thread but use QtThread instead. Need pySerial? Wait there is QtSerial. I guess this can be a pro or con depending on your background.

I looked more in tkinter. I must say that in my mind it was a bit old and wasn't looking very modern. I didn't know that Tk 8.5 came with an entirely new themed widget set to address the dated appearance. The official tutorial is quite nice and comes with code examples in different languages (including Python).

The GUI I needed to write wasn't very advanced. I wanted to have a kind of console where to display log messages.

TextHandler

I quickly found an example on StackOverflow to send Python logging to a tkinter Text widget:

class TextHandler(logging.Handler):
    """This class allows you to log to a Tkinter Text or ScrolledText widget"""

    def __init__(self, text):
        # run the regular Handler __init__
        logging.Handler.__init__(self)
        # Store a reference to the Text it will log to
        self.text = text

    def emit(self, record):
        msg = self.format(record)

        def append():
            self.text.configure(state='normal')
            self.text.insert(tk.END, msg + '\n')
            self.text.configure(state='disabled')
            # Autoscroll to the bottom
            self.text.yview(tk.END)
        # This is necessary because we can't modify the Text from other threads
        self.text.after(0, append)

This looks nice but doesn't work if you try to send a log message from another thread (despite the comment)... because we are passing the text widget with the logging handler to the other thread. And you can only write to a tkinter widget from the main thread.

This is explained in another StackOverflow question but I didn't like the proposed solution. If you implement specific methods as explained (put_line_to_queue), you lose the advantage of just calling the log function from different parts of the program.

QueueHandler

Using a Queue is indeed the way to share data between threads. So I implemented a simple QueueHandler:

class QueueHandler(logging.Handler):
    """Class to send logging records to a queue

    It can be used from different threads
    """

    def __init__(self, log_queue):
        super().__init__()
        self.log_queue = log_queue

    def emit(self, record):
        self.log_queue.put(record)

The handler only puts the message in a queue. I created a ConsoleUi class to poll the messages from the queue and display them in a scrolled text widget:

logger = logging.getLogger(__name__)


class ConsoleUi:
    """Poll messages from a logging queue and display them in a scrolled text widget"""

    def __init__(self, frame):
        self.frame = frame
        # Create a ScrolledText wdiget
        self.scrolled_text = ScrolledText(frame, state='disabled', height=12)
        self.scrolled_text.grid(row=0, column=0, sticky=(N, S, W, E))
        self.scrolled_text.configure(font='TkFixedFont')
        self.scrolled_text.tag_config('INFO', foreground='black')
        self.scrolled_text.tag_config('DEBUG', foreground='gray')
        self.scrolled_text.tag_config('WARNING', foreground='orange')
        self.scrolled_text.tag_config('ERROR', foreground='red')
        self.scrolled_text.tag_config('CRITICAL', foreground='red', underline=1)
        # Create a logging handler using a queue
        self.log_queue = queue.Queue()
        self.queue_handler = QueueHandler(self.log_queue)
        formatter = logging.Formatter('%(asctime)s: %(message)s')
        self.queue_handler.setFormatter(formatter)
        logger.addHandler(self.queue_handler)
        # Start polling messages from the queue
        self.frame.after(100, self.poll_log_queue)

    def display(self, record):
        msg = self.queue_handler.format(record)
        self.scrolled_text.configure(state='normal')
        self.scrolled_text.insert(tk.END, msg + '\n', record.levelname)
        self.scrolled_text.configure(state='disabled')
        # Autoscroll to the bottom
        self.scrolled_text.yview(tk.END)

    def poll_log_queue(self):
        # Check every 100ms if there is a new message in the queue to display
        while True:
            try:
                record = self.log_queue.get(block=False)
            except queue.Empty:
                break
            else:
                self.display(record)
        self.frame.after(100, self.poll_log_queue)

I can safely use the logger from different threads because only a queue is passed with the handler, no tkinter widget.

To demonstrate that, I created a separate thread to display the time every seconds:

class Clock(threading.Thread):
    """Class to display the time every seconds

    Every 5 seconds, the time is displayed using the logging.ERROR level
    to show that different colors are associated to the log levels
    """

    def __init__(self):
        super().__init__()
        self._stop_event = threading.Event()

    def run(self):
        logger.debug('Clock started')
        previous = -1
        while not self._stop_event.is_set():
            now = datetime.datetime.now()
        while not self._stop_event.is_set():
            now = datetime.datetime.now()
            if previous != now.second:
                previous = now.second
                if now.second % 5 == 0:
                    level = logging.ERROR
                else:
                    level = logging.INFO
                logger.log(level, now)
            time.sleep(0.2)

    def stop(self):
        self._stop_event.set()

The full code is available on github. If you checkout the version v0.1.0 and run it, you'll see something like that:

/images/tkinter/logging_handler.png

3-pane layout

The ConsoleUi class takes a frame as argument. It makes it easy to integrate in another layout. Let's see an example with a Paned Window widget to implement the common 3-pane layout.

Let's first create two new classes. The first one will be used to display a simple form to send a message via logging. The user can select the desired logging level:

class FormUi:

    def __init__(self, frame):
        self.frame = frame
        # Create a combobbox to select the logging level
        values = ['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL']
        self.level = tk.StringVar()
        ttk.Label(self.frame, text='Level:').grid(column=0, row=0, sticky=W)
        self.combobox = ttk.Combobox(
            self.frame,
            textvariable=self.level,
            width=25,
            state='readonly',
            values=values
        )
        self.combobox.current(0)
        self.combobox.grid(column=1, row=0, sticky=(W, E))
        # Create a text field to enter a message
        self.message = tk.StringVar()
        ttk.Label(self.frame, text='Message:').grid(column=0, row=1, sticky=W)
        ttk.Entry(self.frame, textvariable=self.message, width=25).grid(column=1, row=1, sticky=(W, E))
        # Add a button to log the message
        self.button = ttk.Button(self.frame, text='Submit', command=self.submit_message)
        self.button.grid(column=1, row=2, sticky=W)

    def submit_message(self):
        # Get the logging level numeric value
        lvl = getattr(logging, self.level.get())
        logger.log(lvl, self.message.get())

The other class is a dummy one to show the 3-pane layout:

class ThirdUi:

    def __init__(self, frame):
        self.frame = frame
        ttk.Label(self.frame, text='This is just an example of a third frame').grid(column=0, row=1, sticky=W)
        ttk.Label(self.frame, text='With another line here!').grid(column=0, row=4, sticky=W)

With those new classes, the only change required is in the App class to create a vertical and horizontal ttk.PanedWindow. The horizontal pane is splitted in two frames (the form and console):

 class App:

     def __init__(self, root):
@@ -109,11 +148,24 @@ class App:
         root.title('Logging Handler')
         root.columnconfigure(0, weight=1)
         root.rowconfigure(0, weight=1)
-        console_frame = ttk.Frame(root)
-        console_frame.grid(column=0, row=0, sticky=(N, W, E, S))
+        # Create the panes and frames
+        vertical_pane = ttk.PanedWindow(self.root, orient=VERTICAL)
+        vertical_pane.grid(row=0, column=0, sticky="nsew")
+        horizontal_pane = ttk.PanedWindow(vertical_pane, orient=HORIZONTAL)
+        vertical_pane.add(horizontal_pane)
+        form_frame = ttk.Labelframe(horizontal_pane, text="MyForm")
+        form_frame.columnconfigure(1, weight=1)
+        horizontal_pane.add(form_frame, weight=1)
+        console_frame = ttk.Labelframe(horizontal_pane, text="Console")
         console_frame.columnconfigure(0, weight=1)
         console_frame.rowconfigure(0, weight=1)
+        horizontal_pane.add(console_frame, weight=1)
+        third_frame = ttk.Labelframe(vertical_pane, text="Third Frame")
+        vertical_pane.add(third_frame, weight=1)
+        # Initialize all frames
+        self.form = FormUi(form_frame)
         self.console = ConsoleUi(console_frame)
+        self.third = ThirdUi(third_frame)
         self.clock = Clock()
         self.clock.start()
         self.root.protocol('WM_DELETE_WINDOW', self.quit)

Note that the Clock and ConsoleUi classes were left untouched. We just pass a ttk.LabelFrame instead of a ttk.Frame to the ConsoleUi class.

This looks more like what could be a real application:

/images/tkinter/paned_window.png

The main window and the different panes can be resized nicely:

/images/tkinter/paned_window_resized.png

As already mentioned, the full example is available on github. You can checkout the version v0.2.0 to see the 3-pane layout.

Conclusion

I want to give some credit to tkinter. It doesn't have a steep learning curve and allows to easily create some nice GUI. You can continue using what you know in Python (Queue, Threads, modules like pySerial). I can only recomment it if you are familiar with Python and want to create a simple GUI. That being said, I'll probably try to dive more in PyQt when I have more time.

Experimenting with asyncio on a Raspberry Pi

In a previous post, I described how I built a LEGO Macintosh Classic with a Raspberry Pi and e-paper display.

For testing purpose I installed the clock demo which is part of the Embedded Artists repository. Of course I wanted to do more than displaying the time on this little box. I also wanted to take advantage of the button I had integrated.

One idea was to create a small web server so that I could receive and display messages. The application would basically:

  • display the time (every minute)
  • when receiving a message, stop the clock and display the message
  • when the button is pressed, start the clock again
/images/legomac/press_button.gif

I don't know about you, but this really makes me think event loop! I learnt asynchronous programming with Dave Peticolas Twisted Introduction a few years ago. If you are not familiar with asynchronous programming, I really recommend it. I wrote a few applications using Twisted but I haven't had the opportunity to use asyncio yet. Here is a very good occasion!

asyncio

REST API using aiohttp

There are already several asyncio web frameworks to build an HTTP server. I decided to go with aiohttp which is kind of the default one.

Using this tutorial I wrote a simple REST API using aiohttp. It uses JSON Web Tokens which is something else I have been wanted to try.

The API has only 3 endpoints:

def setup_routes(app):
    app.router.add_get('/', index)
    app.router.add_post('/login', login)
    app.router.add_post('/messages', post_message)
  • / to check that our token is valid
  • /login to login
  • /messages to post messages
async def login(request):
    config = request.app['config']
    data = await request.json()
    try:
        user = data['username']
        passwd = data['password']
    except KeyError:
        return web.HTTPBadRequest(reason='Invalid arguments')
    # We have only one user hard-coded in the config file...
    if user != config['username'] or passwd != config['password']:
        return web.HTTPBadRequest(reason='Invalid credentials')
    payload = {
        'user_id': 1,
        'exp': datetime.datetime.utcnow() + datetime.timedelta(seconds=config['jwt_exp_delta_seconds'])
    }
    jwt_token = jwt.encode(payload, config['jwt_secret'], config['jwt_algorithm'])
    logger.debug(f'JWT token created for {user}')
    return web.json_response({'token': jwt_token.decode('utf-8')})


@login_required
async def post_message(request):
    if request.content_type != 'application/json':
        return web.HTTPBadRequest()
    data = await request.json()
    try:
        message = data['message']
    except KeyError:
        return web.HTTPBadRequest()
    logger.debug(f'Message received from {request.user}: {message}')
    return web.json_response({'message': message}, status=201)


@login_required
async def index(request):
    return web.json_response({'message': 'Welcome to LegoMac {}!'.format(request.user)})

Raspberry Pi GPIO and asyncio

The default Python package to control the Raspberry Pi GPIO seems to be RPi.GPIO. That's at least what is used in the ImageDemoButton.py from Embedded Artists.

An alternative is the pigpio library which provides a daemon to access the Raspberry Pi GPIO via a pipe or socket interface. And someone (Pierre Rust) already created an aysncio based Python client for the pigpio daemon: apigpio.

Exactly what I needed! It's basically a (incomplete) port of the original Python client provided with pigpio, but far sufficient for my need. I just want to get a notification when pressing the button on top of the screen.

There is an example how to achieve that: gpio_notification.py.

E-paper display and asyncio

The last remaining piece is to make the e-paper display play nicely with asyncio.

The EPD driver uses the fuse library. It allows the display to be represented as a virtual directory of files. So sending a command consists of writing to a file.

There is a library to add file support to asyncio: aiofiles. The only thing I had to do was basically to wrap the file IO in EPD.py with aiofiles:

async def _command(self, c):
    async with aiofiles.open(os.path.join(self._epd_path, 'command'), 'wb') as f:
        await f.write(c)

You can't use await in a class __init__ method. So following some recommendations from stackoverflow, I used the factory pattern and moved the actions requiring some IO to a classmethod:

@classmethod
async def create(cls, *args, **kwargs):
    self = EPD(*args, **kwargs)
    async with aiofiles.open(os.path.join(self._epd_path, 'version')) as f:
        version = await f.readline()
        self._version = version.rstrip('\n')
    async with aiofiles.open(os.path.join(self._epd_path, 'panel')) as f:
        line = await f.readline()
        m = self.PANEL_RE.match(line.rstrip('\n'))
        if m is None:
            raise EPDError('invalid panel string')
        ...

To create an instance of the EPD class, use:

epd = await EPD.create([path='/path/to/epd'], [auto=boolean])

Putting everything together with aiohttp

Running the clock as a background task

For the clock, I adapted the clock demo from Embedded Artists repository.

As described in aiohttp documentation I created a background task to display the clock every minute:

async def display_clock(app):
    """Background task to display clock every minute"""
    clock = Clock(app['epd'])
    first_start = True
    try:
        while True:
            while True:
                now = datetime.datetime.today()
                if now.second == 0 or first_start:
                    first_start = False
                    break
                await asyncio.sleep(0.5)
            logger.debug('display clock')
            await clock.display(now)
    except asyncio.CancelledError:
        logger.debug('display clock cancel')


async def start_background_tasks(app):
     app['epd'] = await EPD.create(auto=True)
     app['clock'] = app.loop.create_task(display_clock(app))


async def cleanup_background_tasks(app):
    app['clock'].cancel()
    await app['clock']


def init_app():
    """Create and return the aiohttp Application object"""
    app = web.Application()
    app.on_startup.append(start_background_tasks)
    app.on_cleanup.append(cleanup_background_tasks)
    ...

Stop the clock and display a message

When receiving a message, I first cancel the clock background task and send the messages to the e-paper display using ensure_future so that I can return a json response without having to wait for the message to be displayed as it takes about 5 seconds:

@login_required
async def post_message(request):
    if request.content_type != 'application/json':
        return web.HTTPBadRequest()
    data = await request.json()
    try:
        message = data['message']
    except KeyError:
        return web.HTTPBadRequest()
    # cancel the display clock
    request.app['clock'].cancel()
    logger.debug(f'Message received from {request.user}: {message}')
    now = datetime.datetime.now(request.app['timezone'])
    helpers.ensure_future(request.app['epd'].display_message(message, request.user, now))
    return web.json_response({'message': message}, status=201)

Start the clock when pressing the button

To be able to restart the clock when pressing the button, I connect to the pigpiod when starting the app (in start_background_tasks) and register the on_input callback:

async def start_background_tasks(app):
    app['pi'] = apigpio.Pi(app.loop)
    address = (app['config']['pigpiod_host'], app['config']['pigpiod_port'])
    await app['pi'].connect(address)
    await app['pi'].set_mode(BUTTON_GPIO, apigpio.INPUT)
    app['cb'] = await app['pi'].add_callback(
            BUTTON_GPIO,
            edge=apigpio.RISING_EDGE,
            func=functools.partial(on_input, app))
    ...

In the on_input callback, I re-create the clock background task but only if the previous task is done:

def on_input(app, gpio, level, tick):
    """Callback called when pressing the button on the e-paper display"""
    logger.info('on_input {} {} {}'.format(gpio, level, tick))
    if app['clock'].done():
        logger.info('restart clock')
        app['clock'] = app.loop.create_task(display_clock(app))

Running on the Pi

You might have noticed that I used some syntax that is Python 3.6 only. I don't really see myself using something else when starting a new project today :-) There are so many new things (like f-strings) that make your programs look cleaner.

On raspbian, if you install Python 3, you get 3.4... So how do you get Python 3.6 on a Raspberry Pi?

On desktop/server I usually use conda. It makes it so easy to install the Python version you want and many dependencies. There are no official installer for the armv6 architecture but I found berryconda which is a conda based distribution for the Raspberry Pi! Really nice!

Another alternative is to use docker. There are official arm32v6 images based on alpine and some from resin.io.

I could have gone with berryconda, but there's one thing I wanted as well. I'll have to open the HTTP server to the outside world meaning I need HTTPS. As mentionned in another post, traefik makes that very easy if you use docker. So that's what I chose.

I created 3 containers:

  • traefik
  • pigpiod
  • aiolegomac

traefik

There are no official Traefik docker images for arm yet, but an issue is currently opened. So it should arrive soon!

In the meantime I created my own:

FROM arm32v6/alpine:3.6

RUN apk --update upgrade \
  && apk --no-cache --no-progress add ca-certificates \
  && apk add openssl \
  && rm -rf /var/cache/apk/*

RUN wget -O /usr/local/bin/traefik https://github.com/containous/traefik/releases/download/v1.3.3/traefik_linux-arm \
  && chmod a+x /usr/local/bin/traefik

ENTRYPOINT ["/usr/local/bin/traefik"]

pigpiod

For pigpiod, I first created an image based on arm32v6/alpine but I noticed I couldn't send a SIGTERM to the daemon to stop it properly... I'm not sure why. Alpine being based on musl instead of glibc might be the problem. Here is the Dockerfile I tried:

FROM arm32v6/alpine:3.6

RUN apk add --no-cache --virtual .build-deps \
  gcc \
  make \
  musl-dev \
  tar \
  && wget -O /tmp/pigpio.tar abyz.co.uk/rpi/pigpio/pigpio.tar \
  && tar -xf /tmp/pigpio.tar -C /tmp \
  && sed -i "/ldconfig/d" /tmp/PIGPIO/Makefile \
  && make -C /tmp/PIGPIO \
  && make -C /tmp/PIGPIO install \
  && rm -rf /tmp/PIGPIO /tmp/pigpio.tar \
  && apk del .build-deps

EXPOSE 8888

ENTRYPOINT ["/usr/local/bin/pigpiod", "-g"]

I even tried using tini as entrypoint without luck. So if someone as the explanation, please share it in the comments.

I tried with resin/rpi-raspbian image and I got it working properly right away:

FROM resin/rpi-raspbian:jessie

RUN apt-get update \
  && apt-get install -y \
     make \
     gcc \
     libc6-dev \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

RUN curl -o /tmp/pigpio.tar abyz.co.uk/rpi/pigpio/pigpio.tar \
  && tar -xf /tmp/pigpio.tar -C /tmp \
  && make -C /tmp/PIGPIO \
  && make -C /tmp/PIGPIO install \
  && rm -rf /tmp/pigpio.tar /tmp/PIGPIO

EXPOSE 8888

ENTRYPOINT ["/usr/local/bin/pigpiod", "-g"]

Note that the container has to run in privileged mode to access the GPIO.

aiolegomac

For the main application, the Dockerfile is quite standard for a Python application:

FROM resin/raspberry-pi-python:3.6

RUN apt-get update \
  && apt-get install -y \
     fonts-liberation \
     fonts-dejavu  \
     libjpeg-dev \
     libfreetype6-dev \
     libtiff5-dev \
     liblcms2-dev \
     libwebp-dev \
     zlib1g-dev \
     libyaml-0-2 \
  && apt-get autoremove \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/requirements.txt
RUN python -m venv /opt/legomac \
  && /opt/legomac/bin/pip install -r requirements.txt

COPY . /app

ENTRYPOINT ["/opt/legomac/bin/python"]
CMD ["run.py"]

What about the EPD driver? As it uses libfuse to represent the e-paper display as a virtual directory of files, the easiest was to install it on the host and to mount it as a volume inside the docker container.

Deployment

To install all that on the Pi, I wrote a small Ansible playbook.

  1. Configure the Pi as described in my previous post.

  2. Clone the playbook:

    $ git clone https://github.com/beenje/legomac.git
    $ cd legomac
    
  3. Create a file host_vars/legomac with your variables (assuming the hostname of the Pi is legomac):

    aiolegomac_hostname: myhost.example.com
    aiolegomac_username: john
    aiolegomac_password: mypassword
    aiolegomac_jwt_secret: secret
    traefik_letsencrypt_email: youremail@example.com
    traefik_letsencrypt_production: true
    
  4. Run the playbook:

    $ ansible-playbook -i hosts -k playbook.yml
    

This will install docker and the EPD driver, download the aiolegomac repository, build the 3 docker images and start everything.

Building the main application docker image on a Raspberry Pi Zero takes quite some time. So be patient :-) Just go and do something else.

When the full playbook is complete (it took about 55 minutes for me), you'll have a server with HTTPS support (thanks to Let's Encrypt) running on the Pi. It's displaying the clock every minute and you can send messages to it!

Client

HTTPie

To test the server you can of course use curl but I really like HTTPie. It's much more user friendly.

Let's try to access our new server:

$ http GET https://myhost.example.com
HTTP/1.1 401 Unauthorized
Content-Length: 25
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:22:42 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "error": "Unauthorized"
}

Good, we need to login:

$ http POST https://myhost.example.com/login username=john password=foo
HTTP/1.1 400 Bad Request
Content-Length: 32
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:18:39 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "error": "Invalid credentials"
}

Oops, wrong password:

$ http POST https://myhost.example.com/login username=john password='mypassword'
HTTP/1.1 200 OK
Content-Length: 134
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:21:14 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "token": "eyK0eXAiOiJRV5QiLCJhbGciOiJIUzI1NiJ9.eyJ1c3VyX2lkIjoxLCJleHAiOjE1MDB5MTIwOTh9.hECnj4u2mxvZ2r8IEC-db1T-eKTplM4kWJKZoHhtLxQ"
}

We got a token that we can use:

$ http GET https://myhost.example.com 'Authorization: eyK0eXAiOiJRV5QiLCJhbGciOiJIUzI1NiJ9.eyJ1c3VyX2lkIjoxLCJleHAiOjE1MDB5MTIwOTh9.hECnj4u2mxvZ2r8IEC-db1T-eKTplM4kWJKZoHhtLxQ'
HTTP/1.1 200 OK
Content-Length: 43
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:22:25 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "message": "Welcome to LegoMac john!"
}

Authentication is working, so we can send a message:

$ http POST https://myhost.example.com/messages message='Hello World!' 'Authorization: eyK0eXAiOiJRV5QiLCJhbGciOiJIUzI1NiJ9.eyJ1c3VyX2lkIjoxLCJleHAiOjE1MDB5MTIwOTh9.hECnj4u2mxvZ2r8IEC-db1T-eKTplM4kWJKZoHhtLxQ'
HTTP/1.1 201 Created
Content-Length: 27
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:23:46 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "message": "Hello World!"
}

Message sent! HTTPie is nice for testing, but we can make a small script to easily send messages from the command line.

requests

requests is of course the HTTP library to use in Python.

So let's write a small script to send messages to our server. We'll store the server url and username to use in a small yaml configuration file. If we don't have a token yet or if the saved one is no longer valid, the script will retrieve one after prompting us for a password. The token is saved in the configuration file for later use.

The following script could be improved with some nicer error messages by catching exceptions. But it does the job:

import os
import click
import requests
import yaml


def get_config(filename):
    with open(filename) as f:
        config = yaml.load(f)
    return config


def save_config(filename, config):
    with open(filename, 'w') as f:
        yaml.dump(config, f, default_flow_style=False)


def get_token(url, username):
    password = click.prompt('Password', hide_input=True)
    payload = {'username': username, 'password': password}
    r = requests.post(url + '/login', json=payload)
    r.raise_for_status()
    return r.json()['token']


def send_message(url, token, message):
    payload = {'message': message}
    headers = {'Authorization': token}
    r = requests.post(url + '/messages', json=payload, headers=headers)
    r.raise_for_status()


@click.command()
@click.option('--conf', '-c', default='~/.pylegomac.yml',
              help='Configuration file [default: "~/.pylegomac.yml"]')
@click.argument('message')
@click.version_option()
def pylegomac(message, conf):
    """Send message to aiolegomac server"""
    filename = os.path.expanduser(conf)
    config = get_config(filename)
    url = config['url']
    username = config['username']
    if 'token' in config:
        try:
            send_message(url, config['token'], message)
        except requests.exceptions.HTTPError as err:
            # Token no more valid
            pass
        else:
            click.echo('Message sent')
            return
    token = get_token(url, username)
    send_message(url, token, message)
    config['token'] = token
    save_config(filename, config)


if __name__ == '__main__':
    pylegomac()

Let's first create a configuration file:

$ cat ~/.pylegomac.yml
url: https://myhost.example.com
username: john

Send a message:

$ python pylegomac.py 'Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated.'
Password:
Message sent
/images/legomac/zen_of_python.jpg

Sending a new message won't request the password as the token was saved in the config file.

Conclusion

I have a nice little aiohttp server running on my Raspberry Pi that can receive and display messages. asyncio is quite pleasant to work with. I really like the async/await syntax.

All the code is on github:

  • aiolegomac (the server and client script)
  • legomac (the Ansible playbook to deploy the server)

Why did I only write a command line script to send messages and no web interface? Don't worry, that's planned! I could have used Jinja2. But I'd like to try a javascript framework. So that will be the subject of another post.

Running your application over HTTPS with traefik

I just read another very clear article from Miguel Grinberg about Running Your Flask Application Over HTTPS.

As the title suggests, it describes different ways to run a flask application over HTTPS. I have been using flask for quite some time, but I didn't even know about the ssl_context argument. You should definitively check his article!

Using nginx as a reverse proxy with a self-signed certificate or Let’s Encrypt are two options I have been using in the past.

If your app is available on the internet, you should definitively use Let's Encrypt. But if your app is only supposed to be used internally on a private network, a self-signed certificate is an option.

Traefik

I now often use docker to deploy my applications. I was looking for a way to automatically configure Let's Encrypt. I initially found nginx-proxy and docker-letsencrypt-nginx-proxy-companion. This was interesting but wasn't that straight forward to setup.

I then discovered traefik: "a modern HTTP reverse proxy and load balancer made to deploy microservices with ease". And that's really the case! I've used it to deploy several applications and I was impressed. It's written in go, so single binary. There is also a tiny docker image that makes it easy to deploy. It includes Let's Encrypt support (with automatic renewal), websocket support (no specific setup required)... And many other features.

Here is a traefik.toml configuration example:

defaultEntryPoints = ["http", "https"]

[web]
# Port for the status page
address = ":8080"

# Entrypoints, http and https
[entryPoints]
  # http should be redirected to https
  [entryPoints.http]
  address = ":80"
    [entryPoints.http.redirect]
    entryPoint = "https"
  # https is the default
  [entryPoints.https]
  address = ":443"
    [entryPoints.https.tls]

# Enable ACME (Let's Encrypt): automatic SSL
[acme]
# Email address used for registration
email = "test@traefik.io"
storageFile = "/etc/traefik/acme/acme.json"
entryPoint = "https"
onDemand = false
OnHostRule = true
  # Use a HTTP-01 acme challenge rather than TLS-SNI-01 challenge
  [acme.httpChallenge]
  entryPoint = "http"

# Enable Docker configuration backend
[docker]
endpoint = "unix:///var/run/docker.sock"
domain = "example.com"
watch = true
exposedbydefault = false

With this simple configuration, you get:

  • HTTP redirect on HTTPS
  • Let's Encrypt support
  • Docker backend support

UPDATE (2018-03-04): as mentioned by @jackminardi in the comments, Let's Encrypt disabled the TLS-SNI challenges for most new issuance. Traefik added support for the HTTP-01 challenge. I updated the above configuration to use this validation method: [acme.httpChallenge].

A simple example

I created a dummy example just to show how to run a flask application over HTTPS with traefik and Let's Encrypt. Note that traefik is made to dynamically discover backends. So you usually don't run it with your app in the same docker-compose.yml file. It usually runs separately. But to make it easier, I put both in the same file:

version: '2'
services:
  flask:
    build: ./flask
    image: flask
    command: uwsgi --http-socket 0.0.0.0:5000 --wsgi-file app.py --callable app
    labels:
      - "traefik.enable=true"
      - "traefik.backend=flask"
      - "traefik.frontend.rule=${TRAEFIK_FRONTEND_RULE}"
  traefik:
    image: traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/traefik.toml:/etc/traefik/traefik.toml:ro
      - ./traefik/acme:/etc/traefik/acme
    ports:
     - "80:80"
     - "443:443"
     - "8080:8080"

Traefik requires access to the docker socket to listen for changes in the backends. It can thus automatically discover when you start and stop containers. You can ovverride default behaviour by using labels in your container.

Supposing you own the myhost.example.com domain and have access to ports 80 and 443 (you can setup port forwarding if you run that on your machine behind a router at home), you can run:

$ git clone https://github.com/beenje/flask_traefik_letsencrypt.git
$ cd flask_traefik_letsencrypt
$ export TRAEFIK_FRONTEND_RULE=Host:myhost.example.com
$ docker-compose up

Voilà! Our flask app is available over HTTPS with a real SSL certificate!

/images/flask_traefik/hello_world.png

Traefik discovered the flask docker container and requested a certificate for our domain. All that automatically!

Traefik even comes with a nice dashboard:

/images/flask_traefik/traefik_dashboard.png

With this simple configuration, Qualys SSL Labs gave me an A rating :-)

/images/flask_traefik/traefik_ssl_report.png

Not as good as the A+ for Miguel's site, but not that bad! Especially considering there isn't any specific SSL setup.

A more realistic deployment

As I already mentioned, traefik is made to automatically discover backends (docker containers in my case). So you usually run it by itself.

Here is an example how it can be deployed using Ansible:

---
- name: create traefik directories
  file:
    path: /etc/traefik/acme
    state: directory
    owner: root
    group: root
    mode: 0755

- name: create traefik.toml
  template:
    src: traefik.toml.j2
    dest: /etc/traefik/traefik.toml
    owner: root
    group: root
    mode: 0644
  notify:
    - restart traefik

- name: create traefik network
  docker_network:
    name: "{{traefik_network}}"
    state: present

- name: launch traefik container with letsencrypt support
  docker_container:
    name: traefik_proxy
    image: "traefik:{{traefik_version}}"
    state: started
    restart_policy: always
    ports:
      - "80:80"
      - "443:443"
      - "{{traefik_dashboard_port}}:8080"
    volumes:
      - /etc/traefik/traefik.toml:/etc/traefik/traefik.toml:ro
      - /etc/traefik/acme:/etc/traefik/acme:rw
      - /var/run/docker.sock:/var/run/docker.sock:ro
    # purge networks so that the container is only part of
    # {{traefik_network}} (and not the default bridge network)
    purge_networks: yes
    networks:
      - name: "{{traefik_network}}"

- name: force all notified handlers to run
  meta: flush_handlers

Nothing strange here. It's quite similar to what we had in our docker-compose.yml file. We created a specific traefik_network. Our docker containers will have to be on that same network.

Here is how we could deploy a flask application on the same server using another ansible role:

- name: launch flask container
  docker_container:
    name: flask
    image: flask
    command: uwsgi --http-socket 0.0.0.0:5000 --wsgi-file app.py --callable app
    state: started
    restart_policy: always
    purge_networks: yes
    networks:
      - name: "{{traefik_network}}"
    labels:
      traefik.enable: "true"
      traefik.backend: "flask"
      traefik.frontend.rule: "Host:myhost.example.com"
      traefik.port: "5000"

We make sure the container is on the same network as the traefik proxy. Note that the traefik.port label is only required if the container exposes multiple ports. It's thus not needed in our example.

That's basically it. As you can see, docker and Ansible make the deployment easy. And traefik takes care of the Let's Encrypt certificate.

Conclusion

Traefik comes with many other features and is well documented. You should check this Docker example that demonstrates load-balancing. Really cool.

If you use docker, you should really give traefik a try!

My LEGO Macintosh Classic with Raspberry Pi and e-paper display

Beginning of April I read an inspiring blog post from Jannis Hermanns about a LEGO Machintosh Classic with e-paper display. It was a really nice and cool article.

I've been playing with some Raspberry Pis before but only with software. I have been willing to fiddle with hardware for some time. This was the perfect opportunity!

LEGO Digital Designer

I decided to try to make my own LEGO Macintosh based on Jannis work. His blog post is quite detailed with even a list of links with all the required components.

But I quickly realized there were no LEGO building instructions... I thus created my own using LEGO Digital Designer, which was fun. Looking at the pictures on Jannis flickr album helped a lot. But having an exact idea of the screen size wasn't easy on the computer. So I also built a small prototype of the front part to get a better idea. For that I had to wait for my e-paper display.

One modification I wanted to do was to use 1U width lego on the side of the display to require less drilling. I also wanted to check if it was possible to use the button located on top of the display.

My .lxf file is on github.

/images/legomac/legomac_ldd.thumbnail.png

E-paper display

When I was about to order the 2.7 inch e-paper display from Embedded Artists, I noticed that Embedded Artists was located in Malmö, where I live :-).

I e-mailed them and I was granted to pick up my order at their office! A big thanks to them!

Raspbery Pi Zero W

The Raspberry Pi Zero W comes with Wifi which is really nice. It does not come with the soldered GPIO header. I was starting to look at existing soldering iron when I discovered this GPIO Hammer Header:

/images/legomac/gpio_hammer_header.thumbnail.jpg

No soldering required! I used the installation jig and it was really easy to install. There is a nice video that explains how to proceed:

Connecting the display to the Pi

Based on Jannis article I initially thought it wasn't possible to use a ribbon cable (due to space), so I ordered some Jumper Wires. I connected the display to the Pi using the serial expansion connector as described in his blog post. It worked. With the demo from embeddedartists, I managed to display a nice cat picture :-)

/images/legomac/jumper_wires.thumbnail.jpg /images/legomac/cat.thumbnail.jpg

I then realized that the serial expansion connector didn't give access to the button on top of the display. That button could allow some interactions, like changing mode, which would be nice. According to my prototype with 1U width lego on the side, using a ribbon cable shouldn't actually be an issue. So I ordered a Downgrade GPIO Ribbon Cable for Raspberry Pi.

It required a little drilling on the right side for the cable to fit. But not that much. More is needed on the left side to center the screen. Carried away by my enthusiasm, I actually cut a bit too much on the left side (using the dremel was fun :-).

/images/legomac/drilling_left.thumbnail.jpg /images/legomac/drilling_right.thumbnail.jpg

Everything fitted nicely in the lego case:

/images/legomac/ribbon_cable.thumbnail.jpg

Button on top

With the ribbon cable, the button on top of the display is connected to pin 15 on the Raspberry Pi (BCM GPIO22). The ImageDemoButton.py part of the demo shows an example how to use the button to change the image displayed.

Using my small prototype, I planned a small hole on top of the case. I thought I'd have to fill the brick with something hard to press the button. The 1x1 brick ended fitting perfectly. As shown on the picture below, the side is exactly on top of the button. I added a little piece of foam inside the brick to keep it straight.

/images/legomac/button_front.thumbnail.jpg

Of course I move away from the Macintosh Classic design here... but practicality beats purity :-)

Pi configuration

Jannis article made me discover resin.io, which is a really interesting project. I did a few tests on a Raspberry Pi 3 and it was a nice experience. But when I received my Pi Zero W, it wasn't supported by resinOS yet... This isn't the case anymore! Version 2.0.3 added support for the wifi chip.

Anyway, as Jannis already wrote about resinOS, I'll describe my tests with Raspbian. To flash the SD card, I recommend Etcher which is an open source project by the same resin.io. I'm more a command line guy and I have used dd many times. But I was pleasantly surprised. It's easy to install and use.

  1. Download and install Etcher
  2. Download Raspbian Strecth Lite image
  3. Flash the SD card using Etcher
  4. Mount the SD card to configure it:
# Go to the boot partition
# This is an example on OSX (mount point will be different on Linux)
$ cd /Volumes/boot

# To enable ssh, create a file named ssh onto the boot partition
$ touch ssh

# Create the file wpa_supplicant.conf with your wifi settings
# Note that for Raspbian Stretch, you need the first line
# (ctrl_interface...)! This was not the case for Jessie.
$  cat << EOF > wpa_supplicant.conf
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
network={
    ssid="MyWifiNetwork"
    psk="password"
    key_mgmt=WPA-PSK
}
EOF

# Uncomment dtparam=spi=on to enable the SPI master driver
$ vi config.txt

# Leave the boot partition
$ cd
  1. Unmount the SD card and put it in the Raspberry Pi
  2. Boot the Pi

I wrote a small Ansible playbook to install the E-ink driver and the clock demo:

- name: install required dependencies
  apt:
    name: "{{item}}"
    state: present
    update_cache: yes
  with_items:
    - git
    - libfuse-dev
    - fonts-liberation
    - python-pil

- name: check if the epd-fuse service exists
  command: systemctl status epd-fuse.service
  check_mode: no
  failed_when: False
  changed_when: False
  register: epd_fuse_service

- name: clone the embeddedartists gratis repository
  git:
    repo: https://github.com/embeddedartists/gratis.git
    dest: /home/pi/gratis

- name: build the EPD driver and install the epd-fuse service
  shell: >
    COG_VERSION=V2 make rpi-epd_fuse &&
    COG_VERSION=V2 make rpi-install
  args:
    chdir: /home/pi/gratis/PlatformWithOS
  when: epd_fuse_service.rc != 0

- name: ensure the epd-fuse service is enabled and started
  service:
    name: epd-fuse
    state: started
    enabled: yes

- name: install the epd-clock service
  copy:
    src: epd-clock.service
    dest: /etc/systemd/system/epd-clock.service
    owner: root
    group: root
    mode: 0644

- name: start and enable epd-clock service
  systemd:
    name: epd-clock.service
    daemon_reload: yes
    state: started
    enabled: yes

To run the playbook, clone the repository https://github.com/beenje/legomac:

$ git clone https://github.com/beenje/legomac.git
$ cd legomac
$ ansible-playbook -i hosts -k epd-demo.yml

That's it!

Of course don't forget to change the default password on your Pi.

One more thing

There isn't much Python in this article but the Pi is running some Python code. I couldn't resist putting a Talk Python To Me sticker on the back :-) It's really a great podcast and you should definitevely give it a try if you haven't yet. Thanks again to @mkennedy for the stickers!

/images/legomac/talkpythontome.thumbnail.jpg

Below are a few pictures. You can see more on flickr.

Dockerfile anti-patterns and best practices

I've been using Docker for some time now. There is already a lot of documentation available online but I recently saw the same "anti-patterns" several times, so I thought it was worth writing a post about it.

I won't repeat all the Best practices for writing Dockerfiles here. You should definitively read that page.

I want to emphasize some things that took me some time to understand.

Avoid invalidating the cache

Let's take a simple example with a Python application:

FROM python:3.6

COPY . /app
WORKDIR /app

RUN pip install -r requirements.txt

ENTRYPOINT ["python"]
CMD ["ap.py"]

It's actually an example I have seen several times online. This looks fine, right?

The problem is that the COPY . /app command will invalidate the cache as soon as any file in the current directory is updated. Let's say you just change the README file and run docker build again. Docker will have to re-install all the requirements because the RUN pip command is run after the COPY that invalidated the cache.

The requirements should only be re-installed if the requirements.txt file changes:

FROM python:3.6

WORKDIR /app

COPY requirements.txt /app/requirements.txt
RUN pip install -r requirements.txt

COPY . /app

ENTRYPOINT ["python"]
CMD ["ap.py"]

With this Dockerfile, the RUN pip command will only be re-run when the requirements.txt file changes. It will use the cache otherwise.

This is much more efficient and will save you quite some time if you have many requirements to install.

Minimize the number of layers

What does that really mean?

Each Docker image references a list of read-only layers that represent filesystem differences. Every command in your Dockerfile will create a new layer.

Let's use the following Dockerfile:

FROM centos:7

RUN yum update -y
RUN yum install -y sudo
RUN yum install -y git
RUN yum clean all

Build the docker image and check the layers created with the docker history command:

$ docker build -t centos-test .
...
$ docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED              SIZE
centos-test                      latest              1fae366a2613        About a minute ago   470 MB
centos                           7                   98d35105a391        24 hours ago         193 MB
$ docker history centos-test
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
1fae366a2613        2 minutes ago       /bin/sh -c yum clean all                        1.67 MB
999e7c7c0e14        2 minutes ago       /bin/sh -c yum install -y git                   133 MB
c97b66528792        3 minutes ago       /bin/sh -c yum install -y sudo                  81 MB
e0c7b450b7a8        3 minutes ago       /bin/sh -c yum update -y                        62.5 MB
98d35105a391        24 hours ago        /bin/sh -c #(nop)  CMD ["/bin/bash"]            0 B
<missing>           24 hours ago        /bin/sh -c #(nop)  LABEL name=CentOS Base ...   0 B
<missing>           24 hours ago        /bin/sh -c #(nop) ADD file:29f66b8b4bafd0f...   193 MB
<missing>           6 months ago        /bin/sh -c #(nop)  MAINTAINER https://gith...   0 B

There are two problems with this Dockerfile:

  1. We added too many layers for nothing.
  2. The yum clean all command is meant to reduce the size of the image but it actually does the opposite by adding a new layer!

Let's check that by removing the latest command and running the build again:

FROM centos:7

RUN yum update -y
RUN yum install -y sudo
RUN yum install -y git
# RUN yum clean all
$ docker build -t centos-test .
...
$ docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             SIZE
centos-test                      latest              999e7c7c0e14        11 minutes ago      469 MB
centos                           7                   98d35105a391        24 hours ago        193 MB

The new image without the yum clean all command is indeed smaller than the previous image (1.67 MB smaller)!

If you want to remove files, it's important to do that in the same RUN command that created those files. Otherwise there is no point.

Here is the proper way to do it:

FROM centos:7

RUN yum update -y \
  && yum install -y \
  sudo \
  git \
  && yum clean all

Let's build this new image:

$ docker build -t centos-test .
...
$ docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             SIZE
centos-test                      latest              54a328ef7efd        21 seconds ago      265 MB
centos                           7                   98d35105a391        24 hours ago        193 MB
$ docker history centos-test
IMAGE               CREATED              CREATED BY                                      SIZE                COMMENT
54a328ef7efd        About a minute ago   /bin/sh -c yum update -y   && yum install ...   72.8 MB
98d35105a391        24 hours ago         /bin/sh -c #(nop)  CMD ["/bin/bash"]            0 B
<missing>           24 hours ago         /bin/sh -c #(nop)  LABEL name=CentOS Base ...   0 B
<missing>           24 hours ago         /bin/sh -c #(nop) ADD file:29f66b8b4bafd0f...   193 MB
<missing>           6 months ago         /bin/sh -c #(nop)  MAINTAINER https://gith...   0 B

The new image is only 265 MB compared to the 470 MB of the original image. There isn't much more to say :-)

If you want to know more about images and layers, you should read the documentation: Understand images, containers, and storage drivers.

Conclusion

Avoid invalidating the cache:

  • start your Dockerfile with commands that should not change often
  • put commands that can often invalidate the cache (like COPY .) as late as possible
  • only add the needed files (use a .dockerignore file)

Minimize the number of layers:

  • put related commands in the same RUN instruction
  • remove files in the same RUN command that created them

Control your accessories from Home Assistant with Siri and HomeKit

While reading more about Home Assistant, I discovered it was possible to control your accessories from Home Assistant with Siri and HomeKit. I decided to give that a try.

This requires to install Homebridge and the homebridge-homeassitant plugin.

Install Homebridge

Homebridge is a lightweight NodeJS server that emulates the iOS HomeKit API. Let's install it in the same LXC container as Home Assistant:

root@turris:~# lxc-attach -n homeassistant

I followed the Running HomeBridge on a Raspberry Pi page.

We need curl and git:

root@homeassistant:~# apt-get install -y curl git

Install Node:

root@homeassistant:~# curl -sL https://deb.nodesource.com/setup_6.x | bash -
## Installing the NodeSource Node.js v6.x repo...

## Populating apt-get cache...

root@homeassistant:~# apt-get install -y nodejs

Install avahi and other dependencies:

root@homeassistant:~# apt-get install -y libavahi-compat-libdnssd-dev

Install Homebridge and dependencies still following this page. Note that I had a strange problem here. The npm command didn't produce any output. I found the same issue on stackoverflow and even an issue on github. The workaround is just to open a new terminal...

root@homeassistant:~# npm install -g --unsafe-perm homebridge hap-nodejs node-gyp
root@homeassistant:~# cd /usr/lib/node_modules/homebridge/
root@homeassistant:/usr/lib/node_modules/homebridge# npm install --unsafe-perm bignum
root@homeassistant:/usr/lib/node_modules/homebridge# cd ../hap-nodejs/node_modules/mdns/
root@homeassistant:/usr/lib/node_modules/hap-nodejs/node_modules/mdns# node-gyp BUILDTYPE=Release rebuild

Install and configure homebridge-homeassistant plugin

root@homeassistant:/usr/lib/node_modules/hap-nodejs/node_modules/mdns# cd
root@homeassistant:~# npm install -g --unsafe-perm homebridge-homeassistant

Try to start Homebridge:

root@homeassistant:~# su -s /bin/bash homeassistant
homeassistant@homeassistant:~$ homebridge

Homebridge won't do anything until you've created a configuration file. So press CTRL-C and create the file ~/.homebridge/config.json:

homeassistant@homeassistant:~$ cat <<EOF >> ~/.homebridge/config.json
{
  "bridge": {
    "name": "Homebridge",
    "username": "CC:22:3D:E3:CE:30",
    "port": 51826,
    "pin": "031-45-154"
  },

  "platforms": [
    {
      "platform": "HomeAssistant",
      "name": "HomeAssistant",
      "host": "http://localhost:8123",
      "logging": false
    }
 ]
}
EOF

Note that you can change the username and pin code. You will need the PIN code to add the Homebridge accessory to HomeKit.

Check the Home Assistant plugin page for more information on how to configure the plugin.

Automatically start Homebridge

Let's configure systemd. Create the file /etc/systemd/system/home-assistant@homebridge.service:

root@homeassistant:~# cat <<EOF >> /etc/systemd/system/home-assistant@homebridge.service
[Unit]
Description=Node.js HomeKit Server
After=syslog.target network-online.target

[Service]
Type=simple
User=homeassistant
ExecStart=/usr/bin/homebridge -U /home/homeassistant/.homebridge
Restart=on-failure
RestartSec=10
KillMode=process

[Install]
WantedBy=multi-user.target
EOF

Enable and launch Homebridge:

root@homeassistant:~# systemctl --system daemon-reload
root@homeassistant:~# systemctl enable home-assistant@homebridge
Created symlink from /etc/systemd/system/multi-user.target.wants/home-assistant@homebridge.service to /etc/systemd/system/home-assistant@homebridge.service.
root@homeassistant:~# systemctl start home-assistant@homebridge

Adding Homebridge to iOS

Homebridge and the Home Assistant plugin are now running. Using the Home app on your iOS device, you should be able to add the accessory "Homebridge". See Homebridge README for more information. You will need to enter the PIN code defined in your config.json file.

You should then see the Homebridge bridge on your device:

/images/homebridge.png

And it will automatically add all the accessories defined in Home Assistant!

/images/home_accessories.png

You can now even use Siri to control your devices, like turning ON or OFF the TV VPN.

/images/siri_tv_vpn_off.png

Note that I renamed the original switch to make it easier to pronounce. As described in the README, avoid names usually used by Siri like "Radio" or "Sonos".

That's it! Homebridge is really a nice addition to Home Assistant if you have some iOS devices at home.

Docker and conda

I just read a blog post about Using Docker with Conda Environments. I do things slightly differently so I thought I would share an example of Dockerfile I use:

FROM continuumio/miniconda3:latest

# Install extra packages if required
RUN apt-get update && apt-get install -y \
    xxxxxx \
    && rm -rf /var/lib/apt/lists/*

# Add the user that will run the app (no need to run as root)
RUN groupadd -r myuser && useradd -r -g myuser myuser

WORKDIR /app

# Install myapp requirements
COPY environment.yml /app/environment.yml
RUN conda config --add channels conda-forge \
    && conda env create -n myapp -f environment.yml \
    && rm -rf /opt/conda/pkgs/*

# Install myapp
COPY . /app/
RUN chown -R myuser:myuser /app/*

# activate the myapp environment
ENV PATH /opt/conda/envs/myapp/bin:$PATH

I don't run source activate myapp but just use ENV to update the PATH variable. There is only one environment in the docker image. No need for the extra checks done by the activate script.

With this Dockerfile, any command will be run in the myapp environment.

Just a few additional notes:

  1. Be sure to only copy the file environment.yml before to copy the full current directory. Otherwise any change in the directory would invalidate the docker cache. We only want to re-create the conda environment if environment.yml changes.
  2. I always add the conda-forge channel. Check this post if you haven't heard of it yet.
  3. I clean some cache (/var/lib/apt/lists/ and /opt/conda/pkgs/) to make the image a bit smaller.

I switched from virtualenv to conda a while ago and I really enjoy it. A big thanks to Continuum Analytics!

Home Assistant on Turris Omnia via LXC container

In a previous post, I described how to install OpenVPN client on a Turris Omnia router. To start or stop the client, I was using the command line and mentioned the LuCi Web User Interface.

Both ways are not super easy and fast to access. A while ago, I wrote a small Flask web application to change some settings in my router. The application just allowed to click on a button to run a script via ssh on the router.

So I could write a small webapp to do just that. But I recently read about Home Assistant. It's an open-source home automation platform to track and control your devices at home. There are many components available, including Command Line Switch which looks exactly like what I need.

The Raspberry Pi is a popular device to install Home Assistant. But my Turris Omnia is quite powerful for a router with 1 GB of RAM and 8 GB of flash. It's time to use some of that power.

From what I read, there is an openWrt package of Home Assistant. I couldn't find it in the Turris Omnia available packages. Anyway, there is another feature I wanted to try: LXC Containers. Home Assistant is a Python application, so it's easy to install in a linux container and would allow to easily keep the version up-to-date.

So let's start!

Create a LXC container

As described here, you can create a LXC container via the LuCI web interface or via the command line:

root@turris:~# lxc-create -t download -n homeassistant
Setting up the GPG keyring
Downloading the image index
WARNING: Failed to download the file over HTTPs.
         The file was instead download over HTTP. A server replay attack may be possible!

 ---
 DIST  RELEASE  ARCH  VARIANT  BUILD
 ---
 Turris_OS  stable  armv7l  default  2017-01-22
 Turris_OS  stable  ppc  default  2017-01-22
 Alpine  3.4  armv7l  default  2017-01-22
 Debian  Jessie  armv7l  default  2017-01-22
 Gentoo  stable  armv7l  default  2017-01-22
 openSUSE  13.2  armv7l  default  2017-01-22
 openSUSE  42.2  armv7l  default  2017-01-22
 openSUSE  Tumbleweed  armv7l  default  2017-01-22
 Ubuntu  Xenial  armv7l  default  2017-01-22
 Ubuntu  Yakkety  armv7l  default  2017-01-22
 ---

 Distribution: Debian
 Release: Jessie
 Architecture: armv7l

 Flushing the cache...
 Downloading the image index
 Downloading the rootfs
 Downloading the metadata
 The image cache is now ready
 Unpacking the rootfs

 ---
 Distribution Debian version Jessie was just installed into your
 container.

 Content of the tarballs is provided by third party, thus there is
 no warranty of any kind.

As you can see above, I chose a Debian Jessie distribution.

Let's start and enter the container:

root@turris:~# lxc-start -n homeassistant
root@turris:~# lxc-attach -n homeassistant

Now that we are inside the container, we can first set the root password:

root@LXC_NAME:~# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully

LXC_NAME is not a super nice hostname. Let's update it:

root@LXC_NAME:~# hostnamectl set-hostname homeassistant
Failed to create bus connection: No such file or directory

Ok... We have to install dbus. While we are at it, let's install vim because we'll need it to edit the homeassistant configuration:

root@LXC_NAME:~# apt-get update
root@LXC_NAME:~# apt-get upgrade
root@LXC_NAME:~# apt-get install -y dbus vim

Setting the hostname now works properly:

root@LXC_NAME:~# hostnamectl set-hostname homeassistant

We can exit and enter the container again to see the change:

root@LXC_NAME:~# exit
root@turris:~# lxc-attach -n homeassistant
root@homeassistant:~#

Install Home Assistant

Next, we just have to follow the Home Assistant installation instructions. They are well detailed. I'll just quickly repeat them here to make it easier to follow but you should refer to the official page for any update:

root@homeassistant:~# apt-get install python-pip python3-dev
root@homeassistant:~# pip install --upgrade virtualenv
root@homeassistant:~# adduser --system homeassistant
root@homeassistant:~# mkdir /srv/homeassistant
root@homeassistant:~# chown homeassistant /srv/homeassistant
root@homeassistant:~# su -s /bin/bash homeassistant
homeassistant@homeassistant:/root$ virtualenv -p python3 /srv/homeassistant
homeassistant@homeassistant:/root$ source /srv/homeassistant/bin/activate
(homeassistant) homeassistant@homeassistant:/root$ pip3 install --upgrade homeassistant

Just run hass to start the application and create the default configuration:

(homeassistant) homeassistant@homeassistant:/root$ hass

Press CTRL-C to exit. Check the created configuration file: /home/homeassistant/.homeassistant/configuration.yaml.

You can comment out the introduction: line:

# Show links to resources in log and frontend
#introduction:

Add a switch to Home Assistant

To start and stop our VPN we define a Command Line Switch that triggers the openvpn script on the router. Add the following at the end of the file:

switch:
  platform: command_line
  switches:
        atv_vpn:
          command_on: 'ssh root@<router IP> "/etc/init.d/openvpn start"'
          command_off: 'ssh root@<router IP> "/etc/init.d/openvpn stop"'
          friendly_name: ATV4 VPN

The LXC container is just like another computer (a virtual one) on the local network. To access the router, we have to ssh to it. For this to work without requesting a password, we have to generate a ssh key and add the public key to the authorized_keys file on the router:

homeassistant@homeassistant:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/homeassistant/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/homeassistant/.ssh/id_rsa.
Your public key has been saved in /home/homeassistant/.ssh/id_rsa.pub.

Copy the content of /home/homeassistant/.ssh/id_rsa.pub to /root/.ssh/authorized_keys (on the router not inside the container).

With this configuration, the switch will always be off when you restart Home Assistant. It won't know either if you change the state using the command line or LuCI web interface. This can be solved by adding the optional command_state line. The command shall return a result code 0 if the switch is on. The openvpn init script on the Turris Omnia doesn't take "status" as argument. An easy way to check if openvpn is running is to use pgrep. Our new configuration becomes:

switch:
  platform: command_line
  switches:
        atv_vpn:
          command_on: 'ssh root@<router IP> "/etc/init.d/openvpn start"'
          command_off: 'ssh root@<router IP> "/etc/init.d/openvpn stop"'
          command_state: 'ssh root@<router IP> "pgrep /usr/sbin/openvpn"'
          friendly_name: ATV4 VPN

That's it. The switch state will now properly be updated even if the VPN is started or stopped without using the application.

If you go to http://<container IP>:8123, you should see something like that:

/images/hass_home.png

Automatically start Home Assistant

Let's configure systemd to automatically start the application. Create the file /etc/systemd/system/home-assistant@homeassistant.service:

root@homeassistant:~# cat <<EOF >> /etc/systemd/system/home-assistant@homeassistant.service
[Unit]
Description=Home Assistant
After=network.target

[Service]
Type=simple
User=homeassistant
ExecStart=/srv/homeassistant/bin/hass -c "/home/homeassistant/.homeassistant"

[Install]
WantedBy=multi-user.target
EOF

Enable and launch Home Assistant:

root@homeassistant:~# systemctl --system daemon-reload
root@homeassistant:~# systemctl enable home-assistant@homeassistant
Created symlink from /etc/systemd/system/multi-user.target.wants/home-assistant@homeassistant.service to /etc/systemd/system/home-assistant@homeassistant.service.
root@homeassistant:~# systemctl start home-assistant@homeassistant

You can check the logs with:

root@homeassistant:~# journalctl -f -u home-assistant@homeassistant

We just have to make sure the container starts automatically when we reboot the router. Set the following in /etc/config/lxc-auto:

root@turris:~# cat /etc/config/lxc-auto
config container
  option name homeassistant
  option timeout 60

Make it easy to access Home Assistant

There is one more thing we want to do: assign a fixed IP to the container. This can be done like for any machines on the LAN via the DHCP and DNS settings in LuCI interface. In Static Leases, assign a fixed IP to the container MAC address.

Now that the container has a fixed IP, go to http://<container IP>:8123 and create a bookmark or add an icon to your phone and tablet home screen. This makes it easy for anyone at home to turn the VPN on and off!

/images/hass_icon.png