Experimenting with asyncio on a Raspberry Pi

In a previous post, I described how I built a LEGO Macintosh Classic with a Raspberry Pi and e-paper display.

For testing purpose I installed the clock demo which is part of the Embedded Artists repository. Of course I wanted to do more than displaying the time on this little box. I also wanted to take advantage of the button I had integrated.

One idea was to create a small web server so that I could receive and display messages. The application would basically:

  • display the time (every minute)
  • when receiving a message, stop the clock and display the message
  • when the button is pressed, start the clock again
/images/legomac/press_button.gif

I don't know about you, but this really makes me think event loop! I learnt asynchronous programming with Dave Peticolas Twisted Introduction a few years ago. If you are not familiar with asynchronous programming, I really recommend it. I wrote a few applications using Twisted but I haven't had the opportunity to use asyncio yet. Here is a very good occasion!

asyncio

REST API using aiohttp

There are already several asyncio web frameworks to build an HTTP server. I decided to go with aiohttp which is kind of the default one.

Using this tutorial I wrote a simple REST API using aiohttp. It uses JSON Web Tokens which is something else I have been wanted to try.

The API has only 3 endpoints:

def setup_routes(app):
    app.router.add_get('/', index)
    app.router.add_post('/login', login)
    app.router.add_post('/messages', post_message)
  • / to check that our token is valid
  • /login to login
  • /messages to post messages
async def login(request):
    config = request.app['config']
    data = await request.json()
    try:
        user = data['username']
        passwd = data['password']
    except KeyError:
        return web.HTTPBadRequest(reason='Invalid arguments')
    # We have only one user hard-coded in the config file...
    if user != config['username'] or passwd != config['password']:
        return web.HTTPBadRequest(reason='Invalid credentials')
    payload = {
        'user_id': 1,
        'exp': datetime.datetime.utcnow() + datetime.timedelta(seconds=config['jwt_exp_delta_seconds'])
    }
    jwt_token = jwt.encode(payload, config['jwt_secret'], config['jwt_algorithm'])
    logger.debug(f'JWT token created for {user}')
    return web.json_response({'token': jwt_token.decode('utf-8')})


@login_required
async def post_message(request):
    if request.content_type != 'application/json':
        return web.HTTPBadRequest()
    data = await request.json()
    try:
        message = data['message']
    except KeyError:
        return web.HTTPBadRequest()
    logger.debug(f'Message received from {request.user}: {message}')
    return web.json_response({'message': message}, status=201)


@login_required
async def index(request):
    return web.json_response({'message': 'Welcome to LegoMac {}!'.format(request.user)})

Raspberry Pi GPIO and asyncio

The default Python package to control the Raspberry Pi GPIO seems to be RPi.GPIO. That's at least what is used in the ImageDemoButton.py from Embedded Artists.

An alternative is the pigpio library which provides a daemon to access the Raspberry Pi GPIO via a pipe or socket interface. And someone (Pierre Rust) already created an aysncio based Python client for the pigpio daemon: apigpio.

Exactly what I needed! It's basically a (incomplete) port of the original Python client provided with pigpio, but far sufficient for my need. I just want to get a notification when pressing the button on top of the screen.

There is an example how to achieve that: gpio_notification.py.

E-paper display and asyncio

The last remaining piece is to make the e-paper display play nicely with asyncio.

The EPD driver uses the fuse library. It allows the display to be represented as a virtual directory of files. So sending a command consists of writing to a file.

There is a library to add file support to asyncio: aiofiles. The only thing I had to do was basically to wrap the file IO in EPD.py with aiofiles:

async def _command(self, c):
    async with aiofiles.open(os.path.join(self._epd_path, 'command'), 'wb') as f:
        await f.write(c)

You can't use await in a class __init__ method. So following some recommendations from stackoverflow, I used the factory pattern and moved the actions requiring some IO to a classmethod:

@classmethod
async def create(cls, *args, **kwargs):
    self = EPD(*args, **kwargs)
    async with aiofiles.open(os.path.join(self._epd_path, 'version')) as f:
        version = await f.readline()
        self._version = version.rstrip('\n')
    async with aiofiles.open(os.path.join(self._epd_path, 'panel')) as f:
        line = await f.readline()
        m = self.PANEL_RE.match(line.rstrip('\n'))
        if m is None:
            raise EPDError('invalid panel string')
        ...

To create an instance of the EPD class, use:

epd = await EPD.create([path='/path/to/epd'], [auto=boolean])

Putting everything together with aiohttp

Running the clock as a background task

For the clock, I adapted the clock demo from Embedded Artists repository.

As described in aiohttp documentation I created a background task to display the clock every minute:

async def display_clock(app):
    """Background task to display clock every minute"""
    clock = Clock(app['epd'])
    first_start = True
    try:
        while True:
            while True:
                now = datetime.datetime.today()
                if now.second == 0 or first_start:
                    first_start = False
                    break
                await asyncio.sleep(0.5)
            logger.debug('display clock')
            await clock.display(now)
    except asyncio.CancelledError:
        logger.debug('display clock cancel')


async def start_background_tasks(app):
     app['epd'] = await EPD.create(auto=True)
     app['clock'] = app.loop.create_task(display_clock(app))


async def cleanup_background_tasks(app):
    app['clock'].cancel()
    await app['clock']


def init_app():
    """Create and return the aiohttp Application object"""
    app = web.Application()
    app.on_startup.append(start_background_tasks)
    app.on_cleanup.append(cleanup_background_tasks)
    ...

Stop the clock and display a message

When receiving a message, I first cancel the clock background task and send the messages to the e-paper display using ensure_future so that I can return a json response without having to wait for the message to be displayed as it takes about 5 seconds:

@login_required
async def post_message(request):
    if request.content_type != 'application/json':
        return web.HTTPBadRequest()
    data = await request.json()
    try:
        message = data['message']
    except KeyError:
        return web.HTTPBadRequest()
    # cancel the display clock
    request.app['clock'].cancel()
    logger.debug(f'Message received from {request.user}: {message}')
    now = datetime.datetime.now(request.app['timezone'])
    helpers.ensure_future(request.app['epd'].display_message(message, request.user, now))
    return web.json_response({'message': message}, status=201)

Start the clock when pressing the button

To be able to restart the clock when pressing the button, I connect to the pigpiod when starting the app (in start_background_tasks) and register the on_input callback:

async def start_background_tasks(app):
    app['pi'] = apigpio.Pi(app.loop)
    address = (app['config']['pigpiod_host'], app['config']['pigpiod_port'])
    await app['pi'].connect(address)
    await app['pi'].set_mode(BUTTON_GPIO, apigpio.INPUT)
    app['cb'] = await app['pi'].add_callback(
            BUTTON_GPIO,
            edge=apigpio.RISING_EDGE,
            func=functools.partial(on_input, app))
    ...

In the on_input callback, I re-create the clock background task but only if the previous task is done:

def on_input(app, gpio, level, tick):
    """Callback called when pressing the button on the e-paper display"""
    logger.info('on_input {} {} {}'.format(gpio, level, tick))
    if app['clock'].done():
        logger.info('restart clock')
        app['clock'] = app.loop.create_task(display_clock(app))

Running on the Pi

You might have noticed that I used some syntax that is Python 3.6 only. I don't really see myself using something else when starting a new project today :-) There are so many new things (like f-strings) that make your programs look cleaner.

On raspbian, if you install Python 3, you get 3.4... So how do you get Python 3.6 on a Raspberry Pi?

On desktop/server I usually use conda. It makes it so easy to install the Python version you want and many dependencies. There are no official installer for the armv6 architecture but I found berryconda which is a conda based distribution for the Raspberry Pi! Really nice!

Another alternative is to use docker. There are official arm32v6 images based on alpine and some from resin.io.

I could have gone with berryconda, but there's one thing I wanted as well. I'll have to open the HTTP server to the outside world meaning I need HTTPS. As mentionned in another post, traefik makes that very easy if you use docker. So that's what I chose.

I created 3 containers:

  • traefik
  • pigpiod
  • aiolegomac

traefik

There are no official Traefik docker images for arm yet, but an issue is currently opened. So it should arrive soon!

In the meantime I created my own:

FROM arm32v6/alpine:3.6

RUN apk --update upgrade \
  && apk --no-cache --no-progress add ca-certificates \
  && apk add openssl \
  && rm -rf /var/cache/apk/*

RUN wget -O /usr/local/bin/traefik https://github.com/containous/traefik/releases/download/v1.3.3/traefik_linux-arm \
  && chmod a+x /usr/local/bin/traefik

ENTRYPOINT ["/usr/local/bin/traefik"]

pigpiod

For pigpiod, I first created an image based on arm32v6/alpine but I noticed I couldn't send a SIGTERM to the daemon to stop it properly... I'm not sure why. Alpine being based on musl instead of glibc might be the problem. Here is the Dockerfile I tried:

FROM arm32v6/alpine:3.6

RUN apk add --no-cache --virtual .build-deps \
  gcc \
  make \
  musl-dev \
  tar \
  && wget -O /tmp/pigpio.tar abyz.co.uk/rpi/pigpio/pigpio.tar \
  && tar -xf /tmp/pigpio.tar -C /tmp \
  && sed -i "/ldconfig/d" /tmp/PIGPIO/Makefile \
  && make -C /tmp/PIGPIO \
  && make -C /tmp/PIGPIO install \
  && rm -rf /tmp/PIGPIO /tmp/pigpio.tar \
  && apk del .build-deps

EXPOSE 8888

ENTRYPOINT ["/usr/local/bin/pigpiod", "-g"]

I even tried using tini as entrypoint without luck. So if someone as the explanation, please share it in the comments.

I tried with resin/rpi-raspbian image and I got it working properly right away:

FROM resin/rpi-raspbian:jessie

RUN apt-get update \
  && apt-get install -y \
     make \
     gcc \
     libc6-dev \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

RUN curl -o /tmp/pigpio.tar abyz.co.uk/rpi/pigpio/pigpio.tar \
  && tar -xf /tmp/pigpio.tar -C /tmp \
  && make -C /tmp/PIGPIO \
  && make -C /tmp/PIGPIO install \
  && rm -rf /tmp/pigpio.tar /tmp/PIGPIO

EXPOSE 8888

ENTRYPOINT ["/usr/local/bin/pigpiod", "-g"]

Note that the container has to run in privileged mode to access the GPIO.

aiolegomac

For the main application, the Dockerfile is quite standard for a Python application:

FROM resin/raspberry-pi-python:3.6

RUN apt-get update \
  && apt-get install -y \
     fonts-liberation \
     fonts-dejavu  \
     libjpeg-dev \
     libfreetype6-dev \
     libtiff5-dev \
     liblcms2-dev \
     libwebp-dev \
     zlib1g-dev \
     libyaml-0-2 \
  && apt-get autoremove \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/requirements.txt
RUN python -m venv /opt/legomac \
  && /opt/legomac/bin/pip install -r requirements.txt

COPY . /app

ENTRYPOINT ["/opt/legomac/bin/python"]
CMD ["run.py"]

What about the EPD driver? As it uses libfuse to represent the e-paper display as a virtual directory of files, the easiest was to install it on the host and to mount it as a volume inside the docker container.

Deployment

To install all that on the Pi, I wrote a small Ansible playbook.

  1. Configure the Pi as described in my previous post.

  2. Clone the playbook:

    $ git clone https://github.com/beenje/legomac.git
    $ cd legomac
    
  3. Create a file host_vars/legomac with your variables (assuming the hostname of the Pi is legomac):

    aiolegomac_hostname: myhost.example.com
    aiolegomac_username: john
    aiolegomac_password: mypassword
    aiolegomac_jwt_secret: secret
    traefik_letsencrypt_email: youremail@example.com
    traefik_letsencrypt_production: true
    
  4. Run the playbook:

    $ ansible-playbook -i hosts -k playbook.yml
    

This will install docker and the EPD driver, download the aiolegomac repository, build the 3 docker images and start everything.

Building the main application docker image on a Raspberry Pi Zero takes quite some time. So be patient :-) Just go and do something else.

When the full playbook is complete (it took about 55 minutes for me), you'll have a server with HTTPS support (thanks to Let's Encrypt) running on the Pi. It's displaying the clock every minute and you can send messages to it!

Client

HTTPie

To test the server you can of course use curl but I really like HTTPie. It's much more user friendly.

Let's try to access our new server:

$ http GET https://myhost.example.com
HTTP/1.1 401 Unauthorized
Content-Length: 25
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:22:42 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "error": "Unauthorized"
}

Good, we need to login:

$ http POST https://myhost.example.com/login username=john password=foo
HTTP/1.1 400 Bad Request
Content-Length: 32
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:18:39 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "error": "Invalid credentials"
}

Oops, wrong password:

$ http POST https://myhost.example.com/login username=john password='mypassword'
HTTP/1.1 200 OK
Content-Length: 134
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:21:14 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "token": "eyK0eXAiOiJRV5QiLCJhbGciOiJIUzI1NiJ9.eyJ1c3VyX2lkIjoxLCJleHAiOjE1MDB5MTIwOTh9.hECnj4u2mxvZ2r8IEC-db1T-eKTplM4kWJKZoHhtLxQ"
}

We got a token that we can use:

$ http GET https://myhost.example.com 'Authorization: eyK0eXAiOiJRV5QiLCJhbGciOiJIUzI1NiJ9.eyJ1c3VyX2lkIjoxLCJleHAiOjE1MDB5MTIwOTh9.hECnj4u2mxvZ2r8IEC-db1T-eKTplM4kWJKZoHhtLxQ'
HTTP/1.1 200 OK
Content-Length: 43
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:22:25 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "message": "Welcome to LegoMac john!"
}

Authentication is working, so we can send a message:

$ http POST https://myhost.example.com/messages message='Hello World!' 'Authorization: eyK0eXAiOiJRV5QiLCJhbGciOiJIUzI1NiJ9.eyJ1c3VyX2lkIjoxLCJleHAiOjE1MDB5MTIwOTh9.hECnj4u2mxvZ2r8IEC-db1T-eKTplM4kWJKZoHhtLxQ'
HTTP/1.1 201 Created
Content-Length: 27
Content-Type: application/json; charset=utf-8
Date: Sun, 16 Jul 2017 06:23:46 GMT
Server: Python/3.6 aiohttp/2.2.3

{
    "message": "Hello World!"
}

Message sent! HTTPie is nice for testing, but we can make a small script to easily send messages from the command line.

requests

requests is of course the HTTP library to use in Python.

So let's write a small script to send messages to our server. We'll store the server url and username to use in a small yaml configuration file. If we don't have a token yet or if the saved one is no longer valid, the script will retrieve one after prompting us for a password. The token is saved in the configuration file for later use.

The following script could be improved with some nicer error messages by catching exceptions. But it does the job:

import os
import click
import requests
import yaml


def get_config(filename):
    with open(filename) as f:
        config = yaml.load(f)
    return config


def save_config(filename, config):
    with open(filename, 'w') as f:
        yaml.dump(config, f, default_flow_style=False)


def get_token(url, username):
    password = click.prompt('Password', hide_input=True)
    payload = {'username': username, 'password': password}
    r = requests.post(url + '/login', json=payload)
    r.raise_for_status()
    return r.json()['token']


def send_message(url, token, message):
    payload = {'message': message}
    headers = {'Authorization': token}
    r = requests.post(url + '/messages', json=payload, headers=headers)
    r.raise_for_status()


@click.command()
@click.option('--conf', '-c', default='~/.pylegomac.yml',
              help='Configuration file [default: "~/.pylegomac.yml"]')
@click.argument('message')
@click.version_option()
def pylegomac(message, conf):
    """Send message to aiolegomac server"""
    filename = os.path.expanduser(conf)
    config = get_config(filename)
    url = config['url']
    username = config['username']
    if 'token' in config:
        try:
            send_message(url, config['token'], message)
        except requests.exceptions.HTTPError as err:
            # Token no more valid
            pass
        else:
            click.echo('Message sent')
            return
    token = get_token(url, username)
    send_message(url, token, message)
    config['token'] = token
    save_config(filename, config)


if __name__ == '__main__':
    pylegomac()

Let's first create a configuration file:

$ cat ~/.pylegomac.yml
url: https://myhost.example.com
username: john

Send a message:

$ python pylegomac.py 'Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated.'
Password:
Message sent
/images/legomac/zen_of_python.jpg

Sending a new message won't request the password as the token was saved in the config file.

Conclusion

I have a nice little aiohttp server running on my Raspberry Pi that can receive and display messages. asyncio is quite pleasant to work with. I really like the async/await syntax.

All the code is on github:

  • aiolegomac (the server and client script)
  • legomac (the Ansible playbook to deploy the server)

Why did I only write a command line script to send messages and no web interface? Don't worry, that's planned! I could have used Jinja2. But I'd like to try a javascript framework. So that will be the subject of another post.

Running your application over HTTPS with traefik

I just read another very clear article from Miguel Grinberg about Running Your Flask Application Over HTTPS.

As the title suggests, it describes different ways to run a flask application over HTTPS. I have been using flask for quite some time, but I didn't even know about the ssl_context argument. You should definitively check his article!

Using nginx as a reverse proxy with a self-signed certificate or Let’s Encrypt are two options I have been using in the past.

If your app is available on the internet, you should definitively use Let's Encrypt. But if your app is only supposed to be used internally on a private network, a self-signed certificate is an option.

Traefik

I now often use docker to deploy my applications. I was looking for a way to automatically configure Let's Encrypt. I initially found nginx-proxy and docker-letsencrypt-nginx-proxy-companion. This was interesting but wasn't that straight forward to setup.

I then discovered traefik: "a modern HTTP reverse proxy and load balancer made to deploy microservices with ease". And that's really the case! I've used it to deploy several applications and I was impressed. It's written in go, so single binary. There is also a tiny docker image that makes it easy to deploy. It includes Let's Encrypt support (with automatic renewal), websocket support (no specific setup required)... And many other features.

Here is a traefik.toml configuration example:

defaultEntryPoints = ["http", "https"]

[web]
# Port for the status page
address = ":8080"

# Entrypoints, http and https
[entryPoints]
  # http should be redirected to https
  [entryPoints.http]
  address = ":80"
    [entryPoints.http.redirect]
    entryPoint = "https"
  # https is the default
  [entryPoints.https]
  address = ":443"
    [entryPoints.https.tls]

# Enable ACME (Let's Encrypt): automatic SSL
[acme]
# Email address used for registration
email = "test@traefik.io"
storageFile = "/etc/traefik/acme/acme.json"
entryPoint = "https"
onDemand = false
OnHostRule = true

# Enable Docker configuration backend
[docker]
endpoint = "unix:///var/run/docker.sock"
domain = "example.com"
watch = true
exposedbydefault = false

With this simple configuration, you get:

  • HTTP redirect on HTTPS
  • Let's Encrypt support
  • Docker backend support

A simple example

I created a dummy example just to show how to run a flask application over HTTPS with traefik and Let's Encrypt. Note that traefik is made to dynamically discover backends. So you usually don't run it with your app in the same docker-compose.yml file. It usually runs separately. But to make it easier, I put both in the same file:

version: '2'
services:
  flask:
    build: ./flask
    image: flask
    command: uwsgi --http-socket 0.0.0.0:5000 --wsgi-file app.py --callable app
    labels:
      - "traefik.enable=true"
      - "traefik.backend=flask"
      - "traefik.frontend.rule=${TRAEFIK_FRONTEND_RULE}"
  traefik:
    image: traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/traefik.toml:/etc/traefik/traefik.toml:ro
      - ./traefik/acme:/etc/traefik/acme
    ports:
     - "80:80"
     - "443:443"
     - "8080:8080"

Traefik requires access to the docker socket to listen for changes in the backends. It can thus automatically discover when you start and stop containers. You can ovverride default behaviour by using labels in your container.

Supposing you own the myhost.example.com domain and have access to ports 80 and 443 (you can setup port forwarding if you run that on your machine behind a router at home), you can run:

$ git clone https://github.com/beenje/flask_traefik_letsencrypt.git
$ cd flask_traefik_letsencrypt
$ export TRAEFIK_FRONTEND_RULE=Host:myhost.example.com
$ docker-compose up

Voilà! Our flask app is available over HTTPS with a real SSL certificate!

/images/flask_traefik/hello_world.png

Traefik discovered the flask docker container and requested a certificate for our domain. All that automatically!

Traefik even comes with a nice dashboard:

/images/flask_traefik/traefik_dashboard.png

With this simple configuration, Qualys SSL Labs gave me an A rating :-)

/images/flask_traefik/traefik_ssl_report.png

Not as good as the A+ for Miguel's site, but not that bad! Especially considering there isn't any specific SSL setup.

A more realistic deployment

As I already mentioned, traefik is made to automatically discover backends (docker containers in my case). So you usually run it by itself.

Here is an example how it can be deployed using Ansible:

---
- name: create traefik directories
  file:
    path: /etc/traefik/acme
    state: directory
    owner: root
    group: root
    mode: 0755

- name: create traefik.toml
  template:
    src: traefik.toml.j2
    dest: /etc/traefik/traefik.toml
    owner: root
    group: root
    mode: 0644
  notify:
    - restart traefik

- name: create traefik network
  docker_network:
    name: "{{traefik_network}}"
    state: present

- name: launch traefik container with letsencrypt support
  docker_container:
    name: traefik_proxy
    image: "traefik:{{traefik_version}}"
    state: started
    restart_policy: always
    ports:
      - "80:80"
      - "443:443"
      - "{{traefik_dashboard_port}}:8080"
    volumes:
      - /etc/traefik/traefik.toml:/etc/traefik/traefik.toml:ro
      - /etc/traefik/acme:/etc/traefik/acme:rw
      - /var/run/docker.sock:/var/run/docker.sock:ro
    # purge networks so that the container is only part of
    # {{traefik_network}} (and not the default bridge network)
    purge_networks: yes
    networks:
      - name: "{{traefik_network}}"

- name: force all notified handlers to run
  meta: flush_handlers

Nothing strange here. It's quite similar to what we had in our docker-compose.yml file. We created a specific traefik_network. Our docker containers will have to be on that same network.

Here is how we could deploy a flask application on the same server using another ansible role:

- name: launch flask container
  docker_container:
    name: flask
    image: flask
    command: uwsgi --http-socket 0.0.0.0:5000 --wsgi-file app.py --callable app
    state: started
    restart_policy: always
    purge_networks: yes
    networks:
      - name: "{{traefik_network}}"
    labels:
      traefik.enable: "true"
      traefik.backend: "flask"
      traefik.frontend.rule: "Host:myhost.example.com"
      traefik.port: "5000"

We make sure the container is on the same network as the traefik proxy. Note that the traefik.port label is only required if the container exposes multiple ports. It's thus not needed in our example.

That's basically it. As you can see, docker and Ansible make the deployment easy. And traefik takes care of the Let's Encrypt certificate.

Conclusion

Traefik comes with many other features and is well documented. You should check this Docker example that demonstrates load-balancing. Really cool.

If you use docker, you should really give traefik a try!

My LEGO Macintosh Classic with Raspberry Pi and e-paper display

Beginning of April I read an inspiring blog post from Jannis Hermanns about a LEGO Machintosh Classic with e-paper display. It was a really nice and cool article.

I've been playing with some Raspberry Pis before but only with software. I have been willing to fiddle with hardware for some time. This was the perfect opportunity!

LEGO Digital Designer

I decided to try to make my own LEGO Macintosh based on Jannis work. His blog post is quite detailed with even a list of links with all the required components.

But I quickly realized there were no LEGO building instructions... I thus created my own using LEGO Digital Designer, which was fun. Looking at the pictures on Jannis flickr album helped a lot. But having an exact idea of the screen size wasn't easy on the computer. So I also built a small prototype of the front part to get a better idea. For that I had to wait for my e-paper display.

One modification I wanted to do was to use 1U width lego on the side of the display to require less drilling. I also wanted to check if it was possible to use the button located on top of the display.

My .lxf file is on github.

/images/legomac/legomac_ldd.thumbnail.png

E-paper display

When I was about to order the 2.7 inch e-paper display from Embedded Artists, I noticed that Embedded Artists was located in Malmö, where I live :-).

I e-mailed them and I was granted to pick up my order at their office! A big thanks to them!

Raspbery Pi Zero W

The Raspberry Pi Zero W comes with Wifi which is really nice. It does not come with the soldered GPIO header. I was starting to look at existing soldering iron when I discovered this GPIO Hammer Header:

/images/legomac/gpio_hammer_header.thumbnail.jpg

No soldering required! I used the installation jig and it was really easy to install. There is a nice video that explains how to proceed:

Connecting the display to the Pi

Based on Jannis article I initially thought it wasn't possible to use a ribbon cable (due to space), so I ordered some Jumper Wires. I connected the display to the Pi using the serial expansion connector as described in his blog post. It worked. With the demo from embeddedartists, I managed to display a nice cat picture :-)

/images/legomac/jumper_wires.thumbnail.jpg /images/legomac/cat.thumbnail.jpg

I then realized that the serial expansion connector didn't give access to the button on top of the display. That button could allow some interactions, like changing mode, which would be nice. According to my prototype with 1U width lego on the side, using a ribbon cable shouldn't actually be an issue. So I ordered a Downgrade GPIO Ribbon Cable for Raspberry Pi.

It required a little drilling on the right side for the cable to fit. But not that much. More is needed on the left side to center the screen. Carried away by my enthusiasm, I actually cut a bit too much on the left side (using the dremel was fun :-).

/images/legomac/drilling_left.thumbnail.jpg /images/legomac/drilling_right.thumbnail.jpg

Everything fitted nicely in the lego case:

/images/legomac/ribbon_cable.thumbnail.jpg

Button on top

With the ribbon cable, the button on top of the display is connected to pin 15 on the Raspberry Pi (BCM GPIO22). The ImageDemoButton.py part of the demo shows an example how to use the button to change the image displayed.

Using my small prototype, I planned a small hole on top of the case. I thought I'd have to fill the brick with something hard to press the button. The 1x1 brick ended fitting perfectly. As shown on the picture below, the side is exactly on top of the button. I added a little piece of foam inside the brick to keep it straight.

/images/legomac/button_front.thumbnail.jpg

Of course I move away from the Macintosh Classic design here... but practicality beats purity :-)

Pi configuration

Jannis article made me discover resin.io, which is a really interesting project. I did a few tests on a Raspberry Pi 3 and it was a nice experience. But when I received my Pi Zero W, it wasn't supported by resinOS yet... This isn't the case anymore! Version 2.0.3 added support for the wifi chip.

Anyway, as Jannis already wrote about resinOS, I'll describe my tests with Raspbian. To flash the SD card, I recommend Etcher which is an open source project by the same resin.io. I'm more a command line guy and I have used dd many times. But I was pleasantly surprised. It's easy to install and use.

  1. Download and install Etcher
  2. Download Raspbian Jessie Lite image
  3. Flash the SD card using Etcher
  4. Mount the SD card to configure it:
# Go to the boot partition
# This is an example on OSX (mount point will be different on Linux)
$ cd /Volumes/boot

# To enable ssh, create a file named ssh onto the boot partition
$ touch ssh

# Create the file wpa_supplicant.conf with your wifi settings
$  cat << EOF > wpa_supplicant.conf
network={
    ssid="MyWifiNetwork"
    psk="password"
    key_mgmt=WPA-PSK
}
EOF

# Uncomment dtparam=spi=on to enable the SPI master driver
$ vi config.txt

# Leave the boot partition
$ cd
  1. Unmount the SD card and put it in the Raspberry Pi
  2. Boot the Pi

I wrote a small Ansible playbook to install the E-ink driver and the clock demo:

- name: install required dependencies
  apt:
    name: "{{item}}"
    state: present
    update_cache: yes
  with_items:
    - git
    - libfuse-dev
    - fonts-liberation
    - python-pil

- name: check if the epd-fuse service exists
  command: systemctl status epd-fuse.service
  check_mode: no
  failed_when: False
  changed_when: False
  register: epd_fuse_service

- name: clone the embeddedartists gratis repository
  git:
    repo: https://github.com/embeddedartists/gratis.git
    dest: /home/pi/gratis

- name: build the EPD driver and install the epd-fuse service
  shell: >
    COG_VERSION=V2 make rpi-epd_fuse &&
    COG_VERSION=V2 make rpi-install
  args:
    chdir: /home/pi/gratis/PlatformWithOS
  when: epd_fuse_service.rc != 0

- name: ensure the epd-fuse service is enabled and started
  service:
    name: epd-fuse
    state: started
    enabled: yes

- name: install the epd-clock service
  copy:
    src: epd-clock.service
    dest: /etc/systemd/system/epd-clock.service
    owner: root
    group: root
    mode: 0644

- name: start and enable epd-clock service
  systemd:
    name: epd-clock.service
    daemon_reload: yes
    state: started
    enabled: yes

To run the playbook, clone the repository https://github.com/beenje/legomac:

$ git clone https://github.com/beenje/legomac.git
$ cd legomac
$ ansible-playbook -i hosts -k epd-demo.yml

That's it!

Of course don't forget to change the default password on your Pi.

One more thing

There isn't much Python in this article but the Pi is running some Python code. I couldn't resist putting a Talk Python To Me sticker on the back :-) It's really a great podcast and you should definitevely give it a try if you haven't yet. Thanks again to @mkennedy for the stickers!

/images/legomac/talkpythontome.thumbnail.jpg

Below are a few pictures. You can see more on flickr.

Dockerfile anti-patterns and best practices

I've been using Docker for some time now. There is already a lot of documentation available online but I recently saw the same "anti-patterns" several times, so I thought it was worth writing a post about it.

I won't repeat all the Best practices for writing Dockerfiles here. You should definitively read that page.

I want to emphasize some things that took me some time to understand.

Avoid invalidating the cache

Let's take a simple example with a Python application:

FROM python:3.6

COPY . /app
WORKDIR /app

RUN pip install -r requirements.txt

ENTRYPOINT ["python"]
CMD ["ap.py"]

It's actually an example I have seen several times online. This looks fine, right?

The problem is that the COPY . /app command will invalidate the cache as soon as any file in the current directory is updated. Let's say you just change the README file and run docker build again. Docker will have to re-install all the requirements because the RUN pip command is run after the COPY that invalidated the cache.

The requirements should only be re-installed if the requirements.txt file changes:

FROM python:3.6

WORKDIR /app

COPY requirements.txt /app/requirements.txt
RUN pip install -r requirements.txt

COPY . /app

ENTRYPOINT ["python"]
CMD ["ap.py"]

With this Dockerfile, the RUN pip command will only be re-run when the requirements.txt file changes. It will use the cache otherwise.

This is much more efficient and will save you quite some time if you have many requirements to install.

Minimize the number of layers

What does that really mean?

Each Docker image references a list of read-only layers that represent filesystem differences. Every command in your Dockerfile will create a new layer.

Let's use the following Dockerfile:

FROM centos:7

RUN yum update -y
RUN yum install -y sudo
RUN yum install -y git
RUN yum clean all

Build the docker image and check the layers created with the docker history command:

$ docker build -t centos-test .
...
$ docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED              SIZE
centos-test                      latest              1fae366a2613        About a minute ago   470 MB
centos                           7                   98d35105a391        24 hours ago         193 MB
$ docker history centos-test
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
1fae366a2613        2 minutes ago       /bin/sh -c yum clean all                        1.67 MB
999e7c7c0e14        2 minutes ago       /bin/sh -c yum install -y git                   133 MB
c97b66528792        3 minutes ago       /bin/sh -c yum install -y sudo                  81 MB
e0c7b450b7a8        3 minutes ago       /bin/sh -c yum update -y                        62.5 MB
98d35105a391        24 hours ago        /bin/sh -c #(nop)  CMD ["/bin/bash"]            0 B
<missing>           24 hours ago        /bin/sh -c #(nop)  LABEL name=CentOS Base ...   0 B
<missing>           24 hours ago        /bin/sh -c #(nop) ADD file:29f66b8b4bafd0f...   193 MB
<missing>           6 months ago        /bin/sh -c #(nop)  MAINTAINER https://gith...   0 B

There are two problems with this Dockerfile:

  1. We added too many layers for nothing.
  2. The yum clean all command is meant to reduce the size of the image but it actually does the opposite by adding a new layer!

Let's check that by removing the latest command and running the build again:

FROM centos:7

RUN yum update -y
RUN yum install -y sudo
RUN yum install -y git
# RUN yum clean all
$ docker build -t centos-test .
...
$ docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             SIZE
centos-test                      latest              999e7c7c0e14        11 minutes ago      469 MB
centos                           7                   98d35105a391        24 hours ago        193 MB

The new image without the yum clean all command is indeed smaller than the previous image (1.67 MB smaller)!

If you want to remove files, it's important to do that in the same RUN command that created those files. Otherwise there is no point.

Here is the proper way to do it:

FROM centos:7

RUN yum update -y \
  && yum install -y \
  sudo \
  git \
  && yum clean all

Let's build this new image:

$ docker build -t centos-test .
...
$ docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             SIZE
centos-test                      latest              54a328ef7efd        21 seconds ago      265 MB
centos                           7                   98d35105a391        24 hours ago        193 MB
$ docker history centos-test
IMAGE               CREATED              CREATED BY                                      SIZE                COMMENT
54a328ef7efd        About a minute ago   /bin/sh -c yum update -y   && yum install ...   72.8 MB
98d35105a391        24 hours ago         /bin/sh -c #(nop)  CMD ["/bin/bash"]            0 B
<missing>           24 hours ago         /bin/sh -c #(nop)  LABEL name=CentOS Base ...   0 B
<missing>           24 hours ago         /bin/sh -c #(nop) ADD file:29f66b8b4bafd0f...   193 MB
<missing>           6 months ago         /bin/sh -c #(nop)  MAINTAINER https://gith...   0 B

The new image is only 265 MB compared to the 470 MB of the original image. There isn't much more to say :-)

If you want to know more about images and layers, you should read the documentation: Understand images, containers, and storage drivers.

Conclusion

Avoid invalidating the cache:

  • start your Dockerfile with commands that should not change often
  • put commands that can often invalidate the cache (like COPY .) as late as possible
  • only add the needed files (use a .dockerignore file)

Minimize the number of layers:

  • put related commands in the same RUN instruction
  • remove files in the same RUN command that created them

Control your accessories from Home Assistant with Siri and HomeKit

While reading more about Home Assistant, I discovered it was possible to control your accessories from Home Assistant with Siri and HomeKit. I decided to give that a try.

This requires to install Homebridge and the homebridge-homeassitant plugin.

Install Homebridge

Homebridge is a lightweight NodeJS server that emulates the iOS HomeKit API. Let's install it in the same LXC container as Home Assistant:

root@turris:~# lxc-attach -n homeassistant

I followed the Running HomeBridge on a Raspberry Pi page.

We need curl and git:

root@homeassistant:~# apt-get install -y curl git

Install Node:

root@homeassistant:~# curl -sL https://deb.nodesource.com/setup_6.x | bash -
## Installing the NodeSource Node.js v6.x repo...

## Populating apt-get cache...

root@homeassistant:~# apt-get install -y nodejs

Install avahi and other dependencies:

root@homeassistant:~# apt-get install -y libavahi-compat-libdnssd-dev

Install Homebridge and dependencies still following this page. Note that I had a strange problem here. The npm command didn't produce any output. I found the same issue on stackoverflow and even an issue on github. The workaround is just to open a new terminal...

root@homeassistant:~# npm install -g --unsafe-perm homebridge hap-nodejs node-gyp
root@homeassistant:~# cd /usr/lib/node_modules/homebridge/
root@homeassistant:/usr/lib/node_modules/homebridge# npm install --unsafe-perm bignum
root@homeassistant:/usr/lib/node_modules/homebridge# cd ../hap-nodejs/node_modules/mdns/
root@homeassistant:/usr/lib/node_modules/hap-nodejs/node_modules/mdns# node-gyp BUILDTYPE=Release rebuild

Install and configure homebridge-homeassistant plugin

root@homeassistant:/usr/lib/node_modules/hap-nodejs/node_modules/mdns# cd
root@homeassistant:~# npm install -g --unsafe-perm homebridge-homeassistant

Try to start Homebridge:

root@homeassistant:~# su -s /bin/bash homeassistant
homeassistant@homeassistant:~$ homebridge

Homebridge won't do anything until you've created a configuration file. So press CTRL-C and create the file ~/.homebridge/config.json:

homeassistant@homeassistant:~$ cat <<EOF >> ~/.homebridge/config.json
{
  "bridge": {
    "name": "Homebridge",
    "username": "CC:22:3D:E3:CE:30",
    "port": 51826,
    "pin": "031-45-154"
  },

  "platforms": [
    {
      "platform": "HomeAssistant",
      "name": "HomeAssistant",
      "host": "http://localhost:8123",
      "logging": false
    }
 ]
}
EOF

Note that you can change the username and pin code. You will need the PIN code to add the Homebridge accessory to HomeKit.

Check the Home Assistant plugin page for more information on how to configure the plugin.

Automatically start Homebridge

Let's configure systemd. Create the file /etc/systemd/system/home-assistant@homebridge.service:

root@homeassistant:~# cat <<EOF >> /etc/systemd/system/home-assistant@homebridge.service
[Unit]
Description=Node.js HomeKit Server
After=syslog.target network-online.target

[Service]
Type=simple
User=homeassistant
ExecStart=/usr/bin/homebridge -U /home/homeassistant/.homebridge
Restart=on-failure
RestartSec=10
KillMode=process

[Install]
WantedBy=multi-user.target
EOF

Enable and launch Homebridge:

root@homeassistant:~# systemctl --system daemon-reload
root@homeassistant:~# systemctl enable home-assistant@homebridge
Created symlink from /etc/systemd/system/multi-user.target.wants/home-assistant@homebridge.service to /etc/systemd/system/home-assistant@homebridge.service.
root@homeassistant:~# systemctl start home-assistant@homebridge

Adding Homebridge to iOS

Homebridge and the Home Assistant plugin are now running. Using the Home app on your iOS device, you should be able to add the accessory "Homebridge". See Homebridge README for more information. You will need to enter the PIN code defined in your config.json file.

You should then see the Homebridge bridge on your device:

/images/homebridge.png

And it will automatically add all the accessories defined in Home Assistant!

/images/home_accessories.png

You can now even use Siri to control your devices, like turning ON or OFF the TV VPN.

/images/siri_tv_vpn_off.png

Note that I renamed the original switch to make it easier to pronounce. As described in the README, avoid names usually used by Siri like "Radio" or "Sonos".

That's it! Homebridge is really a nice addition to Home Assistant if you have some iOS devices at home.

Docker and conda

I just read a blog post about Using Docker with Conda Environments. I do things slightly differently so I thought I would share an example of Dockerfile I use:

FROM continuumio/miniconda3:latest

# Install extra packages if required
RUN apt-get update && apt-get install -y \
    xxxxxx \
    && rm -rf /var/lib/apt/lists/*

# Add the user that will run the app (no need to run as root)
RUN groupadd -r myuser && useradd -r -g myuser myuser

WORKDIR /app

# Install myapp requirements
COPY environment.yml /app/environment.yml
RUN conda config --add channels conda-forge \
    && conda env create -n myapp -f environment.yml \
    && rm -rf /opt/conda/pkgs/*

# Install myapp
COPY . /app/
RUN chown -R myuser:myuser /app/*

# activate the myapp environment
ENV PATH /opt/conda/envs/myapp/bin:$PATH

I don't run source activate myapp but just use ENV to update the PATH variable. There is only one environment in the docker image. No need for the extra checks done by the activate script.

With this Dockerfile, any command will be run in the myapp environment.

Just a few additional notes:

  1. Be sure to only copy the file environment.yml before to copy the full current directory. Otherwise any change in the directory would invalidate the docker cache. We only want to re-create the conda environment if environment.yml changes.
  2. I always add the conda-forge channel. Check this post if you haven't heard of it yet.
  3. I clean some cache (/var/lib/apt/lists/ and /opt/conda/pkgs/) to make the image a bit smaller.

I switched from virtualenv to conda a while ago and I really enjoy it. A big thanks to Continuum Analytics!

Home Assistant on Turris Omnia via LXC container

In a previous post, I described how to install OpenVPN client on a Turris Omnia router. To start or stop the client, I was using the command line and mentioned the LuCi Web User Interface.

Both ways are not super easy and fast to access. A while ago, I wrote a small Flask web application to change some settings in my router. The application just allowed to click on a button to run a script via ssh on the router.

So I could write a small webapp to do just that. But I recently read about Home Assistant. It's an open-source home automation platform to track and control your devices at home. There are many components available, including Command Line Switch which looks exactly like what I need.

The Raspberry Pi is a popular device to install Home Assistant. But my Turris Omnia is quite powerful for a router with 1 GB of RAM and 8 GB of flash. It's time to use some of that power.

From what I read, there is an openWrt package of Home Assistant. I couldn't find it in the Turris Omnia available packages. Anyway, there is another feature I wanted to try: LXC Containers. Home Assistant is a Python application, so it's easy to install in a linux container and would allow to easily keep the version up-to-date.

So let's start!

Create a LXC container

As described here, you can create a LXC container via the LuCI web interface or via the command line:

root@turris:~# lxc-create -t download -n homeassistant
Setting up the GPG keyring
Downloading the image index
WARNING: Failed to download the file over HTTPs.
         The file was instead download over HTTP. A server replay attack may be possible!

 ---
 DIST  RELEASE  ARCH  VARIANT  BUILD
 ---
 Turris_OS  stable  armv7l  default  2017-01-22
 Turris_OS  stable  ppc  default  2017-01-22
 Alpine  3.4  armv7l  default  2017-01-22
 Debian  Jessie  armv7l  default  2017-01-22
 Gentoo  stable  armv7l  default  2017-01-22
 openSUSE  13.2  armv7l  default  2017-01-22
 openSUSE  42.2  armv7l  default  2017-01-22
 openSUSE  Tumbleweed  armv7l  default  2017-01-22
 Ubuntu  Xenial  armv7l  default  2017-01-22
 Ubuntu  Yakkety  armv7l  default  2017-01-22
 ---

 Distribution: Debian
 Release: Jessie
 Architecture: armv7l

 Flushing the cache...
 Downloading the image index
 Downloading the rootfs
 Downloading the metadata
 The image cache is now ready
 Unpacking the rootfs

 ---
 Distribution Debian version Jessie was just installed into your
 container.

 Content of the tarballs is provided by third party, thus there is
 no warranty of any kind.

As you can see above, I chose a Debian Jessie distribution.

Let's start and enter the container:

root@turris:~# lxc-start -n homeassistant
root@turris:~# lxc-attach -n homeassistant

Now that we are inside the container, we can first set the root password:

root@LXC_NAME:~# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully

LXC_NAME is not a super nice hostname. Let's update it:

root@LXC_NAME:~# hostnamectl set-hostname homeassistant
Failed to create bus connection: No such file or directory

Ok... We have to install dbus. While we are at it, let's install vim because we'll need it to edit the homeassistant configuration:

root@LXC_NAME:~# apt-get update
root@LXC_NAME:~# apt-get upgrade
root@LXC_NAME:~# apt-get install -y dbus vim

Setting the hostname now works properly:

root@LXC_NAME:~# hostnamectl set-hostname homeassistant

We can exit and enter the container again to see the change:

root@LXC_NAME:~# exit
root@turris:~# lxc-attach -n homeassistant
root@homeassistant:~#

Install Home Assistant

Next, we just have to follow the Home Assistant installation instructions. They are well detailed. I'll just quickly repeat them here to make it easier to follow but you should refer to the official page for any update:

root@homeassistant:~# apt-get install python-pip python3-dev
root@homeassistant:~# pip install --upgrade virtualenv
root@homeassistant:~# adduser --system homeassistant
root@homeassistant:~# mkdir /srv/homeassistant
root@homeassistant:~# chown homeassistant /srv/homeassistant
root@homeassistant:~# su -s /bin/bash homeassistant
homeassistant@homeassistant:/root$ virtualenv -p python3 /srv/homeassistant
homeassistant@homeassistant:/root$ source /srv/homeassistant/bin/activate
(homeassistant) homeassistant@homeassistant:/root$ pip3 install --upgrade homeassistant

Just run hass to start the application and create the default configuration:

(homeassistant) homeassistant@homeassistant:/root$ hass

Press CTRL-C to exit. Check the created configuration file: /home/homeassistant/.homeassistant/configuration.yaml.

You can comment out the introduction: line:

# Show links to resources in log and frontend
#introduction:

Add a switch to Home Assistant

To start and stop our VPN we define a Command Line Switch that triggers the openvpn script on the router. Add the following at the end of the file:

switch:
  platform: command_line
  switches:
        atv_vpn:
          command_on: 'ssh root@<router IP> "/etc/init.d/openvpn start"'
          command_off: 'ssh root@<router IP> "/etc/init.d/openvpn stop"'
          friendly_name: ATV4 VPN

The LXC container is just like another computer (a virtual one) on the local network. To access the router, we have to ssh to it. For this to work without requesting a password, we have to generate a ssh key and add the public key to the authorized_keys file on the router:

homeassistant@homeassistant:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/homeassistant/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/homeassistant/.ssh/id_rsa.
Your public key has been saved in /home/homeassistant/.ssh/id_rsa.pub.

Copy the content of /home/homeassistant/.ssh/id_rsa.pub to /root/.ssh/authorized_keys (on the router not inside the container).

With this configuration, the switch will always be off when you restart Home Assistant. It won't know either if you change the state using the command line or LuCI web interface. This can be solved by adding the optional command_state line. The command shall return a result code 0 if the switch is on. The openvpn init script on the Turris Omnia doesn't take "status" as argument. An easy way to check if openvpn is running is to use pgrep. Our new configuration becomes:

switch:
  platform: command_line
  switches:
        atv_vpn:
          command_on: 'ssh root@<router IP> "/etc/init.d/openvpn start"'
          command_off: 'ssh root@<router IP> "/etc/init.d/openvpn stop"'
          command_state: 'ssh root@<router IP> "pgrep /usr/sbin/openvpn"'
          friendly_name: ATV4 VPN

That's it. The switch state will now properly be updated even if the VPN is started or stopped without using the application.

If you go to http://<container IP>:8123, you should see something like that:

/images/hass_home.png

Automatically start Home Assistant

Let's configure systemd to automatically start the application. Create the file /etc/systemd/system/home-assistant@homeassistant.service:

root@homeassistant:~# cat <<EOF >> /etc/systemd/system/home-assistant@homeassistant.service
[Unit]
Description=Home Assistant
After=network.target

[Service]
Type=simple
User=homeassistant
ExecStart=/srv/homeassistant/bin/hass -c "/home/homeassistant/.homeassistant"

[Install]
WantedBy=multi-user.target
EOF

Enable and launch Home Assistant:

root@homeassistant:~# systemctl --system daemon-reload
root@homeassistant:~# systemctl enable home-assistant@homeassistant
Created symlink from /etc/systemd/system/multi-user.target.wants/home-assistant@homeassistant.service to /etc/systemd/system/home-assistant@homeassistant.service.
root@homeassistant:~# systemctl start home-assistant@homeassistant

You can check the logs with:

root@homeassistant:~# journalctl -f -u home-assistant@homeassistant

We just have to make sure the container starts automatically when we reboot the router. Set the following in /etc/config/lxc-auto:

root@turris:~# cat /etc/config/lxc-auto
config container
  option name homeassistant
  option timeout 60

Make it easy to access Home Assistant

There is one more thing we want to do: assign a fixed IP to the container. This can be done like for any machines on the LAN via the DHCP and DNS settings in LuCI interface. In Static Leases, assign a fixed IP to the container MAC address.

Now that the container has a fixed IP, go to http://<container IP>:8123 and create a bookmark or add an icon to your phone and tablet home screen. This makes it easy for anyone at home to turn the VPN on and off!

/images/hass_icon.png

OpenVPN source based routing

I already spoke about installing OpenVPN on a Raspberry Pi in another blog post.

I only connect to this VPN server to access content that requires a french IP address. I use OpenVPN Connect App on my iPad and Tunnelblick on my mac. It works nicely but how to use this VPN on my Apple TV 4? There is no VPN client available...

End of last year I finally received my Turris Omnia that I supported on Indiegogo. It's a nice router running a free operating system based on OpenWrt with automatic updates. If you haven't heard about it, you should check it out.

Configuring OpenVPN client on OpenWrt

Installing an OpenVPN client on OpenWrt is not very difficult. Here is a quick summary.

  1. Install openvpn-openssl package (via the webinterface or the command line)

  2. I already have a custom client config that I generated with Ansible in this post. To use this config, create the file /etc/config/openvpn:

    # cat /etc/config/openvpn
    package openvpn
    
    config openvpn myvpn
            # Set to 1 to enable this instance:
            option enabled 1
            # Include OpenVPN configuration
            option config /etc/openvpn/myclientconfig.ovpn
    
  3. Add a new interface in /etc/config/network:

    config interface 'myvpn'
           option proto 'none'
           option ifname 'tun0'
    
  4. Add a new zone to /etc/config/firewall:

    config zone
            option forward 'REJECT'
            option output 'ACCEPT'
            option name 'VPN_FW'
            option input 'REJECT'
            option masq '1'
            option network 'myvpn'
            option mtu_fix '1'
    
    config forwarding
            option dest 'VPN_FW'
            option src 'lan'
    
  5. An easy way to configure DNS servers is to add fixed DNS for the WAN interface of the router. To use Google DNS, add the following two lines to the wan interface in /etc/config/network:

    # diff -u network.save network
    @@ -20,6 +20,8 @@
     config interface 'wan'
             option ifname 'eth1'
             option proto 'dhcp'
    +        option peerdns '0'
    +        option dns '8.8.8.8 8.8.4.4'
    

If you run /etc/init.d/openvpn start with this config, you should connect successfully! All the traffic will go via the VPN. That's nice but it's not what I want. I only want my Apple TV traffic to go via the VPN. How to achieve that?

Source based routing

I quickly found this wiki page to implement source based routing. Exactly what I want. What took me some time to realize is that before to do that I had to ignore the routes pushed by the server.

With my configuration, when the client connects, the server pushes some routes among which a default route that makes all the traffic go via the VPN:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.8.0.21       128.0.0.0       UG    0      0        0 tun0
...

Ignoring the routes pushed by the server can be done with the --route-noexec option. I tried to add option route_noexec 1 to my /etc/config/openvpn file but it had no effect. It looks like that when using a custom config, you can't add other options there. You have to set everything in the custom config. I added route-noexec to my /etc/openvpn/myclientconfig.ovpn file and it worked! No more route added. No traffic sent via the VPN.

We can now apply the changes described in the Routing wiki page.

  1. Install the ip package

  2. Add the 10 vpn line to /etc/iproute2/rt_tables so that it looks like this:

    # cat /etc/iproute2/rt_tables
    #
    # reserved values
    #
    255  local
    254  main
    253  default
    10   vpn
    0    unspec
    #
    # local
    #
    #1  inr.ruhep
    
  3. We now need to add a new rule and route when starting the client. We can do so using the openvpn up command. Create the /etc/openvpn/upvpn script:

    # cat /etc/openvpn/upvpn
    #!/bin/sh
    
    client=192.168.75.20
    
    tun_dev=$1
    tun_mtu=$2
    link_mtu=$3
    ifconfig_local_ip=$4
    ifconfig_remote_ip=$5
    
    echo "Routing client $client traffic through VPN"
    ip rule add from $client priority 10 table vpn
    ip route add $client dev $tun_dev table vpn
    ip route add default via $ifconfig_remote_ip dev $tun_dev table vpn
    ip route flush cache
    
  4. Create the /etc/openvpn/downvpn script to properly remove the rule and route:

    # cat /etc/openvpn/downvpn
    #!/bin/sh
    
    client=192.168.75.20
    
    tun_dev=$1
    tun_mtu=$2
    link_mtu=$3
    ifconfig_local_ip=$4
    ifconfig_remote_ip=$5
    
    echo "Delete client $client traffic routing through VPN"
    ip rule del from $client priority 10 table vpn
    ip route del $client dev $tun_dev table vpn
    ip route del default via $ifconfig_remote_ip dev $tun_dev table vpn
    ip route flush cache
    
  5. We now have to add those scripts to the client config. Here is everything I added to my /etc/openvpn/myclientconfig.ovpn file:

    # Don't add or remove routes automatically
    # Source based routing for specific client added in up script
    route-noexec
    # script-security 2 needed to run up and down scripts
    script-security 2
    # Script to run after successful TUN/TAP device open
    up /etc/openvpn/upvpn
    # Call down script before to close TUN to properly remove the routing
    down-pre
    down /etc/openvpn/downvpn
    

Notice that the machine IP address that we want to route via the VPN is hard-coded in the the upvpn and downvpn scripts. This IP shall be fixed. You can easily do that by associating it to the required MAC address in the DHCP settings.

The tunnel remote IP is automatically passed in parameter to the up and down scripts by openvpn.

If we run /etc/init.d/openvpn start with this config, only the traffic from the 192.168.75.20 IP address will go via the VPN!

Run /etc/init.d/openvpn stop to close the tunnel.

Conclusion

This is a nice way to route traffic through a VPN based on the source IP address.

You can of course use the router webinterface to stop and start openvpn. In another post, I'll talk about an even more user friendly way to control it.

Parsing and indexing PDF in Python

I have a Doxie Go scanner and I scan all the documents I receive in paper. That's nice, but it creates another problem. All the resulting PDF files have to be named, organized and stored... Doing that manually is boring and time consuming. Of course that's something I want to automate!

I even bought Hazel a while ago. It's a nice software that monitors files in a folder and performs specific instructions based on the rules you defined. It works well but I felt a bit limited and I thought I could probably write something more tailored to my use case. And that would be more fun :-)

Parsing PDF in Python

A quick solution I found was to run pdftotext using subprocess. I looked at PDFMiner, a pure Python PDF parser but I found pdftotext output to be more accurate. On MacOS, you can install it using Homebrew:

$ brew install Caskroom/cask/pdftotext

Here is a simple Python function to do that:

In [1]:
import subprocess

def parse_pdf(filename):
    try:
        content = subprocess.check_output(["pdftotext", '-enc', 'UTF-8', filename, "-"])
    except subprocess.CalledProcessError as e:
        print('Skipping {} (pdftotext returned status {})'.format(filename, e.returncode))
        return None
    return content.decode('utf-8')

Let's try to parse a pdf file. We'll use requests to download a sample file.

In [2]:
import requests

url = 'http://www.cbu.edu.zm/downloads/pdf-sample.pdf'
response = requests.get(url)
with open('/tmp/pdf-sample.pdf', 'wb') as f:
    f.write(response.content)

Let's first look at the PDF:

In [3]:
from IPython.display import IFrame
IFrame('http://www.cbu.edu.zm/downloads/pdf-sample.pdf', width=600, height=870)
Out[3]:

Nothing complex. It should be easy to parse.

In [4]:
content = parse_pdf('/tmp/pdf-sample.pdf')
content
Out[4]:
"Adobe Acrobat PDF Files\nAdobe® Portable Document Format (PDF) is a universal file format that preserves all of the fonts, formatting, colours and graphics of any source document, regardless of the application and platform used to create it. Adobe PDF is an ideal format for electronic document distribution as it overcomes the problems commonly encountered with electronic file sharing. • Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat Reader. Recipients of other file formats sometimes can't open files because they don't have the applications used to create the documents. PDF files always print correctly on any printing device. PDF files always display exactly as created, regardless of fonts, software, and operating systems. Fonts, and graphics are not lost due to platform, software, and version incompatibilities. The free Acrobat Reader is easy to download and can be freely distributed by anyone. Compact PDF files are smaller than their source files and download a page at a time for fast display on the Web.\n\n• •\n\n• •\n\n\x0c"

This works quite well. The layout is not respected but it's the text that matters. It would be easy to define some regex to define rules based on the PDF content.

This could be the first step in naming and organizing the scanned documents. But it would be nice to have an interface to easily search in all the files. I've already used MongoDB full text search in a webapp I wrote and it worked well for my use case. But I read about Elasticsearch and I always wanted to give it a try.

Elasticsearch Ingest Attachment Processor Plugin

I could just index the result from pdftotext, but I know there is a plugin that can parse PDF files.

The Mapper Attachments Type plugin is deprecated in 5.0.0. It has been replaced with the ingest-attachment plugin. So let's look at that.

Running Elasticsearch

To run Elasticsearch, the easiest is to use Docker. As the official image from Docker Hub comes with no plugin, we'll create our own image. See Elasticsearch Plugin Management with Docker for more information.

Here is our Dockerfile:

FROM elasticsearch:5

RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment

Create the elasticsearch-ingest docker image:

$ docker build -t elasticsearch-ingest .

We can now run elasticsearch with the ingest-attachment plugin:

$ docker run -d -p 9200:9200 elasticsearch-ingest

Python Elasticsearch Client

We'll use elasticsearch-py to interact with our Elasticsearch cluster.

In [5]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

Let's first check that our elasticsearch cluster is alive by asking about its health:

In [6]:
es.cat.health()
Out[6]:
'1479333419 21:56:59 elasticsearch green 1 1 0 0 0 0 0 0 - 100.0%\n'

Nice! We can start playing with our ES cluster.

As described in the documentation, we first have to create a pipeline to use the Ingest Attachment Processor Plugin:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

OK, how do we do that using the Python client?

In [7]:
body = {
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
es.index(index='_ingest', doc_type='pipeline', id='attachment', body=body)
Out[7]:
{'acknowledged': True}

Now, we can send a document to our pipeline. Let's start by using the same example as in the documentation:

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

Using Python client, this gives:

In [8]:
result1 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
                  body={'data': "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="})
result1
Out[8]:
{'_id': 'AVhvJKzVIvjFWZACJU_t',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'my_type',
 '_version': 1,
 'created': True,
 'result': 'created'}

Let's try to get the created document based on its id:

In [9]:
es.get(index='my_index', doc_type='my_type', id=result1['_id'])
Out[9]:
{'_id': 'AVhvJKzVIvjFWZACJU_t',
 '_index': 'my_index',
 '_source': {'attachment': {'content': 'Lorem ipsum dolor sit amet',
   'content_length': 28,
   'content_type': 'application/rtf',
   'language': 'ro'},
  'data': 'e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0='},
 '_type': 'my_type',
 '_version': 1,
 'found': True}

We can see that the binary data passed to the pipeline was a Rich Text Format file and that the content was extracted: Lorem ipsum dolor sit amet

Displaying the binary data is not very useful. It doesn't matter in this example as it's quite small. But it would be much bigger even on small files. We can exclude it using _source_exclude:

In [10]:
es.get(index='my_index', doc_type='my_type', id=result1['_id'], _source_exclude=['data'])
Out[10]:
{'_id': 'AVhvJKzVIvjFWZACJU_t',
 '_index': 'my_index',
 '_source': {'attachment': {'content': 'Lorem ipsum dolor sit amet',
   'content_length': 28,
   'content_type': 'application/rtf',
   'language': 'ro'}},
 '_type': 'my_type',
 '_version': 1,
 'found': True}

Indexing PDF files

Let's try to parse the same sample pdf as before.

In [11]:
url = 'http://www.cbu.edu.zm/downloads/pdf-sample.pdf'
response = requests.get(url)

Note that we have to encode the content of the pdf before to pass it to ES. The source field must be a base64 encoded binary.

In [12]:
import base64

data = base64.b64encode(response.content).decode('ascii')
In [13]:
result2 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
                  body={'data': data})
result2
Out[13]:
{'_id': 'AVhvJMC6IvjFWZACJU_u',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'my_type',
 '_version': 1,
 'created': True,
 'result': 'created'}

We can get the document based on its id:

In [14]:
doc = es.get(index='my_index', doc_type='my_type', id=result2['_id'], _source_exclude=['data'])
doc
Out[14]:
{'_id': 'AVhvJMC6IvjFWZACJU_u',
 '_index': 'my_index',
 '_source': {'attachment': {'author': 'cdaily',
   'content': "Adobe Acrobat PDF Files\n\nAdobe® Portable Document Format (PDF) is a universal file format that preserves all\nof the fonts, formatting, colours and graphics of any source document, regardless of\nthe application and platform used to create it.\n\nAdobe PDF is an ideal format for electronic document distribution as it overcomes the\nproblems commonly encountered with electronic file sharing.\n\n•  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat\nReader. Recipients of other file formats sometimes can't open files because they\ndon't have the applications used to create the documents.\n\n•  PDF files always print correctly on any printing device.\n\n•  PDF files always display exactly as created, regardless of fonts, software, and\noperating systems. Fonts, and graphics are not lost due to platform, software, and\nversion incompatibilities.\n\n•  The free Acrobat Reader is easy to download and can be freely distributed by\nanyone.\n\n•  Compact PDF files are smaller than their source files and download a\npage at a time for fast display on the Web.",
   'content_length': 1073,
   'content_type': 'application/pdf',
   'date': '2000-06-28T23:21:08Z',
   'language': 'en',
   'title': 'This is a test PDF file'}},
 '_type': 'my_type',
 '_version': 1,
 'found': True}

Or with a basic search:

In [15]:
es.search(index='my_index', doc_type='my_type', q='Adobe', _source_exclude=['data'])
Out[15]:
{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': 'AVhvJMC6IvjFWZACJU_u',
    '_index': 'my_index',
    '_score': 0.45930308,
    '_source': {'attachment': {'author': 'cdaily',
      'content': "Adobe Acrobat PDF Files\n\nAdobe® Portable Document Format (PDF) is a universal file format that preserves all\nof the fonts, formatting, colours and graphics of any source document, regardless of\nthe application and platform used to create it.\n\nAdobe PDF is an ideal format for electronic document distribution as it overcomes the\nproblems commonly encountered with electronic file sharing.\n\n•  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat\nReader. Recipients of other file formats sometimes can't open files because they\ndon't have the applications used to create the documents.\n\n•  PDF files always print correctly on any printing device.\n\n•  PDF files always display exactly as created, regardless of fonts, software, and\noperating systems. Fonts, and graphics are not lost due to platform, software, and\nversion incompatibilities.\n\n•  The free Acrobat Reader is easy to download and can be freely distributed by\nanyone.\n\n•  Compact PDF files are smaller than their source files and download a\npage at a time for fast display on the Web.",
      'content_length': 1073,
      'content_type': 'application/pdf',
      'date': '2000-06-28T23:21:08Z',
      'language': 'en',
      'title': 'This is a test PDF file'}},
    '_type': 'my_type'}],
  'max_score': 0.45930308,
  'total': 1},
 'timed_out': False,
 'took': 75}

Of course Elasticsearch allows much more complex queries. But that's something for another time.

One interesting thing is that by printing the content, we can see that even the layout is quite acurate! Much better than the pdftotext output:

In [16]:
print(doc['_source']['attachment']['content'])
Adobe Acrobat PDF Files

Adobe® Portable Document Format (PDF) is a universal file format that preserves all
of the fonts, formatting, colours and graphics of any source document, regardless of
the application and platform used to create it.

Adobe PDF is an ideal format for electronic document distribution as it overcomes the
problems commonly encountered with electronic file sharing.

•  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
Reader. Recipients of other file formats sometimes can't open files because they
don't have the applications used to create the documents.

•  PDF files always print correctly on any printing device.

•  PDF files always display exactly as created, regardless of fonts, software, and
operating systems. Fonts, and graphics are not lost due to platform, software, and
version incompatibilities.

•  The free Acrobat Reader is easy to download and can be freely distributed by
anyone.

•  Compact PDF files are smaller than their source files and download a
page at a time for fast display on the Web.

The ingest-attachment plugin uses the Apache text extraction library Tika. It's really powerful. It detects and extracts metadata and text from many file types.

Sending the file directly to Elasticsearch is nice, but in my use case, I'd like to process the file (change its title, move it to a specific location...) based on its content. I could of course update the document in ES after processing it.

It might be better in some case to decorelate the parsing and processing from the indexing. So let's check how to use Tika from Python.

Apache Tika

Tika-Python makes Apache Tika available as a Python library. It can even starts a Tika REST server in the background, but this requires Java 7+ to be installed. I prefer to run the server myself using the prebuilt docker image: docker-tikaserver. Like that I have control of what is running.

$ docker run --rm -p 9998:9998 logicalspark/docker-tikaserver

We can then set Tika-Python to use Client mode only:

In [17]:
import tika
tika.TikaClientOnly = True
from tika import parser
In [18]:
parsed = parser.from_file('/tmp/pdf-sample.pdf', 'http://localhost:9998/tika')
2016-11-16 22:57:14,233 [MainThread  ] [INFO ]  Starting new HTTP connection (1): localhost
In [19]:
parsed
Out[19]:
{'content': "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThis is a test PDF file\n\n\nAdobe Acrobat PDF Files\n\nAdobe® Portable Document Format (PDF) is a universal file format that preserves all\nof the fonts, formatting, colours and graphics of any source document, regardless of\nthe application and platform used to create it.\n\nAdobe PDF is an ideal format for electronic document distribution as it overcomes the\nproblems commonly encountered with electronic file sharing.\n\n•  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat\nReader. Recipients of other file formats sometimes can't open files because they\ndon't have the applications used to create the documents.\n\n•  PDF files always print correctly on any printing device.\n\n•  PDF files always display exactly as created, regardless of fonts, software, and\noperating systems. Fonts, and graphics are not lost due to platform, software, and\nversion incompatibilities.\n\n•  The free Acrobat Reader is easy to download and can be freely distributed by\nanyone.\n\n•  Compact PDF files are smaller than their source files and download a\npage at a time for fast display on the Web.\n\n\n",
 'metadata': {'Author': 'cdaily',
  'Content-Type': 'application/pdf',
  'Creation-Date': '2000-06-28T23:21:08Z',
  'Last-Modified': '2013-10-28T19:24:13Z',
  'Last-Save-Date': '2013-10-28T19:24:13Z',
  'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
   'org.apache.tika.parser.pdf.PDFParser'],
  'X-TIKA:parse_time_millis': '62',
  'access_permission:assemble_document': 'true',
  'access_permission:can_modify': 'true',
  'access_permission:can_print': 'true',
  'access_permission:can_print_degraded': 'true',
  'access_permission:extract_content': 'true',
  'access_permission:extract_for_accessibility': 'true',
  'access_permission:fill_in_form': 'true',
  'access_permission:modify_annotations': 'true',
  'created': 'Wed Jun 28 23:21:08 UTC 2000',
  'creator': 'cdaily',
  'date': '2013-10-28T19:24:13Z',
  'dc:creator': 'cdaily',
  'dc:format': 'application/pdf; version=1.3',
  'dc:title': 'This is a test PDF file',
  'dcterms:created': '2000-06-28T23:21:08Z',
  'dcterms:modified': '2013-10-28T19:24:13Z',
  'meta:author': 'cdaily',
  'meta:creation-date': '2000-06-28T23:21:08Z',
  'meta:save-date': '2013-10-28T19:24:13Z',
  'modified': '2013-10-28T19:24:13Z',
  'pdf:PDFVersion': '1.3',
  'pdf:docinfo:created': '2000-06-28T23:21:08Z',
  'pdf:docinfo:creator': 'cdaily',
  'pdf:docinfo:creator_tool': 'Microsoft Word 8.0',
  'pdf:docinfo:modified': '2013-10-28T19:24:13Z',
  'pdf:docinfo:producer': 'Acrobat Distiller 4.0 for Windows',
  'pdf:docinfo:title': 'This is a test PDF file',
  'pdf:encrypted': 'false',
  'producer': 'Acrobat Distiller 4.0 for Windows',
  'resourceName': 'pdf-sample.pdf',
  'title': 'This is a test PDF file',
  'xmp:CreatorTool': 'Microsoft Word 8.0',
  'xmpMM:DocumentID': 'uuid:0805e221-80a8-459e-a522-635ed5c1e2e6',
  'xmpTPg:NPages': '1'}}
In [20]:
print(parsed['content'].strip())
This is a test PDF file


Adobe Acrobat PDF Files

Adobe® Portable Document Format (PDF) is a universal file format that preserves all
of the fonts, formatting, colours and graphics of any source document, regardless of
the application and platform used to create it.

Adobe PDF is an ideal format for electronic document distribution as it overcomes the
problems commonly encountered with electronic file sharing.

•  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
Reader. Recipients of other file formats sometimes can't open files because they
don't have the applications used to create the documents.

•  PDF files always print correctly on any printing device.

•  PDF files always display exactly as created, regardless of fonts, software, and
operating systems. Fonts, and graphics are not lost due to platform, software, and
version incompatibilities.

•  The free Acrobat Reader is easy to download and can be freely distributed by
anyone.

•  Compact PDF files are smaller than their source files and download a
page at a time for fast display on the Web.

Not sure why we get the title of the PDF inside the content. Anyway the text is extracted properly and we even get a lot of metadata:

In [21]:
parsed['metadata']
Out[21]:
{'Author': 'cdaily',
 'Content-Type': 'application/pdf',
 'Creation-Date': '2000-06-28T23:21:08Z',
 'Last-Modified': '2013-10-28T19:24:13Z',
 'Last-Save-Date': '2013-10-28T19:24:13Z',
 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
  'org.apache.tika.parser.pdf.PDFParser'],
 'X-TIKA:parse_time_millis': '62',
 'access_permission:assemble_document': 'true',
 'access_permission:can_modify': 'true',
 'access_permission:can_print': 'true',
 'access_permission:can_print_degraded': 'true',
 'access_permission:extract_content': 'true',
 'access_permission:extract_for_accessibility': 'true',
 'access_permission:fill_in_form': 'true',
 'access_permission:modify_annotations': 'true',
 'created': 'Wed Jun 28 23:21:08 UTC 2000',
 'creator': 'cdaily',
 'date': '2013-10-28T19:24:13Z',
 'dc:creator': 'cdaily',
 'dc:format': 'application/pdf; version=1.3',
 'dc:title': 'This is a test PDF file',
 'dcterms:created': '2000-06-28T23:21:08Z',
 'dcterms:modified': '2013-10-28T19:24:13Z',
 'meta:author': 'cdaily',
 'meta:creation-date': '2000-06-28T23:21:08Z',
 'meta:save-date': '2013-10-28T19:24:13Z',
 'modified': '2013-10-28T19:24:13Z',
 'pdf:PDFVersion': '1.3',
 'pdf:docinfo:created': '2000-06-28T23:21:08Z',
 'pdf:docinfo:creator': 'cdaily',
 'pdf:docinfo:creator_tool': 'Microsoft Word 8.0',
 'pdf:docinfo:modified': '2013-10-28T19:24:13Z',
 'pdf:docinfo:producer': 'Acrobat Distiller 4.0 for Windows',
 'pdf:docinfo:title': 'This is a test PDF file',
 'pdf:encrypted': 'false',
 'producer': 'Acrobat Distiller 4.0 for Windows',
 'resourceName': 'pdf-sample.pdf',
 'title': 'This is a test PDF file',
 'xmp:CreatorTool': 'Microsoft Word 8.0',
 'xmpMM:DocumentID': 'uuid:0805e221-80a8-459e-a522-635ed5c1e2e6',
 'xmpTPg:NPages': '1'}

Conclusion

We saw different methods to extract text from PDF in Python. Depending on what you want to do, one might suit you better. And this was of course not exhaustive.

If you want to index PDFs, Elasticsearch might be all you need. The ingest-attachment plugin uses Apache Tika which is very powerful.

And thanks to Tika-Python, it's very easy to use Tika directly from Python. You can let the library starts the server or use Docker to start your own.

GitLab Container Registry and proxy

GitLab on Synology

I installed GitLab CE on a Synology RackStation RS815+ at work. It has an Intel Atom C2538 that allows to run Docker on the NAS.

Official GitLab Community Edition docker images are available on Docker Hub. The documentation to use the image is quite clear and can be found here.

The ports 80 and 443 are already used by nginx that comes with DSM. I wanted to access GitLab using HTTPS, so I disabled port 443 in nginx configuration. To do that I had to modify the template /usr/syno/share/nginx/WWWService.mustache and reboot the NAS:

--- WWWService.mustache.org 2016-08-16 23:25:06.000000000 +0100
+++ WWWService.mustache 2016-09-19 13:53:45.256735700 +0100
@@ -1,8 +1,6 @@
 server {
     listen 80 default_server{{#reuseport}} reuseport{{/reuseport}};
     listen [::]:80 default_server{{#reuseport}} reuseport{{/reuseport}};
-    listen 443 default_server ssl{{#reuseport}} reuseport{{/reuseport}};
-    listen [::]:443 default_server ssl{{#reuseport}} reuseport{{/reuseport}};

     server_name _;

The port 22 is also already used by the ssh daemon so I decided to use the port 2222. I created the directory /volume1/docker/gitlab to store all GitLab data. Here are the required variables in the /volume1/docker/gitlab/config/gitlab.rb config file:

external_url "https://mygitlab.example.com"

## GitLab Shell settings for GitLab
gitlab_rails['gitlab_shell_ssh_port'] = 2222

nginx['enable'] = true
nginx['redirect_http_to_https'] = true

And this is how I run the image:

docker run --detach \
    --hostname mygitlab.example.com \
    --publish 443:443 --publish 8080:80 --publish 2222:22 \
    --name gitlab \
    --restart always \
    --volume /volume1/docker/gitlab/config:/etc/gitlab \
    --volume /volume1/docker/gitlab/logs:/var/log/gitlab \
    --volume /volume1/docker/gitlab/data:/var/opt/gitlab \
    gitlab/gitlab-ce:latest

This has been working fine. Since I heard about GitLab Container Registry, I've been wanted to give it a try.

GitLab Container Registry

To enable it, I just added to my gitlab.rb file the registry url:

registry_external_url 'https://mygitlab.example.com:4567'

I use the existing GitLab domain and use the port 4567 for the registry. The TLS certificate and key are in the default path, so no need to specify them.

So let's restart GitLab. Don't forget to publish the new port 4567!

$ docker stop gitlab
$ docker rm gitlab
$ docker run --detach \
    --hostname mygitlab.example.com \
    --publish 443:443 --publish 8080:80 --publish 2222:22 \
    --publish 4567:4567 \
    --name gitlab \
    --restart always \
    --volume /volume1/docker/gitlab/config:/etc/gitlab \
    --volume /volume1/docker/gitlab/logs:/var/log/gitlab \
    --volume /volume1/docker/gitlab/data:/var/opt/gitlab \
    gitlab/gitlab-ce:latest

Easy! Let's test our new docker registry!

$ docker login mygitlab.example.com:4567
Username: user
Password:
Error response from daemon: Get https://mygitlab.example.com:4567/v1/users/: Service Unavailable

Hmm... Not super useful error... I thought about publishing port 4567 in docker, so what is happening? After looking through the logs, I found /volume1/docker/gitlab/logs/nginx/gitlab_registry_access.logi. It's empty... Let's try curl:

$ curl https://mygitlab.example.com:4567/v1/users/

curl: (60) Peer certificate cannot be authenticated with known CA certificates
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.

OK, I have a self-signed certificate. So let's try with --insecure:

$ curl --insecure https://mygitlab.example.com:4567/v1/users/
404 page not found

At least I get an entry in my log file:

$ cd /volume1/docker/gitlab
$ cat logs/nginx/gitlab_registry_access.log
xxx.xx.x.x - - [21/Sep/2016:14:24:57 +0000] "GET /v1/users/ HTTP/1.1" 404 19 "-" "curl/7.43.0"

So, docker and nginx seem to be configured properly... It looks like docker login is not even trying to access my host...

Let's try with a dummy host:

$ docker login foo
Username: user
Password:
Error response from daemon: Get https://mygitlab.example.com:4567/v1/users/: Service Unavailable

Same error! Why is that? I can ping mygitlab.example.com and even access nginx on port 4567 (using curl) inside the docker container... My machine is on the same network. It can't be a proxy problem. Wait. Proxy?

That's when I remembered I had configured my docker daemon to use a proxy to access the internet! I created the file /etc/systemd/system/docker.service.d/http-proxy.conf with:

[Service]
Environment="HTTP_PROXY=http://proxy.example.com:8080/"

Reading the docker documentation, it's very clear: If you have internal Docker registries that you need to contact without proxying you can specify them via the NO_PROXY environment variable

Let's add the NO_PROXY variable:

[Service]
Environment="HTTP_PROXY=http://proxy.example.com:8080/" "NO_PROXY=localhost,127.0.0.1,mygitlab.example.com"

Flush the changes and restart the docker daemon:

$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

Now let's try to login again:

$ docker login mygitlab.example.com:4567
Username: user
Password:
Error response from daemon: Get https://mygitlab.example.com:4567/v1/users/: x509: certificate signed by unknown authority

This error is easy to fix (after googling). I have to add the self-signed certificate at the OS level. On my Ubuntu machine:

$ sudo cp mygitlab.example.com.crt /usr/local/share/ca-certificates/
$ sudo update-ca-certificates
$ sudo systemctl restart docker

$ docker login mygitlab.example.com:4567
Username: user
Password:
Login Succeeded

Yes! :-)

I can now push docker images to my GitLab Container Registry!

Conclusion

Setting GitLab Container Registry should have been easy but my proxy settings made me lost quite some time... The proxy environment variables (HTTP_PROXY, NO_PROXY...) are not taken into account by the docker commands. The docker daemon has to be configured specifically. Something to remember!

Note that this was with docker 1.11.2. When trying the same command on my Mac with docker 1.12.1, I got a nicer error message:

$ docker --version
Docker version 1.12.1, build 6f9534c
$ docker login foo
Username: user
Password:
Error response from daemon: Get https://foo/v1/users/: dial tcp: lookup foo on xxx.xxx.xx.x:53: no such host