Parsing HTML Tables in Python with pandas

Not long ago, I needed to parse some HTML tables from our confluence website at work. I first thought: I'm gonna need requests and BeautifulSoup. As HTML tables are well defined, I did some quick googling to see if there was some recipe or lib to parse them and I found a link to pandas. What? Can pandas do that too?

I have been using pandas for quite some time and have used read_csv, read_excel, even read_sql, but I had missed read_html!

Reading excel file with pandas

Before to look at HTML tables, I want to show a quick example on how to read an excel file with pandas. The API is really nice. If I have to look at some excel data, I go directly to pandas.

So let's download a sample file file:

In [1]:
import io
import requests
import pandas as pd
from zipfile import ZipFile
In [2]:
r = requests.get('http://www.contextures.com/SampleData.zip')
ZipFile(io.BytesIO(r.content)).extractall()

This created the SampleData.xlsx file that includes four sheets: Instructions, SalesOrders, SampleNumbers and MyLinks. Only the SalesOrders sheet includes tabular data: SampleData So let's read it.

In [3]:
df = pd.read_excel('SampleData.xlsx', sheet_name='SalesOrders')
In [4]:
df.head()
Out[4]:
OrderDate Region Rep Item Units Unit Cost Total
0 2016-01-06 East Jones Pencil 95 1.99 189.05
1 2016-01-23 Central Kivell Binder 50 19.99 999.50
2 2016-02-09 Central Jardine Pencil 36 4.99 179.64
3 2016-02-26 Central Gill Pen 27 19.99 539.73
4 2016-03-15 West Sorvino Pencil 56 2.99 167.44

That's it. One line and you have your data in a DataFrame that you can easily manipulate, filter, convert and display in a jupyter notebook. Can it be easier than that?

Parsing HTML Tables

So let's go back to HTML tables and look at pandas.read_html.

The function accepts:

A URL, a file-like object, or a raw string containing HTML.

Let's start with a basic HTML table in a raw string.

Parsing raw string

In [5]:
html_string = """
<table>
  <thead>
    <tr>
      <th>Programming Language</th>
      <th>Creator</th> 
      <th>Year</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C</td>
      <td>Dennis Ritchie</td> 
      <td>1972</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>Guido Van Rossum</td> 
      <td>1989</td>
    </tr>
    <tr>
      <td>Ruby</td>
      <td>Yukihiro Matsumoto</td> 
      <td>1995</td>
    </tr>
  </tbody>
</table>
"""

We can render the table using IPython display_html function:

In [6]:
from IPython.display import display_html
display_html(html_string, raw=True)
Programming Language Creator Year
C Dennis Ritchie 1972
Python Guido Van Rossum 1989
Ruby Yukihiro Matsumoto 1995

Let's import this HTML table in a DataFrame. Note that the function read_html always returns a list of DataFrame objects:

In [7]:
dfs = pd.read_html(html_string)
dfs
Out[7]:
[  Programming Language             Creator  Year
 0                    C      Dennis Ritchie  1972
 1               Python    Guido Van Rossum  1989
 2                 Ruby  Yukihiro Matsumoto  1995]
In [8]:
df = dfs[0]
df
Out[8]:
Programming Language Creator Year
0 C Dennis Ritchie 1972
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995

This looks quite similar to the raw string we rendered above, but we are printing a pandas DataFrame object here! We can apply any operation we want.

In [9]:
df[df.Year > 1975]
Out[9]:
Programming Language Creator Year
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995

Pandas automatically found the header to use thanks to the <thead> tag. It is not mandatory to define a table and is actually often missing on the web. So what happens if it's not present?

In [10]:
html_string = """
<table>
  <tr>
    <th>Programming Language</th>
    <th>Creator</th> 
    <th>Year</th>
  </tr>
  <tr>
    <td>C</td>
    <td>Dennis Ritchie</td> 
    <td>1972</td>
  </tr>
  <tr>
    <td>Python</td>
    <td>Guido Van Rossum</td> 
    <td>1989</td>
  </tr>
  <tr>
    <td>Ruby</td>
    <td>Yukihiro Matsumoto</td> 
    <td>1995</td>
  </tr>
</table>
"""
In [11]:
pd.read_html(html_string)[0]
Out[11]:
0 1 2
0 Programming Language Creator Year
1 C Dennis Ritchie 1972
2 Python Guido Van Rossum 1989
3 Ruby Yukihiro Matsumoto 1995

In this case, we need to pass the row number to use as header.

In [12]:
pd.read_html(html_string, header=0)[0]
Out[12]:
Programming Language Creator Year
0 C Dennis Ritchie 1972
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995

Parsing a http URL

The same data we read in our excel file is available in a table at the following address: http://www.contextures.com/xlSampleData01.html

Let's pass this url to read_html:

In [13]:
dfs = pd.read_html('http://www.contextures.com/xlSampleData01.html')
In [14]:
dfs
Out[14]:
[             0        1         2        3      4         5        6
 0    OrderDate   Region       Rep     Item  Units  UnitCost    Total
 1     1/6/2016     East     Jones   Pencil     95      1.99   189.05
 2    1/23/2016  Central    Kivell   Binder     50     19.99   999.50
 3     2/9/2016  Central   Jardine   Pencil     36      4.99   179.64
 4    2/26/2016  Central      Gill      Pen     27     19.99   539.73
 5    3/15/2016     West   Sorvino   Pencil     56      2.99   167.44
 6     4/1/2016     East     Jones   Binder     60      4.99   299.40
 7    4/18/2016  Central   Andrews   Pencil     75      1.99   149.25
 8     5/5/2016  Central   Jardine   Pencil     90      4.99   449.10
 9    5/22/2016     West  Thompson   Pencil     32      1.99    63.68
 10    6/8/2016     East     Jones   Binder     60      8.99   539.40
 11   6/25/2016  Central    Morgan   Pencil     90      4.99   449.10
 12   7/12/2016     East    Howard   Binder     29      1.99    57.71
 13   7/29/2016     East    Parent   Binder     81     19.99  1619.19
 14   8/15/2016     East     Jones   Pencil     35      4.99   174.65
 15    9/1/2016  Central     Smith     Desk      2    125.00   250.00
 16   9/18/2016     East     Jones  Pen Set     16     15.99   255.84
 17   10/5/2016  Central    Morgan   Binder     28      8.99   251.72
 18  10/22/2016     East     Jones      Pen     64      8.99   575.36
 19   11/8/2016     East    Parent      Pen     15     19.99   299.85
 20  11/25/2016  Central    Kivell  Pen Set     96      4.99   479.04
 21  12/12/2016  Central     Smith   Pencil     67      1.29    86.43
 22  12/29/2016     East    Parent  Pen Set     74     15.99  1183.26
 23   1/15/2017  Central      Gill   Binder     46      8.99   413.54
 24    2/1/2017  Central     Smith   Binder     87     15.00  1305.00
 25   2/18/2017     East     Jones   Binder      4      4.99    19.96
 26    3/7/2017     West   Sorvino   Binder      7     19.99   139.93
 27   3/24/2017  Central   Jardine  Pen Set     50      4.99   249.50
 28   4/10/2017  Central   Andrews   Pencil     66      1.99   131.34
 29   4/27/2017     East    Howard      Pen     96      4.99   479.04
 30   5/14/2017  Central      Gill   Pencil     53      1.29    68.37
 31   5/31/2017  Central      Gill   Binder     80      8.99   719.20
 32   6/17/2017  Central    Kivell     Desk      5    125.00   625.00
 33    7/4/2017     East     Jones  Pen Set     62      4.99   309.38
 34   7/21/2017  Central    Morgan  Pen Set     55     12.49   686.95
 35    8/7/2017  Central    Kivell  Pen Set     42     23.95  1005.90
 36   8/24/2017     West   Sorvino     Desk      3    275.00   825.00
 37   9/10/2017  Central      Gill   Pencil      7      1.29     9.03
 38   9/27/2017     West   Sorvino      Pen     76      1.99   151.24
 39  10/14/2017     West  Thompson   Binder     57     19.99  1139.43
 40  10/31/2017  Central   Andrews   Pencil     14      1.29    18.06
 41  11/17/2017  Central   Jardine   Binder     11      4.99    54.89
 42   12/4/2017  Central   Jardine   Binder     94     19.99  1879.06
 43  12/21/2017  Central   Andrews   Binder     28      4.99   139.72]

We have one table and can see that we need to pass the row number to use as header (because <thead> is not present).

In [15]:
dfs = pd.read_html('http://www.contextures.com/xlSampleData01.html', header=0)
dfs[0].head()
Out[15]:
OrderDate Region Rep Item Units UnitCost Total
0 1/6/2016 East Jones Pencil 95 1.99 189.05
1 1/23/2016 Central Kivell Binder 50 19.99 999.50
2 2/9/2016 Central Jardine Pencil 36 4.99 179.64
3 2/26/2016 Central Gill Pen 27 19.99 539.73
4 3/15/2016 West Sorvino Pencil 56 2.99 167.44

Nice!

Parsing a https URL

The documentation states that:

Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

This is true, but bs4 + html5lib are used as a fallback when lxml fails. I guess this is why passing a https url does work. We can confirm that with a wikipedia page.

In [16]:
pd.read_html('https://en.wikipedia.org/wiki/Python_(programming_language)', header=0)[1]
Out[16]:
Type mutable Description Syntax example
0 bool immutable Boolean value True False
1 bytearray mutable Sequence of bytes bytearray(b'Some ASCII') bytearray(b"Some ASCI...
2 bytes immutable Sequence of bytes b'Some ASCII' b"Some ASCII" bytes([119, 105, 1...
3 complex immutable Complex number with real and imaginary parts 3+2.7j
4 dict mutable Associative array (or dictionary) of key and v... {'key1': 1.0, 3: False}
5 ellipsis NaN An ellipsis placeholder to be used as an index... ...
6 float immutable Floating point number, system-defined precision 3.1415927
7 frozenset immutable Unordered set, contains no duplicates; can con... frozenset([4.0, 'string', True])
8 int immutable Integer of unlimited magnitude[76] 42
9 list mutable List, can contain mixed types [4.0, 'string', True]
10 set mutable Unordered set, contains no duplicates; can con... {4.0, 'string', True}
11 str immutable A character string: sequence of Unicode codepo... 'Wikipedia' "Wikipedia" """Spanning multiple l...
12 tuple immutable Can contain mixed types (4.0, 'string', True)But we can append element...

But what if the url requires authentiation?

In that case we can use requests to get the HTML and pass the string to pandas!

To demonstrate authentication, we can use http://httpbin.org

We can first confirm that passing a url that requires authentication raises a 401

In [17]:
pd.read_html('https://httpbin.org/basic-auth/myuser/mypasswd')
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-17-7e6b50c9f1f3> in <module>()
----> 1 pd.read_html('https://httpbin.org/basic-auth/myuser/mypasswd')

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    913                   thousands=thousands, attrs=attrs, encoding=encoding,
    914                   decimal=decimal, converters=converters, na_values=na_values,
--> 915                   keep_default_na=keep_default_na)

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
    747             break
    748     else:
--> 749         raise_with_traceback(retained)
    750 
    751     ret = []

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    383         if traceback == Ellipsis:
    384             _, _, traceback = sys.exc_info()
--> 385         raise exc.with_traceback(traceback)
    386 else:
    387     # this version of raise is a syntax error in Python 3

HTTPError: HTTP Error 401: UNAUTHORIZED
In [ ]:
r = requests.get('https://httpbin.org/basic-auth/myuser/mypasswd')
r.status_code

Yes, as expected. Let's pass the username and password with requests.

In [ ]:
r = requests.get('https://httpbin.org/basic-auth/myuser/mypasswd', auth=('myuser', 'mypasswd'))
r.status_code

We could now pass r.text to pandas. http://httpbin.org was used to demonstrate authentication but it only returns JSON-encoded responses and no HTML. It's a testing service. So it doesn't make sense here.

The following example shows how to combine requests and pandas.

In [18]:
r = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
pd.read_html(r.text, header=0)[1]
Out[18]:
Type mutable Description Syntax example
0 bool immutable Boolean value True False
1 bytearray mutable Sequence of bytes bytearray(b'Some ASCII') bytearray(b"Some ASCI...
2 bytes immutable Sequence of bytes b'Some ASCII' b"Some ASCII" bytes([119, 105, 1...
3 complex immutable Complex number with real and imaginary parts 3+2.7j
4 dict mutable Associative array (or dictionary) of key and v... {'key1': 1.0, 3: False}
5 ellipsis NaN An ellipsis placeholder to be used as an index... ...
6 float immutable Floating point number, system-defined precision 3.1415927
7 frozenset immutable Unordered set, contains no duplicates; can con... frozenset([4.0, 'string', True])
8 int immutable Integer of unlimited magnitude[76] 42
9 list mutable List, can contain mixed types [4.0, 'string', True]
10 set mutable Unordered set, contains no duplicates; can con... {4.0, 'string', True}
11 str immutable A character string: sequence of Unicode codepo... 'Wikipedia' "Wikipedia" """Spanning multiple l...
12 tuple immutable Can contain mixed types (4.0, 'string', True)But we can append element...

A more complex example

We looked at some quite simple examples so far. So let's try a page with several tables: https://en.wikipedia.org/wiki/Timeline_of_programming_languages

In [19]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages')
In [20]:
len(dfs)
Out[20]:
13

If we look at the page we have 8 tables (one per decade). Looking at our dfs list, we can see that the first interesting table is the fifth one and that we need to pass the row to use as header.

In [21]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', header=0)
dfs[4]
Out[21]:
Year Name Chief developer, company Predecessor(s)
0 1943–45 Plankalkül (concept) Konrad Zuse none (unique language)
1 1943–46 ENIAC coding system John von Neumann, John Mauchly, J. Presper Eck... none (unique language)
2 1946 ENIAC Short Code Richard Clippinger, John von Neumann after Ala... ENIAC coding system
3 1946 Von Neumann and Goldstine graphing system (Not... John von Neumann and Herman Goldstine ENIAC coding system
4 1947 ARC Assembly Kathleen Booth[1][2] ENIAC coding system
5 1948 CPC Coding scheme Howard H. Aiken Analytical Engine order code
6 1948 Curry notation system Haskell Curry ENIAC coding system
7 1948 Plankalkül (concept published) Konrad Zuse none (unique language)
8 1949 Short Code John Mauchly and William F. Schmitt ENIAC Short Code
9 Year Name Chief developer, company Predecessor(s)

Notice that the header was repeated in the last row (to make the table easier to read on the HTML page). We can filter that after concatenating together the 8 tables to get one DataFrame.

In [22]:
df = pd.concat(dfs[4:12])
df
Out[22]:
Year Name Chief developer, company Predecessor(s)
0 1943–45 Plankalkül (concept) Konrad Zuse none (unique language)
1 1943–46 ENIAC coding system John von Neumann, John Mauchly, J. Presper Eck... none (unique language)
2 1946 ENIAC Short Code Richard Clippinger, John von Neumann after Ala... ENIAC coding system
3 1946 Von Neumann and Goldstine graphing system (Not... John von Neumann and Herman Goldstine ENIAC coding system
4 1947 ARC Assembly Kathleen Booth[1][2] ENIAC coding system
5 1948 CPC Coding scheme Howard H. Aiken Analytical Engine order code
6 1948 Curry notation system Haskell Curry ENIAC coding system
7 1948 Plankalkül (concept published) Konrad Zuse none (unique language)
8 1949 Short Code John Mauchly and William F. Schmitt ENIAC Short Code
9 Year Name Chief developer, company Predecessor(s)
0 1950 Short Code William F Schmidt, Albert B. Tonik,[3] J.R. Logan Brief Code
1 1950 Birkbeck Assembler Kathleen Booth ARC
2 1951 Superplan Heinz Rutishauser Plankalkül
3 1951 ALGAE Edward A Voorhees and Karl Balke none (unique language)
4 1951 Intermediate Programming Language Arthur Burks Short Code
5 1951 Regional Assembly Language Maurice Wilkes EDSAC
6 1951 Boehm unnamed coding system Corrado Böhm CPC Coding scheme
7 1951 Klammerausdrücke Konrad Zuse Plankalkül
8 1951 OMNIBAC Symbolic Assembler Charles Katz Short Code
9 1951 Stanislaus (Notation) Fritz Bauer none (unique language)
10 1951 Whirlwind assembler Charles Adams and Jack Gilmore at MIT Project ... EDSAC
11 1951 Rochester assembler Nat Rochester EDSAC
12 1951 Sort Merge Generator Betty Holberton none (unique language)
13 1952 A-0 Grace Hopper Short Code
14 1952 Glennie Autocode Alick Glennie after Alan Turing CPC Coding scheme
15 1952 Editing Generator Milly Koss SORT/MERGE
16 1952 COMPOOL RAND/SDC none (unique language)
17 1953 Speedcoding John W. Backus none (unique language)
18 1953 READ/PRINT Don Harroff, James Fishman, George Ryckman none (unique language)
19 1954 Laning and Zierler system Laning, Zierler, Adams at MIT Project Whirlwind none (unique language)
... ... ... ... ...
47 2009 Chapel Brad Chamberlain, Cray Inc. HPF, ZPL
48 2009 Go Google C, Oberon, Limbo, Smalltalk
49 2009 CoffeeScript Jeremy Ashkenas JavaScript, Ruby, Python, Haskell
50 2009 Idris Edwin Brady Haskell, Agda, Coq
51 2009 Parasail S. Tucker Taft, AdaCore Modula, Ada, Pascal, ML
52 2009 Whiley David J. Pearce Java, C, Python
53 Year Name Chief developer, company Predecessor(s)
0 2010 Rust Graydon Hoare, Mozilla Alef, C++, Camlp4, Erlang, Hermes, Limbo, Napi...
1 2011 Ceylon Gavin King, Red Hat Java
2 2011 Dart Google Java, JavaScript, CoffeeScript, Go
3 2011 C++11 C++ ISO/IEC 14882:2011 C++, Standard C, C
4 2011 Kotlin JetBrains Java, Scala, Groovy, C#, Gosu
5 2011 Red Nenad Rakocevic Rebol, Scala, Lua
6 2011 Opa MLstate OCaml, Erlang, JavaScript
7 2012 Elixir José Valim Erlang, Ruby, Clojure
8 2012 Elm Evan Czaplicki Haskell, Standard ML, OCaml, F#
9 2012 TypeScript Anders Hejlsberg, Microsoft JavaScript, CoffeeScript
10 2012 Julia Jeff Bezanson, Stefan Karpinski, Viral Shah, A... MATLAB, Lisp, C, Fortran, Mathematica[9] (stri...
11 2012 P Vivek Gupta: not the politician, Ethan Jackson... NaN
12 2012 Ada 2012 ARA and Ada Europe (ISO/IEC 8652:2012) Ada 2005, ISO/IEC 8652:1995/Amd 1:2007
13 2014 Crystal Ary Borenszweig, Manas Technology Solutions Ruby, C, Rust, Go, C#, Python
14 2014 Hack Facebook PHP
15 2014 Swift Apple Inc. Objective-C, Rust, Haskell, Ruby, Python, C#, CLU
16 2014 C++14 C++ ISO/IEC 14882:2014 C++, Standard C, C
17 2015 Atari 2600 SuperCharger BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...
18 2015 Perl 6 The Rakudo Team Perl, Haskell, Python, Ruby
19 2016 Ring Mahmoud Fayed Lua, Python, Ruby, C, C#, BASIC, QML, xBase, S...
20 2017 C++17 C++ ISO/IEC 14882:2017 C++, Standard C, C
21 2017 Atari 2600 Flashback BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...
22 Year Name Chief developer, company Predecessor(s)

388 rows × 4 columns

Remove the extra header rows.

In [23]:
prog_lang = df[df.Year != 'Year']
prog_lang
Out[23]:
Year Name Chief developer, company Predecessor(s)
0 1943–45 Plankalkül (concept) Konrad Zuse none (unique language)
1 1943–46 ENIAC coding system John von Neumann, John Mauchly, J. Presper Eck... none (unique language)
2 1946 ENIAC Short Code Richard Clippinger, John von Neumann after Ala... ENIAC coding system
3 1946 Von Neumann and Goldstine graphing system (Not... John von Neumann and Herman Goldstine ENIAC coding system
4 1947 ARC Assembly Kathleen Booth[1][2] ENIAC coding system
5 1948 CPC Coding scheme Howard H. Aiken Analytical Engine order code
6 1948 Curry notation system Haskell Curry ENIAC coding system
7 1948 Plankalkül (concept published) Konrad Zuse none (unique language)
8 1949 Short Code John Mauchly and William F. Schmitt ENIAC Short Code
0 1950 Short Code William F Schmidt, Albert B. Tonik,[3] J.R. Logan Brief Code
1 1950 Birkbeck Assembler Kathleen Booth ARC
2 1951 Superplan Heinz Rutishauser Plankalkül
3 1951 ALGAE Edward A Voorhees and Karl Balke none (unique language)
4 1951 Intermediate Programming Language Arthur Burks Short Code
5 1951 Regional Assembly Language Maurice Wilkes EDSAC
6 1951 Boehm unnamed coding system Corrado Böhm CPC Coding scheme
7 1951 Klammerausdrücke Konrad Zuse Plankalkül
8 1951 OMNIBAC Symbolic Assembler Charles Katz Short Code
9 1951 Stanislaus (Notation) Fritz Bauer none (unique language)
10 1951 Whirlwind assembler Charles Adams and Jack Gilmore at MIT Project ... EDSAC
11 1951 Rochester assembler Nat Rochester EDSAC
12 1951 Sort Merge Generator Betty Holberton none (unique language)
13 1952 A-0 Grace Hopper Short Code
14 1952 Glennie Autocode Alick Glennie after Alan Turing CPC Coding scheme
15 1952 Editing Generator Milly Koss SORT/MERGE
16 1952 COMPOOL RAND/SDC none (unique language)
17 1953 Speedcoding John W. Backus none (unique language)
18 1953 READ/PRINT Don Harroff, James Fishman, George Ryckman none (unique language)
19 1954 Laning and Zierler system Laning, Zierler, Adams at MIT Project Whirlwind none (unique language)
20 1954 Mark I Autocode Tony Brooker Glennie Autocode
... ... ... ... ...
45 2008 Genie Jamie McCracken Python, Boo, D, Object Pascal
46 2008 Pure Albert Gräf Q
47 2009 Chapel Brad Chamberlain, Cray Inc. HPF, ZPL
48 2009 Go Google C, Oberon, Limbo, Smalltalk
49 2009 CoffeeScript Jeremy Ashkenas JavaScript, Ruby, Python, Haskell
50 2009 Idris Edwin Brady Haskell, Agda, Coq
51 2009 Parasail S. Tucker Taft, AdaCore Modula, Ada, Pascal, ML
52 2009 Whiley David J. Pearce Java, C, Python
0 2010 Rust Graydon Hoare, Mozilla Alef, C++, Camlp4, Erlang, Hermes, Limbo, Napi...
1 2011 Ceylon Gavin King, Red Hat Java
2 2011 Dart Google Java, JavaScript, CoffeeScript, Go
3 2011 C++11 C++ ISO/IEC 14882:2011 C++, Standard C, C
4 2011 Kotlin JetBrains Java, Scala, Groovy, C#, Gosu
5 2011 Red Nenad Rakocevic Rebol, Scala, Lua
6 2011 Opa MLstate OCaml, Erlang, JavaScript
7 2012 Elixir José Valim Erlang, Ruby, Clojure
8 2012 Elm Evan Czaplicki Haskell, Standard ML, OCaml, F#
9 2012 TypeScript Anders Hejlsberg, Microsoft JavaScript, CoffeeScript
10 2012 Julia Jeff Bezanson, Stefan Karpinski, Viral Shah, A... MATLAB, Lisp, C, Fortran, Mathematica[9] (stri...
11 2012 P Vivek Gupta: not the politician, Ethan Jackson... NaN
12 2012 Ada 2012 ARA and Ada Europe (ISO/IEC 8652:2012) Ada 2005, ISO/IEC 8652:1995/Amd 1:2007
13 2014 Crystal Ary Borenszweig, Manas Technology Solutions Ruby, C, Rust, Go, C#, Python
14 2014 Hack Facebook PHP
15 2014 Swift Apple Inc. Objective-C, Rust, Haskell, Ruby, Python, C#, CLU
16 2014 C++14 C++ ISO/IEC 14882:2014 C++, Standard C, C
17 2015 Atari 2600 SuperCharger BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...
18 2015 Perl 6 The Rakudo Team Perl, Haskell, Python, Ruby
19 2016 Ring Mahmoud Fayed Lua, Python, Ruby, C, C#, BASIC, QML, xBase, S...
20 2017 C++17 C++ ISO/IEC 14882:2017 C++, Standard C, C
21 2017 Atari 2600 Flashback BASIC Microsoft sponsored think tank RelationalFrame... BASIC, Dartmouth BASIC (compiled programming l...

380 rows × 4 columns

In what year was Python created?

In [24]:
prog_lang[prog_lang.Name == 'Python']
Out[24]:
Year Name Chief developer, company Predecessor(s)
9 1991 Python Guido van Rossum ABC, ALGOL 68, Icon, Modula-3

Conclusion

The last example should say it all.

In [25]:
import pandas as pd

dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', header=0)
df = pd.concat(dfs[4:12])
prog_lang = df[df.Year != 'Year']

Four lines of code (including the import) and we have one DataFrame containing the data from 8 different HTML tables on one wikipedia page!

Do I need to say why I love Python and pandas? :-)

This post was written in a jupyter notebook. You can find the notebook on GitHub and download the conda environment.yml file to get all the dependencies I used.

In [ ]:
 

Comments

Comments powered by Disqus