Not long ago, I needed to parse some HTML tables from our confluence website at work. I first thought: I'm gonna need requests and BeautifulSoup. As HTML tables are well defined, I did some quick googling to see if there was some recipe or lib to parse them and I found a link to pandas. What? Can pandas do that too?

I have been using pandas for quite some time and have used read_csv, read_excel, even read_sql, but I had missed read_html!

Reading excel file with pandas¶

Before to look at HTML tables, I want to show a quick example on how to read an excel file with pandas. The API is really nice. If I have to look at some excel data, I go directly to pandas.

So let's download a sample file file:

In [1]:

import io
import requests
import pandas as pd
from zipfile import ZipFile

In [2]:

r = requests.get('http://www.contextures.com/SampleData.zip')
ZipFile(io.BytesIO(r.content)).extractall()

This created the SampleData.xlsx file that includes four sheets: Instructions, SalesOrders, SampleNumbers and MyLinks. Only the SalesOrders sheet includes tabular data: So let's read it.

In [3]:

df = pd.read_excel('SampleData.xlsx', sheet_name='SalesOrders')

In [4]:

df.head()

Out[4]:

	OrderDate	Region	Rep	Item	Units	Unit Cost	Total
0	2016-01-06	East	Jones	Pencil	95	1.99	189.05
1	2016-01-23	Central	Kivell	Binder	50	19.99	999.50
2	2016-02-09	Central	Jardine	Pencil	36	4.99	179.64
3	2016-02-26	Central	Gill	Pen	27	19.99	539.73
4	2016-03-15	West	Sorvino	Pencil	56	2.99	167.44

That's it. One line and you have your data in a DataFrame that you can easily manipulate, filter, convert and display in a jupyter notebook. Can it be easier than that?

Parsing HTML Tables¶

So let's go back to HTML tables and look at pandas.read_html.

The function accepts:

A URL, a file-like object, or a raw string containing HTML.

Let's start with a basic HTML table in a raw string.

Parsing raw string¶

In [5]:

html_string = """
<table>
  <thead>
    <tr>
      <th>Programming Language</th>
      <th>Creator</th> 
      <th>Year</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C</td>
      <td>Dennis Ritchie</td> 
      <td>1972</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>Guido Van Rossum</td> 
      <td>1989</td>
    </tr>
    <tr>
      <td>Ruby</td>
      <td>Yukihiro Matsumoto</td> 
      <td>1995</td>
    </tr>
  </tbody>
</table>
"""

We can render the table using IPython display_html function:

In [6]:

from IPython.display import display_html
display_html(html_string, raw=True)

Programming Language	Creator	Year
C	Dennis Ritchie	1972
Python	Guido Van Rossum	1989
Ruby	Yukihiro Matsumoto	1995

Let's import this HTML table in a DataFrame. Note that the function read_html always returns a list of DataFrame objects:

In [7]:

dfs = pd.read_html(html_string)
dfs

Out[7]:

[  Programming Language             Creator  Year
 0                    C      Dennis Ritchie  1972
 1               Python    Guido Van Rossum  1989
 2                 Ruby  Yukihiro Matsumoto  1995]

In [8]:

df = dfs[0]
df

Out[8]:

	Programming Language	Creator	Year
0	C	Dennis Ritchie	1972
1	Python	Guido Van Rossum	1989
2	Ruby	Yukihiro Matsumoto	1995

This looks quite similar to the raw string we rendered above, but we are printing a pandas DataFrame object here! We can apply any operation we want.

In [9]:

df[df.Year > 1975]

Out[9]:

	Programming Language	Creator	Year
1	Python	Guido Van Rossum	1989
2	Ruby	Yukihiro Matsumoto	1995

Pandas automatically found the header to use thanks to the <thead> tag. It is not mandatory to define a table and is actually often missing on the web. So what happens if it's not present?

In [10]:

html_string = """
<table>
  <tr>
    <th>Programming Language</th>
    <th>Creator</th> 
    <th>Year</th>
  </tr>
  <tr>
    <td>C</td>
    <td>Dennis Ritchie</td> 
    <td>1972</td>
  </tr>
  <tr>
    <td>Python</td>
    <td>Guido Van Rossum</td> 
    <td>1989</td>
  </tr>
  <tr>
    <td>Ruby</td>
    <td>Yukihiro Matsumoto</td> 
    <td>1995</td>
  </tr>
</table>
"""

In [11]:

pd.read_html(html_string)[0]

Out[11]:

	0	1	2
0	Programming Language	Creator	Year
1	C	Dennis Ritchie	1972
2	Python	Guido Van Rossum	1989
3	Ruby	Yukihiro Matsumoto	1995

In this case, we need to pass the row number to use as header.

In [12]:

pd.read_html(html_string, header=0)[0]

Out[12]:

	Programming Language	Creator	Year
0	C	Dennis Ritchie	1972
1	Python	Guido Van Rossum	1989
2	Ruby	Yukihiro Matsumoto	1995

Parsing a http URL¶

The same data we read in our excel file is available in a table at the following address: http://www.contextures.com/xlSampleData01.html

Let's pass this url to read_html:

In [13]:

dfs = pd.read_html('http://www.contextures.com/xlSampleData01.html')

In [14]:

dfs

Out[14]:

[             0        1         2        3      4         5        6
 0    OrderDate   Region       Rep     Item  Units  UnitCost    Total
 1     1/6/2016     East     Jones   Pencil     95      1.99   189.05
 2    1/23/2016  Central    Kivell   Binder     50     19.99   999.50
 3     2/9/2016  Central   Jardine   Pencil     36      4.99   179.64
 4    2/26/2016  Central      Gill      Pen     27     19.99   539.73
 5    3/15/2016     West   Sorvino   Pencil     56      2.99   167.44
 6     4/1/2016     East     Jones   Binder     60      4.99   299.40
 7    4/18/2016  Central   Andrews   Pencil     75      1.99   149.25
 8     5/5/2016  Central   Jardine   Pencil     90      4.99   449.10
 9    5/22/2016     West  Thompson   Pencil     32      1.99    63.68
 10    6/8/2016     East     Jones   Binder     60      8.99   539.40
 11   6/25/2016  Central    Morgan   Pencil     90      4.99   449.10
 12   7/12/2016     East    Howard   Binder     29      1.99    57.71
 13   7/29/2016     East    Parent   Binder     81     19.99  1619.19
 14   8/15/2016     East     Jones   Pencil     35      4.99   174.65
 15    9/1/2016  Central     Smith     Desk      2    125.00   250.00
 16   9/18/2016     East     Jones  Pen Set     16     15.99   255.84
 17   10/5/2016  Central    Morgan   Binder     28      8.99   251.72
 18  10/22/2016     East     Jones      Pen     64      8.99   575.36
 19   11/8/2016     East    Parent      Pen     15     19.99   299.85
 20  11/25/2016  Central    Kivell  Pen Set     96      4.99   479.04
 21  12/12/2016  Central     Smith   Pencil     67      1.29    86.43
 22  12/29/2016     East    Parent  Pen Set     74     15.99  1183.26
 23   1/15/2017  Central      Gill   Binder     46      8.99   413.54
 24    2/1/2017  Central     Smith   Binder     87     15.00  1305.00
 25   2/18/2017     East     Jones   Binder      4      4.99    19.96
 26    3/7/2017     West   Sorvino   Binder      7     19.99   139.93
 27   3/24/2017  Central   Jardine  Pen Set     50      4.99   249.50
 28   4/10/2017  Central   Andrews   Pencil     66      1.99   131.34
 29   4/27/2017     East    Howard      Pen     96      4.99   479.04
 30   5/14/2017  Central      Gill   Pencil     53      1.29    68.37
 31   5/31/2017  Central      Gill   Binder     80      8.99   719.20
 32   6/17/2017  Central    Kivell     Desk      5    125.00   625.00
 33    7/4/2017     East     Jones  Pen Set     62      4.99   309.38
 34   7/21/2017  Central    Morgan  Pen Set     55     12.49   686.95
 35    8/7/2017  Central    Kivell  Pen Set     42     23.95  1005.90
 36   8/24/2017     West   Sorvino     Desk      3    275.00   825.00
 37   9/10/2017  Central      Gill   Pencil      7      1.29     9.03
 38   9/27/2017     West   Sorvino      Pen     76      1.99   151.24
 39  10/14/2017     West  Thompson   Binder     57     19.99  1139.43
 40  10/31/2017  Central   Andrews   Pencil     14      1.29    18.06
 41  11/17/2017  Central   Jardine   Binder     11      4.99    54.89
 42   12/4/2017  Central   Jardine   Binder     94     19.99  1879.06
 43  12/21/2017  Central   Andrews   Binder     28      4.99   139.72]

We have one table and can see that we need to pass the row number to use as header (because <thead> is not present).

In [15]:

dfs = pd.read_html('http://www.contextures.com/xlSampleData01.html', header=0)
dfs[0].head()

Out[15]:

	OrderDate	Region	Rep	Item	Units	UnitCost	Total
0	1/6/2016	East	Jones	Pencil	95	1.99	189.05
1	1/23/2016	Central	Kivell	Binder	50	19.99	999.50
2	2/9/2016	Central	Jardine	Pencil	36	4.99	179.64
3	2/26/2016	Central	Gill	Pen	27	19.99	539.73
4	3/15/2016	West	Sorvino	Pencil	56	2.99	167.44

Nice!

Parsing a https URL¶

The documentation states that:

Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

This is true, but bs4 + html5lib are used as a fallback when lxml fails. I guess this is why passing a https url does work. We can confirm that with a wikipedia page.

In [16]:

pd.read_html('https://en.wikipedia.org/wiki/Python_(programming_language)', header=0)[1]

Out[16]:

	Type	mutable	Description	Syntax example
0	bool	immutable	Boolean value	True False
1	bytearray	mutable	Sequence of bytes	bytearray(b'Some ASCII') bytearray(b"Some ASCI...
2	bytes	immutable	Sequence of bytes	b'Some ASCII' b"Some ASCII" bytes([119, 105, 1...
3	complex	immutable	Complex number with real and imaginary parts	3+2.7j
4	dict	mutable	Associative array (or dictionary) of key and v...	{'key1': 1.0, 3: False}
5	ellipsis	NaN	An ellipsis placeholder to be used as an index...	...
6	float	immutable	Floating point number, system-defined precision	3.1415927
7	frozenset	immutable	Unordered set, contains no duplicates; can con...	frozenset([4.0, 'string', True])
8	int	immutable	Integer of unlimited magnitude[76]	42
9	list	mutable	List, can contain mixed types	[4.0, 'string', True]
10	set	mutable	Unordered set, contains no duplicates; can con...	{4.0, 'string', True}
11	str	immutable	A character string: sequence of Unicode codepo...	'Wikipedia' "Wikipedia" """Spanning multiple l...
12	tuple	immutable	Can contain mixed types	(4.0, 'string', True)But we can append element...

But what if the url requires authentiation?

In that case we can use requests to get the HTML and pass the string to pandas!

To demonstrate authentication, we can use http://httpbin.org

We can first confirm that passing a url that requires authentication raises a 401

In [17]:

pd.read_html('https://httpbin.org/basic-auth/myuser/mypasswd')

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-17-7e6b50c9f1f3> in <module>()
----> 1 pd.read_html('https://httpbin.org/basic-auth/myuser/mypasswd')

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    913                   thousands=thousands, attrs=attrs, encoding=encoding,
    914                   decimal=decimal, converters=converters, na_values=na_values,
--> 915                   keep_default_na=keep_default_na)

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
    747             break
    748     else:
--> 749         raise_with_traceback(retained)
    750 
    751     ret = []

~/miniconda3/envs/jupyter/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    383         if traceback == Ellipsis:
    384             _, _, traceback = sys.exc_info()
--> 385         raise exc.with_traceback(traceback)
    386 else:
    387     # this version of raise is a syntax error in Python 3

HTTPError: HTTP Error 401: UNAUTHORIZED

In [ ]:

r = requests.get('https://httpbin.org/basic-auth/myuser/mypasswd')
r.status_code

Yes, as expected. Let's pass the username and password with requests.

In [ ]:

r = requests.get('https://httpbin.org/basic-auth/myuser/mypasswd', auth=('myuser', 'mypasswd'))
r.status_code

We could now pass r.text to pandas. http://httpbin.org was used to demonstrate authentication but it only returns JSON-encoded responses and no HTML. It's a testing service. So it doesn't make sense here.

The following example shows how to combine requests and pandas.

In [18]:

r = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
pd.read_html(r.text, header=0)[1]

Out[18]:

	Type	mutable	Description	Syntax example
0	bool	immutable	Boolean value	True False
1	bytearray	mutable	Sequence of bytes	bytearray(b'Some ASCII') bytearray(b"Some ASCI...
2	bytes	immutable	Sequence of bytes	b'Some ASCII' b"Some ASCII" bytes([119, 105, 1...
3	complex	immutable	Complex number with real and imaginary parts	3+2.7j
4	dict	mutable	Associative array (or dictionary) of key and v...	{'key1': 1.0, 3: False}
5	ellipsis	NaN	An ellipsis placeholder to be used as an index...	...
6	float	immutable	Floating point number, system-defined precision	3.1415927
7	frozenset	immutable	Unordered set, contains no duplicates; can con...	frozenset([4.0, 'string', True])
8	int	immutable	Integer of unlimited magnitude[76]	42
9	list	mutable	List, can contain mixed types	[4.0, 'string', True]
10	set	mutable	Unordered set, contains no duplicates; can con...	{4.0, 'string', True}
11	str	immutable	A character string: sequence of Unicode codepo...	'Wikipedia' "Wikipedia" """Spanning multiple l...
12	tuple	immutable	Can contain mixed types	(4.0, 'string', True)But we can append element...

A more complex example¶

We looked at some quite simple examples so far. So let's try a page with several tables: https://en.wikipedia.org/wiki/Timeline_of_programming_languages

In [19]:

dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages')

In [20]:

len(dfs)

Out[20]:

If we look at the page we have 8 tables (one per decade). Looking at our dfs list, we can see that the first interesting table is the fifth one and that we need to pass the row to use as header.

In [21]:

dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', header=0)
dfs[4]

Out[21]:

	Year	Name	Chief developer, company	Predecessor(s)
0	1943–45	Plankalkül (concept)	Konrad Zuse	none (unique language)
1	1943–46	ENIAC coding system	John von Neumann, John Mauchly, J. Presper Eck...	none (unique language)
2	1946	ENIAC Short Code	Richard Clippinger, John von Neumann after Ala...	ENIAC coding system
3	1946	Von Neumann and Goldstine graphing system (Not...	John von Neumann and Herman Goldstine	ENIAC coding system
4	1947	ARC Assembly	Kathleen Booth[1][2]	ENIAC coding system
5	1948	CPC Coding scheme	Howard H. Aiken	Analytical Engine order code
6	1948	Curry notation system	Haskell Curry	ENIAC coding system
7	1948	Plankalkül (concept published)	Konrad Zuse	none (unique language)
8	1949	Short Code	John Mauchly and William F. Schmitt	ENIAC Short Code
9	Year	Name	Chief developer, company	Predecessor(s)

Notice that the header was repeated in the last row (to make the table easier to read on the HTML page). We can filter that after concatenating together the 8 tables to get one DataFrame.

In [22]:

df = pd.concat(dfs[4:12])
df

Out[22]:

	Year	Name	Chief developer, company	Predecessor(s)
0	1943–45	Plankalkül (concept)	Konrad Zuse	none (unique language)
1	1943–46	ENIAC coding system	John von Neumann, John Mauchly, J. Presper Eck...	none (unique language)
2	1946	ENIAC Short Code	Richard Clippinger, John von Neumann after Ala...	ENIAC coding system
3	1946	Von Neumann and Goldstine graphing system (Not...	John von Neumann and Herman Goldstine	ENIAC coding system
4	1947	ARC Assembly	Kathleen Booth[1][2]	ENIAC coding system
5	1948	CPC Coding scheme	Howard H. Aiken	Analytical Engine order code
6	1948	Curry notation system	Haskell Curry	ENIAC coding system
7	1948	Plankalkül (concept published)	Konrad Zuse	none (unique language)
8	1949	Short Code	John Mauchly and William F. Schmitt	ENIAC Short Code
9	Year	Name	Chief developer, company	Predecessor(s)
0	1950	Short Code	William F Schmidt, Albert B. Tonik,[3] J.R. Logan	Brief Code
1	1950	Birkbeck Assembler	Kathleen Booth	ARC
2	1951	Superplan	Heinz Rutishauser	Plankalkül
3	1951	ALGAE	Edward A Voorhees and Karl Balke	none (unique language)
4	1951	Intermediate Programming Language	Arthur Burks	Short Code
5	1951	Regional Assembly Language	Maurice Wilkes	EDSAC
6	1951	Boehm unnamed coding system	Corrado Böhm	CPC Coding scheme
7	1951	Klammerausdrücke	Konrad Zuse	Plankalkül
8	1951	OMNIBAC Symbolic Assembler	Charles Katz	Short Code
9	1951	Stanislaus (Notation)	Fritz Bauer	none (unique language)
10	1951	Whirlwind assembler	Charles Adams and Jack Gilmore at MIT Project ...	EDSAC
11	1951	Rochester assembler	Nat Rochester	EDSAC
12	1951	Sort Merge Generator	Betty Holberton	none (unique language)
13	1952	A-0	Grace Hopper	Short Code
14	1952	Glennie Autocode	Alick Glennie after Alan Turing	CPC Coding scheme
15	1952	Editing Generator	Milly Koss	SORT/MERGE
16	1952	COMPOOL	RAND/SDC	none (unique language)
17	1953	Speedcoding	John W. Backus	none (unique language)
18	1953	READ/PRINT	Don Harroff, James Fishman, George Ryckman	none (unique language)
19	1954	Laning and Zierler system	Laning, Zierler, Adams at MIT Project Whirlwind	none (unique language)
...	...	...	...	...
47	2009	Chapel	Brad Chamberlain, Cray Inc.	HPF, ZPL
48	2009	Go	Google	C, Oberon, Limbo, Smalltalk
49	2009	CoffeeScript	Jeremy Ashkenas	JavaScript, Ruby, Python, Haskell
50	2009	Idris	Edwin Brady	Haskell, Agda, Coq
51	2009	Parasail	S. Tucker Taft, AdaCore	Modula, Ada, Pascal, ML
52	2009	Whiley	David J. Pearce	Java, C, Python
53	Year	Name	Chief developer, company	Predecessor(s)
0	2010	Rust	Graydon Hoare, Mozilla	Alef, C++, Camlp4, Erlang, Hermes, Limbo, Napi...
1	2011	Ceylon	Gavin King, Red Hat	Java
2	2011	Dart	Google	Java, JavaScript, CoffeeScript, Go
3	2011	C++11	C++ ISO/IEC 14882:2011	C++, Standard C, C
4	2011	Kotlin	JetBrains	Java, Scala, Groovy, C#, Gosu
5	2011	Red	Nenad Rakocevic	Rebol, Scala, Lua
6	2011	Opa	MLstate	OCaml, Erlang, JavaScript
7	2012	Elixir	José Valim	Erlang, Ruby, Clojure
8	2012	Elm	Evan Czaplicki	Haskell, Standard ML, OCaml, F#
9	2012	TypeScript	Anders Hejlsberg, Microsoft	JavaScript, CoffeeScript
10	2012	Julia	Jeff Bezanson, Stefan Karpinski, Viral Shah, A...	MATLAB, Lisp, C, Fortran, Mathematica[9] (stri...
11	2012	P	Vivek Gupta: not the politician, Ethan Jackson...	NaN
12	2012	Ada 2012	ARA and Ada Europe (ISO/IEC 8652:2012)	Ada 2005, ISO/IEC 8652:1995/Amd 1:2007
13	2014	Crystal	Ary Borenszweig, Manas Technology Solutions	Ruby, C, Rust, Go, C#, Python
14	2014	Hack	Facebook	PHP
15	2014	Swift	Apple Inc.	Objective-C, Rust, Haskell, Ruby, Python, C#, CLU
16	2014	C++14	C++ ISO/IEC 14882:2014	C++, Standard C, C
17	2015	Atari 2600 SuperCharger BASIC	Microsoft sponsored think tank RelationalFrame...	BASIC, Dartmouth BASIC (compiled programming l...
18	2015	Perl 6	The Rakudo Team	Perl, Haskell, Python, Ruby
19	2016	Ring	Mahmoud Fayed	Lua, Python, Ruby, C, C#, BASIC, QML, xBase, S...
20	2017	C++17	C++ ISO/IEC 14882:2017	C++, Standard C, C
21	2017	Atari 2600 Flashback BASIC	Microsoft sponsored think tank RelationalFrame...	BASIC, Dartmouth BASIC (compiled programming l...
22	Year	Name	Chief developer, company	Predecessor(s)

388 rows × 4 columns

Remove the extra header rows.

In [23]:

prog_lang = df[df.Year != 'Year']
prog_lang

Out[23]:

	Year	Name	Chief developer, company	Predecessor(s)
0	1943–45	Plankalkül (concept)	Konrad Zuse	none (unique language)
1	1943–46	ENIAC coding system	John von Neumann, John Mauchly, J. Presper Eck...	none (unique language)
2	1946	ENIAC Short Code	Richard Clippinger, John von Neumann after Ala...	ENIAC coding system
3	1946	Von Neumann and Goldstine graphing system (Not...	John von Neumann and Herman Goldstine	ENIAC coding system
4	1947	ARC Assembly	Kathleen Booth[1][2]	ENIAC coding system
5	1948	CPC Coding scheme	Howard H. Aiken	Analytical Engine order code
6	1948	Curry notation system	Haskell Curry	ENIAC coding system
7	1948	Plankalkül (concept published)	Konrad Zuse	none (unique language)
8	1949	Short Code	John Mauchly and William F. Schmitt	ENIAC Short Code
0	1950	Short Code	William F Schmidt, Albert B. Tonik,[3] J.R. Logan	Brief Code
1	1950	Birkbeck Assembler	Kathleen Booth	ARC
2	1951	Superplan	Heinz Rutishauser	Plankalkül
3	1951	ALGAE	Edward A Voorhees and Karl Balke	none (unique language)
4	1951	Intermediate Programming Language	Arthur Burks	Short Code
5	1951	Regional Assembly Language	Maurice Wilkes	EDSAC
6	1951	Boehm unnamed coding system	Corrado Böhm	CPC Coding scheme
7	1951	Klammerausdrücke	Konrad Zuse	Plankalkül
8	1951	OMNIBAC Symbolic Assembler	Charles Katz	Short Code
9	1951	Stanislaus (Notation)	Fritz Bauer	none (unique language)
10	1951	Whirlwind assembler	Charles Adams and Jack Gilmore at MIT Project ...	EDSAC
11	1951	Rochester assembler	Nat Rochester	EDSAC
12	1951	Sort Merge Generator	Betty Holberton	none (unique language)
13	1952	A-0	Grace Hopper	Short Code
14	1952	Glennie Autocode	Alick Glennie after Alan Turing	CPC Coding scheme
15	1952	Editing Generator	Milly Koss	SORT/MERGE
16	1952	COMPOOL	RAND/SDC	none (unique language)
17	1953	Speedcoding	John W. Backus	none (unique language)
18	1953	READ/PRINT	Don Harroff, James Fishman, George Ryckman	none (unique language)
19	1954	Laning and Zierler system	Laning, Zierler, Adams at MIT Project Whirlwind	none (unique language)
20	1954	Mark I Autocode	Tony Brooker	Glennie Autocode
...	...	...	...	...
45	2008	Genie	Jamie McCracken	Python, Boo, D, Object Pascal
46	2008	Pure	Albert Gräf	Q
47	2009	Chapel	Brad Chamberlain, Cray Inc.	HPF, ZPL
48	2009	Go	Google	C, Oberon, Limbo, Smalltalk
49	2009	CoffeeScript	Jeremy Ashkenas	JavaScript, Ruby, Python, Haskell
50	2009	Idris	Edwin Brady	Haskell, Agda, Coq
51	2009	Parasail	S. Tucker Taft, AdaCore	Modula, Ada, Pascal, ML
52	2009	Whiley	David J. Pearce	Java, C, Python
0	2010	Rust	Graydon Hoare, Mozilla	Alef, C++, Camlp4, Erlang, Hermes, Limbo, Napi...
1	2011	Ceylon	Gavin King, Red Hat	Java
2	2011	Dart	Google	Java, JavaScript, CoffeeScript, Go
3	2011	C++11	C++ ISO/IEC 14882:2011	C++, Standard C, C
4	2011	Kotlin	JetBrains	Java, Scala, Groovy, C#, Gosu
5	2011	Red	Nenad Rakocevic	Rebol, Scala, Lua
6	2011	Opa	MLstate	OCaml, Erlang, JavaScript
7	2012	Elixir	José Valim	Erlang, Ruby, Clojure
8	2012	Elm	Evan Czaplicki	Haskell, Standard ML, OCaml, F#
9	2012	TypeScript	Anders Hejlsberg, Microsoft	JavaScript, CoffeeScript
10	2012	Julia	Jeff Bezanson, Stefan Karpinski, Viral Shah, A...	MATLAB, Lisp, C, Fortran, Mathematica[9] (stri...
11	2012	P	Vivek Gupta: not the politician, Ethan Jackson...	NaN
12	2012	Ada 2012	ARA and Ada Europe (ISO/IEC 8652:2012)	Ada 2005, ISO/IEC 8652:1995/Amd 1:2007
13	2014	Crystal	Ary Borenszweig, Manas Technology Solutions	Ruby, C, Rust, Go, C#, Python
14	2014	Hack	Facebook	PHP
15	2014	Swift	Apple Inc.	Objective-C, Rust, Haskell, Ruby, Python, C#, CLU
16	2014	C++14	C++ ISO/IEC 14882:2014	C++, Standard C, C
17	2015	Atari 2600 SuperCharger BASIC	Microsoft sponsored think tank RelationalFrame...	BASIC, Dartmouth BASIC (compiled programming l...
18	2015	Perl 6	The Rakudo Team	Perl, Haskell, Python, Ruby
19	2016	Ring	Mahmoud Fayed	Lua, Python, Ruby, C, C#, BASIC, QML, xBase, S...
20	2017	C++17	C++ ISO/IEC 14882:2017	C++, Standard C, C
21	2017	Atari 2600 Flashback BASIC	Microsoft sponsored think tank RelationalFrame...	BASIC, Dartmouth BASIC (compiled programming l...

380 rows × 4 columns

In what year was Python created?

In [24]:

prog_lang[prog_lang.Name == 'Python']

Out[24]:

	Year	Name	Chief developer, company	Predecessor(s)
9	1991	Python	Guido van Rossum	ABC, ALGOL 68, Icon, Modula-3

Conclusion¶

The last example should say it all.

In [25]:

import pandas as pd

dfs = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', header=0)
df = pd.concat(dfs[4:12])
prog_lang = df[df.Year != 'Year']

Four lines of code (including the import) and we have one DataFrame containing the data from 8 different HTML tables on one wikipedia page!

Do I need to say why I love Python and pandas? :-)

This post was written in a jupyter notebook. You can find the notebook on GitHub and download the conda environment.yml file to get all the dependencies I used.

In [ ]:

Reading excel file with pandas¶

Parsing HTML Tables¶

Parsing raw string¶

Parsing a http URL¶

Parsing a https URL¶

A more complex example¶

Conclusion¶

Comments