Scraping Tabular Data with Pandas

Medium Article Link: https://medium.com/p/10cf2a133cbf

import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)
# pd.read_html returns list
type(df_list)
list
# Total number of tables found
len(df_list)
32
df_list[2]
No. Name Net worth (USD) Age Nationality Source(s) of wealth
0 1 Jeff Bezos $113 billion 56 United States Amazon
1 2 Bill Gates $98 billion 64 United States Microsoft
2 3 Bernard Arnault and family $76 billion 71 France LVMH
3 4 Warren Buffett $67.5 billion 89 United States Berkshire Hathaway
4 5 Larry Ellison $59 billion 75 United States Oracle Corporation
5 6 Amancio Ortega $55.1 billion 84 Spain Inditex, Zara
6 7 Mark Zuckerberg $54.7 billion 35 United States Facebook
7 8 Jim Walton $54.6 billion 71 United States Walmart
8 9 Alice Walton $54.4 billion 70 United States Walmart
9 10 S. Robson Walton $54.1 billion 77 United States Walmart

Set a particular column as index

pd.read_html(url, index_col=1)[2]
No. Net worth (USD) Age Nationality Source(s) of wealth
Name
Jeff Bezos 1 $113 billion 56 United States Amazon
Bill Gates 2 $98 billion 64 United States Microsoft
Bernard Arnault and family 3 $76 billion 71 France LVMH
Warren Buffett 4 $67.5 billion 89 United States Berkshire Hathaway
Larry Ellison 5 $59 billion 75 United States Oracle Corporation
Amancio Ortega 6 $55.1 billion 84 Spain Inditex, Zara
Mark Zuckerberg 7 $54.7 billion 35 United States Facebook
Jim Walton 8 $54.6 billion 71 United States Walmart
Alice Walton 9 $54.4 billion 70 United States Walmart
S. Robson Walton 10 $54.1 billion 77 United States Walmart

Return tables containing a string or regex

pd.read_html(url, match='Number and combined net worth of billionaires by year')[0].head()
Year Number of billionaires Group's combined net worth
0 2020 2095 $8.0 trillion
1 2019 2153 $8.7 trillion
2 2018 2208 $9.1 trillion
3 2017 2043 $7.7 trillion
4 2016 1810 $6.5 trillion

Specify strings to recognize as NA/NaN

without specifying na_values

pd.read_html(url)[0].tail()
0 1
9 Net worth (1st) US$113 billion
10 Number of billionaires 2095
11 Total list net worth value US$8 trillion
12 New members to the list 178
13 Forbes: The World's Billionaires website Forbes: The World's Billionaires website

after specifying na_values

pd.read_html(url,na_values=["Forbes: The World's Billionaires website"])[0].tail()
0 1
9 Net worth (1st) US$113 billion
10 Number of billionaires 2095
11 Total list net worth value US$8 trillion
12 New members to the list 178
13 NaN NaN

Other parameters

pd.read_html(url, skiprows=3, header=0)[0].head()
Publisher Whale Media InvestmentsForbes family
0 Publication Forbes
1 First published March 1987[1]
2 Latest publication April 7, 2020
3 Current list details (2020)[2] Current list details (2020)[2]
4 Wealthiest Jeff Bezos