Scraping Tabular Data with Pandas¶
Medium Article Link: https://medium.com/p/10cf2a133cbf¶
import pandas as pd
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
df_list = pd.read_html(url)
# pd.read_html returns list
type(df_list)
list
# Total number of tables found
len(df_list)
32
df_list[2]
No. | Name | Net worth (USD) | Age | Nationality | Source(s) of wealth | |
---|---|---|---|---|---|---|
0 | 1 | Jeff Bezos | $113 billion | 56 | United States | Amazon |
1 | 2 | Bill Gates | $98 billion | 64 | United States | Microsoft |
2 | 3 | Bernard Arnault and family | $76 billion | 71 | France | LVMH |
3 | 4 | Warren Buffett | $67.5 billion | 89 | United States | Berkshire Hathaway |
4 | 5 | Larry Ellison | $59 billion | 75 | United States | Oracle Corporation |
5 | 6 | Amancio Ortega | $55.1 billion | 84 | Spain | Inditex, Zara |
6 | 7 | Mark Zuckerberg | $54.7 billion | 35 | United States | |
7 | 8 | Jim Walton | $54.6 billion | 71 | United States | Walmart |
8 | 9 | Alice Walton | $54.4 billion | 70 | United States | Walmart |
9 | 10 | S. Robson Walton | $54.1 billion | 77 | United States | Walmart |
Set a particular column as index¶
pd.read_html(url, index_col=1)[2]
No. | Net worth (USD) | Age | Nationality | Source(s) of wealth | |
---|---|---|---|---|---|
Name | |||||
Jeff Bezos | 1 | $113 billion | 56 | United States | Amazon |
Bill Gates | 2 | $98 billion | 64 | United States | Microsoft |
Bernard Arnault and family | 3 | $76 billion | 71 | France | LVMH |
Warren Buffett | 4 | $67.5 billion | 89 | United States | Berkshire Hathaway |
Larry Ellison | 5 | $59 billion | 75 | United States | Oracle Corporation |
Amancio Ortega | 6 | $55.1 billion | 84 | Spain | Inditex, Zara |
Mark Zuckerberg | 7 | $54.7 billion | 35 | United States | |
Jim Walton | 8 | $54.6 billion | 71 | United States | Walmart |
Alice Walton | 9 | $54.4 billion | 70 | United States | Walmart |
S. Robson Walton | 10 | $54.1 billion | 77 | United States | Walmart |
Return tables containing a string or regex¶
pd.read_html(url, match='Number and combined net worth of billionaires by year')[0].head()
Year | Number of billionaires | Group's combined net worth | |
---|---|---|---|
0 | 2020 | 2095 | $8.0 trillion |
1 | 2019 | 2153 | $8.7 trillion |
2 | 2018 | 2208 | $9.1 trillion |
3 | 2017 | 2043 | $7.7 trillion |
4 | 2016 | 1810 | $6.5 trillion |
Specify strings to recognize as NA/NaN¶
without specifying na_values
¶
pd.read_html(url)[0].tail()
0 | 1 | |
---|---|---|
9 | Net worth (1st) | US$113 billion |
10 | Number of billionaires | 2095 |
11 | Total list net worth value | US$8 trillion |
12 | New members to the list | 178 |
13 | Forbes: The World's Billionaires website | Forbes: The World's Billionaires website |
after specifying na_values
¶
pd.read_html(url,na_values=["Forbes: The World's Billionaires website"])[0].tail()
0 | 1 | |
---|---|---|
9 | Net worth (1st) | US$113 billion |
10 | Number of billionaires | 2095 |
11 | Total list net worth value | US$8 trillion |
12 | New members to the list | 178 |
13 | NaN | NaN |
Other parameters¶
pd.read_html(url, skiprows=3, header=0)[0].head()
Publisher | Whale Media InvestmentsForbes family | |
---|---|---|
0 | Publication | Forbes |
1 | First published | March 1987[1] |
2 | Latest publication | April 7, 2020 |
3 | Current list details (2020)[2] | Current list details (2020)[2] |
4 | Wealthiest | Jeff Bezos |