Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, jmagallanes@pucp.edu.pe
Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, magajm@uw.edu
In the session we will see the use of Python to:
Collect data as dataframes into Python
Clean data:
Merge both tables:
Prepare a file for further analysis
# Location of data file
linkFile="https://github.com/eScienceWinterSchool/PythonSession/raw/master/data/HDI_Table.xlsx"
Reading in a table from a file using pandas, since it is an Excel file, I requires openpyxl:
# available in my computer?
!pip show openpyxl
If not available, please go to Anaconda and install it. Once installed, or if available, continue:
import pandas as pd
hdiFile=pd.read_excel(linkFile) # only for excel
Take a look (as it is in Excel, it might be a good idea to see it in from Excel too)
hdiFile
Let me bring another table, this time from wikipedia. Make sure to have html5lib and beautifulsoup4 installed before the next code (use pip show)
!pip show html5lib beautifulsoup4
Now bring the tables:
#path
linkwiki='https://en.wikipedia.org/wiki/The_Economist_Democracy_Index'
# call
sortableTables=pd.read_html(io=linkwiki,# this is the link to main webpage
flavor='bs4',# you want pandas to use bs4
attrs={"class": "wikitable sortable"}) # an attribute of the table to scrape
Notice:
Pay attention:
# what you got, and many you got
type(sortableTables), len(sortableTables)
We have three tables, let's see one of them:
# let's see the second one
sortableTables[1].head(10)
# let's see the third one
sortableTables[2].head(10)
Let's keep the right one:
# this the one
demoTable=sortableTables[2].copy()
Home ______
hdiFile.iloc[[3,4],:]
As you see, the column names are in different positions:
# here
hdiFile.iloc[3,2:]
# and here
hdiFile.iloc[4,:2]
Can we concatenate those values?
# save column names
RealHeaders=hdiFile.iloc[4,:2]+hdiFile.iloc[3,2:]
# these are:
RealHeaders
It is easier if we have lists, so we use .to_list():
# save column names turned to lists
RealHeaders=hdiFile.iloc[4,:2].to_list()+hdiFile.iloc[3,2:].to_list()
# these are:
RealHeaders
Let's see how it looks now:
# rename all the columns
hdiFile.columns=RealHeaders
# newDF
better_1=hdiFile.copy()
# see head
better_1.head()
Notice the repeated column names (HDI rank) and NaN. Notice also that we do not need the last three columns. Let's solve that:
# without the last 4 columns
better_1.iloc[:,:-4]
We use the previous result to rewrite the original:
# then,
better_1=better_1.iloc[:,:-4]
We still have column names with missing values:
better_1.columns
...let's get rid of those:
#like this?
better_1.columns.dropna()
# make the change!
BetterHeaders=better_1.columns.dropna()
#result
BetterHeaders
#subsetting again
better_1=better_1.loc[:,BetterHeaders]
better_2=better_1.copy()
#see
better_2.head(20)
Notice above that the columns:
Let's see what can be done using Regular Expressions (REGEX):
# bye anything between parentheses
better_2.columns.str.replace('\(.+\)',"", regex=True)
# bye anything between parentheses, bye leading-trailing spaces
better_2.columns.str.replace('\(.+\)',"", regex=True).str.strip()
# bye anything between parentheses, bye leading-trailing spaces, title case
better_2.columns.str.replace('\(.+\)',"", regex=True).\
str.strip().\
str.title()
Let's keep this last one for a while:
#changing column names
better_2.columns=better_2.columns.str.replace('\(.+\)',"", regex=True).\
str.strip().\
str.title()
#so
better_2
Now, it is time to shorten column names:
a. CamelCase?
better_2.columns.str.replace(" ",'',regex=False)
b. Just acronyms:
# each column names splitted:
[name.split() for name in better_2.columns[2::]]
# first letter of each word
[[word[0] for word in name.split()] for name in better_2.columns[2::]]
# final result
[''.join([word[0] for word in name.split()]) for name in better_2.columns[2::]]
Let's keep the last alternative:
newNames=[''.join([word[0] for word in name.split()]) for name in better_2.columns[2::]]
better_2.columns=better_2.columns[:2].str.replace(" ",'',regex=False).to_list()+newNames
#newDF
better_3=better_2.copy()
Finally...
better_3.head(20)
better_3[~pd.isna(better_3['HdiRank'])]
# then
hdiSubset=better_3[~pd.isna(better_3['HdiRank'])].copy()
#see
hdiSubset
We still have a bad row data:
hdiSubset.drop(index=4)
We just drop it:
hdiSubset.drop(index=4, inplace=True)
hdiSubset.reset_index(drop=True, inplace=True)
#see
hdiSubset
Using try:
for i in range(hdiSubset.shape[1]):
try:
print(hdiSubset.iloc[:,i][hdiSubset.iloc[:,i].str.fullmatch("\W+",na=False)])
except:
pass
We do not have weird symbols, but if we had:
# replacing !
badSymbols=["..",'xx','tba']
hdiSubset.replace(to_replace=badSymbols,
value=None,
inplace=True)
#result
hdiSubset
# with all missing (after the first column)
hdiSubset[hdiSubset.iloc[:,1:].isna().all(axis=1)]
# with at least one missing (after the first column)
hdiSubset[hdiSubset.iloc[:,1:].isna().any(axis=1)]
hdiClean=hdiSubset.copy()
Let's check the data types:
# explore
hdiClean.info()
This way you drop a column name (not the whole column):
hdiClean.columns.drop('Country')
Numbers have been recognised as object type. It might be due to having a non numeric value in one cell, or because it had a non-numeric value before.
# keep all numeric columns
allNumCols=hdiClean.columns.drop('Country')
allNumCols
# as easy as:
hdiClean[allNumCols]=hdiClean[allNumCols].apply(pd.to_numeric)
hdiFormat=hdiClean.copy()
#recheck
hdiFormat.info()
That was easy!
# can you apply math?
hdiFormat.drop(columns=['Country'], axis=0).max()
Before the next process, let's quickly preprocess the table we scraped:
# brief look
demoTable
#data types
demoTable.info()
# keep some:
someCols=demoTable.columns[~demoTable.columns.str.contains('Δ')]
#subset
demoTable=demoTable[someCols].copy()
demoTable.columns=demoTable.columns.str.replace('\W',"",regex=True)
#then
demoTable
We have a categorical column, let's give the right data type:
#rewrite the levels in ascending order:
correctLevels=['Authoritarian', 'Hybrid regime', 'Flawed democracy','Full democracy']
#format as ordinal:
demoTable.Regimetype=pd.Categorical(demoTable.Regimetype,categories=correctLevels,ordered=True)
demoFormat=demoTable.copy()
demoFormat.info()
If we are confident we did a good cleaning and formatting, this step should be easy:
# left_on= / right_on NOT NEEDED (only when column names differ)
HdiDemo=hdiFormat.merge(demoFormat,left_on='Country', right_on='Country')
HdiDemo
Notice the amount of rowd returned above, and compare it with the amount of rows in each data frame:
len(HdiDemo),len(hdiFormat),len(demoFormat)
If you do not want to check country names, you stop here.
Home ______
The smallest amount of rows between two tables, is the maximum amount you expect after the merge. Let's check the key values that were not matched:
onlyHDI=set(hdiFormat.Country)-set(demoFormat.Country)
onlyDEMO=set(demoFormat.Country)-set(hdiFormat.Country)
onlyHDI
onlyDEMO
The previous objects (onlyDEMO, onlyHDI) inform the values not matched in the other data frames. If you want to recover some of these values, you may follow these steps (you may need to install thefuzz):
from thefuzz import process as fz
# take a country from onlyDEMO
# and get the country that matches the most in OnlyHDI, show the match score!
# notice I sorted onlyDEMO
[(aDemoCountry,fz.extractOne(aDemoCountry, onlyHDI)) for aDemoCountry in sorted(onlyDEMO)]
There are countries that will not find a match, then let's subset:
notInHDI=['North Korea','Taiwan']
demoFormat_sub=demoFormat[~demoFormat.Country.isin(notInHDI)]
There are countries that did find a match, but the process worked wrong. Let's change it by brute force:
# dictionary of changes
changesDEMO={'Czech Republic':'Czechia',
'Laos':"Lao People's Democratic Republic"}
# make the replacement
demoFormat_sub.Country.replace(to_replace=changesDEMO,inplace=True)
As we have make changes to the column values, let's redo the fuzzy-merging process:
onlyHDI=set(hdiFormat.Country)-set(demoFormat_sub.Country)
onlyDEMO=set(demoFormat_sub.Country)-set(hdiFormat.Country)
[(aDemoCountry,fz.extractOne(aDemoCountry, onlyHDI)) for aDemoCountry in sorted(onlyDEMO)]
All matches are great. Let's create a dictionary of changes:
changesDEMO={aDemoCountry:fz.extractOne(aDemoCountry, onlyHDI)[0] for aDemoCountry in sorted(onlyDEMO)}
changesDEMO
# replace in democracy
demoFormat_sub.Country.replace(to_replace=changesDEMO,inplace=True)
Let's do a new merge:
# did you get more rows?
HdiDemo_2=hdiFormat.merge(demoFormat_sub)
# lenghts
len(HdiDemo_2),len(HdiDemo), len(hdiFormat),len(demoFormat_sub),len(demoFormat)
The format should still be good:
HdiDemo_2.info()
HdiDemo_2.describe().loc[['min','max']].T #T for transposing
As you see different ranges, it would be good to request a boxplot (make sure to install matplotlib if not previously installed)
import matplotlib.pyplot as plt
HdiDemo_2.plot(kind='box', rot=90,fontsize=5)
plt.semilogy();
Notice that our concern is the numeric data. In case of categorical it is unusual to worry about it, but some cases might need some thinking.
Let me get the column names of the numeric columns:
import numpy as np
colsToScale = HdiDemo_2.select_dtypes([np.number]).columns
colsToScale
Time to produce new ranges (make sure you have previously install scikit-learn):
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_minmax = scaler.fit_transform(HdiDemo_2.loc[:,colsToScale].to_numpy())
df_scaled = pd.DataFrame(df_minmax, columns=colsToScale)
Let's explore the result:
df_scaled.describe().loc[['min','max']].T
df_scaled.plot(kind='box', rot=90,fontsize=5);
Let's add a suffix to the column names:
df_scaled.columns=df_scaled.columns+"_mM"
# concat to the right (instead of bottom) with axis=1
pd.concat([HdiDemo_2,df_scaled],axis=1)
So this is our last version:
hdidem_plus=pd.concat([HdiDemo_2,df_scaled],axis=1)
hdidem_plus.head()
hdidem_plus.to_pickle("hdidem_plus.pkl")
# you will need: DF=pd.read_pickle("hdidem_plus.pkl")
# or:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://...../hdidem_plus.pkl"),compression=None)
!pip show rpy2
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from rpy2.robjects.packages import importr
base = importr('base')
base.saveRDS(hdidem_plus,file="hdidem_plus.RDS")
#In R, you call it with: DF = readRDS("hdidem_plus.RDS")
#or, if read from cloud: DF = readRDS(url("https://...../hdidem_plus.RDS")