Python Programming
for big Data Analysis and Visualisation

Session 5 : Data Exploration¶

Welcome to Session 5 of the Python Programming for big Data Analysis and Visualisation course. In this notebook you will find the material covered during this session.

Exercises¶

There are 2 exercises and 1 mini-project in this notebook.

  • Exercise 1
  • Exercise 2
  • Exercise 3
  • Exercise 4
  • Exercise 5
  • Project: Tips

Recap: visualisation¶

Last time we saw how to make the most common plots using pyplot, pandas and seaborn. In particular we saw:

  • Scatterplots to study the relationship between two variables (typically continuous):
    • plt.scatter();
    • pd.DataFrame.plot.scatter();
    • sns.scatterplot() (and sns.swarmplot if one variable is categorical).
  • Lineplots as the most basic type of plot, using the very powerful plt.plot.
  • Barplots to study the relationship between a categorical and a continuous variable:
    • pd.DataFrame.plot.bar() and pd.Series.plot.bar();
  • Histograms to study the distribution of a variable (typically continuous):
    • plt.hist();
    • pd.DataFrame.hist();
    • sns.histplot() for continuous variables;
    • sns.countplot() for categorical variables.
  • Boxplots to study the relationship between a categorical and a continuous variable:
    • pd.DataFrame.boxplot();
    • sns.boxplot() and sns.violinplot().

All of these libraries use matplotlib in the background, so we can use common pyplot commands to enrich our representations: plt.xlabel(), plt.ylabel(), plt.xlim(), plt.ylim(), plt.grid(), plt.title(), etc...

We also saw two common ways to manipulate your DataFrames:

  • Sorting is an important operation we do on continuous variables: for example titanic.sort_values('age') and titanic.sort_index().
  • Split-apply-combine to group rows of your data according to the value of a categorical variable, in order to manipulate each group separately: for example titanic.groupby('sex').mean() (see also other methods specified above).

More pandas¶

We will see three more fundamental pandas techniques :

  1. Working with missing data;
  2. Renaming columns;
  3. Changing column types.
In [3]:
# first let's import some libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Working with missing data¶

Working with missing data is very common. A typical example of missing data could be a csv-formatted data file like the following:

name,sex,age,fare,survived
Allen,male,,8.05,0
...

(the age of M. Allen is missing).

Missing data in pandas is represented by the numpy.NaN value, which is of type float.

Checking if data is missing can be done using pd.DataFrame.isna() and pd.Series.isna().
Missing data can be removed from a DataFrame by using pd.DataFrame.dropna(); note that the options axis, how and subset of pd.DataFrame.dropna() (see examples below and the help) can help with only removing the problematic missing data.

Let's load a modified version of the titanic dataset that contains few rows with some missing data :

In [2]:
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic_na.csv", index_col=0)
In [3]:
titanic.info()  # .info informs us on null (missing) data
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, Allen to Vestrom
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sex       9 non-null      object 
 1   age       8 non-null      float64
 2   fare      9 non-null      float64
 3   survived  10 non-null     int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 400.0+ bytes
In [4]:
titanic.isna()
Out[4]:
sex age fare survived
name
Allen False True False False
Braund False False False False
Cumings True False False False
Futrelle False True True False
Futrelle False False False False
Heikkinen False False False False
Jussila False False False False
Madsen False False False False
Sloper False False False False
Vestrom False False False False
In [7]:
titanic.dropna()
Out[7]:
sex age fare survived
name
Braund male 22.0 7.25 0
Futrelle male 37.0 53.10 0
Heikkinen female 26.0 7.92 1
Jussila female 20.0 9.82 0
Madsen male 24.0 7.14 1
Sloper male 28.0 35.50 1
Vestrom female 14.0 7.85 0
In [11]:
titanic.dropna(axis='columns')
Out[11]:
survived
name
Allen 0
Braund 0
Cumings 1
Futrelle 1
Futrelle 0
Heikkinen 1
Jussila 0
Madsen 1
Sloper 1
Vestrom 0

Exercise 1¶

Load the full titanic dataset from "https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv", using the first column as index. Then:

  1. Create a new dataframe by removing rows with NA values: how many lines were removed?
  2. Create a new dataframe by removing columns with NA values: how many columns were removed? How many lines are left?
In [10]:
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv").set_index("name")
In [15]:
titanic1 = titanic.dropna()
len(titanic) - len(titanic1)
Out[15]:
708
In [17]:
titanic2 = titanic.dropna(axis="columns")
print(len(titanic.columns) - len(titanic2.columns))
len(titanic2)
3
Out[17]:
891
In [23]:
titanic.dropna(subset=["age"])
Out[23]:
sex age fare survived
name
Braund male 22.0 7.25 0
Cumings NaN 38.0 71.28 1
Futrelle male 37.0 53.10 0
Heikkinen female 26.0 7.92 1
Jussila female 20.0 9.82 0
Madsen male 24.0 7.14 1
Sloper male 28.0 35.50 1
Vestrom female 14.0 7.85 0

Renaming columns¶

We've seen modifying the DataFrame index using pd.DataFrame.set_index() and pd.DataFrame.reset_index(); but what about modifying column labels? We can use pd.DataFrame.rename(). The columns argument is a dict, where each key is a column name to rename, and each value is the new name: simple and expressive.

In [75]:
# rename column age to Age
titanic.rename(columns={"age": "Age"})
Out[75]:
sex Age fare survived
name
Allen male NaN 8.05 0
Braund male 22.0 7.25 0
Cumings NaN 38.0 71.28 1
Futrelle female NaN NaN 1
Futrelle male 37.0 53.10 0
Heikkinen female 26.0 7.92 1
Jussila female 20.0 9.82 0
Madsen male 24.0 7.14 1
Sloper male 28.0 35.50 1
Vestrom female 14.0 7.85 0

Changing column types¶

We've seen that pd.DataFrame.info() informs us on the type of each column of a DataFrame, initially determined automatically by pd.read_csv(). It may sometimes be useful to change the type to something more appropriate: we can use pd.DataFrame.astype(). The argument to astype is a dict, where each key is a column name, and each value is the desired type for that column (similar to the columns argument of pd.DataFrame.rename() !).

In [57]:
titanic.dropna().astype({"age": int, "survived": bool, "sex": "category"})
Out[57]:
sex age fare survived
name
Braund male 22 7.25 False
Futrelle male 37 53.10 False
Heikkinen female 26 7.92 True
Jussila female 20 9.82 False
Madsen male 24 7.14 True
Sloper male 28 35.50 True
Vestrom female 14 7.85 False

Remember: method chaining (i.e. pd.DataFrame.first_operation().second_operation().third_operation()) is an expressive and concise way to perform multiple operations on a DataFrame !

Exercise 2¶

Load the full titanic dataset from "https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv", using the first column as index.
Perform the following operations in a chain :

  1. rename the column pclass to "ticket_class";
  2. remove rows with any NA values at the age column;
  3. set the type of age to int, of survived to bool and of ticket_class to category.
In [8]:
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv").set_index("name")
In [9]:
titanic = titanic \
    .rename(columns={"pclass": "ticket_class"}) \
    .dropna(subset=["age"]) \
    .astype({"age": int, "survived": bool, "ticket_class": "category"})
titanic.head()
Out[9]:
passengerId survived ticket_class sex age sibsp parch ticket fare cabin embarked
name
Braund, Mr. Owen Harris 1 False 3 male 22 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 True 1 female 38 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 True 3 female 26 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 True 1 female 35 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 False 3 male 35 0 0 373450 8.0500 NaN S

Data Exploration¶

Exploratory Data Analysis (EDA) is a set of techniques and approaches aimed at analyzing data sets to summarize their main characteristics. It is a fundamental step towards using the data to train a predictive model (e.g. in machine learning) and to make hypotheses about the phenomena described in the data.

EDA has two main goals:

  1. Determine whether the data makes sense and is ready for analysis, or it needs further cleaning/manipulation;
  2. Find relevant patterns and trends in the data.

We will split EDA in 4 phases:

  1. Initial exploration
  2. Monovariate analysis
  3. Bivariate analysis
  4. Multivariate analysis

Initial exploration¶

Tries to answer the following questions:

  1. Has the data been loaded correctly?
    Do we need to rename columns or set an index?
  2. Is the dataset complete or does it contain missing data?
    Do we need to drop NAs?
  3. What are the variables and their types?
    Have the variable types been guessed correctly by pd.read_csv?
In [10]:
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv").set_index("name")
titanic.head()
titanic.info()
<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, Braund, Mr. Owen Harris to Dooley, Mr. Patrick
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   passengerId  891 non-null    int64  
 1   survived     891 non-null    int64  
 2   pclass       891 non-null    int64  
 3   sex          891 non-null    object 
 4   age          714 non-null    float64
 5   sibsp        891 non-null    int64  
 6   parch        891 non-null    int64  
 7   ticket       891 non-null    object 
 8   fare         891 non-null    float64
 9   cabin        204 non-null    object 
 10  embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 83.5+ KB
In [12]:
titanic = titanic\
    .dropna(subset=["age", "embarked"]) \
    .rename(columns={"pclass": "ticket_class"})\
    .astype({"survived": bool, "ticket_class": "category", 
             "age": int, "sex": "category", "embarked": "category"})

Monovariate analysis¶

Obtain descriptive statistics of each numerical variable using pd.DataFrame.describe().

Visual inspection of each variable according to its type:

  • continuous: sns.histplot() (note option kde=True)
  • categorical: sns.countplot()
In [138]:
titanic.describe()
Out[138]:
passengerId age sibsp parch fare
count 712.000000 712.000000 712.000000 712.000000 712.000000
mean 448.589888 29.622191 0.514045 0.432584 34.567251
std 258.683191 14.502891 0.930692 0.854181 52.938648
min 1.000000 0.000000 0.000000 0.000000 0.000000
25% 222.750000 20.000000 0.000000 0.000000 8.050000
50% 445.000000 28.000000 0.000000 0.000000 15.645850
75% 677.250000 38.000000 1.000000 1.000000 33.000000
max 891.000000 80.000000 5.000000 6.000000 512.329200
In [139]:
titanic.describe(include="all")
Out[139]:
passengerId survived ticket_class sex age sibsp parch ticket fare cabin embarked
count 712.000000 712 712.0 712 712.000000 712.000000 712.000000 712 712.000000 183 712
unique NaN 2 3.0 2 NaN NaN NaN 541 NaN 133 3
top NaN False 3.0 male NaN NaN NaN 347082 NaN G6 S
freq NaN 424 355.0 453 NaN NaN NaN 7 NaN 4 554
mean 448.589888 NaN NaN NaN 29.622191 0.514045 0.432584 NaN 34.567251 NaN NaN
std 258.683191 NaN NaN NaN 14.502891 0.930692 0.854181 NaN 52.938648 NaN NaN
min 1.000000 NaN NaN NaN 0.000000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 222.750000 NaN NaN NaN 20.000000 0.000000 0.000000 NaN 8.050000 NaN NaN
50% 445.000000 NaN NaN NaN 28.000000 0.000000 0.000000 NaN 15.645850 NaN NaN
75% 677.250000 NaN NaN NaN 38.000000 1.000000 1.000000 NaN 33.000000 NaN NaN
max 891.000000 NaN NaN NaN 80.000000 5.000000 6.000000 NaN 512.329200 NaN NaN
In [140]:
# Categorical : countplot
plt.subplot(2,2,1)
sns.countplot(x="survived", data=titanic)
plt.subplot(2,2,2)
sns.countplot(x="ticket_class", data=titanic)
plt.subplot(2,2,3)
sns.countplot(x="sibsp", data=titanic)
plt.subplot(2,2,4)
sns.countplot(x="parch", data=titanic)
plt.tight_layout()
No description has been provided for this image
In [141]:
# Continuous : histplot
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
sns.histplot(x="age", data=titanic)
plt.subplot(2,2,2)
sns.histplot(x="sibsp", data=titanic)
plt.subplot(2,2,3)
sns.histplot(x="parch", data=titanic)
plt.subplot(2,2,4)
sns.histplot(x="fare", data=titanic, kde=True)
plt.tight_layout()
No description has been provided for this image

Bivariate analysis¶

Evaluate the statistical relationships between pairs of numerical variables using pd.DataFrame.corr(). A powerful way to visualise the resulting correlation matrix is using sns.heatmap(); note the option annot makes interpretation easier.

Visual inspection of the relationship between pairs of variables according to their type:

  • continuous vs categorical: sns.boxplot() and sns.violinplot()
  • continuous vs continuous: sns.jointplot() (note options kind="hex" and kind="kde")
  • categorical vs categorical: sns.countplot() (using x=... and hue=...)

Remember to use the option hue to add a further categorical variable.

In [8]:
sns.heatmap(titanic.corr(method='spearman'), annot=True)
Out[8]:
<AxesSubplot:>
No description has been provided for this image
In [15]:
# Continuous vs categorical : boxplot (or violinplot)
# ... 2 examples:
sns.boxplot(x="survived", y="fare", data=titanic, hue="ticket_class")
plt.figure()
sns.violinplot(x="survived", y="fare", data=titanic, hue="sex", split=True, cut=0)
Out[15]:
<Axes: xlabel='survived', ylabel='fare'>
No description has been provided for this image
No description has been provided for this image
In [10]:
# Continuous vs continuous : jointplot
# ... 2 examples, using the only 2 continuous variables in this dataset
sns.jointplot(x="age", y="fare", data=titanic, hue="sex")
plt.figure()
sns.jointplot(x="age", y="fare", data=titanic, kind="hex")
# The kind=hex option allows us to see where most points are located (2D histogram)
Out[10]:
<seaborn.axisgrid.JointGrid at 0x7fa592f84e20>
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
No description has been provided for this image
In [143]:
# Categorical vs categorical : countplots using x= and hue=
# ... 2 examples
sns.countplot(x="survived", hue="sex", data=titanic)
plt.figure()
sns.countplot(x="ticket_class", hue="sex", data=titanic)
Out[143]:
<AxesSubplot:xlabel='ticket_class', ylabel='count'>
No description has been provided for this image
No description has been provided for this image

Multivariate analysis¶

Pseudo-multivariate visual inspection of the relationship between multiple variables according to their type. In addition to the use of hue in several cases above, there are two common techniques to extend a plot to take into account further categorical variables:

  1. Extend a scatterplot using sns.relplot();
  2. Extend a boxplot or countplot using sns.catplot() (remember kind="box" and kind="count" respectively).

Relplots¶

In [144]:
# Extend the age vs fare scatterplot to consider also sex, survived and pclass : 5 variables in total !!!
sns.relplot(x="age", y="fare", data=titanic, hue="sex", row='survived', col='ticket_class')
Out[144]:
<seaborn.axisgrid.FacetGrid at 0x7f8b8f422f70>
No description has been provided for this image

Catplots¶

In [145]:
# Extend the survived vs fare boxplot to consider also sex and pclass
sns.catplot(x="survived", y="fare", data=titanic, hue="sex", kind='box', col='ticket_class')
Out[145]:
<seaborn.axisgrid.FacetGrid at 0x7f8b8f5dee50>
No description has been provided for this image
In [146]:
# Extend the survived countplot to consider also sex and pclass
sns.catplot(x="survived", data=titanic, hue="sex", kind='count', col='ticket_class')
Out[146]:
<seaborn.axisgrid.FacetGrid at 0x7f8bad127160>
No description has been provided for this image

Hypothesis testing¶

Hypothesis testing is one of the workhorses of science: it is how we can draw conclusions or make decisions based on finite samples of data ; in this context, we will be using hypothesis testing to determine whether or not our dataset can provide sufficient evidence to claim that a given null hypothesis is false.

R is a language dedicated to statistics, while Python is a general-purpose language with statistics modules. R has more statistical analysis features than Python, and specialized syntaxes, but Python can still perform some basic hypothesis testing. In particular, we will be using the stats module of the scipy library to perform some basic hypothesis testing.

In [68]:
from scipy import stats

Continuous vs categorical (binary): comparing two means¶

Many experimental measurements are reported as rational numbers, and the simplest comparison we can make is between two groups: the basic test for such situations is the t-test. The t-test comes in multiple flavors, and the specific type of test to use depends on a variety of factors. Below are listed the scipy.stats functions that perform the test, always returning the t statistic and the p-value.

  • one-sample test: ttest_1samp(x, popmean)

the null hypothesis is that the mean of sample x is equal to a given population mean popmean.

  • two-sample tests:

the null hypothesis is that samples x1 and x2 have the same mean;

  • for independent samples: ttest_ind(x1, x2)
    the test assumes equal variances (homoscedasticity) in the two groups;
    specifying equal_var=False performs Welch's t-test, which relaxes this assumption.
  • for paired samples: ttest_rel(x1, x2)
In [116]:
t, p_value = stats.ttest_1samp(titanic.loc[:, "age"], 42.4)
In [151]:
t, p_value = stats.ttest_ind(titanic.set_index("survived").loc[False, "age"], 
                             titanic.set_index("survived").loc[True, "age"])
In [150]:
titanic.groupby("survived").var().loc[:, "age"]
Out[150]:
survived
False    200.664832
True     221.781601
Name: age, dtype: float64

Exercise 3¶

Determine if there is a statistically significant association between the age of passengers and their sex.

In [157]:
titanic.groupby("sex").var().loc[:, "age"]
Out[157]:
sex
female    196.056658
male      215.736213
Name: age, dtype: float64
In [158]:
t, p_value = stats.ttest_ind(titanic.set_index("sex").loc["male", "age"], 
                             titanic.set_index("sex").loc["female", "age"])

Non-parametric tests for comparing two means¶

All t-tests assume that errors are normally distributed. Non-parametric tests have been developed that relax this assumption. The non-parametric versions of the two-sample tests are very similar to the t-tests, and return the test statistic and the p-value.

  • for independent samples: mannwhitneyu(x1, x2)
  • for paired samples: wilcoxon(x1, x2)
In [162]:
u, p_value = stats.mannwhitneyu(titanic.set_index("survived").loc[False, "fare"], 
                                titanic.set_index("survived").loc[True, "fare"])

Continuous vs categorical: comparing several means¶

Analysis of variance (ANOVA) is a collection of statistical models that generalise t-tests beyond two means. In particular one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. ANOVA can be performed using stats.f_oneway(s1, s2, s3, ...), returning the test statistic as well as the p-value.

ANOVA has important assumptions that must be satisfied in order for the associated p-value to be valid:

  1. The samples are independent.
  2. Each sample is from a normally distributed population.
  3. The population standard deviations of the groups are all equal (homoscedasticity).

The Kruskal-Wallis H-test is a non-parametric version of ANOVA that relaxes assumptions 2 and 3. The null hypothesis that the population median of all of the groups are equal. The K-W test can be performed using stats.kruskal(s1, s2, s3, ...), returning the test statistic as well as the p-value. For the p-value to be valid, the number of samples in each group must not be too small; a typical rule is that each sample must have at least 5 measurements.

In [181]:
titanic.groupby("ticket_class").var().loc[:, "age"]
Out[181]:
ticket_class
1    218.755405
2    196.763880
3    156.028997
Name: age, dtype: float64
In [182]:
f, p_value = stats.f_oneway(titanic.set_index("ticket_class").loc[1, "age"],
                            titanic.set_index("ticket_class").loc[2, "age"],
                            titanic.set_index("ticket_class").loc[3, "age"])

Exercise 4¶

Determine if there is a statistically significant association between the fare payed by passengers and their ticket_class.

In [179]:
titanic.groupby("ticket_class").var().loc[:, "fare"]
Out[179]:
ticket_class
1    6608.637062
2     173.908290
3     100.865030
Name: fare, dtype: float64
In [200]:
h, p_value = stats.kruskal(titanic.set_index("ticket_class").loc[1, "fare"],
                           titanic.set_index("ticket_class").loc[2, "fare"],
                           titanic.set_index("ticket_class").loc[3, "fare"])

Continuous vs continuous: correlation¶

The statistical relationship between two continuous variables can be quantified by means of a correlation coefficient. It is possible to perform an associated test, where the null hypothesis is that such a coefficient is equal to zero. Below are the methods that perform such tests, and return the calculated coefficient as well as the p-value:

  • Pearson's correlation coefficient

is a measure of linear correlation between two sets of data: stats.pearsonr(x, y)

  • Spearman's correlation coefficient

is a measure of monotonic correlation between two sets of data: stats.spearmanr(x, y)

In [166]:
r, p_value = stats.spearmanr(titanic.loc[:, "fare"], 
                             titanic.loc[:, "age"])

Categorical vs categorical: comparing frequencies¶

Many measurement devices in biology are based on sampling and counting of molecules. The simplest comparisons we can make involve comparing such counts or frequencies among groups. Count data can often be conveniently stored in contingency tables; the family of $\chi^2$ tests is used in the analysis of contingency tables.

We can calculate contingency tables with pandas, for example using pd.crosstab(series1, series2), or pd.Series.value_counts() for one-dimensional tables.

Below are listed the scipy.stats functions that perform the tests, returning (unless otherwise indicated) the $\chi^2$ statistic and the p-value.

  • one-sample test: chisquare(f_obs, f_exp)

the null hypothesis is the the categorical data in (one-dimensional) contingency table f_obs has the same frequencies as in f_exp; by default, f_exp is taken as a uniform distribution.

  • two-sample test: chi2_contingency(observed)

the null hypothesis is that observed frequencies in the contingency table observed correspond to independent variables; expected frequencies are computed based on the marginal sums under the assumption of independence. It returns 4 values: $\chi^2$ statistic, p-value, number of degrees of freedom and expected contingency table.

An often quoted guideline for the validity of the two-sample $\chi^2$ test is that the observed and expected frequencies in each cell should be at least 5. Fisher's exact test (or hypergeometric test) is an alternative, equivalent test that relaxes this assumption; it can be performed on a 2x2 contingency table using fisher_exact(observed), returning the odds-ratio and the p-value.

In [117]:
chi2, p_value = stats.chisquare(titanic.loc[:, "sex"].value_counts())
In [82]:
contingency = pd.crosstab(titanic.loc[:, "sex"], titanic.loc[:, "survived"])
contingency
Out[82]:
survived False True
sex
female 64 195
male 360 93
In [197]:
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
# odds_ratio, p_value = stats.fisher_exact(contingency)

Exercise 5¶

Determine if there is a statistically significant association between the ticket_class of passengers and their survival.

In [187]:
contingency = pd.crosstab(titanic.loc[:, "ticket_class"], titanic.loc[:, "survived"])
In [199]:
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

mini-Project¶

Tips¶

No description has been provided for this image

Take the Tips dataset from the previous notebook and perform a full EDA analysis using all the techniques described above.

On the basis of the EDA, formulate 3 hypotheses on the underlying phenomenon.

The data is available at the following address:

https://marcopasi.github.io/physenbio_pyDAV/data/tips.csv

Here is a description of the variables:

Variable Definition Key
total_bill Total bill (cost of the meal), including tax, in US dollars
tip Tip (gratuity) in US dollars
sex Sex of person paying for the meal Female, Male
smoker Smoker in party? Yes, No
day Day of the week Thu, Fri, Sat, Sun
time Time of day Lunch, Dinner
size Size of the party
In [43]:
tips = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/tips.csv")
_tips = tips
In [44]:
tips.head()
Out[44]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [45]:
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
In [46]:
tips = tips.astype({"sex":"category", "smoker":"category", "day":"category", "time":"category"})
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB
In [47]:
tips2.describe()
tips2.describe(include=['category'])
Out[47]:
sex smoker day time
count 244 244 244 244
unique 2 2 4 2
top Male No Sat Dinner
freq 157 151 87 176
In [48]:
# univariate
plt.subplot(2,2,1)
sns.countplot(x="sex", data=tips)
plt.subplot(2,2,2)
sns.countplot(x="smoker", data=tips)
plt.subplot(2,2,3)
sns.countplot(x="day", data=tips)
plt.subplot(2,2,4)
sns.countplot(x="time", data=tips)
plt.tight_layout()
No description has been provided for this image
In [53]:
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.histplot(x="size", data=tips)
plt.subplot(2,2,2)
sns.histplot(x="total_bill", data=tips, kde=True)
plt.subplot(2,1,2)
sns.histplot(x="tip", data=tips, kde=True)
plt.tight_layout()
No description has been provided for this image
In [69]:
# ...

License¶

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.