for big Data Analysis and Visualisation
Session 5 : Data Exploration¶
Welcome to Session 5 of the Python Programming for big Data Analysis and Visualisation course. In this notebook you will find the material covered during this session.
Exercises¶
There are 2 exercises and 1 mini-project in this notebook.
Recap: visualisation¶
Last time we saw how to make the most common plots using pyplot, pandas and seaborn. In particular we saw:
- Scatterplots to study the relationship between two variables (typically continuous):
plt.scatter();pd.DataFrame.plot.scatter();sns.scatterplot()(andsns.swarmplotif one variable is categorical).
- Lineplots as the most basic type of plot, using the very powerful
plt.plot. - Barplots to study the relationship between a categorical and a continuous variable:
pd.DataFrame.plot.bar()andpd.Series.plot.bar();
- Histograms to study the distribution of a variable (typically continuous):
plt.hist();pd.DataFrame.hist();sns.histplot()for continuous variables;sns.countplot()for categorical variables.
- Boxplots to study the relationship between a categorical and a continuous variable:
pd.DataFrame.boxplot();sns.boxplot()andsns.violinplot().
All of these libraries use matplotlib in the background, so we can use common pyplot commands to enrich our representations: plt.xlabel(), plt.ylabel(), plt.xlim(), plt.ylim(), plt.grid(), plt.title(), etc...
We also saw two common ways to manipulate your DataFrames:
- Sorting is an important operation we do on continuous variables: for example
titanic.sort_values('age')andtitanic.sort_index(). - Split-apply-combine to group rows of your data according to the value of a categorical variable, in order to manipulate each group separately: for example
titanic.groupby('sex').mean()(see also other methods specified above).
More pandas¶
We will see three more fundamental pandas techniques :
- Working with missing data;
- Renaming columns;
- Changing column types.
# first let's import some libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Working with missing data¶
Working with missing data is very common. A typical example of missing data could be a csv-formatted data file like the following:
name,sex,age,fare,survived
Allen,male,,8.05,0
...
(the age of M. Allen is missing).
Missing data in pandas is represented by the numpy.NaN value, which is of type float.
Checking if data is missing can be done using pd.DataFrame.isna() and pd.Series.isna().
Missing data can be removed from a DataFrame by using pd.DataFrame.dropna(); note that the options axis, how and subset of pd.DataFrame.dropna() (see examples below and the help) can help with only removing the problematic missing data.
Let's load a modified version of the titanic dataset that contains few rows with some missing data :
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic_na.csv", index_col=0)
titanic.info() # .info informs us on null (missing) data
<class 'pandas.core.frame.DataFrame'> Index: 10 entries, Allen to Vestrom Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 9 non-null object 1 age 8 non-null float64 2 fare 9 non-null float64 3 survived 10 non-null int64 dtypes: float64(2), int64(1), object(1) memory usage: 400.0+ bytes
titanic.isna()
| sex | age | fare | survived | |
|---|---|---|---|---|
| name | ||||
| Allen | False | True | False | False |
| Braund | False | False | False | False |
| Cumings | True | False | False | False |
| Futrelle | False | True | True | False |
| Futrelle | False | False | False | False |
| Heikkinen | False | False | False | False |
| Jussila | False | False | False | False |
| Madsen | False | False | False | False |
| Sloper | False | False | False | False |
| Vestrom | False | False | False | False |
titanic.dropna()
| sex | age | fare | survived | |
|---|---|---|---|---|
| name | ||||
| Braund | male | 22.0 | 7.25 | 0 |
| Futrelle | male | 37.0 | 53.10 | 0 |
| Heikkinen | female | 26.0 | 7.92 | 1 |
| Jussila | female | 20.0 | 9.82 | 0 |
| Madsen | male | 24.0 | 7.14 | 1 |
| Sloper | male | 28.0 | 35.50 | 1 |
| Vestrom | female | 14.0 | 7.85 | 0 |
titanic.dropna(axis='columns')
| survived | |
|---|---|
| name | |
| Allen | 0 |
| Braund | 0 |
| Cumings | 1 |
| Futrelle | 1 |
| Futrelle | 0 |
| Heikkinen | 1 |
| Jussila | 0 |
| Madsen | 1 |
| Sloper | 1 |
| Vestrom | 0 |
Exercise 1¶
Load the full titanic dataset from "https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv", using the first column as index. Then:
- Create a new dataframe by removing rows with NA values: how many lines were removed?
- Create a new dataframe by removing columns with NA values: how many columns were removed? How many lines are left?
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv").set_index("name")
titanic1 = titanic.dropna()
len(titanic) - len(titanic1)
708
titanic2 = titanic.dropna(axis="columns")
print(len(titanic.columns) - len(titanic2.columns))
len(titanic2)
3
891
titanic.dropna(subset=["age"])
| sex | age | fare | survived | |
|---|---|---|---|---|
| name | ||||
| Braund | male | 22.0 | 7.25 | 0 |
| Cumings | NaN | 38.0 | 71.28 | 1 |
| Futrelle | male | 37.0 | 53.10 | 0 |
| Heikkinen | female | 26.0 | 7.92 | 1 |
| Jussila | female | 20.0 | 9.82 | 0 |
| Madsen | male | 24.0 | 7.14 | 1 |
| Sloper | male | 28.0 | 35.50 | 1 |
| Vestrom | female | 14.0 | 7.85 | 0 |
Renaming columns¶
We've seen modifying the DataFrame index using pd.DataFrame.set_index() and pd.DataFrame.reset_index(); but what about modifying column labels? We can use pd.DataFrame.rename(). The columns argument is a dict, where each key is a column name to rename, and each value is the new name: simple and expressive.
# rename column age to Age
titanic.rename(columns={"age": "Age"})
| sex | Age | fare | survived | |
|---|---|---|---|---|
| name | ||||
| Allen | male | NaN | 8.05 | 0 |
| Braund | male | 22.0 | 7.25 | 0 |
| Cumings | NaN | 38.0 | 71.28 | 1 |
| Futrelle | female | NaN | NaN | 1 |
| Futrelle | male | 37.0 | 53.10 | 0 |
| Heikkinen | female | 26.0 | 7.92 | 1 |
| Jussila | female | 20.0 | 9.82 | 0 |
| Madsen | male | 24.0 | 7.14 | 1 |
| Sloper | male | 28.0 | 35.50 | 1 |
| Vestrom | female | 14.0 | 7.85 | 0 |
Changing column types¶
We've seen that pd.DataFrame.info() informs us on the type of each column of a DataFrame, initially determined automatically by pd.read_csv(). It may sometimes be useful to change the type to something more appropriate: we can use pd.DataFrame.astype(). The argument to astype is a dict, where each key is a column name, and each value is the desired type for that column (similar to the columns argument of pd.DataFrame.rename() !).
titanic.dropna().astype({"age": int, "survived": bool, "sex": "category"})
| sex | age | fare | survived | |
|---|---|---|---|---|
| name | ||||
| Braund | male | 22 | 7.25 | False |
| Futrelle | male | 37 | 53.10 | False |
| Heikkinen | female | 26 | 7.92 | True |
| Jussila | female | 20 | 9.82 | False |
| Madsen | male | 24 | 7.14 | True |
| Sloper | male | 28 | 35.50 | True |
| Vestrom | female | 14 | 7.85 | False |
Remember: method chaining (i.e. pd.DataFrame.first_operation().second_operation().third_operation()) is an expressive and concise way to perform multiple operations on a DataFrame !
Exercise 2¶
Load the full titanic dataset from "https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv", using the first column as index.
Perform the following operations in a chain :
- rename the column
pclassto "ticket_class"; - remove rows with any NA values at the
agecolumn; - set the type of
agetoint, ofsurvivedtobooland ofticket_classtocategory.
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv").set_index("name")
titanic = titanic \
.rename(columns={"pclass": "ticket_class"}) \
.dropna(subset=["age"]) \
.astype({"age": int, "survived": bool, "ticket_class": "category"})
titanic.head()
| passengerId | survived | ticket_class | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| name | |||||||||||
| Braund, Mr. Owen Harris | 1 | False | 3 | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | True | 1 | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| Heikkinen, Miss. Laina | 3 | True | 3 | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | True | 1 | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| Allen, Mr. William Henry | 5 | False | 3 | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Data Exploration¶
Exploratory Data Analysis (EDA) is a set of techniques and approaches aimed at analyzing data sets to summarize their main characteristics. It is a fundamental step towards using the data to train a predictive model (e.g. in machine learning) and to make hypotheses about the phenomena described in the data.
EDA has two main goals:
- Determine whether the data makes sense and is ready for analysis, or it needs further cleaning/manipulation;
- Find relevant patterns and trends in the data.
We will split EDA in 4 phases:
- Initial exploration
- Monovariate analysis
- Bivariate analysis
- Multivariate analysis
Initial exploration¶
Tries to answer the following questions:
- Has the data been loaded correctly?
Do we need to rename columns or set an index? - Is the dataset complete or does it contain missing data?
Do we need to drop NAs? - What are the variables and their types?
Have the variable types been guessed correctly bypd.read_csv?
titanic = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/titanic.csv").set_index("name")
titanic.head()
titanic.info()
<class 'pandas.core.frame.DataFrame'> Index: 891 entries, Braund, Mr. Owen Harris to Dooley, Mr. Patrick Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 passengerId 891 non-null int64 1 survived 891 non-null int64 2 pclass 891 non-null int64 3 sex 891 non-null object 4 age 714 non-null float64 5 sibsp 891 non-null int64 6 parch 891 non-null int64 7 ticket 891 non-null object 8 fare 891 non-null float64 9 cabin 204 non-null object 10 embarked 889 non-null object dtypes: float64(2), int64(5), object(4) memory usage: 83.5+ KB
titanic = titanic\
.dropna(subset=["age", "embarked"]) \
.rename(columns={"pclass": "ticket_class"})\
.astype({"survived": bool, "ticket_class": "category",
"age": int, "sex": "category", "embarked": "category"})
Monovariate analysis¶
Obtain descriptive statistics of each numerical variable using pd.DataFrame.describe().
Visual inspection of each variable according to its type:
- continuous:
sns.histplot()(note optionkde=True) - categorical:
sns.countplot()
titanic.describe()
| passengerId | age | sibsp | parch | fare | |
|---|---|---|---|---|---|
| count | 712.000000 | 712.000000 | 712.000000 | 712.000000 | 712.000000 |
| mean | 448.589888 | 29.622191 | 0.514045 | 0.432584 | 34.567251 |
| std | 258.683191 | 14.502891 | 0.930692 | 0.854181 | 52.938648 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 222.750000 | 20.000000 | 0.000000 | 0.000000 | 8.050000 |
| 50% | 445.000000 | 28.000000 | 0.000000 | 0.000000 | 15.645850 |
| 75% | 677.250000 | 38.000000 | 1.000000 | 1.000000 | 33.000000 |
| max | 891.000000 | 80.000000 | 5.000000 | 6.000000 | 512.329200 |
titanic.describe(include="all")
| passengerId | survived | ticket_class | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 712.000000 | 712 | 712.0 | 712 | 712.000000 | 712.000000 | 712.000000 | 712 | 712.000000 | 183 | 712 |
| unique | NaN | 2 | 3.0 | 2 | NaN | NaN | NaN | 541 | NaN | 133 | 3 |
| top | NaN | False | 3.0 | male | NaN | NaN | NaN | 347082 | NaN | G6 | S |
| freq | NaN | 424 | 355.0 | 453 | NaN | NaN | NaN | 7 | NaN | 4 | 554 |
| mean | 448.589888 | NaN | NaN | NaN | 29.622191 | 0.514045 | 0.432584 | NaN | 34.567251 | NaN | NaN |
| std | 258.683191 | NaN | NaN | NaN | 14.502891 | 0.930692 | 0.854181 | NaN | 52.938648 | NaN | NaN |
| min | 1.000000 | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN |
| 25% | 222.750000 | NaN | NaN | NaN | 20.000000 | 0.000000 | 0.000000 | NaN | 8.050000 | NaN | NaN |
| 50% | 445.000000 | NaN | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 15.645850 | NaN | NaN |
| 75% | 677.250000 | NaN | NaN | NaN | 38.000000 | 1.000000 | 1.000000 | NaN | 33.000000 | NaN | NaN |
| max | 891.000000 | NaN | NaN | NaN | 80.000000 | 5.000000 | 6.000000 | NaN | 512.329200 | NaN | NaN |
# Categorical : countplot
plt.subplot(2,2,1)
sns.countplot(x="survived", data=titanic)
plt.subplot(2,2,2)
sns.countplot(x="ticket_class", data=titanic)
plt.subplot(2,2,3)
sns.countplot(x="sibsp", data=titanic)
plt.subplot(2,2,4)
sns.countplot(x="parch", data=titanic)
plt.tight_layout()
# Continuous : histplot
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
sns.histplot(x="age", data=titanic)
plt.subplot(2,2,2)
sns.histplot(x="sibsp", data=titanic)
plt.subplot(2,2,3)
sns.histplot(x="parch", data=titanic)
plt.subplot(2,2,4)
sns.histplot(x="fare", data=titanic, kde=True)
plt.tight_layout()
Bivariate analysis¶
Evaluate the statistical relationships between pairs of numerical variables using pd.DataFrame.corr(). A powerful way to visualise the resulting correlation matrix is using sns.heatmap(); note the option annot makes interpretation easier.
Visual inspection of the relationship between pairs of variables according to their type:
- continuous vs categorical:
sns.boxplot()andsns.violinplot() - continuous vs continuous:
sns.jointplot()(note optionskind="hex"andkind="kde") - categorical vs categorical:
sns.countplot()(usingx=...andhue=...)
Remember to use the option hue to add a further categorical variable.
sns.heatmap(titanic.corr(method='spearman'), annot=True)
<AxesSubplot:>
# Continuous vs categorical : boxplot (or violinplot)
# ... 2 examples:
sns.boxplot(x="survived", y="fare", data=titanic, hue="ticket_class")
plt.figure()
sns.violinplot(x="survived", y="fare", data=titanic, hue="sex", split=True, cut=0)
<Axes: xlabel='survived', ylabel='fare'>
# Continuous vs continuous : jointplot
# ... 2 examples, using the only 2 continuous variables in this dataset
sns.jointplot(x="age", y="fare", data=titanic, hue="sex")
plt.figure()
sns.jointplot(x="age", y="fare", data=titanic, kind="hex")
# The kind=hex option allows us to see where most points are located (2D histogram)
<seaborn.axisgrid.JointGrid at 0x7fa592f84e20>
<Figure size 640x480 with 0 Axes>
# Categorical vs categorical : countplots using x= and hue=
# ... 2 examples
sns.countplot(x="survived", hue="sex", data=titanic)
plt.figure()
sns.countplot(x="ticket_class", hue="sex", data=titanic)
<AxesSubplot:xlabel='ticket_class', ylabel='count'>
Multivariate analysis¶
Pseudo-multivariate visual inspection of the relationship between multiple variables according to their type. In addition to the use of hue in several cases above, there are two common techniques to extend a plot to take into account further categorical variables:
- Extend a scatterplot using
sns.relplot(); - Extend a boxplot or countplot using
sns.catplot()(rememberkind="box"andkind="count"respectively).
Relplots¶
# Extend the age vs fare scatterplot to consider also sex, survived and pclass : 5 variables in total !!!
sns.relplot(x="age", y="fare", data=titanic, hue="sex", row='survived', col='ticket_class')
<seaborn.axisgrid.FacetGrid at 0x7f8b8f422f70>
Catplots¶
# Extend the survived vs fare boxplot to consider also sex and pclass
sns.catplot(x="survived", y="fare", data=titanic, hue="sex", kind='box', col='ticket_class')
<seaborn.axisgrid.FacetGrid at 0x7f8b8f5dee50>
# Extend the survived countplot to consider also sex and pclass
sns.catplot(x="survived", data=titanic, hue="sex", kind='count', col='ticket_class')
<seaborn.axisgrid.FacetGrid at 0x7f8bad127160>
Hypothesis testing¶
Hypothesis testing is one of the workhorses of science: it is how we can draw conclusions or make decisions based on finite samples of data ; in this context, we will be using hypothesis testing to determine whether or not our dataset can provide sufficient evidence to claim that a given null hypothesis is false.
R is a language dedicated to statistics, while Python is a general-purpose language with statistics modules. R has more statistical analysis features than Python, and specialized syntaxes, but Python can still perform some basic hypothesis testing. In particular, we will be using the stats module of the scipy library to perform some basic hypothesis testing.
from scipy import stats
Continuous vs categorical (binary): comparing two means¶
Many experimental measurements are reported as rational numbers, and the simplest comparison we can make is between two groups: the basic test for such situations is the
t-test. The t-test comes in multiple flavors, and the specific type of test to use depends on a variety of factors. Below are listed the scipy.stats functions that perform the test, always returning the t statistic and the p-value.
- one-sample test:
ttest_1samp(x, popmean)
the null hypothesis is that the mean of sample x is equal to a given population mean popmean.
- two-sample tests:
the null hypothesis is that samples x1 and x2 have the same mean;
- for independent samples:
ttest_ind(x1, x2)
the test assumes equal variances (homoscedasticity) in the two groups;
specifyingequal_var=Falseperforms Welch's t-test, which relaxes this assumption. - for paired samples:
ttest_rel(x1, x2)
t, p_value = stats.ttest_1samp(titanic.loc[:, "age"], 42.4)
t, p_value = stats.ttest_ind(titanic.set_index("survived").loc[False, "age"],
titanic.set_index("survived").loc[True, "age"])
titanic.groupby("survived").var().loc[:, "age"]
survived False 200.664832 True 221.781601 Name: age, dtype: float64
Exercise 3¶
Determine if there is a statistically significant association between the age of passengers and their sex.
titanic.groupby("sex").var().loc[:, "age"]
sex female 196.056658 male 215.736213 Name: age, dtype: float64
t, p_value = stats.ttest_ind(titanic.set_index("sex").loc["male", "age"],
titanic.set_index("sex").loc["female", "age"])
Non-parametric tests for comparing two means¶
All t-tests assume that errors are normally distributed. Non-parametric tests have been developed that relax this assumption. The non-parametric versions of the two-sample tests are very similar to the t-tests, and return the test statistic and the p-value.
- for independent samples:
mannwhitneyu(x1, x2) - for paired samples:
wilcoxon(x1, x2)
u, p_value = stats.mannwhitneyu(titanic.set_index("survived").loc[False, "fare"],
titanic.set_index("survived").loc[True, "fare"])
Continuous vs categorical: comparing several means¶
Analysis of variance (ANOVA) is a collection of statistical models that generalise t-tests beyond two means. In particular one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. ANOVA can be performed using stats.f_oneway(s1, s2, s3, ...), returning the test statistic as well as the p-value.
ANOVA has important assumptions that must be satisfied in order for the associated p-value to be valid:
- The samples are independent.
- Each sample is from a normally distributed population.
- The population standard deviations of the groups are all equal (homoscedasticity).
The Kruskal-Wallis H-test is a non-parametric version of ANOVA that relaxes assumptions 2 and 3. The null hypothesis that the population median of all of the groups are equal. The K-W test can be performed using stats.kruskal(s1, s2, s3, ...), returning the test statistic as well as the p-value. For the p-value to be valid, the number of samples in each group must not be too small; a typical rule is that each sample must have at least 5 measurements.
titanic.groupby("ticket_class").var().loc[:, "age"]
ticket_class 1 218.755405 2 196.763880 3 156.028997 Name: age, dtype: float64
f, p_value = stats.f_oneway(titanic.set_index("ticket_class").loc[1, "age"],
titanic.set_index("ticket_class").loc[2, "age"],
titanic.set_index("ticket_class").loc[3, "age"])
Exercise 4¶
Determine if there is a statistically significant association between the fare payed by passengers and their ticket_class.
titanic.groupby("ticket_class").var().loc[:, "fare"]
ticket_class 1 6608.637062 2 173.908290 3 100.865030 Name: fare, dtype: float64
h, p_value = stats.kruskal(titanic.set_index("ticket_class").loc[1, "fare"],
titanic.set_index("ticket_class").loc[2, "fare"],
titanic.set_index("ticket_class").loc[3, "fare"])
Continuous vs continuous: correlation¶
The statistical relationship between two continuous variables can be quantified by means of a correlation coefficient. It is possible to perform an associated test, where the null hypothesis is that such a coefficient is equal to zero. Below are the methods that perform such tests, and return the calculated coefficient as well as the p-value:
- Pearson's correlation coefficient
is a measure of linear correlation between two sets of data: stats.pearsonr(x, y)
- Spearman's correlation coefficient
is a measure of monotonic correlation between two sets of data: stats.spearmanr(x, y)
r, p_value = stats.spearmanr(titanic.loc[:, "fare"],
titanic.loc[:, "age"])
Categorical vs categorical: comparing frequencies¶
Many measurement devices in biology are based on sampling and counting of molecules. The simplest comparisons we can make involve comparing such counts or frequencies among groups. Count data can often be conveniently stored in contingency tables; the family of $\chi^2$ tests is used in the analysis of contingency tables.
We can calculate contingency tables with pandas, for example using pd.crosstab(series1, series2), or pd.Series.value_counts() for one-dimensional tables.
Below are listed the scipy.stats functions that perform the tests, returning (unless otherwise indicated) the $\chi^2$ statistic and the p-value.
- one-sample test:
chisquare(f_obs, f_exp)
the null hypothesis is the the categorical data in (one-dimensional) contingency table f_obs has the same frequencies as in f_exp; by default, f_exp is taken as a uniform distribution.
- two-sample test:
chi2_contingency(observed)
the null hypothesis is that observed frequencies in the contingency table observed correspond to independent variables; expected frequencies are computed
based on the marginal sums under the assumption of independence. It returns 4 values:
$\chi^2$ statistic, p-value, number of degrees of freedom and expected contingency table.
An often quoted guideline for the validity of the two-sample $\chi^2$ test is that the observed and expected frequencies in each cell should be at least 5. Fisher's exact test (or hypergeometric test) is an alternative, equivalent test that relaxes this assumption; it can be performed on a 2x2 contingency table using fisher_exact(observed), returning the odds-ratio and the p-value.
chi2, p_value = stats.chisquare(titanic.loc[:, "sex"].value_counts())
contingency = pd.crosstab(titanic.loc[:, "sex"], titanic.loc[:, "survived"])
contingency
| survived | False | True |
|---|---|---|
| sex | ||
| female | 64 | 195 |
| male | 360 | 93 |
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
# odds_ratio, p_value = stats.fisher_exact(contingency)
Exercise 5¶
Determine if there is a statistically significant association between the ticket_class of passengers and their survival.
contingency = pd.crosstab(titanic.loc[:, "ticket_class"], titanic.loc[:, "survived"])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
mini-Project¶
Tips¶
Take the Tips dataset from the previous notebook and perform a full EDA analysis using all the techniques described above.
On the basis of the EDA, formulate 3 hypotheses on the underlying phenomenon.
The data is available at the following address:
https://marcopasi.github.io/physenbio_pyDAV/data/tips.csv
Here is a description of the variables:
| Variable | Definition | Key |
|---|---|---|
| total_bill | Total bill (cost of the meal), including tax, in US dollars | |
| tip | Tip (gratuity) in US dollars | |
| sex | Sex of person paying for the meal | Female, Male |
| smoker | Smoker in party? | Yes, No |
| day | Day of the week | Thu, Fri, Sat, Sun |
| time | Time of day | Lunch, Dinner |
| size | Size of the party |
tips = pd.read_csv("https://marcopasi.github.io/physenbio_pyDAV/data/tips.csv")
_tips = tips
tips.head()
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
tips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_bill 244 non-null float64 1 tip 244 non-null float64 2 sex 244 non-null object 3 smoker 244 non-null object 4 day 244 non-null object 5 time 244 non-null object 6 size 244 non-null int64 dtypes: float64(2), int64(1), object(4) memory usage: 13.5+ KB
tips = tips.astype({"sex":"category", "smoker":"category", "day":"category", "time":"category"})
tips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_bill 244 non-null float64 1 tip 244 non-null float64 2 sex 244 non-null category 3 smoker 244 non-null category 4 day 244 non-null category 5 time 244 non-null category 6 size 244 non-null int64 dtypes: category(4), float64(2), int64(1) memory usage: 7.3 KB
tips2.describe()
tips2.describe(include=['category'])
| sex | smoker | day | time | |
|---|---|---|---|---|
| count | 244 | 244 | 244 | 244 |
| unique | 2 | 2 | 4 | 2 |
| top | Male | No | Sat | Dinner |
| freq | 157 | 151 | 87 | 176 |
# univariate
plt.subplot(2,2,1)
sns.countplot(x="sex", data=tips)
plt.subplot(2,2,2)
sns.countplot(x="smoker", data=tips)
plt.subplot(2,2,3)
sns.countplot(x="day", data=tips)
plt.subplot(2,2,4)
sns.countplot(x="time", data=tips)
plt.tight_layout()
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.histplot(x="size", data=tips)
plt.subplot(2,2,2)
sns.histplot(x="total_bill", data=tips, kde=True)
plt.subplot(2,1,2)
sns.histplot(x="tip", data=tips, kde=True)
plt.tight_layout()
# ...
License¶

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.