This project was the final project for the udacity's course : Intro to Data Analysis
The main task of the project is to analyze a dataset and then communicate the findings about it. and I should use the Python libraries NumPy, Pandas, and Matplotlib to make the analysis easier.
Udacity's Introduction to the project
For the final project, you will conduct your own data analysis and create a file to share that documents your findings. You should start by taking a look at your dataset and brainstorming what questions you could answer using it. Then you should use Pandas and NumPy to answer the questions you are most interested in, and create a report sharing the answers. You will not be required to use statistics or machine learning to complete this project, but you should make it clear in your communications that your findings are tentative. This project is open-ended in that we are not looking for one right answer.
Choosing a dataset
There are two available data sets:- Titanic Data - Contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. You can view a description of this dataset on the Kaggle website, where the data was obtained.
- Baseball Data - A data set containing complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. This dataset contains many files, but you can choose to analyze only the one(s) you are most interested in.
Getting started
to be organized I created a new folder for the project which contains this Ipython notebook and the dataset "titanic_data.csv"Analyzing The Data
In [1]:
# importing the python libraries that will be used for analysis and visualization
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# reading the csv file using pandas
titanic_df = pd.read_csv('titanic_data.csv')
now I'm going to take a look at the dataset to see what data it contains
In [3]:
# viewing the first 10 rows of the data set
titanic_df.head(10)
Out[3]:
Questions
the first thing i wondered about the passenger is his suvival did he survived or not and from those 891 passenger how many survived and how many didn't. How was the suvival related to other factors did the age had an effect, or being male or a female change your suvival chance, also does being with your relatives give you higher chance of survival. for now the variables i'm interested in are:Variable | Definition | key | notes |
---|---|---|---|
survival | Survival | 0 = No, 1 = Yes | dependent variable |
Sex | Sex | independent variable | |
Age | Age in years | independent variable | |
SibSp | # of siblings / spouses aboard the Titanic | independent variable | |
Parch | # of parents / children aboard the Titanic | independent variable |
In [4]:
# viewing the first 10 rows of the data set
titanic_df.head(10)
Out[4]:
I noticed a few missing values in the age column and many in the cabin column so I'll invesigate the data more to see how many values are missing and if there are any other poblems with the data
In [5]:
# descriptive statistcs
titanic_df.describe()
Out[5]:
describe function displayed only the numerical columns so I'll count values in the of the columns with different data types
In [6]:
# take a look at the data type of each column
titanic_df.dtypes
Out[6]:
In [7]:
# counting non numerical columns
print('count of Name column : ', titanic_df['Name'].count())
print('count of Sex column : ', titanic_df['Sex'].count())
print('count of Ticket column : ', titanic_df['Ticket'].count())
print('count of Cabin column : ', titanic_df['Cabin'].count())
print('count of Embarked column : ', titanic_df['Embarked'].count())
In [8]:
titanic_df.isnull().sum()
Out[8]:
- thers is only 204 passengers out of 891 had cabin number, since most of the data is missing from this column and there is no inferences I want from it I'll just ignore it.
- but the age column there is 177 missing values: in pandas the rows with null values are ignored when performing calculations so the analysis of the age effect on survival will be done on only 714 passengers
Suvival rate in general
In [9]:
# counting how many passengers survived and how many didn't
survival_count = titanic_df.groupby('Survived')['PassengerId'].count()
survival_count
Out[9]:
In [10]:
((survival_count/891)*100).plot(kind='pie', autopct='%.2f', figsize=[5,5])
Out[10]:
As we can see: 61.62% (549) of the passengers didn't survive
now I'm going to see if that was related to the other factors as age and sex ..
Survival in relation to Age
In [11]:
# get the rows with valid age value, also I'll slice the 2 columns I need right now Survived and Age
valid_age_df = titanic_df[(titanic_df['Age'].isnull() == False)][['Survived', 'Age']]
len(valid_age_df)
Out[11]:
I got 714 valid entries I can do the analysis on
In [12]:
# describtive statistics on the age
valid_age_df['Age'].describe()
Out[12]:
In [13]:
valid_age_df['Age'].plot(kind='box')
Out[13]:
In [14]:
valid_age_df['Age'].plot.hist()
Out[14]:
the mean age is about 30 years old , the minimum is less than a year and maximum is 80.
with looking at the box plot we can identify some outliers. and by looking at the histogram we can see that more of the ages are between 20 and 40.
In [15]:
# removeing the outliers
# keep only the ones that are within +3 to -3 standard deviations away from the mean age.
age_df = valid_age_df[np.abs(valid_age_df['Age']-valid_age_df['Age'].mean()) <= (3*valid_age_df['Age'].std())]
age_df['Age'].describe()
Out[15]:
the count now is 712 , 2 outliers were removed and the maximum age now is 71 years
In [16]:
# groub the data by suvival state
age_df.groupby('Survived').describe()
Out[16]:
Age and survival Inferential Statistics
I'm going to split age into categories and perform chi-square test the age categories I'll create are:- Children (00-14 years)
- Youth (15-24 years)
- youngAdults (25-44 years)
- oldAdults (45-64 years)
- Seniors (65 years and over)
In [17]:
# Categorizing and counting each survival state for each age category
children = age_df[(age_df['Age'] <= 14)].groupby('Survived').count()
youth = age_df[(age_df['Age'] > 14) & (age_df['Age'] <= 24)].groupby('Survived').count()
young_adult = age_df[(age_df['Age'] > 24) & (age_df['Age'] <= 44)].groupby('Survived').count()
old_adult = age_df[(age_df['Age'] > 44) & (age_df['Age'] <= 64)].groupby('Survived').count()
senior = age_df[(age_df['Age'] > 64 )].groupby('Survived').count()
print ('Children data: ', children, end='\n\n')
print ('Youth data: ', youth, end='\n\n')
print ('youngAdult data: ', young_adult, end='\n\n')
print ('oldAdult data: ', old_adult, end='\n\n')
print ('Senior data: ', senior, end='\n\n')
In [18]:
youth.sum()
Out[18]:
Chi-Squared test of Independence
for this test I need to have the observed value for each category and the expected valuechi^2 = sum( (f_obs-f_exp)^2 / f_exp )f_obs is the observed value and f_exp is the expected value
In [19]:
# organize the data for chi square test
age_cats = [children, youth, young_adult, old_adult, senior]
age_cats_labels = ['children', 'youth', 'young_adult', 'old_adult', 'senior']
survived = [cat.iloc[1]['Age'] if len(cat)>1 else 0 for cat in age_cats]
notsurvived = [cat.iloc[0]['Age'] for cat in age_cats]
age_observed_df = pd.DataFrame(data = {
'survived' : survived,
'notsurvived' : notsurvived
}, index = age_cats_labels)
N = sum(age_observed_df.sum(axis=1)) # total number
cats_count = age_observed_df.sum(axis=1) # number of passengers for each category
# calculating the expected value for each category
def generate_expected_df(observed_df):
N = sum(observed_df.sum(axis=1)) # total number
# calculating the expected value for each category
expected_df = pd.DataFrame(data = {
'survived' : (observed_df.sum(axis=1)/N) * observed_df.sum()['survived'],
'notsurvived' : (observed_df.sum(axis=1)/N) * observed_df.sum()['notsurvived']
})
return expected_df
age_expected_df = generate_expected_df(age_observed_df)
In [20]:
age_observed_df.sum()
Out[20]:
In [21]:
age_expected_df.sum()
Out[21]:
In [22]:
# calculating chi squared
chi2 = (((age_observed_df - age_expected_df)**2) / age_expected_df).values.sum()
Calcuating chi square using python scipy library¶
In [23]:
# import scipy.stats
import scipy.stats as st
In [24]:
# function that calculates chi square
def calc_chi2(observed_df, expected_df):
f_obs = list(observed_df['survived']) + list(observed_df['notsurvived'])
f_exp = list(expected_df['survived']) + list(expected_df['notsurvived'])
return st.chisquare(f_obs=f_obs, f_exp=f_exp)
age_chi2stat, age_chi2p = calc_chi2(age_observed_df, age_expected_df)
print('age chi^2 statistic = ', age_chi2stat)
print('age chi^2 p-value = ', age_chi2p)
Chi squared test results for the Age
chi^2 statistic = 17.751100308595902so for an alpha level of 0.05 there is a relationship between age and survival but we need an effect size measure to how much effect the age has on survival. I'll use Cramer's V is the effect size measure
chi^2 p-value = 0.038173517948452126
Cramer's V* = sqrt(chi2/(n(k-1)))
n = total number of passengers (sample size), k = smaller of the number or rows or columns- in this case k=2
In [25]:
import math
def clac_cramersv(chi2stat, n, k):
return math.sqrt( chi2stat / ( n*(k-1) ) )
In [26]:
k = min(age_observed_df.shape)
age_cramers_v = clac_cramersv(age_chi2stat, N, k)
age_cramers_v
Out[26]:
Results and Conclusion (Survival in relation to Age)
chi^2 (4, N = 712) = 17.75, p = 0.038these results of chi^2 suggests that there is a correlation between age on survival but Cramer's V points out that it is of a small effect
Cramer's V = 0.16
Survival in relation to gender
In [27]:
# count how many survived and how many don't foreach gender category
gender_df = titanic_df[['Survived', 'Sex']]
male = gender_df[(gender_df['Sex'] == 'male')].groupby('Survived').count()
female = gender_df[(gender_df['Sex'] == 'female')].groupby('Survived').count()
print('male data: ', male, end= '\n\n')
male.plot(kind='pie', autopct='%.2f', figsize=[5,5], subplots=True, title='male survival')
print('female data: ', female)
female.plot(kind='pie', autopct='%.2f', figsize=[5,5], subplots=True, title='Female survival')
Out[27]:
there is 891 passengers:
- 577 males : 109 survived (18.89%) and 468 didnot (81.11%).
- 314 females: 233 survived (74.20%) and 81 didnot (25.80%).
In [28]:
survived = [male.iloc[1]['Sex'], female.iloc[1]['Sex']]
notsurvived = [male.iloc[0]['Sex'], female.iloc[0]['Sex']]
sex_observed_df = pd.DataFrame(data = {
'survived' : survived,
'notsurvived' : notsurvived
}, index = ['male', 'female'])
N = sum(sex_observed_df.sum(axis=1)) # total number
# calculating the expected value for each category
sex_expected_df = generate_expected_df(sex_observed_df)
# calculate chi^2
sex_chi2stat, sex_chi2p = calc_chi2(sex_observed_df, sex_expected_df)
print('sex chi^2 statistic = ', sex_chi2stat)
print('sex chi^2 p-value = ', sex_chi2p)
# calculate Cramer's v
k = min(sex_observed_df.shape)
sex_cramers_v = clac_cramersv(sex_chi2stat, N, k)
print("sex Cramer's V = ", sex_cramers_v)
Chi squared test results for the Sex
chi^2 statistic = 263.05057407065567so for an alpha level of 0.05 or even alpha level of 0.01 there is a relationship between sex and survival and with the Cramer's v = 0.54 there is a large effect.
chi^2 p-value = 9.83773178330153e-57 (very low)
Results and Conclusion (Survival in relation to Sex)¶
chi^2 (1, N = 891) = 263.05, p = 9.84e-57 ( p < 0.000001)these results of chi^2 suggests that there is a correlation between sex on survival and Cramer's V points out that it is of a large effect
Cramer's V = 0.54
Survival in relation to family
here I'm going to see if being with a family affects the chance to survive.
In [29]:
# slice the colunm Survived, SibSp, Parch that indicates having family or not
# create a new column "family" True if passenger has family on board, False if not
family = (titanic_df['SibSp'] > 0) | (titanic_df['Parch'] > 0) # series of values of family column
family_detailed_df = titanic_df[['Survived', 'SibSp', 'Parch']]
family_detailed_df.loc[:,'family'] = family
In [30]:
family_detailed_df.head()
Out[30]:
In [31]:
family_df = family_detailed_df[['Survived', 'family']]
# getting the survival count for each family status
family_count = family_df.groupby(['Survived', 'family']).size()
print(family_count)
print('total (n) = ', family_count.sum())
print('# has family = ', family_count.loc[ : , True ].sum())
print('# no family = ', family_count.loc[ : , False ].sum())
In [32]:
family_count.plot.bar()
Out[32]:
there is 891 passengers:
- 354 has family : 179 survived and 175 didnot.
- 537 no family : 163 survived and 374 didnot.
In [33]:
survived = family_count.loc[1, : ].values
notsurvived = family_count.loc[0, : ].values
family_observed_df = pd.DataFrame(data = {
'survived' : survived,
'notsurvived' : notsurvived
}, index= ['nofamily', 'family'])
N = sum(family_observed_df.sum(axis=1)) # total number
# calculating the expected value for each category
family_expected_df = generate_expected_df(family_observed_df)
# calculate chi^2
family_chi2stat, family_chi2p = calc_chi2(family_observed_df, family_expected_df)
print('family chi^2 statistic = ', family_chi2stat)
print('family chi^2 p-value = ', family_chi2p)
# calculate Cramer's v
k = min(family_observed_df.shape)
family_cramers_v = clac_cramersv(family_chi2stat, N, k)
print("family Cramer's V = ", family_cramers_v)
Chi squared test results for the Family
chi^2 statistic = 36.85013084754587so for an alpha level of 0.05 or even alpha level of 0.01 there is a relationship between having a family on board and survival and with the Cramer's v = 0.20 there is a small effect.
chi^2 p-value = 4.949879871232269e-08 (very low)
Results and Conclusion (Survival in relation to Family)
chi^2 (1, N = 891) = 36.85, p = 4.95e-08 ( p < 0.000001)these results of chi^2 suggests that there is a correlation between family on survival but Cramer's V points out that it is of a small effect
Cramer's V = 0.20
Download project files from github
0 comments: