문제

데이터 사이언스, 머신러닝, 캐글 입문자가 항상 추천 받는 타이타닉이다.
이 타이타닉 문제는 타이타닉 사고 당시 탑승했던 사람들의 신상 정보를 바탕으로 사람들의 생존 여부를 예측하는 모델을 만들어내는 것이다.
데이터 분석 도구(pandas, numpy), 데이터 시각화 도구(matplotlib, seaborn, plotly), 머신 러닝 도구(sklearn)를 사용할 것이다.
이 내용들은 이유한님의 타이타닉 튜토리얼을 보고 공부하며 만들었다.

로드맵

dataset확인 & 문제인식
EDA(exploratoty data analysis) - 탐색적 데이터 분석
1. 여러 feature들을 개별적으로 분석하고 feature들 간의 상관관계를 확인한다.
feature engineering
1. 모델을 세우기에 앞서, 모델의 성능을 높일 수 있도록 feature들을 engineering합니다.
model 만들기
1. sklearn의 RandomForestClassifier를 이용해 모델을 만든다.
model 학습 및 예측
1. trainset을 가지고 모델을 학습시킨 후, testset을 가지고 예측한다.
model 평가
1. 예측성능을 확인한다. 풀려는 문제에 따라 모델을 평가하는 방식도 달라진다.

0. 초기 세팅

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: <https://github.com/kaggle/docker-python>
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/titanic/train.csv

/kaggle/input/titanic/test.csv

/kaggle/input/titanic/gender_submission.csv

캐글에서 notebook을 새로 만들면 항상 따라오는 코드이다. pandas와 numpy를 제공한다.
그리고 문제를 해결하는데 필요한 파일이 어디에 있는지 출력해준다. (train.csv, test.csv)

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
sns.set(font_scale=2.5)
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

데이터 시각화를 위한 matplotlib.pyploy과 seaborn 모듈을 가져온다.
데이터의 NULL값을 찾기 위한 missingno 모듈과 불필요한 경고문을 출력하지 않기 위해 warnings모듈을 가져와 ignore해준다.
seaborn을 통해 출력 될 그래프의 기본 세팅을 해준다.

colors = sns.color_palette('Set2')

sns.set_theme(style="whitegrid", palette=colors, context = 'notebook', font_scale=1.5)

1. Dataset 확인

테이블화 된 데이터를 다루는 데 가장 최적화된 파이썬의 pandas 라이브러리를 사용하여 데이터를 확인해 본다.
캐글에서 dataset은 보통 train, test으로 나뉘어 있다.

dbtrain = pd.read_csv('../input/titanic/train.csv')
dbtest = pd.read_csv('../input/titanic/test.csv')
dbtrain

1.1 Null data check

dataset에 존재하는 null값이 모델에 영향을 미치기 때문에 얼마나 분포되어 있는지 확인해야 한다.
missingno 라이브러리를 이용하여 시각화 해준다.

msno.matrix(df=dbtrain.iloc[:,:], figsize=(8,8))

각 열마다 null이 얼마나 있는지 수치화해준다.

print('train.csv')
for col in dbtrain.columns:
    msg_train = 'column: {:>10}\\t Percent of NaN value: {:.2f}%'.format(col, 100 * (dbtrain[col].isnull().sum() / dbtrain[col].shape[0]))
    print(msg_train)

print('test.csv')
for col in dbtest.columns:
    msg_test = 'column: {:>10}\\t Percent of NaN value: {:.2f}%'.format(col, 100 * (dbtest[col].isnull().sum() / dbtest[col].shape[0]))
    print(msg_test)

train.csv
column: PassengerId	 Percent of NaN value: 0.00%
column:   Survived	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 19.87%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.00%
column:      Cabin	 Percent of NaN value: 77.10%
column:   Embarked	 Percent of NaN value: 0.22%
test.csv
column: PassengerId	 Percent of NaN value: 0.00%
column:     Pclass	 Percent of NaN value: 0.00%
column:       Name	 Percent of NaN value: 0.00%
column:        Sex	 Percent of NaN value: 0.00%
column:        Age	 Percent of NaN value: 20.57%
column:      SibSp	 Percent of NaN value: 0.00%
column:      Parch	 Percent of NaN value: 0.00%
column:     Ticket	 Percent of NaN value: 0.00%
column:       Fare	 Percent of NaN value: 0.24%
column:      Cabin	 Percent of NaN value: 78.23%
column:   Embarked	 Percent of NaN value: 0.00%

train, test 두 군데에서 Age(약 20%) , Cabin(약77%)의 null값이 나왔고, train만 embarked에서(0.22%)의 null값이 나왔음을 알 수 있다.

1.2 Target label 확인

target label이 어떤 분포인지 확인해봐야 한다.

f, ax = plt.subplots(1,2,figsize=(18,8))
dbtrain['Survived'].value_counts().plot.pie(explode=[0,0.1], autopct='%1.1f%%', ax=ax[0])
sns.countplot('Survived', data=dbtrain,ax=ax[1])

2. EDA(exploratoty data analysis) - 탐색적 데이터 분석

2.1 Pclass

Pclass의 인구수 분포

sns.countplot('Pclass', hue='Survived', data=dbtrain)

Pclass의 Survived분포

sns.countplot('Pclass', data=dbtrain)

Pclass의 생존률

dbtrain[['Pclass', 'Survived']].groupby(['Pclass'], as_index=True).mean().sort_values(by='Survived', ascending = False).plot.bar()

Pclass가 3인 곳에서 가장 많은 사람이 탔고, 생존률은 1이 가장 높다.
생존률이 1,2,3 순서대로 높으니 생존에 Pclass가 큰 영향을 미친다고 생각해볼 수 있다.

2.2 Sex

Sex의 Survived 분포

sns.countplot('Sex', hue='Survived', data=dbtrain)

성별에 따른 생존자 숫자

sp=dbtrain[dbtrain['Survived']==0].index
dbtrain_save = dbtrain.drop(sp)
sns.countplot('Sex', hue='Survived', data=dbtrain_save)

train 데이터 셋의 성별에 따른 분포 상태(표)

dbtrain.groupby(['Sex']).mean()

여성은 74%확률로 살아남고, 남자는 18%확률로 살아남았다.
성별도 예측 모델에 쓰일 중요한 feature이다.

2.3 Sex and Pclass

Sex, Pclass 두 가지에 관하여 생존이 어떻게 달라지는 보도록 하자

#sex & Pclass
sns.factorplot('Pclass', 'Survived', hue='Sex', data=dbtrain)

2.4 Age

나이의 분포 확인

print('oldest : {:.1f}'.format(dbtrain['Age'].max()))
print('youngest : {:.1f}'.format(dbtrain['Age'].min()))
print('average : {:.1f}'.format(dbtrain['Age'].mean()))

oldest : 80.0
youngest : 0.4
average : 29.7

생존에 따른 나이의 histogram

fig, ax=plt.subplots(1,1)
sns.kdeplot(dbtrain[dbtrain['Survived']==1]['Age'],ax=ax)
sns.kdeplot(dbtrain[dbtrain['Survived']==0]['Age'],ax=ax)
# sns.kdeplot(dbtrain['Age'])
plt.legend(['Survived=1', 'Survived=0'])

생존자 중 나이가 어린 경우가 많음을 알 수 있다.
Pclass에 따른 나이의 histogram

#Pclass & Age
sns.kdeplot(dbtrain[dbtrain['Pclass']==1]['Age'])
sns.kdeplot(dbtrain[dbtrain['Pclass']==2]['Age'])
sns.kdeplot(dbtrain[dbtrain['Pclass']==3]['Age'])
plt.legend(['Pclass=1', 'Pclass=2', 'Pclass=3'])

Pclass가 높을 수록 나이가 많은 사람의 비중이 커진다.
나이대가 변하면서 생존률이 어떻게 되는지 보자.

arr=[]
for i in range(1,80):
    arr.append(dbtrain[dbtrain['Age']<i]['Survived'].sum()/len(dbtrain[dbtrain['Age']<i]['Survived']))
    # print('head',dbtrain[dbtrain['Age']<i]['Survived'].sum(), 'bottom',len(dbtrain[dbtrain['Age']<i]['Survived']) )
plt.plot(arr)

나이가 어릴 수록 생존률이 확실히 높은 것을 확인할 수 있다.
나이가 중요한 feature임을 알 수 있다.

2.5 Pclass, Sex, Age

Pclass, Sex, Age 모두에 대하여 seaborn의 violinplot을 통해 보도록 하자
Pclass별로 Age의 분포가 어떻게 다르고, 생존 여부는 어떻게 되는지 나타내었다.

plt.figure(figsize=(14,8))
sns.violinplot('Pclass', 'Age', hue = 'Survived', data=dbtrain, split=True)

Sex, 생존에 따른 분포가 어떻게 되는지 나타내었다.

plt.figure(figsize=(14,8))
sns.violinplot('Sex', 'Age', hue = 'Survived', data=dbtrain, split=True)

종합적으로 봤을 때, 여자와 아이가 생존률이 높은 것을 알 수 있다.

2.6 Embarked

탑승한 곳에 따른 생존률을 나타내었다.

print(dbtrain[['Embarked', 'Survived']].groupby(['Embarked']).mean())
plt.plot(dbtrain[['Embarked', 'Survived']].groupby(['Embarked']).mean())
plt.legend(['Survived==1 Rate'])

          Survived
Embarked
C         0.553571
Q         0.389610
S         0.336957

C가 가장 높은 생존률을 갖고 있다.
탑승한 곳에 따른 성별 분포

sns.countplot('Embarked',hue='Sex', data=dbtrain)

S에서 남자가 여자보다 많이 탔고, C와 Q는 비슷한 비율로 탔다.
탑승한 곳에 따른 생존 분포

sns.countplot('Embarked',hue='Survived', data=dbtrain)

S는 생존률이 다른 곳보다 낮다.
탑승한 곳에 따른 Pclass 분포

sns.countplot('Embarked',hue='Pclass', data=dbtrain)

C가 생존률이 높은 이유는 Pclass가 1인 사람들이 많이 탔기 때문이다.
Embarked는 feature로 쓰긴 애매하긴 하지만 일단 넣어보도록 하겠다.

2.7 Family - SibSp(형제 자매) + Parch(부모 자녀)

SibSp와 Parch를 합하면 Family로 볼 수 있다.
가족 크기에 따른 탑승객 수 분포

dbtrain['FamilySize'] = dbtrain['SibSp']+dbtrain['Parch']+1 # self counting contain
dbtest['FamilySize'] = dbtest['SibSp']+dbtest['Parch']+1 # self counting contain
sns.countplot('FamilySize',data=dbtrain)

가족의 크기가 1 ~ 11이고 1인 가족이 많다.
가족 크기에 따른 생존률

plt.plot(dbtrain[['FamilySize', 'Survived']].groupby(['FamilySize']).mean())
plt.legend(['Survived==1 Rate'])

4명인 경우 가장 생존률이 높고, 가족 수가 많아질 수록 또는 적어질 수록 생존률이 떨어진다.

2.8 Fare

Fare는 연속적인 데이터로 볼 수 있다.

sns.distplot(dbtrain['Fare'])
print('skewness :',round(dbtrain['Fare'].skew(), 2))

skewness : 4.79

skewness는 비대칭도 인데 4.79로 많이 비대칭 이므로 정규분포 그래프 형태로 바꿔주는 것이 모델에 더 좋다.
만약 이 모양 그대로 모델에 넣어준다면 모델이 잘못 학습할 수 도 있다.
Fare에 log를 취하여 정규분포와 가까운 모양으로 만들어 줄 것이다.

dbtest.loc[dbtest.Fare.isnull(), 'Fare'] = dbtest['Fare'].mean() # testset 에 있는 nan value 를 평균값으로 치환합니다.

dbtrain['Fare'] = dbtrain['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
dbtest['Fare'] = dbtest['Fare'].map(lambda i: np.log(i) if i > 0 else 0)

fig, ax = plt.subplots(1, 1)
g = sns.distplot(dbtrain['Fare'], color='b', label='Skewness : {:.2f}'.format(dbtrain['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

이제 비대칭성이 많이 사라진 것을 알 수 있다. (skewness = 0.44)
이 부분은 feature engineering에 해당하는 부분인데 여기서 미리 했다.
모델을 학습 시키고, 성능을 높이기 위해 feature 들에 여러 조작을 가하거나, 새로운 feature을 추가하는 과정을 feature engineering이라고 한다.

2.9 Cabin

Cabin은 null이 77%이므로, 생존에 영향을 미칠 중요한 정보를 얻기 힘들다.
따라서 모델에 사용할 feature에 포함하지 않겠다.

2.10 Ticket

Ticket은 null이 없다.
하지만 형식이나 분포가 너무 불규칙적인 부분이 많아 모델에 사용할 feature에 포함하지 않겠다.

dbtrain['Ticket'].value_counts()

347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
           ..
9234        1
19988       1
2693        1
PC 17612    1
370376      1
Name: Ticket, Length: 681, dtype: int64

3. Feature Engineering

가장 먼저 dataset에 있는 null을 지워주도록 할 것이다.
null data에 채워질 데이터에 의해 모델의 성능이 변할 것이다.
Feature Engineering은 실제 모델의 학습에 쓰려고 하는 것이기 때문에 train 뿐만 아니라 test에도 똑같이 적용해야 한다.

3.1.1 filling null data(initial)

print(sum(dbtrain['Age'].isnull())) # null counting

177

Age에는 null data가 177개, 약 20% 있다.
이것을 채울 수 있는 방법은 여러가지가 있는데 여기서는 title + statistics를 사용할 것이다.
영어에는 Miss, Mr, Mrs, Dr 같은 title이 존재한다.
각 탑승객의 이름에는 이런 title이 존재하므로 이것을 사용할 것이다.
pandas에는 data를 string으로 바꿔주는 str method, 거기에 정규표현식을 적용하게 해주는 extract method가 있다. 이를 사용하여 title을 쉽게 추출할 수 있다.
title을 Initial column에 저장하겠다.

dbtrain['Initial']=dbtrain.Name.str.extract('([A-Za-x]+)\\.') # lets extract the salutations
dbtest['Initial']=dbtest.Name.str.extract('([A-Za-x]+)\\.')

pd.crosstab(dbtrain['Initial'],dbtrain['Sex']).T.style.background_gradient(cmap='BuPu')

위 테이블을 통해 남, 여가 사용하는 initial을 구분할 수 있다.
replace method를 사용하여 특정 데이터값을 원하는 값으로 치환해준다.

dbtrain['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Mr'],inplace=True)

dbtest['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Mr'],inplace=True)

dbtrain.groupby('Initial').mean()

p = dbtrain[['Initial','Survived']].groupby(['Initial']).mean() # survived rate

p.plot(kind='bar')

여성과 관계있는 Miss, Mrs이 생존률이 높은 것을 볼 수 있다.
statistics를 활용하여 null값을 채워줄 것이다.
여기서 train 데이터만 참고하여 test 데이터를 채워줘야 한다.
항상 test 데이터를 모르는 상태로 놔둔다고 생각해야 한다.

dbtrain.groupby('Initial').mean()

Age의 평균을 이용하여 null 값을 채운다.
padas dataframe을 다룰 때에는 boolean array를 이용해 indexing하는 방법이 있다.
Age 열의 null값인 셀을 해당 initial의 평균 나이로 치환하여준다.

dbtrain.loc[(dbtrain.Age.isnull())&(dbtrain.Initial=='Mr'),'Age'] = 33
dbtrain.loc[(dbtrain.Age.isnull())&(dbtrain.Initial=='Mrs'),'Age'] = 36
dbtrain.loc[(dbtrain.Age.isnull())&(dbtrain.Initial=='Master'),'Age'] = 5
dbtrain.loc[(dbtrain.Age.isnull())&(dbtrain.Initial=='Miss'),'Age'] = 22
dbtrain.loc[(dbtrain.Age.isnull())&(dbtrain.Initial=='Other'),'Age'] = 46

dbtest.loc[(dbtest.Age.isnull())&(dbtest.Initial=='Mr'),'Age'] = 33
dbtest.loc[(dbtest.Age.isnull())&(dbtest.Initial=='Mrs'),'Age'] = 36
dbtest.loc[(dbtest.Age.isnull())&(dbtest.Initial=='Master'),'Age'] = 5
dbtest.loc[(dbtest.Age.isnull())&(dbtest.Initial=='Miss'),'Age'] = 22
dbtest.loc[(dbtest.Age.isnull())&(dbtest.Initial=='Other'),'Age'] = 46

3.1.2 Fill Null in Embarked

print(sum(dbtrain['Embarked'].isnull())) # null counting

Embarked는 null값이 2개이고 S에 가장 많은 탑승객이 있었으므로, 간단하게 null값을 S로 채워준다.
dataframe의 fillna method를 이용하여 채워준다.
inplace = True로 하면 dbtrain에 fillna를 적용하게 된다.

dbtrain['Embarked'].fillna('S', inplace=True) #null값을 fillna 메소드로 's'로 채워줬다

S로 채워져서 null 값이 사라졌는지 확인해준다.

print(sum(dbtrain['Embarked'].isnull())) # null counting

3.2 Change Age(연속적인 데이터 카테고리화 하기)

Age는 연속적인 feature이다.
연속적인 상태로 써도 모델을 만들 수 있지만 카테고리화 해주고 모델을 만들어 볼 수 도 있다.
다만 연속적인 값을 카테고리화 하면 데이터의 손실이 생길 수도 있다.
첫번째 카테고리 방법으로 loc을 이용해 나이를 10살 간격으로 나눠준다.

dbtrain['Age_cat'] = 0
dbtrain.loc[dbtrain['Age'] < 10, 'Age_cat'] = 0
dbtrain.loc[(10 <= dbtrain['Age']) & (dbtrain['Age'] < 20), 'Age_cat'] = 1
dbtrain.loc[(20 <= dbtrain['Age']) & (dbtrain['Age'] < 30), 'Age_cat'] = 2
dbtrain.loc[(30 <= dbtrain['Age']) & (dbtrain['Age'] < 40), 'Age_cat'] = 3
dbtrain.loc[(40 <= dbtrain['Age']) & (dbtrain['Age'] < 50), 'Age_cat'] = 4
dbtrain.loc[(50 <= dbtrain['Age']) & (dbtrain['Age'] < 60), 'Age_cat'] = 5
dbtrain.loc[(60 <= dbtrain['Age']) & (dbtrain['Age'] < 70), 'Age_cat'] = 6
dbtrain.loc[70 <= dbtrain['Age'], 'Age_cat'] = 7

dbtest['Age_cat'] = 0
dbtest.loc[dbtest['Age'] < 10, 'Age_cat'] = 0
dbtest.loc[(10 <= dbtest['Age']) & (dbtest['Age'] < 20), 'Age_cat'] = 1
dbtest.loc[(20 <= dbtest['Age']) & (dbtest['Age'] < 30), 'Age_cat'] = 2
dbtest.loc[(30 <= dbtest['Age']) & (dbtest['Age'] < 40), 'Age_cat'] = 3
dbtest.loc[(40 <= dbtest['Age']) & (dbtest['Age'] < 50), 'Age_cat'] = 4
dbtest.loc[(50 <= dbtest['Age']) & (dbtest['Age'] < 60), 'Age_cat'] = 5
dbtest.loc[(60 <= dbtest['Age']) & (dbtest['Age'] < 70), 'Age_cat'] = 6
dbtest.loc[70 <= dbtest['Age'], 'Age_cat'] = 7

두번째로 간단한 함수를 만들어 apply method에 넣어주는 방법이 있다.
똑같이 10살 간격으로 만들어준다.

def category_age(x):
    if x < 10:
        return 0
    elif x < 20:
        return 1
    elif x < 30:
        return 2
    elif x < 40:
        return 3
    elif x < 50:
        return 4
    elif x < 60:
        return 5
    elif x < 70:
        return 6
    else:
        return 7    
    
dbtrain['Age_cat_2'] = dbtrain['Age'].apply(category_age)
dbtest['Age_cat_2'] = dbtest['Age'].apply(category_age)

두가지 방법이 잘 적용되었다면 같은 결과 값을 도출해내야 한다.
Series간 boolean 비교 후 all() method를 사용해준다.
all() method는 모든 값이 True면 True, 하나라도 False가 있으면 False를 준다.

print('1번 방법, 2번 방법 둘다 같은 결과를 내면 True 줘야함 -> ', (dbtrain['Age_cat'] == dbtrain['Age_cat_2']).all())

1번 방법, 2번 방법 둘다 같은 결과를 내면 True 줘야함 -> True

True로 나온 것을 보니 두 방법 다 잘 적용되었다.
이제 중복되는 Age_cat 열과 원래 열인 Age 열을 제거한다.

dbtrain.drop(['Age', 'Age_cat_2'], axis=1, inplace=True)
dbtest.drop(['Age',  'Age_cat_2'], axis=1, inplace=True)

2번째 필사 때 위 과정에서 나이를 10살 말고 5살로 나눠서 진행해 봤었다.
카테고리화가 더 세분화 되어서 모델이 더 정확히 예측할 것을 예상했었지만 모델의 예측율이 더 떨어지는 현상이 벌어졌었다.
왜 그런지는 궁금했지만 잘 모르겠다. 공부를 좀 더 하다 이유를 알아내면 이유를 남겨놓도록 할 것이다.

3.3 Change Initial, Embarked and Sex(string to numerical)

Initial은 Mr, Mrs, Miss, Master, Other 총 5개로 이루어져 있다.
이러한 카테고리로 표현되어 있는 데이터를 모델에 넣어주기 전에 컴퓨터가 인식할 수 있도록 수치화 해줘야한다.
map method를 통해 사전 순서대로 정리하여 mapping 시켜준다.

dbtrain['Initial'] = dbtrain['Initial'].map({'Master' : 0, 'Miss':1, 'Mr':2, 'Mrs':3, 'Other':4})
dbtest['Initial'] = dbtest['Initial'].map({'Master' : 0, 'Miss':1, 'Mr':2, 'Mrs':3, 'Other':4})

Embarked도 C, Q, S로 이루어져 있으니 map을 통해 바꿔주겠다.
그러기 앞서 특정 열에 어떤 값들이 있는지 확인해보겠다.
unique() method를 쓰거나, value_counts() 를 써서 count까지 보는 방법이 있다.

dbtrain['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

dbtrain['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

Embarked가 C, Q, S로 이루어져 있는 것을 확인했다. map을 사용하여 바꿔준다.

dbtrain['Embarked'] = dbtrain['Embarked'].map({'C':0, 'Q':1, 'S':2})
dbtest['Embarked'] = dbtest['Embarked'].map({'C':0, 'Q':1, 'S':2})

null값이 사라졌는지 확인해본다.
한개의 열만 가져온 것은 하나의 pandas Series 객체이므로, isnull()을 통해 null값인지 boolean값을 얻을 수 있다.
여기서 추가적으로 any()를 사용하여 True가 단 하나라도 있을 시 True를 반환하여 null값이 아직도 남아있는지 확인할 수 있다. (True가 반환되면 아직도 null값이 남아있는 의미)

dbtrain['Embarked'].isnull().any() #null값이 하나도 없으면 false(null 값을 'S'로 바꿨었음)

False

null 값이 없다.
Sex도 map을 이용해 바꿔준다.

dbtrain['Sex'] = dbtrain['Sex'].map({'female':0, 'male':1})
dbtest['Sex'] = dbtest['Sex'].map({'female':0, 'male':1})

이제 각 feature간의 상관관계를 한번 보려고 한다.
두 변수간의 피어슨 상관 계수(Pearson correlation)을 구하면 (-1, 1) 사이의 값을 얻을 수 있다.
-1로 다가갈 수록 음의 상관관계, 1로 갈수로 양의 상관관계를 가지며, 0은 상관관계가 없다는 것을 의미한다.

여러 feature을 하나의 matrix 형태로 보게 해주는 heatmap plot을 이용한다.
dataframe의 corr() method와 seaborn을 가지고 그린다.

heatmap_data = dbtrain[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']]
plt.figure(figsize = (14,12))
sns.heatmap(heatmap_data.astype(float).corr(), annot=True, linewidths = 0.1)

EDA에서 봤듯이 Sex, Pclass 가 Survived와 상관관계가 어느 정도 있다는 것을 알 수 있다.
그리고 fare과 Embarked도 상관관계가 있음을 알 수 있다.
또한 서로 강한 상관관계를 갖는 feature들이 없다.
모델을 학습 시킬 때, 불필요한(redundant, superfluous) feature 가 없다는 것을 의미한다.
1 또는 -1의 상관관계를 가진 feature이 있다면 우리가 얻을 수 있는 정보는 하나 뿐일 것이다.
이제 실제로 모델을 학습시키기 앞서 data preprocessing(데이터 전처리)를 진행하겠다.

3.4 One-hot encoding on Initial and Emabarked

수치화된 카테고리 데이터를 그대로 넣어도 되지만, 모델의 성능을 위해 one-hot encoding을 해줄 것이다.
Master == 0, Miss == 1, Mr == 2, Mrs == 3, Other == 4 mapping 할 것이다.
One-hot encoding 은 위 카테고리를 아래와 같이 (0, 1) 로 이루어진 5차원의 벡터로 나타내는 것을 말한다.

pandas의 get_dummies를 사용하여 mapping 해준다. Initial을 prefix로 두어서 구분이 쉽게 되도록 해준다.

dbtrain = pd.get_dummies(dbtrain, columns = ['Initial'], prefix = 'Initial')
dbtest = pd.get_dummies(dbtest, columns = ['Initial'])

dbtrain.head()

One-hot encoding으로 매핑하여 오른쪽에 새로운 initial열에 생겼다.
Embarked에도 적용해주겠다.

dbtrain = pd.get_dummies(dbtrain, columns=['Embarked'])
dbtest = pd.get_dummies(dbtest, columns=['Embarked'])

sklearn으로 Labelcoder + OneHotencoder 이용해도 가능하다.
만약 category가 100개가 넘어가는 경우가 있는데, 이때 one-hot encoding을 사용하면 열이 100개가 넘게 생겨서 모델 학습 시 힘들 수 있다. 이런 경우는 다른 경우는 다른 방법을 사용한다. 추후에 공부하고 리뷰해 보도록 해보겠다.

3.5 Drop columns

필요한 columns만 남기고 다 지워줄 것이다.

dbtrain.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
dbtest.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

dbtrain.head()

dbtest.head()

dbtest에 Survived column이 없는 것만 빼면 train과 test는 같은 columns를 가지고 있다.

4. Building machine learning model and prediction using the trained model

이제 sklearn을 이용하여 본격적으로 머신러닝 모델을 만들어주면 된다.

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

sklearn는 feature engineering, preprocessing, 지도 학습 알고리즘, 비지도 학습 알고리즘, 모델 평가, 파이프라인 등 머신러닝에 관련된 모든 작업들이 처음부터 끝까지 있다.
이 문제는 target class(Survived)가 0, 1로 이루어져 있으므로 binary classification문제이다.
train set의 Survived를 제외한 input을 가지고 모델을 최적화 시켜서 각 샘플(탑승객)의 생존 유무를 판단하는 모델을 만들 것이다.
그 후 모델이 학습하지 않았던 test set을 input으로 주어서 test set의 각 샘플(탑승객)의 생존 유무를 예측할 것이다.

4.1 Preparation - Split dataset into train, valid, test set

먼저, 학습에 쓰일 데이터와, target label(Survived)를 분리한다.

xtrain = dbtrain.drop('Survived', axis=1).values
target_label = dbtrain['Survived'].values
xtest=dbtest.values

보통 train test만 언급되지만, 실제 좋은 모델들 만들기 위하여 위해서 우리는 valid set을 따로 만들어 모델을 평가하게 된다.
train_test_split을 사용하여 쉽게 train 셋을 분리할 수 있다

xtr, xvld, ytr, yvld = train_test_split(xtrain, target_label, test_size = 0.3, random_state=2018)

랜덤 포레스트는 결정트리기반 모델이며, 여러 결정 트리들을 앙상블한 모델이다.
각 머신러닝에는 여러 파라미터들이 있다. 랜덤포레스트분류기도 n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf등 여러 파라미터들이 존재한다. 이것이 어떻게 세팅 되냐에 따라 같은 데이터셋이라 하더라도 모델의 성능이 달라진다.
일단 기본 default 세팅 먼저 진행할 것이다.
모델 객체를 만들고, fit메소드로 학습시킨다.
그런 후 valid set input을 넣어주어 예측값(X_vld sample(탑승객)의 생존여부)를 얻는다.

4.2 Model generation and prediction

model = RandomForestClassifier()
model.fit(xtr, ytr)
prediction= model.predict(xvld)

모델을 세우고 예측까지 진행하였다.

print('총 {}명 중 {:.2f}% 정확도로 생존을 맞춤'.format(yvld.shape[0], 100 * metrics.accuracy_score(prediction, yvld)))

총 268명 중 81.34% 정확도로 생존을 맞춤

모델의 성능을 출력해보았다.
아무런 파라미터 튜닝도 하지 않았는데 82%의 정확도가 나왔다.
몇 가지 파라미터를 튜닝해 봤을 때 어떤 결과가 나오는지 궁금하니 다음에 올려보도록 하겠다.

4.3 Feature importance

학습된 모델은 feature importance를 갖게 되는데 높을 수록 모델에 미치는 영향이 높다고 볼 수 있다.
학습된 모델은 기본적으로 feature importance를 가지고 있다.
pandas series를 이용하면 쉽게 sorting하여 그래프를 그릴 수 있다.

from pandas import Series

feature_importance = model.feature_importances_
Series_feat_imp = Series(feature_importance, index=dbtest.columns)

plt.figure(figsize=(8,8))
Series_feat_imp.sort_values(ascending=True).plot.barh()

우리가 얻은 모델에서는 fare이 가장 큰 영향력을 가지고 있음을 알 수 있다.
feature importance는 지금 모델에서의 importance를 나타낸다. 만약 다른 모델을 사용하게 된다면 feature importance가 다르게 나올 수 있다.
feature importance를 가지고 좀 더 정확도가 높은 모델을 얻기 위해 feature selection을 할 수도 있고, 좀 더 빠른 모델을 위해 feature을 제거할 수도 있다.
fare feature을 조정하면 예측결과가 어떻게 바뀌는지도 궁금하니 다음에 확인해 보도록 하겠다.

4.4 Prediction on Test set

이제 모델이 학습하지 않았던 테스트 셋을 모델에 주어서, 생존여부를 예측해 봐야한다.
이 결과는 contest에 submission이므로 결과는 해당 contest의 leaderboard에서 확인할 수 있다.
캐글에서 준 파일인 gender_submission.csv파일을 읽어서 제출 준비를 한다.

submission = pd.read_csv('../input/titanic/gender_submission.csv')

submission

이제 testset에 대하여 예측을 하고, 결과를 csv파일로 저장해 보겠다.

prediction = model.predict(xtest)
submission['Survived'] = prediction

submission.to_csv('./titanic_3rd_transcription.csv', index=False)

3번째 필사 notebook을 바탕으로 만들었기 때문에 titanic_3rd_transcription.csv 이란 파일로 저장이 될 것이다.

5. Conclusion

우측 바에서 결과 파일을 찾아서 submission을 위해 pc에 저장한다.

타이타닉 contest에 submission을 해준다.

step1에 아까 pc에 저장했던 파일을 업로드 하고 make submission을 해주면 contest의 leaderboard에서 내 결과를 확인해볼 수 있다.

0.74401로 12249등을 했다.

타이타닉 문제를 해결해보면서 생긴 궁금증들은 좀 더 공부해 보면서 해결을 할 예정이다.

나이를 카테고리화 하는 과정 중 카테고리화 하는 정도가 결과에 미치는 영향
one-hot encoding말고 다른 방법으로 결과를 도출해내기
파라미터를 튜닝했을 때 결과에 미치는 영향
feature importance가 가장 높은 것을 조종하면 결과가 얼마나 달라지는지

'캐글 - kaggle' 카테고리의 다른 글

캐글(kaggle) 필사하며 공부하기 (0)	2022.01.27

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

이게되네!

캐글(kaggle) - Titanic

문제

로드맵

0. 초기 세팅

1. Dataset 확인

1.1 Null data check

1.2 Target label 확인

2. EDA(exploratoty data analysis) - 탐색적 데이터 분석

2.1 Pclass

2.2 Sex

2.3 Sex and Pclass

2.4 Age

2.5 Pclass, Sex, Age

2.6 Embarked

2.7 Family - SibSp(형제 자매) + Parch(부모 자녀)

2.8 Fare

2.9 Cabin

2.10 Ticket

3. Feature Engineering

3.1.1 filling null data(initial)

3.1.2 Fill Null in Embarked

3.2 Change Age(연속적인 데이터 카테고리화 하기)

3.3 Change Initial, Embarked and Sex(string to numerical)

3.4 One-hot encoding on Initial and Emabarked

3.5 Drop columns

4. Building machine learning model and prediction using the trained model

4.1 Preparation - Split dataset into train, valid, test set

4.2 Model generation and prediction

4.3 Feature importance

4.4 Prediction on Test set

5. Conclusion

'캐글 - kaggle' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

캐글(kaggle) - Titanic

문제

로드맵

0. 초기 세팅

1. Dataset 확인

1.1 Null data check

1.2 Target label 확인

2. EDA(exploratoty data analysis) - 탐색적 데이터 분석

2.1 Pclass

2.2 Sex

2.3 Sex and Pclass

2.4 Age

2.5 Pclass, Sex, Age

2.6 Embarked

2.7 Family - SibSp(형제 자매) + Parch(부모 자녀)

2.8 Fare

2.9 Cabin

2.10 Ticket

3. Feature Engineering

3.1.1 filling null data(initial)

3.1.2 Fill Null in Embarked

3.2 Change Age(연속적인 데이터 카테고리화 하기)

3.3 Change Initial, Embarked and Sex(string to numerical)

3.4 One-hot encoding on Initial and Emabarked

3.5 Drop columns

4. Building machine learning model and prediction using the trained model

4.1 Preparation - Split dataset into train, valid, test set

4.2 Model generation and prediction

4.3 Feature importance

4.4 Prediction on Test set

5. Conclusion

'캐글 - kaggle' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역