Exploring Machine Learning Fundamentals: Predicting Survival on the Titanic

Exploring Machine Learning Fundamentals: Predicting Survival on the Titanic



In 2024, one of my goals is to learn machine learning and publish a crop physiology paper in an academic journal using machine learning. While taking online or offline courses of machine learning, I discovered Kaggle, a popular platform for data science and machine learning competitions, datasets, and tutorials. Kaggle provides excellent datasets for practicing basic machine learning and data analysis.

If you visit the Kaggle website: Titanic – Machine Learning from Disaster, you can access and download various datasets. To download data from Kaggle, you’ll need to create a Kaggle account. Signing up for Kaggle is easy, but you may not always need an account. If you’re interested, I can share a method for importing Kaggle datasets directly into Google Colab. Please check out the post below for more details!


How to import Kaggle datasets directly into Google Colab?
How to use Google Colab for Python (power tool to analyze data)?


The code I explained in the post is based on Python, and Google Colab is an excellent Python-friendly environment. If you’re interested in learning how to use Google Colab, please refer to the post above for detailed instructions.


#to mount Google Drive on Google Colab
from google.colab import drive
drive.mount('/content/drive')

# JSON file download
! mkdir -p ~/.kaggle
! cp /content/drive/MyDrive/Colab/0_archive/kaggle.json ~/.kaggle
! chmod 600 ~/.kaggle/kaggle.json

# to download dataset from Kaggle
!kaggle competitions download -c titanic -p "/content/drive/MyDrive/Colab/3_machine_learning/Practice/Titanic_Machine_Learning_from_Disaster" 
# Please check the file pathway in your Google Colab

# path to the directory containing the zip file
import zipfile
import os
titanic= "/content/drive/MyDrive/Colab/3_machine_learning/Practice/Titanic_Machine_Learning_from_Disaster/titanic.zip"
# Please check the file pathway in your Google Colab

# Path to the directory where you want to extract the contents
titanic_path = "/content/drive/MyDrive/Colab/3_machine_learning/Practice/Titanic_Machine_Learning_from_Disaster"
# Please check the file pathway in your Google Colab

# Unzip the file
with zipfile.ZipFile(titanic, 'r') as zip_ref:
    zip_ref.extractall(titanic_path)

# List the contents of the extracted directory
extracted_titanic = os.listdir(titanic_path)
print("Extracted files:", extracted_titanic)

If you run the above code in your Google Colab, you’ll be able to save the data to your Google Drive. Of course, using this code is optional; alternatively, you can download the data from Kaggle and manually save it to your Google Drive. Either way, when completing to download data, let’s start machine learning process step by step.



1) Data upload

First, I’ll upload data to Google Colab. The data are saved as train.csv and test.csv in the file folder;

/content/drive/MyDrive/Colab/3_machine_learning/Practice/Titanic_Machine_Learning_from_Disaster

So, I’ll create data name; train and test using pandas.

import pandas as pd
import numpy as np

train=pd.read_csv("/content/drive/MyDrive/Colab/3_machine_learning/Practice/Titanic_Machine_Learning_from_Disaster/train.csv")

test=pd.read_csv("/content/drive/MyDrive/Colab/3_machine_learning/Practice/Titanic_Machine_Learning_from_Disaster/test.csv")

Let’s check the data info.

train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

In the train data, you can find the ‘Survived’ column, while in the test data, it does not exist. This is because we will create our own model to predict survival on the Titanic using various variables (i.e., Pclass, Sex, Age, etc.) in the train data, and then apply our model to the test data to verify its accuracy. Of course, in the test data, the actual survival status per PassengerId exists, but it is hidden from us. When we submit our predictions, Kaggle will evaluate the accuracy of our model.

Let’s see how this data are organized.

train.head()
test.head()


2) Null data

Ensuring there are no null values in your dataset is crucial as they can distort predictions. Therefore, the first step is to check whether there are any null values present.

train.isnull().sum()
test.isnull().sum()

While deleting null values is a simple approach, it may distort the final results. There are various methods available to estimate null values. Personally, I prefer using the median value of specific groups.

This would be the first step in developing your own model. For example, to substitute the null value with the median value of a group, which group should we choose? If we choose ‘Sex’ as a grouping factor, it would result in a large estimation because we would use the median value of all men or women. What I see as a better grouping method is using the Title in the name, such as ‘Mr.’ or ‘Miss’, as the median age of ‘Mr.’ or ‘Miss’ would provide a more accurate estimate of age. Therefore, I’ll use the title as a grouping factor to estimate age.



3) to fill blanks

First, I’ll set up train_test data which contains references to both the train and test data.

# to make a list containing references to the train and test
train_test=[train, test]

Then, I’ll extract titles.

for dataset in train_test: dataset["Title"]= dataset["Name"].str.extract("([\w]+)\.", expand=False)

Let’s see how many different titles in the train and test data.

train['Title'].value_counts()
test['Title'].value_counts()

Now, we need to convert these categorical values into numerical values since computers recognize numbers but not text. Therefore, I’ll use the following code:

Title_mapping= {"Mr": 0, "Miss":1, "Mrs":2, "Master":3, "Dr":4, "Rev":5,"Mlle":6, "Major":7, "Col":8,"Countess":9, "Capt":10, "Ms":11, "Sir":11, "Lady":12, "Mme":13, "Don":14, "Jonkheer":15, "Dona":16, "Dr":17}
for dataset in train_test:
    dataset["Title"]=dataset["Title"].map(Title_mapping)

Let’s check all values were converted.

train["Title"].unique()
test["Title"].unique()

However, here is one problem. There are so many different titles. If we have another dataset, and the dataset has the title; “president”, This code will generate an error as this code does not include “president”. Therefore, we can choose the popular title and the others.

For example, if I run the following code, I’ll set up ‘Mr’, ‘Miss’, ‘Mrs’, and ‘Master’ as 1, 2, 3, and 4, respectively, and assign the others as 4. So, if we have another data, having “president” title, the president will be assigned as 4.

for dataset in train_test: dataset["Title_group"]= dataset["Name"].str.extract("([\w]+)\.", expand=False)

for dataset in train_test:
    dataset['Title_group'] = dataset['Title_group'].apply(lambda x: 0 if x=="Mr" else 1 if x=="Miss" else 2 if x=="Mrs" else 3 if x=="Master" else 4)

train.head()
test.head()

Therefore, the ‘Title’ column and the ‘Title_group’ column will be different, and I’ll be using the ‘Title_group’ column.

Now, let’s fill in the blanks. I’ll use the median value of the ‘Title_group’ for ‘Age’,

# Age
for dataset in train_test: dataset['Age'].fillna(dataset.groupby("Title_group")["Age"].transform("median"), inplace=True)

and you can see that all null values in ‘Age’ have been filled out.

Let’s fill the blank in ‘Fare’. I assume that the Fare would be different from Pclass, so I’ll use the median value of Pclass.

# Fare
for dataset in train_test: dataset["Fare"].fillna(dataset.groupby("Pclass")["Fare"].transform("median"), inplace=True)


4) to replace categorical values to binary

I already explained how to convert categorical values to binary when explaining ‘Title’. Let’s convert other categorical values.

I’ve identified ‘Sex’ and ‘Embarked’ as categorical values. Since I won’t include ‘Cabin’ in my model, I won’t convert it. However, I’ll convert ‘Embarked’, which has three levels: Q, S, and C. I’ll convert them to 0, 1, and 2, respectively.

# Sex
sex_mapping = {"male": 0, "female":1}
for dataset in train_test:
    dataset["Sex"]=dataset["Sex"].map(sex_mapping)

# Embarked
dataset["Embarked"].unique()
 # array(['Q', 'S', 'C'], dtype=object)

Embarked_mapping = {"C":0, "Q":1, "S":2}
for dataset in train_test:
    dataset["Embarked"]=dataset["Embarked"].map(Embarked_mapping)

“Do you remember there were null values in ‘Embarked’? I’ll use the median value of the title group to fill those blanks.

for dataset in train_test: dataset['Embarked'].fillna(dataset.groupby("Title_group")["Embarked"].transform("median"), inplace=True)

Now, there are no null values except for Cabin.



5) Specific grouping

I assume that ‘Age’ would be better divided into groups. This is because the data is biased at some points. For example, even though the same 1-year difference in age, the difference between 1 and 2 years and between 20 and 21 years would be different, and just the age difference would distort the result. Therefore, in this case, it would be better to use groups such as teenager, adult, mid-age, and senior, something like that.

for dataset in train_test:
    dataset["Agebin"]= pd.cut(dataset["Age"], 5, labels= [0,1,2,3,4])

Also, for “Fare”, the variation is so wide, and therefore, it would be better to divide into different “Fare” group.

for dataset in train_test:
    dataset["Farebin"]= pd.cut(dataset["Fare"], 4, labels= [0,1,2,3])

Okay!! Finally I cleaned all data. I’ll only include “Sex”, ” SibSp”, “Embarked”, “Title_group”, “Agebin”, and “Farebin” in my model.



6) to create the answer sheet and final train data

First, let’s delete columns that are not included in the models.

drop_column=["Name", "Age", "SibSp","Parch","Ticket","Fare","Cabin", "Title"]

for dataset in train_test:
    dataset=dataset.drop(drop_column, axis=1, inplace=True)

Second, I’ll extract ‘Survived’ as a separate variable. Actually, it’s an answer sheet!!

target= train["Survived"]
target

Third, since “PassengerId” merely represents the sequence of passengers, it is not used for training. Similarly, “Survived” represents the outcome and should not be included in training. It is separated and stored in the target variable, then removed from the training data.

drop_column2=["PassengerId", "Survived"]
train_data=train.drop(drop_column2, axis=1)
train_data


7) Testing the Models

It has been a long journey. From now on, let’s test the models. I’ll use five different models.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

and run each model to find the best accuracy.

# Initialize Logistic Regression model
clf1 = LogisticRegression()

# Fit the model on the training data
clf1.fit(train_data, target)

# Check accuracy on the training data
accuracy = clf1.score(train_data, target)
print("Accuracy:", accuracy)
# Accuracy: 0.792368125701459
## Initialize Random forest model
clf2 = RandomForestClassifier()

# Fit the model on the training data
clf2.fit(train_data, target)

# Check accuracy on the training data
accuracy = clf2.score(train_data, target)
print("Accuracy:", accuracy)
# Accuracy: 0.8417508417508418
# Initialize Gaussian model
clf3 = GaussianNB()

# Fit the model on the training data
clf3.fit(train_data, target)

# Check accuracy on the training data
accuracy = clf3.score(train_data, target)
print("Accuracy:", accuracy)
# Accuracy: 0.792368125701459
# Initialize Decision Tree model
clf4 = DecisionTreeClassifier()

# Fit the model on the training data
clf4.fit(train_data, target)

# Check accuracy on the training data
accuracy = clf4.score(train_data, target)
print("Accuracy:", accuracy)
# Accuracy: 0.8417508417508418
# Initialize SVC model
clf5 = SVC()

# Fit the model on the training data
clf5.fit(train_data, target)

# Check accuracy on the training data
accuracy = clf5.score(train_data, target)
print("Accuracy:", accuracy)
#Accuracy: 0.8271604938271605

I’ll choose Random forest model.



8) to create the final test data

For test data, it should be the same format as the train data. So, I’ll delete “PassengerId”

drop_column3=["PassengerId"]
test_data=test.drop(drop_column3, axis=1)

The basic flow of machine learning!!



9) Prediction

Now, I’ll predict the survivors using the model I chose.

predict= clf2.predict(test_data) 
#clf2= Random forest model

and I’ll add “PassengerId” and predicted values.

submission = pd.DataFrame({
    'PassengerId' : test['PassengerId'],
    'Survived' : predict})

Then, I’ll save this data to .csv file.

submission.to_csv('submission.csv', index=False)

submission = pd.read_csv("submission.csv")
submission.head()

Now, I created the prediction about survivors for each PassengerID.

For example, PassengerID 892 is a middle-aged man (Sex: 0 = male, Agebin: 2 = 3rd age group) who stayed in Pclass 3, embarked at Queenstown (Embarked: 1 = Q), and purchased a cheap ticket (Farebin: 0 = 1st fare group).

According to the model we accepted, will he survive or be dead? The model tells us he is dead (0 = dead, 1 = Survival). See the data dictionary: https://www.kaggle.com/competitions/titanic/data

Now, let’s verify how much our model is accurate. First, let’s download the data in my Google Drive.

import shutil
shutil.copy("submission.csv", "/content/drive/My Drive/submission.csv")

After downloading data, upload the data to Kaggle: https://www.kaggle.com/competitions/titanic/

Click the “Submit Prediction” and upload the data.

My model has achieved 80% accuracy. I have tried different models, and my score is improving compared to before.

In summary, humans assume that certain variables might affect survival and allow the machine to learn the model based on these variables. We then find the best model and apply it to the actual data. This is one of the processes in machine learning.

full code: https://github.com/agronomy4future/python_code/blob/main/Exploring_Machine_Learning_Fundamentals_Predicting_Survival_on_the_Titanic.ipynb

Reference: Codable* offline course

* Codable: Korean student coding association at University of Illinois Urbana-Champaign.



Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.