Titanic Tutorial

Shreya Malraju
Sep 5, 2022
2 min read

Updated: Oct 7, 2022

In this post, we want to discuss about the Kaggle competition "Titanic - Machine Learning from Disaster" to learn more about Machine Learning .

The competition is all about creating a simple model that predicts which passengers survived the Titanic which was very shocking and one of the biggest disasters in the world which happened in April, 1912.

The total data is split into

train.csv
test.csv

The train.csv is used for training the ML Models, which is based on features like a specific passenger's Age, Gender, Class, etc.

The test.csv is used to check the performance of the model for the new/unseen data.

Below is the Jupyter Notebook where the model is implemented -

CODE REVIEW

First, we need to import different modules like numpy,pandas which we will be using in this task.

The given train.csv data file is loaded into the environment as pandas data-frame, using read_csv() method.

We see that the training data has 890 records and 11 attributes denoting different passengers on the ship.

Similarly, we load the test.csv file into the environment and hence observe the test data.

We can see that we have 417 records in the test data with 10 attributes.

The below code is used to find the percentage of women who were abke to survive in the disaster. We take the data from the training set and observe that 74.2% of the women survived.

Similarly, we observe that only 18.89 percent of the men were able to survive the disaster.

From the above two, we observe that "gender" plays an important role in deciding which passenger survived.

We observe that the data contains NaN values i.e., null values which may lead to errors while predicting. Hence we replace the null values of numerical attributes (Eg. Age) with the mean of the remaining values in both test and training data.

We use RandomForestClassifier over 4 features and perform training on the train_data.

We get an accuracy of 77.5 percent by using the above code and the submission.csv is submitted to kaggle competition.

CONTRIBUTION

As we can see that the accuracy was 77.5 percent by using the features "Pclass", "Sex", "SibSp" , "Parch", and RandomForestClassifier with max_depth=6 and number of estimators as 100.

As an attempt to increase the accuracy, I tried adding the feature "Fare" while training and also increased the max_depth to 7 and also increasing the number of estimators to 900.

The feature "Fare" is added to the list of features as we can get the insights from the data that the passengers who survived the disaster apparently paid a greater fare.Hence, we increased the accuracy from 77.5 percent to 78.46 percent.

The latest submission is saved as Submission3 in the Kaggle.

REFERENCES

https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook

https://thispointer.com/pandas-replace-nan-with-mean-or-average-in-dataframe-using-fillna/

Shreya Malraju

Titanic Tutorial

Recent Posts

Comments