Original notebook from: https://www.kaggle.com/code/ttminh27/using-autoencoder-to-impute-missing-data¶
Probabilistic circuit for handling missing data¶
!pip install deeprob-kit
Using a probabilistic circuit, a sum-product network, to autoencode missing data
In [1]:
# !pip install -e git+https://github.com/deeprob-org/deeprob-kit.git@main#egg=deeprob-kit
!pip install deeprob-kit
!pip install -q kaggle
Requirement already satisfied: deeprob-kit in /usr/local/lib/python3.10/dist-packages (1.1.0) Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (1.3.2) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (1.23.5) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (1.11.4) Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (4.66.1) Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (2.1.0+cu121) Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (0.16.0+cu121) Requirement already satisfied: scikit-learn<1.2.0 in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (1.1.3) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (3.7.1) Requirement already satisfied: networkx<2.8.3 in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (2.8.2) Requirement already satisfied: pydot in /usr/local/lib/python3.10/dist-packages (from deeprob-kit) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn<1.2.0->deeprob-kit) (3.2.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (1.2.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (4.46.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (1.4.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (23.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->deeprob-kit) (2.8.2) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->deeprob-kit) (3.13.1) Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->deeprob-kit) (4.5.0) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->deeprob-kit) (1.12) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->deeprob-kit) (3.1.2) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->deeprob-kit) (2023.6.0) Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->deeprob-kit) (2.1.0) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchvision->deeprob-kit) (2.31.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->deeprob-kit) (1.16.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->deeprob-kit) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision->deeprob-kit) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision->deeprob-kit) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision->deeprob-kit) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision->deeprob-kit) (2023.11.17) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->deeprob-kit) (1.3.0)
In [5]:
In [98]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import random
import tqdm
import math
print(f'tf version: {tf.__version__}')
import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
# for filename in filenames:
# print(os.path.join(dirname, filename))
tf version: 2.15.0
In [5]:
### Upload kaggle.json with API key
from google.colab import files
files.upload()
In [6]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
In [12]:
!kaggle competitions download -c titanic
!unzip titanic.zip
titanic.zip: Skipping, found more recently modified local copy (use --force to force download) Archive: titanic.zip inflating: gender_submission.csv inflating: test.csv inflating: train.csv
Fix random seed¶
In [7]:
seed = 42
tf.random.set_seed(seed)
random.seed(seed)
np.random.seed(seed)
random.seed(seed)
np.random.seed(seed)
# torch.manual_seed(seed)
# torch.cuda.manual_seed(seed)
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark = False
Load data¶
In [116]:
df_full = pd.read_csv('/content/train.csv')
df_valid = pd.read_csv('/content/test.csv')
df_full
Out[116]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Define utils functions¶
Normalize ticker number¶
In [117]:
def normalize_ticket(ticket_data):
ticket_data = ticket_data.split()[0]
ticket_data = re.sub('^\d+$', 'normal', ticket_data)
ticket_data = re.sub('\.|/', '', ticket_data)
return ticket_data
Statistic of missing value¶
In [118]:
def statMissingValue(X):
lstSummary = []
for col in X.columns:
liTotal = len(X.index)
liMissing = X[col].isna().sum()
lfMissingRate = round(liMissing * 100/liTotal,2)
liZero = 0
liNUnique = X[col].nunique()
if(X[col].dtype!='object'):
liZero = X[col].isin([0]).sum()
lfZeroRate = round(liZero*100/liTotal,2)
lstSummary.append([col,str(X[col].dtype),liTotal, liNUnique, liMissing, lfMissingRate,liZero,lfZeroRate])
return pd.DataFrame(lstSummary,columns=['feature','col_type','total', 'unique', 'na','na_rate','zero','zero_rate'])
EDA¶
Review statistic on each feature¶
In [119]:
df_stat = statMissingValue(df_full)
print(df_stat.feature.to_list())
df_stat
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Out[119]:
feature | col_type | total | unique | na | na_rate | zero | zero_rate | |
---|---|---|---|---|---|---|---|---|
0 | PassengerId | int64 | 891 | 891 | 0 | 0.00 | 0 | 0.00 |
1 | Survived | int64 | 891 | 2 | 0 | 0.00 | 549 | 61.62 |
2 | Pclass | int64 | 891 | 3 | 0 | 0.00 | 0 | 0.00 |
3 | Name | object | 891 | 891 | 0 | 0.00 | 0 | 0.00 |
4 | Sex | object | 891 | 2 | 0 | 0.00 | 0 | 0.00 |
5 | Age | float64 | 891 | 88 | 177 | 19.87 | 0 | 0.00 |
6 | SibSp | int64 | 891 | 7 | 0 | 0.00 | 608 | 68.24 |
7 | Parch | int64 | 891 | 7 | 0 | 0.00 | 678 | 76.09 |
8 | Ticket | object | 891 | 681 | 0 | 0.00 | 0 | 0.00 |
9 | Fare | float64 | 891 | 248 | 0 | 0.00 | 15 | 1.68 |
10 | Cabin | object | 891 | 147 | 687 | 77.10 | 0 | 0.00 |
11 | Embarked | object | 891 | 3 | 2 | 0.22 | 0 | 0.00 |
We notice that:
- "Pclass" has only 3 unique value so it mustbe categorical type. Same for "Sex", "Embarked".
- "Ticket" seems useless but if we split it, keep only first chunk (remove numeric part), it looks like categorial type.
- "Parch", "SibSp" are numeric type due to their definition.
- "Name" looks messy. We extract only the title part.
Plot the missing value rate of each feature
In [120]:
plt.figure(figsize=(20,4))
plt.barh(df_stat.feature, df_stat.na_rate, label='na rate (%)')
plt.legend()
plt.show()
- "Cabin" looks bad and need to remove instead of doing imputation.
- "Age" has only about 20% missing value and can be impute.
- "Embarked" has 0.22% of missing value and very good to impute.
Define columns¶
In [191]:
col_id = [] #['PassengerId']
col_target = ['Survived']
col_cat_small =['Pclass', 'Sex', 'Embarked'] #['Pclass', 'Sex', 'Embarked','Name']
col_cat_big = ['Ticket','Cabin']
col_cat = col_cat_big + col_cat_small
col_num = ['Age', 'SibSp', 'Parch', 'Fare']
Simple preprocess function¶
In [122]:
def simple_preprocess(X):
df_ret = X.copy()
df_ret['Ticket'] = df_ret['Ticket'].apply(lambda x:normalize_ticket(x))
df_ret['Name'] = df_ret['Name'].str.extract(r', (\w+\.)')
df_ret[df_ret.Cabin.isna()] = 'null_value'
return df_ret
Review data and statistic after simple preprocess¶
In [123]:
simple_preprocess(df_full)
Out[123]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value |
1 | 2 | 1 | 1 | Mrs. | female | 38.0 | 1 | 0 | PC | 71.2833 | C85 | C |
2 | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value |
3 | 4 | 1 | 1 | Mrs. | female | 35.0 | 1 | 0 | normal | 53.1 | C123 | S |
4 | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value |
887 | 888 | 1 | 1 | Miss. | female | 19.0 | 0 | 0 | normal | 30.0 | B42 | S |
888 | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value |
889 | 890 | 1 | 1 | Mr. | male | 26.0 | 0 | 0 | normal | 30.0 | C148 | C |
890 | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value | null_value |
891 rows × 12 columns
In [124]:
statMissingValue(simple_preprocess(df_full))
Out[124]:
feature | col_type | total | unique | na | na_rate | zero | zero_rate | |
---|---|---|---|---|---|---|---|---|
0 | PassengerId | object | 891 | 205 | 0 | 0.00 | 0 | 0.0 |
1 | Survived | object | 891 | 3 | 0 | 0.00 | 0 | 0.0 |
2 | Pclass | object | 891 | 4 | 0 | 0.00 | 0 | 0.0 |
3 | Name | object | 891 | 13 | 1 | 0.11 | 0 | 0.0 |
4 | Sex | object | 891 | 3 | 0 | 0.00 | 0 | 0.0 |
5 | Age | object | 891 | 64 | 19 | 2.13 | 0 | 0.0 |
6 | SibSp | object | 891 | 5 | 0 | 0.00 | 0 | 0.0 |
7 | Parch | object | 891 | 5 | 0 | 0.00 | 0 | 0.0 |
8 | Ticket | object | 891 | 11 | 0 | 0.00 | 0 | 0.0 |
9 | Fare | object | 891 | 101 | 0 | 0.00 | 0 | 0.0 |
10 | Cabin | object | 891 | 148 | 0 | 0.00 | 0 | 0.0 |
11 | Embarked | object | 891 | 4 | 2 | 0.22 | 0 | 0.0 |
In [125]:
simple_preprocess(df_full).Name.value_counts()
Out[125]:
null_value 687 Mr. 93 Miss. 47 Mrs. 44 Master. 7 Dr. 3 Major. 2 Mlle. 2 Mme. 1 Lady. 1 Sir. 1 Col. 1 Capt. 1 Name: Name, dtype: int64
Using DeepProbKit¶
In [126]:
df_nna = df_full.dropna()
df_nna.head(3)
Out[126]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
In [127]:
df_nna.drop(columns=['PassengerId', 'Name'], inplace=True)
df_full.drop(columns=['PassengerId', 'Name'], inplace=True)
<ipython-input-127-2ca628d0a0c0>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_nna.drop(columns=['PassengerId', 'Name'], inplace=True)
In [128]:
CAT_VARS = ['Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']
def to_numeric_data(dataset_train, cat_vars = CAT_VARS):
for col in dataset_train.columns:
if col in cat_vars:
dataset_train[col] = pd.factorize(dataset_train[col])[0]
# else:
# dataset_train[col] = pd.to_numeric(dataset_train[col])
return dataset_train
df_nna_num = to_numeric_data(df_nna)
In [129]:
# df_nna_num['Sex'] = pd.factorize(df_nna_num['Sex'])[0]
for col in CAT_VARS:
df_nna_num[col] = pd.factorize(df_nna_num[col])[0]
df_full[col] = pd.factorize(df_full[col])[0]
df_nna_num.head(3)
<ipython-input-129-26c21c7166a0>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_nna_num[col] = pd.factorize(df_nna_num[col])[0] <ipython-input-129-26c21c7166a0>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_nna_num[col] = pd.factorize(df_nna_num[col])[0] <ipython-input-129-26c21c7166a0>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_nna_num[col] = pd.factorize(df_nna_num[col])[0] <ipython-input-129-26c21c7166a0>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_nna_num[col] = pd.factorize(df_nna_num[col])[0] <ipython-input-129-26c21c7166a0>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_nna_num[col] = pd.factorize(df_nna_num[col])[0]
Out[129]:
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 38.0 | 0 | 0 | 0 | 71.2833 | 0 | 0 |
3 | 1 | 1 | 0 | 35.0 | 0 | 0 | 1 | 53.1000 | 1 | 1 |
6 | 0 | 1 | 1 | 54.0 | 1 | 0 | 2 | 51.8625 | 2 | 1 |
In [130]:
from deeprob.spn.structure.leaf import Categorical, Gaussian, Bernoulli
import time, os
from deeprob.spn.algorithms.inference import *
from typing import Optional, Union, Tuple
from deeprob.spn.structure.node import Node
from deeprob.spn.learning.wrappers import learn_estimator
from deeprob.spn.utils.statistics import compute_statistics
In [131]:
df_full.columns
Out[131]:
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
In [132]:
print(df_nna_num.columns)
print(df_nna_num.dtypes)
for col in df_nna_num.columns:
print(f"{col}: {df_nna_num[col].nunique()}")
df_nna_num.head(3)
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object') Survived int64 Pclass int64 Sex int64 Age float64 SibSp int64 Parch int64 Ticket int64 Fare float64 Cabin int64 Embarked int64 dtype: object Survived: 2 Pclass: 3 Sex: 2 Age: 63 SibSp: 4 Parch: 4 Ticket: 127 Fare: 93 Cabin: 133 Embarked: 3
Out[132]:
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 38.0 | 0 | 0 | 0 | 71.2833 | 0 | 0 |
3 | 1 | 1 | 0 | 35.0 | 0 | 0 | 1 | 53.1000 | 1 | 1 |
6 | 0 | 1 | 1 | 54.0 | 1 | 0 | 2 | 51.8625 | 2 | 1 |
In [91]:
max(df_nna_num['Cabin'].unique())
Out[91]:
132
In [133]:
### SPECIFY THE DISTRIBUTION DESCRIBING EACH FEATURE: MATCH WITH ABOVE COLUMNS
distributions = [Bernoulli, Categorical, Bernoulli, Gaussian,
Categorical, Categorical, Categorical,
Gaussian, Categorical, Bernoulli ] # * 10 + [Categorical] + [Gaussian] * 5 + [Categorical, Bernoulli] + [Gaussian] * 3
### SPECIFY THE DOMAINS OF EACH DISTRIBUTION: MATCH WITH ABOVE PRINTED COLUMNS
domains = [[0, 1], list(range(1, 3 + 1)), [0, 1], (0, 80 + 1),
list(range(0, 4 + 1)), list(range(0, 4 + 1)), list(range(0, 126 + 1)),
(0, 513 + 1), list(range(0, 132 + 1)), [0, 1]
] # * 10 + [list(range(2 + 1))] + [(0, 1)] * 5 + [list(range(16 + 1)), [0, 1]] + [(0, 1)] * 3
dfn = df_nna_num.values
spn = learn_estimator(
data=dfn,
distributions=distributions,
domains=domains,
random_state=42
)
1/? [00:00, 105.29node/s]
In [134]:
df_full.isna().sum() # cabin is filled due to pd.factorize
df_full.replace(-1, math.nan, inplace=True)
df_full.isnull().any(axis=1)
Out[134]:
0 True 1 False 2 True 3 False 4 True ... 886 True 887 False 888 True 889 False 890 True Length: 891, dtype: bool
In [135]:
df_full.isna().sum()
Out[135]:
Survived 0 Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
In [136]:
nan_mask = df_full.isnull().any(axis=1)
nan_data = df_full[nan_mask]
### IMPUTATION IS DONE ONLY ON THE ENTRIES WITH MATH.NAN entries, unlike autoencoder where I can't control where it fills in missing values
mpe_nan_data = mpe(root = spn, x = nan_data.values)
In [137]:
mpe_nan_data
Out[137]:
array([[ 0. , 3. , 0. , ..., 7.25 , 3. , 0. ], [ 1. , 3. , 1. , ..., 7.925, 3. , 0. ], [ 0. , 3. , 0. , ..., 8.05 , 3. , 0. ], ..., [ 0. , 2. , 0. , ..., 13. , 3. , 0. ], [ 0. , 3. , 1. , ..., 23.45 , 3. , 0. ], [ 0. , 3. , 0. , ..., 7.75 , 3. , 2. ]])
In [138]:
df_full_spn = df_full.copy()
df_full_spn[nan_mask] = mpe_nan_data
df_full_spn.isna().sum()
Out[138]:
Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 0 Embarked 0 dtype: int64
In [148]:
likelihood_per_row = likelihood(spn, df_full_spn.values)
np.mean(likelihood_per_row)
Out[148]:
1.1082337e-11
In [144]:
missing_cols = df_full.columns[df_full.isnull().any()].tolist()
df_mode_imputation = df_full.copy()
for col in missing_cols:
df_mode_imputation[col].fillna(df_mode_imputation[col].mode()[0], inplace=True)
In [163]:
likelihood_per_row_mode = likelihood(spn, df_mode_imputation.values)
np.mean(likelihood_per_row_mode)
Out[163]:
9.901252e-12
In [158]:
def impute_random_values(value):
if value.ndim == 0:
return value.size == 0
elif value.ndim == 1:
return value.size == 0
else:
return value.new(value.dtype)
# Impute random values for numerical columns
df_fill_random = df_full.fillna(method='ffill')
df_fill_random.head()
Out[158]:
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 22.0 | 0 | 0 | 0 | 7.2500 | NaN | 0.0 |
1 | 1 | 1 | 1 | 38.0 | 0 | 0 | 1 | 71.2833 | 0.0 | 1.0 |
2 | 1 | 3 | 1 | 26.0 | 1 | 0 | 2 | 7.9250 | 0.0 | 0.0 |
3 | 1 | 1 | 1 | 35.0 | 0 | 0 | 3 | 53.1000 | 1.0 | 0.0 |
4 | 0 | 3 | 0 | 35.0 | 1 | 0 | 4 | 8.0500 | 1.0 | 0.0 |
In [159]:
likelihood_per_row_rand = likelihood(spn, df_fill_random.values)
np.mean(likelihood_per_row_rand)
Out[159]:
7.066738e-12
In [217]:
df_valid_nname = df_valid.drop(columns = ['PassengerId', 'Name'])
for col in CAT_VARS:
df_valid_nname[col] = pd.factorize(df_valid_nname[col])[0]
df_valid_nname.insert(0, 'Survived', -1)
df_valid_na = df_valid_nname.replace(-1, math.nan)
df_valid_na.head(3)
Out[217]:
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 3 | 0 | 34.5 | 0 | 0 | 0 | 7.8292 | NaN | 0 |
1 | NaN | 3 | 1 | 47.0 | 1 | 0 | 1 | 7.0000 | NaN | 1 |
2 | NaN | 2 | 0 | 62.0 | 0 | 0 | 2 | 9.6875 | NaN | 0 |
In [218]:
test_nan_mask = df_valid_na.isnull().any(axis=1)
test_nan_data = df_valid_na[test_nan_mask]
### IMPUTATION IS DONE ONLY ON THE ENTRIES WITH MATH.NAN entries, unlike autoencoder where I can't control where it fills in missing values
mpe_test_nan_data = mpe(root = spn, x = test_nan_data.values)
In [219]:
mpe_test_nan_data
Out[219]:
array([[ 1. , 3. , 0. , ..., 7.8292, 3. , 0. ], [ 1. , 3. , 1. , ..., 7. , 3. , 1. ], [ 1. , 2. , 0. , ..., 9.6875, 3. , 0. ], ..., [ 1. , 3. , 0. , ..., 7.25 , 3. , 1. ], [ 1. , 3. , 0. , ..., 8.05 , 3. , 1. ], [ 1. , 3. , 0. , ..., 22.3583, 3. , 2. ]])
In [221]:
df_valid_spn = df_valid_na.copy()
df_valid_spn[test_nan_mask] = mpe_test_nan_data
df_valid_spn.isna().sum()
df_valid_spn.head()
Out[221]:
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 3.0 | 0.0 | 34.5 | 0.0 | 0.0 | 0.0 | 7.8292 | 3.0 | 0.0 |
1 | 1.0 | 3.0 | 1.0 | 47.0 | 1.0 | 0.0 | 1.0 | 7.0000 | 3.0 | 1.0 |
2 | 1.0 | 2.0 | 0.0 | 62.0 | 0.0 | 0.0 | 2.0 | 9.6875 | 3.0 | 0.0 |
3 | 1.0 | 3.0 | 0.0 | 27.0 | 0.0 | 0.0 | 3.0 | 8.6625 | 3.0 | 1.0 |
4 | 1.0 | 3.0 | 1.0 | 22.0 | 1.0 | 1.0 | 4.0 | 12.2875 | 3.0 | 1.0 |
In [225]:
df_valid_spn['Survived'] == 0 # WTF
Out[225]:
0 False 1 False 2 False 3 False 4 False ... 413 False 414 False 415 False 416 False 417 False Name: Survived, Length: 418, dtype: bool
Original Notebook's training of a autoencoder to fill-in missing data and build a classifier.¶
In [ ]:
Fill in missing value from random¶
In [ ]:
#
In [170]:
def fill_na_with_random(df_ref, df_na):
df_ret = df_na.copy()
for col in df_ret.columns:
ret_nan = df_ret[col][df_ret[col].isna()]
ref_n_nan = df_ref[~df_ref[col].isna()][col]
df_ret[col].loc[df_ret[col].isna()] = np.random.choice(ref_n_nan,size=len(ret_nan))
return df_ret
# fill_na_with_random(df_xtrain, df_xtrain)
Create noisy data by using permutate data in each column¶
In [164]:
def make_noisy(np_data):
np_ret = np.copy(np_data)
for i in range(np_ret.shape[1]):
np.random.shuffle(np_ret[:,i])
return np_ret
In [165]:
def mode_imputation(df, col):
df[col].fillna(df[col].mode()[0], inplace=True)
return df
def mean_imputation(df, col):
df[col].fillna(df[col].mean()[0], inplace=True)
return df
Model¶
Split dataset¶
In [175]:
df_valid.head(3)
Out[175]:
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
In [193]:
# x_full = df_full[col_num+col_cat]
# y_full = df_full[col_target]
x_full = df_full_spn.drop(columns=['Survived'])
y_full = df_full_spn['Survived']
# x_valid = df_valid[col_num+col_cat]
x_valid = df_valid.drop(columns=['PassengerId', 'Name'])
x_train, x_test, y_train, y_test = train_test_split(x_full, y_full, test_size=0.25, random_state=6668)
x_ref = x_train.copy()
x_train = fill_na_with_random(x_ref, x_train)
x_test = fill_na_with_random(x_ref, x_test)
x_valid = fill_na_with_random(x_ref, x_valid)
# x_train = simple_preprocess(x_train)
# x_test = simple_preprocess(x_test)
# x_valid = simple_preprocess(x_valid)
<ipython-input-170-e38bb74c6ef7>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_ret[col].loc[df_ret[col].isna()] = np.random.choice(ref_n_nan,size=len(ret_nan))
In [195]:
# x_train.Name.value_counts()
x_train.head(5)
Encode category pipeline¶
In [196]:
x_train.head()
Out[196]:
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|
839 | 1.0 | 0.0 | 35.674426 | 1.0 | 0.0 | 648.0 | 29.7000 | 140.0 | 1.0 |
28 | 3.0 | 1.0 | 35.674426 | 1.0 | 0.0 | 27.0 | 7.8792 | 3.0 | 2.0 |
387 | 2.0 | 1.0 | 36.000000 | 1.0 | 0.0 | 331.0 | 13.0000 | 3.0 | 0.0 |
797 | 3.0 | 1.0 | 31.000000 | 1.0 | 0.0 | 623.0 | 8.6833 | 3.0 | 0.0 |
261 | 3.0 | 0.0 | 3.000000 | 3.0 | 2.0 | 24.0 | 31.3875 | 3.0 | 0.0 |
In [197]:
# !pip install category_encoders
Requirement already satisfied: category_encoders in /usr/local/lib/python3.10/dist-packages (2.6.3) Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.23.5) Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.1.3) Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.11.4) Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.14.1) Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.5.3) Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.4) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2023.3.post1) Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0) Requirement already satisfied: joblib>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.2.0) Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (23.2)
In [198]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder,OrdinalEncoder,StandardScaler
from sklearn.preprocessing import PowerTransformer
import category_encoders as ce
from xgboost import XGBClassifier
import lightgbm
# Preprocessing for numerical data
numerical_transformer = Pipeline(verbose=False,steps=[
('scale', StandardScaler(with_mean=True,with_std=True)),
])
# Preprocessing for categorical data
categorical_onehot_transformer = Pipeline(verbose=False,steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
categorical_count_transformer = Pipeline(verbose=False,steps=[
('count', ce.CountEncoder(min_group_size = 3)),
('scale', StandardScaler(with_mean=True,with_std=True)),
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(verbose=False,
transformers=[
('pre_cat_count', categorical_count_transformer, col_cat_big),
('pre_cat_onehot', categorical_onehot_transformer, col_cat_small),
('pre_num', numerical_transformer, col_num),
])
Execute encode category process¶
In [200]:
preprocessor.fit(x_train)
x_train_encoded = preprocessor.transform(x_train)
x_test_encoded = preprocessor.transform(x_test)
# x_valid_encoded = preprocessor.transform(x_valid)
Warning: No categorical columns found. Calling 'transform' will only return input data.
Impute missing value using auto encoder¶
In [201]:
input_dim = x_train_encoded.shape[1]
model_impute = keras.Sequential()
model_impute.add(layers.Dense(20,activation='gelu', input_dim=input_dim, kernel_initializer='he_uniform'))
model_impute.add(layers.Dense(16,activation='gelu', kernel_initializer='he_uniform'))
model_impute.add(layers.Dense(10,activation='gelu', kernel_initializer='he_uniform', name='bottleneck'))
model_impute.add(layers.Dense(16,activation='gelu', kernel_initializer='he_uniform'))
model_impute.add(layers.Dense(20,activation='gelu', kernel_initializer='he_uniform'))
model_impute.add(layers.Dense(input_dim,activation='linear', kernel_initializer='he_uniform'))
optimizer = keras.optimizers.Adam(learning_rate=0.03)
model_impute.compile(optimizer = optimizer, loss = 'msle')
model_impute.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 20) 300 dense_1 (Dense) (None, 16) 336 bottleneck (Dense) (None, 10) 170 dense_2 (Dense) (None, 16) 176 dense_3 (Dense) (None, 20) 340 dense_4 (Dense) (None, 14) 294 ================================================================= Total params: 1616 (6.31 KB) Trainable params: 1616 (6.31 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
In [202]:
tf.keras.utils.plot_model(model_impute, show_shapes=True,rankdir='LR')
Out[202]:
In [203]:
es = tf.keras.callbacks.EarlyStopping(monitor='loss', mode='min', verbose=1, patience=50)
noise_X = make_noisy(x_train_encoded)
# noise_X = np.concatenate((noise_X, make_noisy(noise_X)), axis=0)
noise_X = np.concatenate((noise_X, np.copy(x_train_encoded)), axis=0)
his = model_impute.fit(noise_X, noise_X, epochs = 2000, batch_size = 512, shuffle = True, callbacks=[es], verbose=0)
Epoch 477: early stopping
In [205]:
x_train_encoded
Out[205]:
(668, 14)
Plot learning curve¶
In [206]:
plt.figure(figsize=(12,8))
plt.plot(his.epoch,his.history['loss'], label='loss', linewidth=2)
plt.legend()
plt.show()
Create final dataset with impute missing value by auto encoder¶
In [ ]:
x_train_impute = model_impute.predict(x_train_encoded)
x_test_impute = model_impute.predict(x_test_encoded)
x_valid_impute = model_impute.predict(x_valid_encoded)
In [ ]:
x_train_encoded
In [ ]:
x_train_impute
In [ ]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
# Preprocess the data
def preprocess_data(df):
continuous_features = df.select_dtypes(include=[np.number])
categorical_features = df.select_dtypes(include=[object])
scaler = MinMaxScaler()
continuous_features = scaler.fit_transform(continuous_features)
encoder = OneHotEncoder(sparse=False)
categorical_features = encoder.fit_transform(categorical_features)
return np.hstack([continuous_features, categorical_features])
# Create a mask for missing values
def create_missing_mask(df):
return df.isnull()
# Build the autoencoder model
def build_autoencoder(input_dim, encoding_dim):
input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='sigmoid')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
return autoencoder
# Load and preprocess your dataset
df = df_full#pd.read_csv('your_dataset.csv')
missing_mask = create_missing_mask(df)
preprocessed_data = preprocess_data(df.fillna(df.mean()))
In [ ]:
df_full.mean()
In [ ]:
preprocessed_data.shape
In [ ]:
# Train the autoencoder
input_dim = preprocessed_data.shape[1]
encoding_dim = 64
autoencoder = build_autoencoder(input_dim, encoding_dim)
autoencoder.fit(preprocessed_data, preprocessed_data, epochs=5, batch_size=256, shuffle=True)
# Impute missing values
imputed_data = autoencoder.predict(preprocessed_data)
imputed_df = pd.DataFrame(imputed_data)
# imputed_df[missing_mask] = np.nan
# imputed_df.fillna(df.mean(), inplace=True)
In [ ]:
imputed_df
Model predict survival¶
Model for suvival prediction¶
In [ ]:
from sklearn.metrics import accuracy_score
def get_score(model, X, y):
y_pred = model.predict(X)
return accuracy_score(y, y_pred)
In [ ]:
# list_score = []
# for n_est in tqdm.tqdm(range(200,300,1)):
# xgb = XGBClassifier(n_estimators=n_est)
# xgb.fit(x_train_impute, y_train)
# score = get_score(xgb,x_test_impute, y_test)
# list_score.append([n_est,score])
# list_score = np.array(list_score)
In [ ]:
# plt.figure(figsize=(12,8))
# plt.plot(list_score[:,0],list_score[:,1], label='accurary')
# plt.legend()
# plt.show()
In [ ]:
# best_param = list_score[np.argmax(list_score[:,1])]
# best_param
In [ ]:
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
# Best param
# best_n_est = int(best_param[0])
best_n_est = 1000
xgb = XGBClassifier(n_estimators=best_n_est, learning_rate=1e-3, seed=6688)
xgb.fit(x_train_impute, y_train)
print(f'train score: {get_score(xgb,x_train_impute,y_train)}')
print(f'test score: {get_score(xgb,x_test_impute,y_test)}')
Submit result to leaderboard¶
In [ ]:
y_valid = xgb.predict(x_valid_impute)
In [ ]:
df_submit = pd.DataFrame({'PassengerId': df_valid.PassengerId, 'Survived': y_valid})
df_submit.head(3)
In [ ]:
df_submit.to_csv('submission.csv', index=False)
print("Submitted successful!")