When to use Artificial Intelligence to solve a problem?¶

First note: The following article is intended to audience who's trying to decide whether a problem should be solved using Artificial Intelligence methods or not. It can be for industrial purposes such as deciding where to allocate resources on (AI pipeline or not), or academic purposes, such as deciding whether a project's problem is suitable to be addressed using AI. If you enjoy exploring AI across various settings, and wants to experiment using AI or non-AI-based solutions to tackle different problems, then you can consider this article as simply sharing my experiences, which can hopefully help you in anyway possible in your journey learning AI.

Second note: I use the term 'Artificial Intelligence' as an euphemism to refer to kinds of solutions which I consider fundamentally different to heuristics-based or analytical solutions. Although the correct term should be narrow-AI based solution. Here is a table to illustrate the differences:

AI-based solution (also called narrow AI) Heuristics-based solution (also referred as analytical)
Usually probabilistic in nature Usually deterministic in nature
Usually poor explainability or interpretability Usually good explainability and interpretability
Examples involve deep neural networks, reinforcement learning Examples include linear regression, tree-search algorithms

Anyway, on the question of whether AI should be used to tackle certain problem. I come across this question quiet often with my colleagues, which is why I'll share in this following notebook my own thoughts and opinions on this matter, which is derived from my own adventure in studying AI algorithms and modelsāœŒļø.

I'm already fully convinced that Artificial Intelligence (AI) will be a core protagonist of the Fourth Industrial Revolution (see a brief discussion in my website). This has attracted a lot of enthusiasts across the world (like me) to learn the algorithms and models proposed in the quest to achieve AI. However, not all the models or algorithms as of 2022[^1] are close in achieving Artificial General Intelligence (AGI), and they are mostly specialized into solving a task in a simplified setting tackling a problem reminiscence of a human faculty (although it's true that as the years go by, assumptions and restrictions are being lifted off, and the models, algorithms and data are being tweaked or enhanced so that they can work for more generalized tasks).

Anyway, the point is that there is yet an algorithm or model that can work for every known problem in math, and I want to share with you some of the tips I've learned when deciding what kind of problem should be solved using AI and which should follow a non-AI paradigm, that is, be solved using an analytical solution.

Briefly:

  • An AI solution is preferred to problems which are hard to be distilled mathematically
    • Although the pursuit of AI is to solve any problem (i.e. domain generalization), a simple heuristic can sometimes be more time and memory efficient than an AI solution.
    • The more you understand the problem, the less likely it's to be solved using an AI algorithm.
    • Example: The Tower of Hanoi puzzle is a problem that can be solved using a reinforcement learning agent, but it's better solved through a recursive algorithm.
  • An AI solution is preferred when the variables involved in the problem (input features predicting an output) don't follow a monotonic relationship and/or can't be linearised
    • An AI solution is preferred for a task when a model has to capture the non-linear, non-monotonic relationship between input features and output.
    • Models where there exists a linear or monotonic relationship between input features and output usually focus on their parameters' interpretability and explainability, while those that don't have such linear relationships care more about the accuracy of the models' predictions and optimal performance in a task.
    • Input features that usually bear no relationship with the output usually correspond to unstructured data, while those which do display some sort of monotonic relationship are carefully-processed data, which may have research backing then up.
    • Example: The task of cat image recognition based on input features consisting of pixels is better solved using an AI algorithm like deep learning. The task of coronary artery disease (CAD) detection based on known physiological features like age, gender or cholesterol levels are better solved using logistic regression.
  • A symbiosis rather than zero-sum choice
    • A purely AI solution may optimally solve a task and can generalize to different environments, but it may be time-consuming. A pure non-AI, heuristic solution may require too much understanding of a problem beforehand and can't be generalized. A mix of both strategies, such as what AlphaGo does by mixing reinforcement learning paradigm of environment interaction based on a tree-search heuristic can be better at solving a task. Albeit the creativity that is required to extrapolate the advantages of both the AI and non-AI paradigm may also be challenging.
  • NP problems, Artificial General Intelligence and quantum computing
    • It seems as though heuristics are preferred for polynomially-solvable problems since they aren't as cumbersome and time-consuming as AI-based solutions. What about NP problems?
    • Will the above recommendations still be useful once we achieve Artificial General Intelligence?
    • NP problems have the quality of being non-polynomial due to the fundamental, discrete, binary nature of computation. If quantum computing, which proposes that a state can be evaluated to different values at the same time, can simplify NP problems to P, then will all problems be solved through a simple, AI-General heuristic?

[^1]: I have to add "2022" because the pace in which AI algorithms and models are developing is exponentially fast (see The Singularity is Near, by Ray Kurzweil) to the point that by next year, or next 5 years, we just might witness an "Universal Algorithm" which allows a model to solve any kind of task and the whole argument of when to use AI to solve a problem is pointless.

An AI solution is preferred to problems which are hard to be defined mathematically¶

Problems that can't be thoroughbly described through math and theorems are better left for a ML algorithm to solve. Converselly, problems which have a thorough mathematic definition and are understandable through math are better solved analytically. Here, to be "precisely mathematically defined" is subject to discussion. It could be that the task has a clearly defined goal where there is known to be a correct result, or that there is a precise "mathematical abstraction" of it. Let me illustrate it with an example of solving the task consisting of the puzzle of Tower of Hanoi (ToH):

tower

Figure: The game starts with the state at top left corner, at an initial state where n disks are placed in the leftmost rod, and goal is to place all disks to the rightmost rod. It's illegal to put a larger disk on top of a smaller one. Only one disk can be moved per more. Here n = 3, and it takes $2^n - 1 = 7$ moves to move all disks as required. Figure extracted from https://clipground.com/tower-of-types-of-clipart.html. For the actual game please visit: https://www.mathsisfun.com/games/towerofhanoi.html

Well... what do you think? "Should" this problem be solved using ML or analytically?

My answer would be that this problem can be solved using ML, however, it's more time and memory-efficient to simply use an analytical solution given that ToH is a mathematically, well-defined problem:

  • The end state is explicit: all n disks have to be at the rightmost rod
  • The rules of the game leave no ambiguity and they're easily understandable: there are only 3 rods, one can't move more than 1 disk per turn.
  • There is well-defined strategy to play the game. This bullet point is important, given that ToH allows for some interesting backward, inductive reasoning. To actually achieve the end state, we should be aware that the nth disk (the largest one) should move only once to the third rod, which implies the second rod has all n - 1 disks. And they got there because each one of the top level disks has been moved to the surrounding 2 rods to make space for lower-level disks, and each one of them had to make two-step moves exponentially in order to do so (this is better observed using fewer disks). Through this reasoning, ToH can be solved with a minimum, therefore optimally, $2 ^ n -1$ moves, which means that its computation should be $O(2^n)$ for both running time and memory.

In general, the more we understand a problem and can mathematically-distill it, the less likely it's to be solved using an AI algorithm. This is nonetheless different than saying we can't solve it using AI, it's just that it'll be time-cumbersome due to training, or memory inefficient due to the storage of state-value or weight matrices depending on which paradigm you're using. Moreover, assuming we're employing techniques from the connectionist school such as deep learning (DL) or reinforcement learning (RL), then because of the probabilistic/statistical nature of those approaches, we might get different results that are not the optimal, which we could have gotten through an analytical approach which gives a sole answer.

Anyhow, at the beginning, I did try to solve ToH using AI šŸ˜‚. Namely, given that ToH is not about predicting something from features, I chose RL which I learnt from a Coursera course on RL offered by the University of Alberta, which is an AI paradigm which focuses on exploration and interaction with a digital environment (the rods in this case) to maximize a reward (to reach the end state in this case). This required me to implement an environment, and solve it using a SARSA Agent following this template. The code is in my GitHub, albeit I also copied down in the Appendix Section. The general idea of RL is very interesting. In this case of ToH, the agent had to move disks randomly at the beginning, and reach the final state. Because there is a state-value matrix which stores information about visiting states that yields the highest value, it can subsequently greedily choose, iterate and modify the values of this matrix so that it can choose the sequence of states which maximizes the reward (helps it reach the end state while not making illegal moves). While I enjoyed a lot being able to apply what I've learned from Coursera on a math puzzle I liked, I do admit that I don't want to try solving ToH using RL šŸ˜‚. To be even more honest, I'm not even sure the RL code can actually complete the game in the optimal $2^n - 1$ moves, and things would get extremely ugly as the number of disks increases, which means an exponential increase in computation time. The state-value matrix I designed occupies a memory which is $O(6n^3)$, albeit it can be reduced by choosing a better way to represent the environment.

It isn't really recommended to use AI on ToH given how much of ToH can be mathematically distilled, and therefore tackled using an analytical solution. Actually, a recursive solution uploaded by Harshit Agrawal in GeeksForGeeks can solve the ToH in about 6 lines of code, it finds the optimal solution whilst avoiding the training of an agent and estimation of a value matrix:

InĀ [1]:
### Original code by Harshit Agrawal @ GeeksForGeeks
def TowerOfHanoi(n , from_rod, to_rod, aux_rod, counter = 0):
    # The code is interestingly reminiscent of the proof that ToH can be completed in 2^n - 1 moves. That
    # is, we can see that there are two lines of code (i.e. the two remaining rods) where the n - 1 discs have to 
    # switch around (the recursion) to let the final disk move from rod 1 to rod 3. 
    if n == 0:
        return
    TowerOfHanoi(n-1, from_rod, aux_rod, to_rod)
    print("Move disk",n,"from rod",from_rod,"to rod",to_rod)
    TowerOfHanoi(n-1, aux_rod, to_rod, from_rod)

n = 3
TowerOfHanoi(n, 'A', 'C', 'B')
# A, C, B are the name of rods
Move disk 1 from rod A to rod C
Move disk 2 from rod A to rod B
Move disk 1 from rod C to rod B
Move disk 3 from rod A to rod C
Move disk 1 from rod B to rod A
Move disk 2 from rod B to rod C
Move disk 1 from rod A to rod C

An AI solution is preferred when the variables involved in the problem (input features predict an output) don't follow a monotonic relationship and/or can't be linearised¶

This next reason can technically be deduced from the above tip, that is, use an AI solution if a problem can't be thoroughly mathematically abstracted (although this wouldn't be a list with only one item šŸ˜‚). Again, let's use an example in a healthcare setting, where I want to predict whether a person has a heart disease from the following set of features: gender, age, cholesterol level and heart rate. It's well backed-up by extensive research that, on average, fluctuations in these features can have a significant impact on whether a person is predicted to having heart disease or not. What's important is that there exists a linear relationship between heart disease (output) and the input features, and therefore this prediction task is better left for logistic regression or classification tree, rather than deep learning.

non-lin-lin

Figure: Some problems where input features don't share a linear relationship with the output may require adding non-linearities in between so that they can be mapped consistently. The non-linearities of the deep neural network correspond to ReLu activations. The notation is as follows: x_i are the input features, while k_j would be the class. On the right I tried to illustrate a logistic regression models where inputs are mapped directly using a sigmoid function to the output, where the output is very likely the log odds of the outocme. Illustration edited from: https://www.researchgate.net/figure/An-illustration-of-a-deep-neural-network_fig4_333049256![image.png](attachment:image.png) Even if variables share non-linear relationship, we should still first look at the available literature about whether there exists ways to linearise them, id est, exponential or potential relationships can be linearised using logarithms, we could also take the logarithm of the odds or go for standardization of input features. This reinforces a previous idea that the more we understand and mathematically distill a problem, it's better to use an analytical approach to model and solve it.

Conversely, what if the variables concerned share non-linear, non-monotonic and they can't be linearised? Then it's plausible to say that approaches like deep learning could be useful in solving this problem. Clear examples include today's famous problems of image recognition and machine translation. While both of them is about predicting some output (class label in the former, words in a desired language in the latter) given input features (pixels in the former, words in a language in the latter), there is just no research finding that suggest, for instance, that pixels follow a linear relationship with its associated class label (because otherwise a LinReg model could actually predict whether an image shows a cat or not). Another example would be if all words across all languages had an unique one-to-one mapping irrespective of context, and all shared the same grammar rules, then it's plausible to think that an analytical solution to machine translation would be a model consisting of a dictionary in charge of the one-to-one mapping and grammar rules of the desired language to produce a feasible translation. However, there are just an infinite amount of inconsistencies, irregularities, "exceptions" (and whatever other synonyms), that characterise human laguage, making it suitable to be tackled through AI algorithms.

Let's illustrate the point made above with two problems: cat image recognition from pixel inputs and coronary artery disease (CAD) detection from physiological features. Here, we'll propose that cat image recognition is a problem suitable to be tackled through deep learning because there exists a non-linear relationship between pixels and 'cat' class label, while CAD detection based on physiological features (e.g. age, sex, cholesterol levels, et al.) is more suitable for binary logistic regression because there is a linear relationship that can be taken advantage of. We can explore this assertion by running both deep learning and logistic regression on both problems (4 models in total) and analyse the results we get.

Dataset I of cat images: convolutional deep learning vs. logistic regression¶

The dataset 'train_catvnoncat.h5' and 'test_catvnoncat.h5' contain hundreds of 64 px * 64 px RGB pictures of cats and other non-cat things (e.g. landscapes, insects) that are offered in the Deep Learning Specialization in Coursera. The code below first loads and does some exploration of the dataset, and we'll subsequently run both a deep learning and logistic regression model and analyse the results we get. Here, we have to distinguish fundamental differences between the two tasks above. While image recognition is concerned with increasing the classification accuracy of a cat image, which implies that the model should, hopefully, be shape-, scale-, rotation-, brightness- invariant, the task of disease detection is more concerned on which features can significantly predict and by what extent the absence or presence of a disease. From these different purposes we can also identify when an AI solution is more suitable for each task. **When a task is more concerned in improving accuracy more than interpretability of its parameters, then it's likely to be addressed using an AI solution***. It's partly due to this reason that there is a rise in the demand of explainable AI given that AI models in the past and present are mostly concerned in solving a task rather than explaining how do they arrive at their solutions. I won't delve into details problems such as black box algorithms or dataset biases given that they're not the focus of today's notebook, but I do reinstate that they're present concerns in the AI community.

Anyhow, the assertion above can also be written as follows: Conversely, if a task is more concerned in interpretability of its parameters, then usually this entails a lower number of parameters, and which share a linear relationship with the output.

To show the above idea, we can first start running a convolutional neural network (CNN) as well as a logistic regression model to solve the task of cat image recognition and compare interpretability and accuracy results.

InĀ [1]:
import time
import numpy as np
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
import torch                     # you need to have installed pytorch via pip (check https://www.geeksforgeeks.org/install-pytorch-on-windows/) or 
                                 # I prefer installing anaconda and then running 'conda install -c pytorch pytorch'
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# For CNN
from torch.nn import Module
from torch.nn import Conv2d
from torch.nn import Linear
from torch.nn import MaxPool2d
from torch import flatten

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2


### 1. Load the data
# Training instances
train_dataset = h5py.File('train_catvnoncat.h5', "r")
train_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
train_y = np.array(train_dataset["train_set_y"][:]) # your train set labels
train_y = train_y.reshape((1, train_y.shape[0]))
# Testing instances
test_dataset = h5py.File('test_catvnoncat.h5', "r")
test_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
test_y = np.array(test_dataset["test_set_y"][:]) # your test set labels
test_y = test_y.reshape((1, test_y.shape[0]))
# Class labels for visualization
classes = np.array(test_dataset["list_classes"][:]) # the list of classes

# Data preprocessing so that PyTorch can work with numpy arrays

tensor_train_x = torch.from_numpy(train_x_orig)
tensor_train_y = torch.from_numpy(train_y)
tensor_test_x = torch.from_numpy(test_x_orig)
tensor_test_y = torch.from_numpy(test_y)

tensor_train_x_flatten = torch.from_numpy(train_x_orig.reshape(train_x_orig.shape[0], -1)/255)
tensor_train_y_flatten = torch.from_numpy(train_y.reshape(train_y.shape[0], -1).T)
tensor_test_x_flatten = torch.from_numpy(test_x_orig.reshape(test_x_orig.shape[0], -1)/255)
tensor_test_y_flatten = torch.from_numpy(test_y.reshape(test_y.shape[0], -1).T)

# Data visualization
image_idx = 5        # You can choose your own number between 0 - 208
plt.imshow(tensor_train_x[image_idx,])
print("The label is {0}".format(classes[tensor_train_y_flatten[image_idx].item()]))


### 2. Model definition
class ConvNet(nn.Module):
    '''
    Simple Convolutional Neural Network for cat image recognition.
    It's quiet unorthodox to use a CNN for binary classification,
    but we'll see that it serves to defend the idea that image classification
    is better tackled using an AI strategy. 
    The code is adapted from: https://coderzcolumn.com/tutorials/artifical-intelligence/pytorch-convolutional-neural-networks
    '''
    def __init__(self):
        super(ConvNet, self).__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=24, kernel_size=5, padding = 1, stride = 1),
            # the ReLu activation function led to gradient vanishing. I guess it's due to 
            # dataset (209 training instances) being too small.  
            nn.Tanh(),
            # pooling is for down-scaling values which ensure robustness to orientation and scale variance
            nn.MaxPool2d(kernel_size = 2, stride = 1, padding = 1), 
            # Dropout is to prevent overfitting. Earlier versions of the code achieved 100% training
            # accuracy with ~70% validation accuracy
            nn.Dropout(), 
            
            nn.Conv2d(in_channels=24, out_channels=12, kernel_size=3, padding = 1, stride = 1),
            nn.Tanh(),
            nn.MaxPool2d(kernel_size = 2, stride = 1, padding = 1),
            nn.Dropout(), 
            
            nn.Conv2d(in_channels=12, out_channels=6, kernel_size=3, padding = 1, stride = 1),
            nn.Tanh(),
            nn.MaxPool2d(kernel_size = 2, stride = 1, padding = 1),
            nn.Dropout(),
            
            nn.Flatten(),
            # the last layer contains 25350 perceptrons. The number emerges as follows
            # There are originally (64 + 3)^2 * 6 = 26934 perceptrons, but 
            # 26934 - 1584 randomly dropped out perceptrons
            nn.Linear(in_features = 25350, out_features = 1),  
            # The output will be passed through the Sigmoid function because it's 
            # binary classification. 
            nn.Sigmoid()         
        )
    def forward(self, x_batch):
        preds = self.seq(x_batch)
        return preds
    
The label is b'non-cat'
No description has been provided for this image
InĀ [2]:
### 3. Model initialization and hyperparameter choice
torch.manual_seed(1)                                        # for reproducibility
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'   # speeds things up if you have a GPU
#device = 'cpu'
conv_net = ConvNet()
conv_net.to(device)
loss_criterion = nn.BCELoss()                                # loss criterion for binary classification
optimizer = torch.optim.Adam(conv_net.parameters(), lr=0.001)# Adam optimizer with arbitrary low learning rate.
                                                             # A higher lr could lead to gradient vanishing. 
# We'll perform mini-batch gradient descent
batch_size = 24                                              # arbitrary, you can modify it
epochs = 100                                              # arbitrary, you can modify it
# There is an additional preprocessing step to ensure the number of channels is coherent
# with the model definition relevant for training and testing. 
tensor_train_x_cnn = tensor_train_x.permute(0, 3, 1, 2).float()
tensor_test_x_cnn = tensor_test_x.permute(0, 3, 1, 2).float()


### 4. Training model
for _ in range(epochs): 
    batches = torch.arange((tensor_train_x_cnn.shape[0]//batch_size)+1) ### Batch Indices
    # Mini-batch gradient descent
    for batch in batches:
        if batch != batches[-1]:
            start, end = int(batch*batch_size), int(batch*batch_size+batch_size)
        else:
            start, end = int(batch*batch_size), None
        X_batch, Y_batch = tensor_train_x_cnn[start:end], tensor_train_y_flatten[start:end] ## Single batch of data
        X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
        # Forward pass
        preds = conv_net(X_batch) 
        # Compute loss
        loss = loss_criterion(preds, Y_batch.float()) ## Calculate Loss
        # Optimize
        optimizer.zero_grad() ## Zero weights before calculating gradients
        loss.backward()       ## Calculate Gradients
        optimizer.step()      ## Update Weights
print("The model was able to minimize the error up to: ", loss) 
The model was able to minimize the error up to:  tensor(0.0032, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
InĀ [3]:
torch.manual_seed(1) 
### 5. Model evaluation
train_output = conv_net(tensor_train_x_cnn.to(device)) 
train_accuracy = sum((train_output > 0.5) == tensor_train_y_flatten.to(device)\
                    )/tensor_train_y_flatten.shape[0]
test_output = conv_net(tensor_test_x_cnn.to(device))
test_accuracy = sum((test_output > 0.5) == tensor_test_y_flatten.to(device)\
                   )/tensor_test_y_flatten.shape[0]
print("The model can achieve a {0}% accuracy in the training set, while a {1}% accuracy in the testing set \n"
      .format(round(train_accuracy.item()*100,2), round(test_accuracy.item()*100, 2)))

### 6. We can check which images were misclassified for fun :)
test_output[((test_output.to(device) > 0.5) == tensor_test_y_flatten.to(device)) == False]
idx = 1
plt.imshow(tensor_test_x[idx, ])
print("This image was classified as {}, when the correct label was {}"
      .format(test_output[idx].item() > 0.5, tensor_test_y_flatten[idx].item() == 1) )
The model can achieve a 99.52% accuracy in the training set, while a 62.0% accuracy in the testing set 

This image was classified as True, when the correct label was True
No description has been provided for this image
InĀ [4]:
for name, parameter in conv_net.named_parameters():
    
    #print("weight" in name)
    print(name)
    if parameter.requires_grad: 
        print("yes")
        print(torch.abs(parameter.grad.mean()))
seq.0.weight
yes
tensor(0.0018, device='cuda:0')
seq.0.bias
yes
tensor(1.0082e-05, device='cuda:0')
seq.4.weight
yes
tensor(0.0003, device='cuda:0')
seq.4.bias
yes
tensor(0.0013, device='cuda:0')
seq.8.weight
yes
tensor(0.0001, device='cuda:0')
seq.8.bias
yes
tensor(0.0008, device='cuda:0')
seq.13.weight
yes
tensor(0.0004, device='cuda:0')
seq.13.bias
yes
tensor(0.0026, device='cuda:0')

Now, we can observe that a very small, CNN can achieve quiet a robust performance for cat recognition. Because gradient descent is non-deterministic, as well as there are many hyperparamaters that can be adjusted, the accuracy in the testing set can actually climb up to >80%. You can try it for yourself šŸ˜„, by either removing the manual seeds, varying the learning rate, activation functions, strides, epochs, so on and so forth. Logistic regression, on the other hand, is deterministic, as we'll see below:

InĀ [5]:
from sklearn.linear_model import LogisticRegression

### We will work with the original numpy arrays without converting them to tensors this time
### But we'd still flatten them and normalize them given that log.reg models don't accent
### 4D input
train_x_orig_lr = train_x_orig.reshape(train_x_orig.shape[0], -1)/255    
train_y_lr = train_y.reshape(train_y.shape[0], -1).T
test_x_lr = test_x_orig.reshape(test_x_orig.shape[0], -1)/255
test_y_lr = test_y.reshape(test_y.shape[0], -1).T

lr_cat = LogisticRegression()
lr_cat.fit(train_x_orig_lr, train_y_lr)
training_acc_lr = lr_cat.score(train_x_orig_lr, train_y_lr)
test_acc_lr = lr_cat.score(test_x_lr, test_y_lr)
print("The model can achieve a {0}% accuracy in the training set, while a {1}% accuracy in the testing set \n"
      .format(round(training_acc_lr*100,2), round(test_acc_lr*100, 2)))
print("The model parameters are {0}, where we can pick a random parameter and say that a unit increase \
of a feature pixel is the expected unit change in log odds of the outcome."
      .format(lr_cat.coef_))
C:\Users\86136\anaconda3\lib\site-packages\sklearn\utils\validation.py:72: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(**kwargs)
The model can achieve a 100.0% accuracy in the training set, while a 72.0% accuracy in the testing set 

The model parameters are [[ 0.01848942 -0.04769157 -0.01966443 ... -0.02367532 -0.04662887
   0.05457665]], where we can pick a random parameter and say that a unit increase of a feature pixel is the expected unit change in log odds of the outcome.
C:\Users\86136\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

When running the model and obtaining the results, I was quiet shocked to see such a high testing accuracyšŸ˜‚! I guess it has to do with such a small dataset used for binary classification, i.e, the advantages of using an AI approach based on a CNN would be more noticeable if we were to deal with multi-class classification, along with a larger dataset. Regardless of such, some key points don't change:

  • Logistic regression is not suitable for image recognition. The testing accuracy for LogReg may increase by changing some hyperparameters like the solver utilized, but the fundamental idea behind Log.Reg, which is that a linear relationship can be captured between input features and output (the log odds of output to be precise, which still falls under the notion of "relationships that can be linearised") doesn't hold for image recognition.
  • There is however, an advantage of interpretability, where in this case we can make a (rather weird) statement that from the increase of any random pixel's value we can expect an unit change in the log odds of the outcome. This is some form of explainable AI, because we can't make a similar statement for the CNN above. This is because, without being too technical, while in LogReg we have 12288 features corresponding to the pixels of the image, there are > 20000 weights in the CNN corresponding to the kernel filters and fully connected layer's perceptrons (which can even increase furthermore). There is no linear relationship because a key idea behind CNNs is to add non-linearities using activation functions like Tanh or ReLu.
  • We can wonder then under what scenario would image recognition be suitable for LogReg. I'd say that if instead of using unstructured input data like matrix of pixels, we used small feature vectors like ['is_mammal', 'has_fur', weight, number_of_feet, 'can_fly'...], then because of the specificity of the features, as well as there is a plethora of zoological or biological research backing some of the features (such as all cats are mammals), then I'd argue that in this case LogReg would be better than CNN. Definately better.

Dataset II: Coronary artery disease features¶

We can do the same as above, but this time we'll explore the task of CAD detection, which is more suitable for LogReg rather than an AI-approach like a deep neural network (DNN). We'll run a DNN and LogReg, to which we'll analyze their accuracy as well as interpretability. The dataset used was taken from Kaggle, and it consists of 13 physiological features measured from 303 participants (297 after excluding those whose measurements aren't complete). There is research justifying the choice of these features, which are:

  1. age
  2. gender
  3. cp - chest pain
  4. trestbps - resting blood sugar
  5. chol - serum cholesterol
  6. fbs - fasting blood sugar
  7. restecg - resting electrographic results
  8. thalach - maximum heart rate achieved
  9. exang - excercise induced engima
  10. oldpeak - ST Depression of ECG
  11. slope (not sure what it refers to in dataset)
  12. ca (not specified in dataset)
  13. thal (not specified in dataset)

Based on these features, there are 5 classes of participants, where I think 0 denotes control. The rest, 1, 2, 3 and 4 may denote different degrees of severity of CAD, although I'm not sure given that the dataset owner didn't really explain what different classes entailed. Anyhow, we want to show that multinomial logistic regression, a non-AI approach, is more suitable in this case, so we'll begin with some data exploration, and first fit a DNN, then a Multinomial LogReg model and explain why in this case of CAD detection, LogReg is more suitable.

InĀ [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, log_loss
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### 1. Data loading and descriptive statistics
cad_orig = pd.read_csv("Coronary_artery.csv")
cad_data = pd.read_csv("data.csv") # The same as "cad_orig", but with the factors converted from
                                   # strings to numbers. E.g. male -> 0, female -> 1
cad_data.drop(cad_data.columns[[0]], axis=1, inplace=True) # drop the first column since it consists only of
                                                           # each participant's ID number
### 1.1 Data cleaning
### There exists participants whose measurements of ca and thal are marked as '?', 
### so they'll be removed to avoid some errors below in running the models
### We're ultimately left with 297 participants
index_ca = cad_data[cad_data['ca'] == '?'].index
cad_data.drop(index_ca, inplace=True)    
index_thal = cad_data[cad_data['thal'] == '?'].index
cad_data.drop(index_thal, inplace=True)  
cad_data.dropna(inplace = True)
print("A sample of the dataset shows")
cad_data.head(5)
A sample of the dataset shows
Out[6]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal class
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0
InĀ [7]:
print("Descriptive statistics show:")
cad_data.describe()
Descriptive statistics show:
Out[7]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope class
count 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000 297.000000
mean 54.542088 0.676768 3.158249 131.693603 247.350168 0.144781 0.996633 149.599327 0.326599 1.055556 1.602694 0.946128
std 9.049736 0.468500 0.964859 17.762806 51.997583 0.352474 0.994914 22.941562 0.469761 1.166123 0.618187 1.234551
min 29.000000 0.000000 1.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000
25% 48.000000 0.000000 3.000000 120.000000 211.000000 0.000000 0.000000 133.000000 0.000000 0.000000 1.000000 0.000000
50% 56.000000 1.000000 3.000000 130.000000 243.000000 0.000000 1.000000 153.000000 0.000000 0.800000 2.000000 0.000000
75% 61.000000 1.000000 4.000000 140.000000 276.000000 0.000000 2.000000 166.000000 1.000000 1.600000 2.000000 2.000000
max 77.000000 1.000000 4.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 3.000000 4.000000

Perhaps one of the most important plots to illustrate the point made above is shown below. A non-AI approach is more suitable when input-features and output share some sort of linear relationship, which is captured by the righmost column of non-zero numbers which denote the correlation between each of the features and the output. Regardless of such, let's run a DNN firstšŸ˜‚:

InĀ [8]:
correlation_mat = cad_data.corr()
plt.figure(figsize=(20,20))
ax=sns.heatmap(correlation_mat, annot = True)
plt.show(ax)

# sns.pairplot(cad_data, hue="class", diag_kind="hist")
No description has been provided for this image
InĀ [9]:
m = nn.Softmax()
input_m = torch.randn(2, 3)
output = m(input_m)
output
<ipython-input-9-a3e6e306f5b0>:3: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  output = m(input_m)
Out[9]:
tensor([[0.2103, 0.5198, 0.2699],
        [0.2325, 0.2464, 0.5210]])
InĀ [10]:
### 1.2 Splitting data and conversion to tensor
X_train, X_test, y_train, y_test = train_test_split(cad_data.drop("class", axis=1), cad_data["class"],
                                                    train_size=0.7, test_size=0.3, random_state=0)
X_train_tensor, X_test_tensor, y_train_tensor, y_test_tensor = \
torch.from_numpy(np.vstack(X_train.values).astype(np.float)),\
torch.from_numpy(np.vstack(X_test.values).astype(np.float)),\
torch.from_numpy(np.vstack(y_train.values).astype(np.float)), \
torch.from_numpy(np.vstack(y_test.values).astype(np.float))

y_train_tensor = torch.flatten(y_train_tensor.long())
y_test_tensor = torch.flatten(y_test_tensor.long())

### 2. Building the PyTorch Deep Learning model
torch.manual_seed(1)
class Net(nn.Module):
    '''
    We will emulate Coursera's deep learning model using PyTorch. The original 
    code is attached below in the Appendix Section. Please don't look if you haven't
    completed the specialization.
    In the DL course, we were asked to implement a 5-layers model 
    with the following structure = input -> relu -> relu -> relu -> sigmoid (output)
    '''
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(13, 50)  # The input size must match the amount of features. 
        self.fc2 = nn.Linear(50, 25)  # The number of perceptrons was chosen arbitrarily 
        self.fc3 = nn.Linear(25, 10)  # to make it look complex.
        self.fc4 = nn.Linear(10, 5)    # Last layer must have 5 outputs for the 5 classes to be 
                                      # be predicted by softmax, where
                                      # the max prob would denote the network's prediction

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return F.log_softmax(x, dim = 1)
### 3. Initializing the model and defining hyperparameters
my_net = Net()
loss_criterion_dl = nn.CrossEntropyLoss()  # for multi-class classification
optimizer_dl = torch.optim.Adam(my_net.parameters(), lr=0.001)
### Running the deep learning model for cat recognition
iterations = 1000                                    
for epoch in range(iterations): 
    my_net.zero_grad()  
    # Forward pass
    output = my_net(X_train_tensor.float())  
    loss = loss_criterion_dl(output, y_train_tensor)  
    # Stochastic gradient descent for weight adjustments
    loss.backward()  
    optimizer_dl.step()  
print("The model was able to minimize the error up to: ", loss) 

### Model evaluation
train_output = my_net(X_train_tensor.float())  
train_accuracy = sum(nn.Softmax(1)(train_output).argmax(1) == y_train_tensor)/y_train_tensor.shape[0]
train_accuracy = train_accuracy.item()
test_output = my_net(X_test_tensor.float())  
test_accuracy = sum(nn.Softmax(1)(X_test_tensor).argmax(1) == y_test_tensor)/y_test_tensor.shape[0]
test_accuracy = test_accuracy.item()
print("The model can achieve a {0}% accuracy in the training set, while a {1}% accuracy in the testing set \n"
      .format(round(train_accuracy*100,2), round(test_accuracy*100, 2)))
### comment: About ~250 lines of code of deep learning from the Coursera specialization
### was simplified through ~50 lines of DNN using PyTorch. It was fun though.
The model was able to minimize the error up to:  tensor(0.4472, grad_fn=<NllLossBackward>)
The model can achieve a 84.54% accuracy in the training set, while a 5.56% accuracy in the testing set 

InĀ [11]:
### Logistic regression model running the CAD dataset
from sklearn.linear_model import LogisticRegression
np.random.seed(1)
### 1. Load data
X_train, X_test, y_train, y_test = train_test_split(cad_data.drop("class", axis=1), cad_data["class"],
                                                   train_size=0.7, test_size=0.3, random_state=0)
### 2. Initialize model 
lr_cad = LogisticRegression(multi_class='multinomial')
### 3. Fit model
lr_cad.fit(X_train, y_train)
### 4. Evaluate model
training_acc_lr = lr_cad.score(X_train, y_train)
test_acc_lr = lr_cad.score(X_test, y_test)
print("The multinomial LogReg model can achieve a {0}% accuracy in the training set, while a {1}% accuracy in the testing set \n"
      .format(round(training_acc_lr*100,2), round(test_acc_lr*100, 2)))
print("There are {1} model parameters, which are are {0}, where we can pick a random parameter and say that a unit increase \
of a feature is the expected unit change in log odds of the outcome."
      .format(lr_cad.coef_, lr_cad.coef_.shape[0] * lr_cad.coef_.shape[1] ))
The multinomial LogReg model can achieve a 60.87% accuracy in the training set, while a 58.89% accuracy in the testing set 

There are 65 model parameters, which are are [[ 2.39301738e-02 -1.03061946e-01 -2.47418316e-01 -2.35785151e-02
  -1.39543167e-03  1.09133566e-02 -4.03659032e-02  5.29101246e-02
  -1.23536211e-01 -2.22637393e-01 -7.71379136e-02 -2.05627430e-01
  -5.61176488e-01]
 [-1.50238146e-02  3.99877695e-02  3.52649069e-02 -3.65654646e-03
   6.80972195e-04 -3.16127153e-02 -1.40256724e-02  8.99341750e-03
   4.33511730e-02 -5.36037256e-02  1.25803942e-02 -2.94317585e-02
   8.64128143e-02]
 [ 3.25608501e-02  3.74101668e-02  8.29935055e-02 -1.05713470e-02
   1.45558810e-03 -2.44598191e-03 -7.69949869e-02 -1.59133508e-02
   4.76870717e-02  1.39895559e-01  2.71667135e-02  3.82662493e-02
   1.75143462e-01]
 [-2.77738643e-02  2.46901574e-02  9.86667184e-02  8.85909984e-03
   9.86808738e-05  3.04585898e-02  8.44704125e-02 -1.05306866e-02
   4.40649888e-02  9.05611643e-02  2.08036631e-02  1.43457832e-01
   2.10378988e-01]
 [-1.36933450e-02  9.73852209e-04  3.04931850e-02  2.89473086e-02
  -8.39809501e-04 -7.31324926e-03  4.69161500e-02 -3.54595046e-02
  -1.15670230e-02  4.57843950e-02  1.65871427e-02  5.33351070e-02
   8.92412243e-02]], where we can pick a random parameter and say that a unit increase of a feature is the expected unit change in log odds of the outcome.
C:\Users\86136\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Alas! The deep neural network can definately achieve a higher accuracy than the log reg model. It can go even higher if we change some hyperparameters like number of perceptrons per each layers, change optimizer, increase iterations (try it yourself!šŸ‘Œ). Of course, it also overfitted the training data due to the complexities of the neural network fit on such a small dataset where there was technically no need to add non-linearities, as we can extrapolate from the heatmat above. There are workarounds to this common issue, such as regularization, maybe having a larger dataset synthethized using maximum likelihood estimate, but I won't do it since that's also out of the focus of the notebook.

Anyway, back to the main topic, the idea is that logistic regression is better suited for addressing the task of CAD detection. Although it generalizes better than the DNN by having a larger test set accuracy, another important thing to consider is its interpretability. Here, the explainability of the model's parameters can be thought of as crucial, as implications of identifying and quantifying the relevant features that are associated with CAD can have public health policy implications. Moreover, once again we can also see that an interesting characteristic of LogReg is its deterministic feature. Although results can change depending on the solver or a few hyperparameters like regularization or number of iterations, it's not as probabilistic as its DNN counterpart. At the time of writing this, this has slowly morphed into a statistical problem, which further reinforces the notion of explainability of the LogReg model's parameters. We can also plausibly discuss the significance of the parameters, as well as make causal inferences from it. Below is a summary of statistics of the estimated coefficients, albeit distilling also isn't the notebook's intention (nor something within my expertisešŸ˜‚).

InĀ [12]:
import statsmodels.api as sm
X_train, X_test, y_train, y_test = train_test_split(cad_data.drop("class", axis=1), cad_data["class"],
                                                   train_size=0.7, test_size=0.3, random_state=0)
X_train = X_train.values.astype("float64")
y_train = y_train.values.astype("float64")
X_test = X_test.values.astype("float64")
y_test = y_test.values.astype("float64")
mn_logit_model = sm.MNLogit(y_train, X_train)
fit_mn_logit_model = mn_logit_model.fit()
fit_mn_logit_model.summary()
Optimization terminated successfully.
         Current function value: 0.796120
         Iterations 9
Out[12]:
MNLogit Regression Results
Dep. Variable: y No. Observations: 207
Model: MNLogit Df Residuals: 155
Method: MLE Df Model: 48
Date: Sat, 02 Apr 2022 Pseudo R-squ.: 0.3763
Time: 16:07:52 Log-Likelihood: -164.80
converged: True LL-Null: -264.22
Covariance Type: nonrobust LLR p-value: 2.912e-20
y=1 coef std err z P>|z| [0.025 0.975]
x1 -0.0446 0.027 -1.648 0.099 -0.098 0.008
x2 0.9794 0.603 1.625 0.104 -0.202 2.161
x3 0.2384 0.245 0.972 0.331 -0.242 0.719
x4 0.0274 0.013 2.131 0.033 0.002 0.053
x5 0.0029 0.005 0.592 0.554 -0.007 0.013
x6 -1.1738 0.758 -1.549 0.121 -2.659 0.311
x7 0.1337 0.241 0.555 0.579 -0.338 0.606
x8 -0.0440 0.011 -3.945 0.000 -0.066 -0.022
x9 1.1704 0.539 2.173 0.030 0.115 2.226
x10 0.0633 0.287 0.220 0.825 -0.499 0.626
x11 0.2703 0.426 0.635 0.525 -0.564 1.104
x12 0.7982 0.335 2.382 0.017 0.141 1.455
x13 0.2476 0.132 1.875 0.061 -0.011 0.506
y=2 coef std err z P>|z| [0.025 0.975]
x1 -0.0234 0.036 -0.657 0.511 -0.093 0.046
x2 0.6922 0.809 0.856 0.392 -0.894 2.278
x3 0.5337 0.393 1.359 0.174 -0.236 1.303
x4 0.0216 0.017 1.248 0.212 -0.012 0.056
x5 0.0016 0.006 0.260 0.795 -0.011 0.014
x6 -0.1144 0.844 -0.136 0.892 -1.768 1.539
x7 -0.2439 0.329 -0.741 0.459 -0.889 0.402
x8 -0.0626 0.014 -4.469 0.000 -0.090 -0.035
x9 1.2058 0.710 1.698 0.090 -0.186 2.598
x10 0.8793 0.347 2.536 0.011 0.200 1.559
x11 -0.2558 0.615 -0.416 0.677 -1.460 0.949
x12 1.0306 0.394 2.617 0.009 0.259 1.803
x13 0.4167 0.187 2.231 0.026 0.051 0.783
y=3 coef std err z P>|z| [0.025 0.975]
x1 -0.0809 0.040 -2.040 0.041 -0.159 -0.003
x2 0.0744 0.866 0.086 0.932 -1.623 1.771
x3 0.5498 0.435 1.263 0.207 -0.304 1.403
x4 0.0349 0.019 1.852 0.064 -0.002 0.072
x5 -0.0026 0.007 -0.366 0.715 -0.016 0.011
x6 0.3754 0.823 0.456 0.648 -1.237 1.988
x7 0.6569 0.351 1.873 0.061 -0.030 1.344
x8 -0.0625 0.015 -4.042 0.000 -0.093 -0.032
x9 1.1190 0.775 1.443 0.149 -0.401 2.639
x10 0.4566 0.352 1.296 0.195 -0.234 1.147
x11 0.1707 0.649 0.263 0.792 -1.101 1.442
x12 1.6337 0.402 4.067 0.000 0.846 2.421
x13 0.6072 0.213 2.853 0.004 0.190 1.024
y=4 coef std err z P>|z| [0.025 0.975]
x1 -0.1309 0.072 -1.815 0.069 -0.272 0.010
x2 0.2766 1.374 0.201 0.840 -2.417 2.970
x3 0.4453 0.832 0.535 0.593 -1.186 2.077
x4 0.0534 0.038 1.389 0.165 -0.022 0.129
x5 -0.0069 0.011 -0.608 0.543 -0.029 0.015
x6 -2.6535 2.237 -1.186 0.236 -7.038 1.731
x7 1.1721 0.691 1.697 0.090 -0.182 2.526
x8 -0.1056 0.028 -3.714 0.000 -0.161 -0.050
x9 -1.2295 1.285 -0.957 0.339 -3.748 1.289
x10 0.5350 0.557 0.960 0.337 -0.557 1.627
x11 1.7512 1.565 1.119 0.263 -1.316 4.819
x12 1.9947 0.617 3.235 0.001 0.786 3.203
x13 1.1011 0.444 2.482 0.013 0.231 1.971

A symbiosis rather than zero-sum choice¶

The above tips sound as if one solution paradigm is distinctive to the other paradigm to the point that depending on the qualities of the problem, we should go one way or the other. I guess this is a plausible point, however, I think the tips above only work for problems on very practical settings.

For instance, if we were doing a school project where we're given the freedom to decide how to solve a problem, but we're not specified what strategy to follow as long as we can achieve certain performance bottleneck, then some ideas shared above would hopefully help you on your decision. Another practical setting would be in the industrial environment. Assuming we're given a limited amount of resources to tackle certain problem, where meeting deadlines and achieving a performance benchmark are very important, then I guess we should first ask ourselves whether a simple heuristic could solve the task instead of experimenting with AI given their exhaustive demand in time for training and computational resources.

Nonetheless, for a non-practical setting, say for research purposes or simply for the joy of getting to know first-hand what are the advantages and limits of both AI and non-AI solution paradigms, then there are no brakes for trying out both, or mixing them upšŸ‘Œ.

Finally, when reflecting on this issue I do find that a more "powerful" (if I may use that word) solution could result from a mix of both paradigms. A purely AI solution may optimally solve a task and can generalize to different environments, but it may be time-consuming or resource-demanding in terms of dataset or computational hardware. A pure non-AI, heuristic solution may require too much understanding of a problem beforehand and can't be generalized. A mix of both strategies, such as what AlphaGo does by mixing reinforcement learning paradigm of environment interaction, along with a tree-search heuristic can be better at solving a task. Albeit the creativity that is required to extrapolate the advantages of both the AI and non-AI paradigm may also be challenging. I think this kind of mixing is more noticeable for problems of rising complexity, especially in understanding human nature or human intelligence... Food for thought āœŒļø...

Closing remarks: NP, AGI and quantum computing¶

I don't usually like to end with conclusions, because I don't agree with the notion that this is "the end". Rather, I want to finish with questions for discussion and reflection. So here are some that emerged when writing this notebook:

  • It seems as though heuristics are preferred for polynomially-solvable problems since they aren't as cumbersome and time-consuming as AI-based solutions. What about NP problems?
  • Will the above recommendations still be useful once we achieve Artificial General Intelligence?
  • NP problems have the quality of being non-polynomial due to the fundamental, discrete, binary nature of computation. If quantum computing, which proposes that a state can be evaluated to different values at the same time, can simplify NP problems to P, then will all problems be solved through a simple, AI-General heuristic?

Appendix Section¶

InĀ [13]:
### SARSA AGENT THAT PLAYS TOWER OF HANOI. TEMPLATE CODE ADAPTED FROM 
### https://github.com/clarisli/RL-Easy21
import numpy as np
import random
import math 
from enum import Enum
from itertools import permutations
from copy import copy

class Environment:
    '''
    Intuitive environment for Tower of Hanoi in which
    the RL agent will learn to play
    '''
    def __init__(self, discs):
        """
        The 0 rod means there isn't any rod. I added it in order to 
        take into account the possibility of anction popping from an empty list.
        """
        self.rods = {0:[], 1:[], 2:[]}
        self.state = State(self.rods)
        self.discs = discs

    def init_state(self):
        self.state.rods = {0:[], 1:[], 2:[]}
        self.state.rods[0] += [i for i in reversed(range(self.discs))] 
        self.state.is_win = False
        self.state.num_steps = 0
        return self.state

    def moveFrom(self, myRod):
        if len(myRod) == 0:
            return []
        else:
            return [myRod.pop()]

    def step(self, state, action):
        """
        KEY FUNCTION
        Arguments:
        state: gives access to current states of three rods
        action: tuple of size 2 (a, b), where a represents
        the rod from which a disk is moved, b the rod moved to.
        Returns:
        next_state
        reward: depending on whether there was an illegal move, 
        reached the goal, or otherwise made a normal step
        """
        next_state = copy(state)
        next_state.rods[action[1]] += self.moveFrom(next_state.rods[action[0]])
        #next_state.rods[action[1]].append(next_state.rods[action[0]].pop())  # append disc to rod 1 popped from rod 0
        next_state.num_steps += 1  # to check that agent learns to solve TOH in 2^N - 1 steps
        if self.is_illegal(next_state.rods[action[1]]):
            next_state.rods[action[0]].append(state.rods[action[1]].pop())  # reverse the illegal move
            return next_state, Reward.REVERSE
        elif self.is_victory(next_state):
            next_state.is_win = True
            return next_state, Reward.GOAL

        return next_state, Reward.STEP
        
    def is_illegal(self, rod):
        """
        For example, 0: [5, 4, 3, 2, 0, 1] is illegal 
        """
        if (len(rod) == 1) or (len(rod) == 0):
            return False
        elif rod[-1] > rod[-2]:
            return True 
    def is_victory(self, state):
        return state.rods[2] == [i for i in reversed(range(self.discs))]

class Reward (Enum):
    STEP = -1
    GOAL = 10
    REVERSE = -50

class State:
    def __init__(self, rods, is_win = False):
        self.rods = rods
        self.is_win = is_win
        self.num_steps = 0  # to check if agent completed in 2**discs - 1 steps

class Action:
    """
    A field of this class gets all the possible moves that 
    the rod can move from, to.
    
    """
    def __init__(self):
        self.all_actions = list(permutations([0, 1, 2], 2))
        # all actions are [(0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]
        self.num_actions = len(self.all_actions)
        self.enum = list(enumerate(self.all_actions)) # for filling the Q(S,A) matrix
        
class RLAgent:
    def __init__(self, environment, actions, N0=100, discount_factor=1, _lambda=0.1):
        self.env = environment
        self.actions = actions
        self.N0 = N0
        self.discount_factor = discount_factor
        self._lambda = _lambda
        self.Q = self._init_tenor()
        self.returns_count = self._init_tenor()

    def _init_tenor(self):
        """
        initializing Q(S, A)
        O(6n^3)
        Added + 1 to avoid zero index error or out of bounds
        """
        return np.zeros((self.env.discs + 1, self.env.discs + 1, self.env.discs + 1, self.actions.num_actions))  
        # 2 * 3C2 = 12, 12 possible actions, e.g, move from rod 0 to rod 2 and viceversa

    def _get_alpha(self):
        return 0.5 # for learning rate or stepsize update 

    def _get_epsilon(self):
        return 0.1 # for exploratory moves

    
    def train(self, num_episodes=10):
        for e in range(num_episodes):
            E = self._init_tenor()  
            state = self.env.init_state()
            action = self._policy(state)
            action_idx = self.actions.all_actions.index(action)
            while not state.is_win:  # play the game greedily until reaching the end state blindly at the beginning, 
                                     # and slowly getting better at choosing which actions.
                self.returns_count[len(state.rods[0])][len(state.rods[1])][len(state.rods[2])][action_idx] += 1
                next_state, reward = self.env.step(state, action)
                print(reward.value)
                print(next_state.rods)
                next_action = self._policy(next_state) # there is no action anymore if you've won
                print(next_action)
                
                td_delta = self._td_delta(state, action_idx, next_state, next_action, reward)
                E[len(state.rods[0])][len(state.rods[1])][len(state.rods[2])][action_idx] += 1
                self.Q += self._get_alpha()*td_delta*E
                E = self.discount_factor*self._lambda*E
                state = next_state
                action = next_action
                
            if e % 10 == 0:
                print("\rEpisode {}/{}.".format(e, num_episodes), end="")


        return self.Q

    def _policy(self, state):
        """
        Returns:
        tuple(a, b)
        """
        if state.is_win: return None
        if random.random() < self._get_epsilon():
            return self._get_random_action()
        else:
            epsilon = self._get_epsilon()
            state_actions = np.ones(self.actions.num_actions, dtype=float) * epsilon/self.actions.num_actions
            greedy_action = np.argmax(self.Q[len(state.rods[0])][len(state.rods[1])][len(state.rods[2])])
            state_actions[greedy_action] += (1.0 - epsilon)
            action = np.random.choice(np.arange(len(state_actions)), p=state_actions)
            return self.actions.all_actions[action]

    def _get_random_action(self):
        """
        https://stackoverflow.com/questions/66727094/how-to-randomly-choose-a-tuple-from-a-list-of-tuples
        Returns:
        Tuple(a,b)
        """
        indeces = [i for i in range(self.actions.num_actions)]
        rand_action =  self.actions.all_actions[np.random.choice(indeces)]
        return rand_action
    
    
    def _td_delta(self, s, a, next_s, next_a, r):
        if next_s.is_win:
            return r.value - self.Q[len(s.rods[0])][len(s.rods[1])][len(s.rods[2])][a]
        else:
            next_action_idx = self.actions.all_actions.index(next_a)
            return (r.value + self.discount_factor*self.Q[len(next_s.rods[0])][len(next_s.rods[1])][len(next_s.rods[2])][next_action_idx]) - self.Q[len(s.rods[0])][len(s.rods[1])][len(s.rods[2])][a]
TowerOfHanoiEnv = Environment(3)  # when using >=4 disks there may be some errors such as the model being
                                  # stuck in a loop. There seems to be a bug. 
LegalActions = Action()
myRLAgent = RLAgent(TowerOfHanoiEnv, LegalActions)
#Q = myRLAgent.train(10)
#print(Q)  # prints the state-value matrix
InĀ [14]:
### DEEP LEARNING FOR SIMPLE CAT IMAGE RECOGNITION. TEMPLATE CODE FOUND AT DEEP LEARNING COURSERA SPECIALIZATION
### PLEASE DON'T LOOK AT THE CODE IF YOU HAVEN'T COMPLETED THE DEEP LEARNING SPECIALIZATION!!!!! AHHHHHHHHH

import numpy as np
import matplotlib.pyplot as plt
import h5py


def load_data():
    train_dataset = h5py.File('train_catvnoncat.h5', "r")
    train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
    train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels

    test_dataset = h5py.File('test_catvnoncat.h5', "r")
    test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
    test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels

    classes = np.array(test_dataset["list_classes"][:]) # the list of classes
    
    train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
    test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
    
    return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
def sigmoid(Z):
    """
    Implements the sigmoid activation in numpy
    
    Arguments:
    Z -- numpy array of any shape
    
    Returns:
    A -- output of sigmoid(z), same shape as Z
    cache -- returns Z as well, useful during backpropagation
    """
    
    A = 1/(1+np.exp(-Z))
    cache = Z
    
    return A, cache

def relu(Z):
    """
    Implement the RELU function.

    Arguments:
    Z -- Output of the linear layer, of any shape

    Returns:
    A -- Post-activation parameter, of the same shape as Z
    cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
    """
    
    A = np.maximum(0,Z)
    
    assert(A.shape == Z.shape)
    
    cache = Z 
    return A, cache


def relu_backward(dA, cache):
    """
    Implement the backward propagation for a single RELU unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    
    # When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0
    
    assert (dZ.shape == Z.shape)
    
    return dZ

def sigmoid_backward(dA, cache):
    """
    Implement the backward propagation for a single SIGMOID unit.

    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
    """
    
    Z = cache
    
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    
    assert (dZ.shape == Z.shape)
    
    return dZ

def initialize_parameters_deep(layer_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    
    np.random.seed(1)
    parameters = {}
    L = len(layer_dims)            # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) / np.sqrt(layer_dims[l-1]) #*0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

        
    return parameters

def linear_forward(A, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter 
    cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """
    
    Z = W.dot(A) + b
    
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache

def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python dictionary containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """
    
    if activation == "sigmoid":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    
    elif activation == "relu":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
    
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
                the cache of linear_sigmoid_forward() (there is one, indexed L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A 
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)], activation = "relu")
        caches.append(cache)
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation = "sigmoid")
    caches.append(cache)
    
    assert(AL.shape == (1,X.shape[1]))
            
    return AL, caches

def compute_cost(AL, Y):
    """
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    
    m = Y.shape[1]

    # Compute loss from aL and y.
    cost = (1./m) * (-np.dot(Y,np.log(AL).T) - np.dot(1-Y, np.log(1-AL).T))
    
    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())
    
    return cost

def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = 1./m * np.dot(dZ,A_prev.T)
    db = 1./m * np.sum(dZ, axis = 1, keepdims = True)
    dA_prev = np.dot(W.T,dZ)
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.
    
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db

def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (there are (L-1) or them, indexes from 0 to L-2)
                the cache of linear_activation_forward() with "sigmoid" (there is one, index L-1)
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid")
    
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, activation = "relu")
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]
        
    return parameters

def predict(X, y, parameters):
    """
    This function is used to predict the results of a  L-layer neural network.
    
    Arguments:
    X -- data set of examples you would like to label
    parameters -- parameters of the trained model
    
    Returns:
    p -- predictions for the given dataset X
    """
    
    m = X.shape[1]
    n = len(parameters) // 2 # number of layers in the neural network
    p = np.zeros((1,m))
    
    # Forward propagation
    probas, caches = L_model_forward(X, parameters)

    
    # convert probas to 0/1 predictions
    for i in range(0, probas.shape[1]):
        if probas[0,i] > 0.5:
            p[0,i] = 1
        else:
            p[0,i] = 0
    
    #print results
    #print ("predictions: " + str(p))
    #print ("true labels: " + str(y))
    print("Accuracy: "  + str(np.sum((p == y)/m)))
        
    return p

def print_mislabeled_images(classes, X, y, p):
    """
    Plots images where predictions and truth were different.
    X -- dataset
    y -- true labels
    p -- predictions
    """
    a = p + y
    mislabeled_indices = np.asarray(np.where(a == 1))
    plt.rcParams['figure.figsize'] = (40.0, 40.0) # set default size of plots
    num_images = len(mislabeled_indices[0])
    for i in range(num_images):
        index = mislabeled_indices[1][i]
        
        plt.subplot(2, num_images, i + 1)
        plt.imshow(X[:,index].reshape(64,64,3), interpolation='nearest')
        plt.axis('off')
        plt.title("Prediction: " + classes[int(p[0,index])].decode("utf-8") + " \n Class: " + classes[y[0,index]].decode("utf-8"))

def L_layer_model(X, Y, layers_dims, learning_rate = 0.009, num_iterations = 3000, print_cost=False):
    """
    Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID.

    Arguments:
    X -- data, numpy array of shape (number of examples, num_px * num_px * 3)
    Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples)
    layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
    learning_rate -- learning rate of the gradient descent update rule
    num_iterations -- number of iterations of the optimization loop
    print_cost -- if True, it prints the cost every 100 steps

    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    np.random.seed(1)
    costs = []                         # keep track of cost
    
    parameters = initialize_parameters_deep(layers_dims)
    
    for i in range (num_iterations):
        
        AL, caches = L_model_forward(X, parameters)
        
        cost = compute_cost (AL, Y)
        
        grads = L_model_backward(AL, Y, caches)
        
        parameters = update_parameters(parameters, grads, learning_rate)
        
        
    # Print the cost every 100 training example
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)

    # plot the cost
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per tens)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters
layers_dims = [12288, 20, 7, 5, 1] 
#parameters = L_layer_model(train_x_orig_lr.T, train_y, layers_dims, num_iterations = 2500, print_cost = True)