Nobody expects the Spanish Inquisition!
I am very familiar and have experience in the main data analysis workflow: data inspection, data cleaning/preprocesing, data transforming, data minning and data modelling. Here are some projects using Exploratory Data Analysis or Confirmatory Data Analysis that i have done/worked on:
Building a forum Page view time-series visualizer with Python and pandas

code Page view time-series visualizer
For this project we visualize time series data using a line chart, bar chart, and box plots to understand the patterns in visits and identify yearly and monthly growth.
To work through this task, we will use
pands
, matplotlib
and
seaborn
to visualize
a dataset containing the number of page views each day on the FreeCodeCamp's
forum from 2016-05-09 to 2019-12-03

Let's import the main modules and libraries we will be working with:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
1. Reading and cleaning the data
We read ours CSV file
df = pd.read_csv('files/forum-pageviews.csv',
header = 0,
names=['Date', 'Views'],
parse_dates=['Date'],
index_col='Date')
and make sure to filter out days when the page views were in the top 2.5% of the dataset or bottom 2.5% of the dataset.
mask = (df['Views'] >= df['Views'].quantile(0.025)) & (df['Views'] <= df['Views'].quantile(0.975))
df_cln = df[mask]
Views | |
---|---|
Date | |
2016-05-19 | 19736 |
2016-05-26 | 18060 |
2016-05-27 | 19997 |
2016-05-28 | 19044 |
2016-05-29 | 20325 |
2. Plotting results
2.1 Daily views line plot:
We draw a line chart representing daily page views on the freeCodeCamp Forum from May 2016 to December 2019:

2.2 Monthly bar plot:
We copy and modify data for monthly bar plot:
df_bar = df_cln.copy()
df_bar['Month'] = df_bar.index.month_name()
df_bar['Year'] = df_bar.index.year
df_bar['month_index'] = df_bar.index.month
df_bar = df_bar.sort_values(by='month_index')
And draw a bar plot using Seaborn
that shows the average daily page views for each month grouped by year:

2.3 Year/Month-wise box plot:
We first sort and prepare data for box plots:
df_box = df_bar.copy()
df_box['Month'] = df_bar.index.month_name().str.slice(stop=3)
Then, we draw a box plot that shows how the values compare over time and how they are distributed within a given year:

or month:

Here is the full code as a function :
Sea level prediction using linear regression
In this project, we anaylize a dataset of the global average sea level change since 1880 in order to predict the sea level change through the year 2050.

The data used here are the Global Average A bsolute Sea Level Change, 1880-2014 Data set, from the US Environmental Protection Agency using data from CSIRO, 2015; NOAA, 2015.
code Linear regression for time series prediction
We use linear regression to get the slope and y-intercept of the line of best fit. We then create a scatter plot with the line through the year 2050 to predict the sea level rise in 2050 and observe the rate of rise since the year 2000.
To work through these topics, we will use pandas
, matplotlib
and linregress
from scipy.stats
's.
Let's import the main modules and libraries we will be working with:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
1. Read data:
df = pd.read_csv('files/epa-sea-level.csv', na_values = ['?'])
Year | CSIRO Adjusted Sea Level | Lower Error Bound | Upper Error Bound | NOAA Adjusted Sea Level | |
---|---|---|---|---|---|
0 | 1880 | 0.000000 | -0.952756 | 0.952756 | NaN |
1 | 1881 | 0.220472 | -0.732283 | 1.173228 | NaN |
2 | 1882 | -0.440945 | -1.346457 | 0.464567 | NaN |
3 | 1883 | -0.232283 | -1.129921 | 0.665354 | NaN |
4 | 1884 | 0.590551 | -0.283465 | 1.464567 | NaN |
We can have a look at how the sampled Data looks like:

As we see, the data is a bit scattered. We will use linear regression to find the most fitting model to our prediction. The goal is basically to find the linear model $y = ax+b $ that fits the most i.e. with the minimum possible loss. For that we use the least squares regression which calculates
the slope, given by the equation:
$\displaystyle{ a = \frac{ N \sum xy - \sum x \sum y}{N \sum x^2 - \left( \sum x \right)^2 } }$
the y-intercept, given by the equation:
$\displaystyle{ b = \frac{ \sum y - a \sum x}{N} }$
We use the linregress
function from scipi.stats
to get
the slope and y-intercept of the line of best fit:
X = df['Year']
Y = df['CSIRO Adjusted Sea Level']
model = linregress(X, Y)
(a, b) = (model.slope, model.intercept)
And we calculate the coordinates of the line of best fit:
X_1 = list(range(1880, 2050))
Y_1 = [ a * x + b for x in X_1 ]
We plot the line of best fit over the scatter plot making pass through our desired sea level prediction (i.e. x = 2050)

2. Comparison of sea rise rate:
Now we want to plot a new line of best fit sing only the data from year 2000:
mask = (df['Year'] >= 2000)
X_rec = df[mask]['Year']
Y_rec = df[mask]['CSIRO Adjusted Sea Level']
model_rec = linregress(X_rec, Y_rec)
(a_rec, b_rec) = (model_rec.slope, model_rec.intercept)
and still make it pass by 2050 to see if the rate of rise continues as it has since 2000:
X_2 = list(range(2000, 2050))
Y_2 = [ a_rec * x + b_rec for x in X_2]
We can then observe the difference, and how the rate of sea rise went up:

It suggests that the level of the sea in 2050 will approximate the value of:
print('{:.2f} inch ~ {:.2f} cm'.format(Y_2[-1], Y_2[-1] * 2.54) )
>>>
15.22 inch ~ 38.65 cm
Here is the full code as a function:
- I am very familiar and have experience in the main machine learning workflow: I worked with
TensorFlow, Keras, as well as with PyTorch and I am familiar with:
- Regression algorithms: Linear regression, Logistic regression, ...
- Computer vision
- Natrual Language Processing
- Data Minning
- Deep learning: CNN, RNN and LSTM
Here are some Machine learning projects I did/worked on myself, during my studies and/or for learning/practice.
- COVID19 fakenews detection
- Dog breed classification
- Book recommendation engine
- Cardiovascular disease classification
- Breast tumours Diagnostic
COVID19 fakenews detection using Natural Language Processing
11th December 2020

One of the shared tasks of CONSTRAINT 2021's first Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation, was about "Fighting an Infodemic: COVID-19 Fake News". Here is my modest participation at the Codalab's competition COVID19 Fake News Detection in English
code Building a tweet classification algorithm using Logistic Regression
The Fighting an Infodemic: COVID-19 Fake News Dataset
contains tweets along with their associated binary labels: real
f
or real/verified COVID19 related informations and fake
for
hoaxes/fake news.
To work through this task we follow the usual workflow:
- Data analysis - we prepare, study and explore the dataset, highliting important correlations.
- Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
- Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
- Model experimentation - we try to improve the model through experimentation.
Let's import the main modules and libraries we will be working with:
## Data analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import collections
## Scikit Learn Models:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
## Feature extraction
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
## text preprocessing
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
import string
import emoji
## Scikit Learn Model evaluators:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, roc_curve
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline, FeatureUnion
1. Data analysis:
We import our csv files:
train_df = pd.read_csv('files/Constraint_English_Train - Sheet1.csv', index_col = 'id',)
val_df = pd.read_csv('files/Constraint_English_Val - Sheet1.csv', index_col = 'id')
test_df = pd.read_csv('files/english_test_with_labels - Sheet1.csv', index_col = 'id')
The data consists of 8560 with the labels real
for real news and
fake
for the fake ones:
tweet | label | |
---|---|---|
0 | Nancy Pelosi said, ???if you accept a check fr... | fake |
1 | RT @CDCHeart_Stroke: #HeartAttacks and #stroke... | real |
2 | The SARS-CoV-2 has been engineered by man edit... | fake |
3 | We expect to complete this by the end of the d... | real |
4 | RT @PIB_India: Cases per million population in... | real |

We will perform some statistical analysis on the dataset and get descriptive statistics in order to process our data: count words :
- number of words in the tweet:
count_words = main_df['tweet'].apply(lambda x: len(re.findall(r'\w+', x)))
main_df['count_words'] = count_words

count_mentions = main_df['tweet'].apply(lambda x: len(re.findall(r'@\w+', x)))
main_df['count_mentions'] = count_mentions

count_hashtags = main_df['tweet'].apply(lambda x: len(re.findall(r'#\w+', x)))
main_df['count_hashtags'] = count_hashtags
Most of the tweets do not contain hash tags, and the few that have tend to be real, maybe it can make a difference during training depending on if we get rid of it or not in text processing.

count_caps = main_df['tweet'].apply(lambda x: len(re.findall(r'\b[A-Z]{2,}\b', x)))
main_df['count_caps'] = count_caps
Most of the tweets do not contain capitalized words and so it probably won't have any effect if we lower() our texts.

count_excl_quest_marks = main_df['tweet'].apply(lambda x: len(re.findall(r'!|\?', x)))
main_df['count_excl_quest_marks'] = count_excl_quest_marks
The fake tweets seem to be using a bit more exclamation or question marks than the real ones.

count_urls = main_df['tweet'].apply(lambda x: len(re.findall(r'http.?://[^\s]+[\s]?', x)))
main_df['count_urls'] = count_urls
It seems that more real tweets use URLS than fake ones, which seems a bit logical, maybe taking into account the tweets with verifiable urls might help our classification.

emoji
)
count_emojis = main_df['tweet'].apply(lambda x: emoji.demojize(x)).apply(lambda x: len(re.findall(r':[a-z_&]+:', x)))
main_df['count_emojis'] = count_emojis
We can see that quasi all the tweets do not use emojis, hence it would not make that much of a difference to leave them.

2. Data moddeling
Here we will try to build a model and train in order to predict our label
variable:
2.1. text processing:
We first convert the class labels into binary numerical variables for convenience:
main_df['label'] = main_df.label.map({'real':1, 'fake':0})
We write a function that would clean and preprocess our tweets in order to feed it to our model:
stopwords = stop_words.ENGLISH_STOP_WORDS
def clean(text):
""" Takes a string of text and preprocess it, by removing hashtapgs, punctuations, numbers and tokenizing"""
# text = re.sub(r'@\w+', '', text) # remove mentions
text = text.replace("#", "") # remove the hashtag character
text = text.replace("RT", "") # remove the retweet mention
text = text.lower() # lowercase
text = "".join([char for char in text if char not in string.punctuation and not char.isdigit()]) #remove punctuation and numbers
text = " ".join([token for token in text.split() if token not in stopwords])
# porter = PorterStemmer()
# text = " ".join([porter.stem(word) for word in text.split()]) #stemming #it actually works better without
return text
Here is an example of how the clean functions works, applied to one of my favourite tweetter accounts:
text = '@KyloR3n hello @hottopic i want a shirt that says I’m a different person but it should be the same color and basically the same shirt'
clean(text)
>>>
hello want shirt say i’m differ person color basic shirt
Finally, we regroup everything together:
clean_tweet = main_df['tweet'].apply(clean)
main_df['clean_tweet'] = clean_tweet
main_df = main_df.drop(['tweet'], axis = 1)
We can also see the distribution of the words per frequency in all the tweets
vect = CountVectorizer()
bag = vect.fit_transform(clean_tweet)
word_freq = dict(zip(vect.get_feature_names(), np.asarray(bag.sum(axis=0)).ravel()))
word_count = collections.Counter(word_freq)
word_count_df = pd.DataFrame(word_count.most_common(20), columns = ['word', 'frequency'])

Data modelling
Our model is about classification, where we are predicting a categorical variable : label
. It has labeled data of less than 100.000 samples. Using Scikit Learn's documentation we opt for the following models:
- Log Reg : LogisticRegression
- MultinomialNB : Naive Bayes
- Linear SVC : Support Vector Classification
- XGBoost : XGBoost.
X_train, X_val, y_train, y_val = train_test_split(main_df.drop(['label'], axis=1), main_df.label, test_size=0.2, random_state=1729)
In order to see which model performs the best, we proceed as follow: We instantiate each model in a dictionary
models = {'logReg': LogisticRegression(max_iter = 1000),
'MultiNB': MultinomialNB(),
'lSVC': LinearSVC(class_weight='balanced'),
'XGB': XGBClassifier(use_label_encoder=False)}
we create function to preprocess, vectorize the training and validation data, fits and scores models
def mod_score(models, X_train, X_test, y_train, y_test, vect=CountVectorizer()):
'''
Takes a dictionarry of models and fits each on a training set and returns its score on a test set
'''
# empty dict to append
results = {}
# ffeature extraction:
X_train_vect = vect.fit_transform(X_train.clean_tweet)
#transform testing data (using training data's features)
X_test_vect = vect.transform(X_test.clean_tweet)
for name, model in models.items() :
model.fit(X_train_vect, y_train)
results[name] = model.score(X_test_vect, y_test)
return results
and we observe the results:
mod_score(models, X_train, X_val, y_train, y_val)

SVM's linear SVC and LogisticRegression both achieved an acuuracy of 92%! thats sounds great.
4. Hyper-parameters tuning
This is somehow, the most expensive, time and ressources consuming step of the workflow.
As the vectorizers and the classification models have both tuneable parameters, and due to
the small size of our sample (8650 tweets only) we use GridSearchCV
to select
the best parameters on a 5-fold cross-validation.
We use the following grid parameters:
## Vectorizers hyperparameters:
# BOW hyperparameters
vect_grid = {'features__pipe__vect__max_df': (0.25, 0.5, 0.75),
'features__pipe__vect__min_df': (1,2) }
# TF-IDF hyperparaleters
tfidf_grid = {'features__pipe__vect__analyzer' : ('word', 'char'),
'features__pipe__vect__ngram_range': ((1, 3), (3, 7), (5, 7), (2, 2), (1, 2)),
'features__pipe__vect__binary' : (True, False) }
## Classifiers hyperparameters:
# logReg hyperparameters
logReg_grid = {'model__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
'model__solver': ['lbfgs', 'liblinear'],
'model__penalty' : ('l1', 'l2')}
# MultionomialNB hyperparameters
MultiNB_grid = {'model__alpha': (0.25, 0.5, 0.75)}
# SVC hyperparameters
lSVC_grid = {'model__C': (.05, .12, .25, .5, 1, 2, 4)}
We will use two vectorizing methods : the bag of words vectorizer : CountVectorizer : vect=CountVectorizer()
and the Term Frequency - Inverse Document Frequency : TF - IDF
vect=TfidfVectorizer()

accuracy_Cvect | accuracy_TF-IDF | |
---|---|---|
logReg | 93.049% | 94.217% |
lSVC | 93.166% | 94.042% |
MultiNB | 93.283% | 91.998% |
XGB | 88.084% | 87.266% |
We clearly see that the best results were achived by the following parametrization:
tfidf_cstm_lorReg = TfidfVectorizer(binary=True, max_df=0.5, min_df=1, ngram_range=(1, 2))
logReg_tfidf_cstm = LogisticRegression(max_iter = 1000, C=1000, penalty='l2', solver='liblinear')
Model evaluation
Now that we saw a slight improvement in our model's performance, we can evaluate it on the testing Dataset: We first prepare our testing dataset
test_df = test_df.reset_index().drop('id', axis=1)
test_df['label'] = test_df.label.map({'real':1, 'fake':0})
test_df['clean_tweet'] = test_df['tweet'].apply(clean)
test_df = test_df.drop(['tweet'], axis = 1)
and we evaluate our model on the test data:
X_train, X_test, y_train, y_test = main_df.drop(['label'], axis=1), test_df.drop(['label'], axis=1), main_df.label, test_df.label
# Preprocess and Vectorize the data:
X_train_tfidfv = tfidf_cstm_lorReg.fit_transform(X_train.clean_tweet)
X_test_tfidfv = tfidf_cstm_lorReg.transform(X_test.clean_tweet)
# Instantiate best model with best hyperparameters
history = logReg_tfidf_cstm.fit(X_train_tfidfv, y_train)
# Make predictions
y_preds = history.predict(X_test_tfidfv)
3.1. Confusion matrix:
We can have a visual idea on how our model's performance is, by looking at where the model nailed its predictions and where it failed:
conf_mtrx = confusion_matrix(y_test, y_preds)

3.2. ROC Curve and AUC Scores:
We can also compare the TPR to the FPR, i.e tweet that is flagged real, but is fake vs a tweet that is flagged fake although it is real:

3.3. Classification report:
and finally, we have a look at the main evaluation metrics:
# Create a classification report using the classification_report function
report = classification_report(y_test, y_preds, output_dict=True)
pd.DataFrame(report).T
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.939303 | 0.925490 | 0.932346 | 1020.000000 |
1 | 0.933040 | 0.945536 | 0.939246 | 1120.000000 |
accuracy | 0.935981 | 0.935981 | 0.935981 | 0.935981 |
macro avg | 0.936172 | 0.935513 | 0.935796 | 2140.000000 |
weighted avg | 0.936025 | 0.935981 | 0.935957 | 2140.000000 |
Conclusion
We have reached a nice result, an accuracy of around more than 93% a recall of around 94%. We can manually try our model out, with some random made up tweets and see how it performs : here are some random tweets I found following the hashtag #COVID
tweet1 = """The charity @TEDSliverpool says money raised in memory of Claudia Walsh
means it'll be able to help more people with mental health issuesClaudia died with
#coronoavirus on her 25th birthday. She also volunteered at @WhitechapelLiv. More
than £21,000 has already been raised"""
tweet2 = """Canadian-made SARS-CoV-2 detector set to ship worldwide | $KNR $KNR.ca
#biocloud #smartbuildings #smartfactory #kontrol #smartcity #smartcities #ontario
#canada #safespaces #madeincanada #innovation #ai #iot #covid19 #covidー19 #sarscov2
#coronoavirus https://bit.ly/3mXWV2O"""
tweet3 = """#Coronoavirus : "Indigenous people are stepping up as volunteer firefighters,
but they are now doubly strained: Closed borders have shriveled their income from sustainably
harvested forest exports" @ACOFOP #Guatemala @alianzabosques https://apnews.com/e3cddd53e453a2"""
tweet4 = """Masih Alinejad says Government hid news of #coronoavirus for political purposes"""
and we can test our model and predict the veracity of the tweets:
# Make predictions
y_preds = ['real' if v == 1 else 'fake' for v in history.predict(vect).tolist() ]
y_preds
>>>
['fake', 'real', 'fake', 'fake']
Further developpement
Maybe try Deep Learning? a CNN using Words Embeddings with LogReg's feature extractions... or roBert/daBert? Ernie2.0?
Dog bread classification using convolutional neural networks 🐕
25th August 2019
In this multi-class image classification project, we use Deep Learning, and particularily Convolutional Neural Networks to build a model that will help us identify up to 120 different breeds of dogs 🐩
We will use Kaggle dog breed identification competition which consists of a collection of 10,000+ labelled images of 120 different dog breeds.

code Multi-label image classification:
In image classification problems, Convolutional Neural Networks (CNN for short) are somehow one of the most popular and effective neural network models to consider, especially if it is combined with Data augumentation.
Here is a short summary of what we will be doing:
- Data analysis - we prepare, study and explore the dataset, highliting important correlations.
- Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
- Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
- Model experimentation - we try to improve the model through experimentation by starting with 1000 images to make sure it works.
To work through these topics, we will use TensorFlow
and Keras
:
Let's import the main modules and libraries we will be working with:
# Data analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress, norm
## Tensorflow:
import tensorflow as tf
import tensorflow_hub as hub
## keras model layers and callback:
import keras
from keras import models
from keras.models import Sequential, model_from_json
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout, GlobalAveragePooling2D, BatchNormalization, InputLayer, Lambda, Input
from keras.callbacks import TensorBoard, ModelCheckpoint
from keras.optimizers import Adam, RMSprop
from keras.models import Model, Input
## Keras models
from keras import applications
from keras.applications.inception_v3 import InceptionV3
from keras.applications.nasnet import NASNetLarge
from keras.applications.inception_resnet_v2 import InceptionResNetV2
from keras.applications.xception import Xception
## Scikit Learn Model evaluators:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
## Loading images
import os
from urllib import request
from io import BytesIO
## Preprocessing:
from keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
1. Data analysis:
We start by reading our CSV file using pandas
:
train_df = pd.read_csv("files/dog_breed/files/labels.csv")
train_df.describe()
train_df.head()
We have in total 10.222 images IDs with a total of 120 breeds. Our DataFrame looks like this: The rows in the dataset represent image IDs for each dog breed.
We check how these labels are distributed: We have an average of 82 pictures per breed

1.1. Data pre-processing: Labels
We first make some Hot Encoding: we encode our labels into a boolean format
labels = np.array(train_df['breed'])
breeds = np.unique(labels)
enc_labels = [ label == np.array(breeds) for label in labels ]
and we use Scikit Learn's train_test_split
to create a validation set: Here, we will first train our model on a subset of
data to see how it is performing.
# dataset variables
X = img_name
y = enc_labels
# set a size parameter
SUBSET_SIZE = 1000
# set random seed
np.random.seed(1729)
# create validation set for the subset_data:
X_train, X_val, y_train, y_val = train_test_split(X[:SUBSET_SIZE],
y[:SUBSET_SIZE],
test_size = 0.2)
1.2. Data pre-processing: Images
We want to create a function that would process our images: TensorFlow's documentation
suggests that we convert our data into batches of (image, label)
tensors:
# Choose a size
IMG_SIZE = 224
def img_process(img_path) :
'''
Processes and converts an image path into a tensor
'''
# read image path
img = tf.io.read_file(img_path)
#Tensorize the image
img = tf.image.decode_jpeg(img, channels = 3)
#Normalize the tensor
img = tf.image.convert_image_dtype(img, tf.float32)
#resize the image
img = tf.image.resize(img, size = [IMG_SIZE, IMG_SIZE])
return img
def img_label_get(img_path, label):
'''
Processes and converts an image path into a tensor and returns a tuple (image, label)
'''
img = img_process(img_path)
return img, label
and we create a function that will create small batches:
# Define a batch size
BATCH_SIZE = 32
def Batch(img, label=None, batch_size = BATCH_SIZE, valid_data = False, test_data = False):
'''
Creates batches of data out of an (image, label) pairs.
Shuffles the data only if it's training data.
Accepts test data as input (no labels).
'''
# valid dataset, no need to shuffle
if valid_data == True :
print("Creating validation data batches...")
data = tf.data.Dataset.from_tensor_slices((tf.constant(img), # filepaths
tf.constant(label)) #labels
).cache().prefetch(tf.data.AUTOTUNE)
data_batch = data.map(img_label_get).batch(BATCH_SIZE)
print('Done!')
return data_batch
# test dataset, no labels
if test_data == True :
print("Creating test data batches...")
data = tf.data.Dataset.from_tensor_slices((tf.constant(img))) # no labels
data_batch = data.map(img_process).batch(BATCH_SIZE)
print('Done!')
return data_batch
else:
#Tensorize and shuffle training dataset
print("Creating training data batches...")
data = tf.data.Dataset.from_tensor_slices((tf.constant(img),
tf.constant(label))
).cache().shuffle(buffer_size = len(img)).prefetch(tf.data.AUTOTUNE)
data = data.map(img_label_get)
data_batch = data.batch(BATCH_SIZE)
print('Done!')
return data_batch
Now, we batch our data sets:
train_data = Batch(X_train, y_train)
val_data = Batch(X_val, y_val, valid_data=True)
Our batches now have the following attributes:
((TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None),
TensorSpec(shape=(None, 120), dtype=tf.bool, name=None))
2. Data modeling:
We will test our data set and our modelisation using Transfer learning: we will use a very known and pretrained model for dog related classifications: Mobilenet v2 from Tensorflow Hub
2.1. Instantiate a pre-trained model:
# Setup input shape
INPUT_SHAPE = [None, 224, 224, 3] # batch, height, width, colour channels
# Setup output shape
OUTPUT_SHAPE = len(breeds) # number of unique labels
#pre trained model from tensorflow_hub:
MODEL_URL = 'https://tfhub.dev/google/imagenet/mobilenet_v2_140_224/classification/4'
We create a function that will build a keras
model from our inputs:
def build_model(input_shape=INPUT_SHAPE, output_shape=OUTPUT_SHAPE, model_url=MODEL_URL):
'''
Creates and builds and returns a keras model from input shape, output shape and url
'''
print("Building model with:", model_url.split('/')[5])
# Setup the layers
model = Sequential([
hub.KerasLayer(model_url), # input layer
Dense(output_shape, activation="softmax")]) # output layer
# Compile the model
opt = Adam(lr=0.001) # Gradient descent ptimizer
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
# Build the model
model.build(input_shape) # Let the model know what kind of inputs it'll be getting
return model
And we build our model:
Building model with: mobilenet_v2_140_224
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
keras_layer (KerasLayer) (None, 1001) 6158505
_________________________________________________________________
dense (Dense) (None, 120) 120240
=================================================================
Total params: 6,278,745
Trainable params: 120,240
Non-trainable params: 6,158,505
2.2. Train its upper dense layer:
NUM_EPOCHS = 100
def build_model(input_shape=INPUT_SHAPE, output_shape=OUTPUT_SHAPE, model_url=MODEL_URL):
'''
Creates and builds and returns a keras model from input shape, output shape and url
'''
print("Building model with:", model_url.split('/')[5])
# Setup the layers
model = Sequential([
hub.KerasLayer(model_url), # input layer
Dense(output_shape, activation="softmax")]) # output layer
# Compile the model
opt = Adam(lr=0.001) # Gradient descent ptimizer
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
# Build the model
model.build(input_shape) # Let the model know what kind of inputs it'll be getting
return model
And we train our model: The first impression that we get is that the model is clearly overfitting: its performance on the training test outstands its results on the validation set
...
Epoch 7/100
25/25 [==============================] - 5s 216ms/step - loss: 0.0757 - accuracy: 1.0000 - val_loss: 1.2328 - val_accuracy: 0.6750
Epoch 8/100
25/25 [==============================] - 5s 215ms/step - loss: 0.0581 - accuracy: 1.0000 - val_loss: 1.2111 - val_accuracy: 0.6800
...
Epoch 13/100
25/25 [==============================] - 5s 216ms/step - loss: 0.0266 - accuracy: 1.0000 - val_loss: 1.1501 - val_accuracy: 0.6950
Epoch 14/100
25/25 [==============================] - 5s 215ms/step - loss: 0.0235 - accuracy: 1.0000 - val_loss: 1.1402 - val_accuracy: 0.6950
Now, training the model on the whole data set (preferably using a GPU)
model.evaluate(val_data)
We see that we reach an accuracy of 82% with a loss of 0.64
64/64 [==============================] - 793s 12s/step - loss: 0.6375 - accuracy: 0.8152
It's time to make some predictions
predictions = model.predict(val_data, verbose=1)
and see how our model is doing:
We can start printing out a classification report:
report = classification_report(np.argmax(predictions,axis=1), np.argmax(yf_val, axis=1), target_names=breeds, output_dict=True)
report_df = pd.DataFrame(report).T
report_df
precision | recall | f1-score | support | |
---|---|---|---|---|
affenpinscher | 0.777778 | 0.875000 | 0.823529 | 16.000000 |
afghan_hound | 0.931034 | 0.964286 | 0.947368 | 28.000000 |
african_hunting_dog | 1.000000 | 1.000000 | 1.000000 | 16.000000 |
airedale | 0.964286 | 0.900000 | 0.931034 | 30.000000 |
american_staffordshire_terrier | 0.692308 | 0.450000 | 0.545455 | 20.000000 |
... | ... | ... | ... | ... |
wire-haired_fox_terrier | 0.857143 | 0.600000 | 0.705882 | 20.000000 |
yorkshire_terrier | 0.750000 | 0.631579 | 0.685714 | 19.000000 |
accuracy | 0.815159 | 0.815159 | 0.815159 | 0.815159 |
macro avg | 0.804588 | 0.810483 | 0.800445 | 2045.000000 |
weighted avg | 0.829992 | 0.815159 | 0.815969 | 2045.000000 |
Our model feels more confident about guessing the following breeds:
report_df['precision'].sort_values(ascending = False)[:10]
>>>
japanese_spaniel 1.0
soft-coated_wheaten_terrier 1.0
african_hunting_dog 1.0
curly-coated_retriever 1.0
papillon 1.0
west_highland_white_terrier 1.0
clumber 1.0
chow 1.0
saint_bernard 1.0
cardigan 1.0
Name: precision, dtype: float64
and way less about guessing the following ones:
report_df['precision'].sort_values(ascending = True)[:10]
>>>
eskimo_dog 0.181818
miniature_poodle 0.357143
walker_hound 0.400000
border_collie 0.400000
boxer 0.466667
staffordshire_bullterrier 0.466667
chihuahua 0.523810
lakeland_terrier 0.555556
flat-coated_retriever 0.588235
malamute 0.600000
Name: precision, dtype: float64
It does not seem too bad! Although, as we have seen in the subset training, our model is maybe overfitting and these results might be misleading. Let's try to improve our model and use a more accurate evaluation :
Model tuning
3.1. Data augmentation:
One way to avoid that is using Data augmentation: we take the training images and manipulate (crop, resize) or distort them (flip, rotate) to create more generalized patterns for the model to learn from.
NOTE: Data augmentation is an augmentation in the sense where it increases the generalizability of the model, and NOT an actual addition or increase in the dataset, i.e. it does not increase the size of the dataset per se.
We will use the class from keras.preprocessing. which will generate batches of tensor image data with real-time data augmentation:
datagen = ImageDataGenerator(rescale=1./255,
validation_split=0.2,
rotation_range=45, # randomly rotate pictures
width_shift_range=0.2, # randomly translate pictures vertically
height_shift_range=0.2, # randomly translate pictures horizontally
shear_range=0.2, # randomly applying shearing transformations.
zoom_range=0.25, # randomly zoom inside pictures.
horizontal_flip=True, # randomly flip half the images horizontally
fill_mode='nearest')
This is how a batch of augmented images would look like:

3.2. Dropouts:
Dropouts is another technique that helps preventing overfitting, by regulizing deep neural networks : https://jmlr.org/papers/v15/srivastava14a.html
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
keras_layer (KerasLayer) (None, 1001) 6158505
_________________________________________________________________
dense (Dense) (None, 256) 256512
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 120) 30840
=================================================================
Total params: 6,445,857
Trainable params: 287,352
Non-trainable params: 6,158,505
3.3. K-fold cross validation:
We now evaluate our model on a 5-fold cross validation process. This will ensure that the estimation and evaluation of our model has the least bias. Here is a big pipeline-like code where we put everything together:
VALIDATION_ACCURACY = []
VALIDATION_LOSS = []
fold = 1
kf = KFold(n_splits = 5, shuffle = True, random_state = 1729)
# Set up manual K-fold cross validation:
for train_index, test_index in kf.split(train_df):
trainData = train_df.iloc[train_index]
testData = train_df.iloc[test_index]
print('Initializing Kfold %s'%str(fold))
print('Train shape:',trainData.shape)
print('Test shape:',testData.shape)
epochs = 100
# prepare training and validation generators
train_datagen = ImageDataGenerator(rescale=1./255,
validation_split=0.2,
rotation_range=45, # randomly rotate pictures
width_shift_range=0.2, # randomly translate pictures vertically
height_shift_range=0.2, # randomly translate pictures horizontally
shear_range=0.2, # randomly applying shearing transformations.
zoom_range=0.25, # randomly zoom inside pictures.
horizontal_flip=True, # randomly flip half the images horizontally
fill_mode='nearest')
valid_datagen = ImageDataGenerator(rescale=1. / 255)
# Create a flow from dataframe for training data
train_gen = train_datagen.flow_from_dataframe(dataframe=trainData,
directory=train_dir,
x_col="id",
y_col="breed",
subset="training",
shuffle=True,
seed=1729,
target_size=(IMG_SIZE, IMG_SIZE),
batch_size=BATCH_SIZE,
class_mode="categorical")
valid_gen = train_datagen.flow_from_dataframe(dataframe=trainData,
directory=train_dir,
x_col="id",
y_col="breed",
subset="validation",
target_size=(IMG_SIZE, IMG_SIZE),
batch_size=BATCH_SIZE,
class_mode="categorical")
# Create the model
model_full_aug = Sequential([
hub.KerasLayer(MODEL_URL), # input layer
Dense(256, activation='relu'),
Dropout(0.5),
Dense(OUTPUT_SHAPE, activation="softmax")]) # output layer
# Compile the model
opt = Adam(lr=0.001) # Gradient descent ptimizer
model_full_aug.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
# Build the model
model_full_aug.build(INPUT_SHAPE)
# Setting callbacks
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=8)
checkpoint = tf.keras.callbacks.ModelCheckpoint(model_dir+get_model_name(fold),
monitor='val_accuracy',
verbose=1,
save_best_only=True,
mode='max')
# Train the model
history_aug = model_full_aug.fit(x = train_gen,
steps_per_epoch = train_gen.n // train_gen.batch_size,
epochs = NUM_EPOCHS,
validation_data = valid_gen,
validation_steps = valid_gen.n // valid_gen.batch_size,
callbacks = [early_stopping, checkpoint])
# LOAD BEST MODEL to evaluate the performance of the model
model.load_weights(model_dir+"model_"+str(fold)+"fold.h5")
results = model.evaluate(valid_gen)
results = dict(zip(model_full_aug.metrics_names,results))
VALIDATION_ACCURACY.append(results['accuracy'])
VALIDATION_LOSS.append(results['loss'])
tf.keras.backend.clear_session()
fold += 1
And we can see the results:

Here we actually did not have any significant improvement in accuracy. However, the model is not overfitting as the gap is smaller between the validation and the training curves, which is still an improvement
4. Feature extraction
At last, we will use feature extraction on known trained models on imageNet in order to boost our transfer learning process. This might give us a significant improvement of our classification. We will explore the following models:
We first modify our function in order to quickly and efficiently preprocess our
data: We chose the target size (331, 331, 3)
because it is the one
dictated by NasNetLarge if one loads it with weights='imagnet'
TARGET_SIZE = (331,331,3)
def images_to_array(directory, dataframe, target_size = TARGET_SIZE):
images = np.zeros([len(dataframe), target_size[0], target_size[1], target_size[2]], dtype=np.uint8)
img = ''
for ix, image_name in enumerate(dataframe['id'].values):
img_dir = os.path.join(directory, image_name + '.jpg')
img = load_img(img_dir, target_size = target_size)
images[ix] = img_to_array(img)
del img
label_dict = dict(enumerate(dataframe['breed'].unique()))
return images, label_dict
We write a function that would run the convolutional base over our data,
record its predictions so we can later feed them to a pretty simple CNN with
a dense layer and some dropout: this is somehow less costly as we only need to
run the convolutional base once for every image. However, we will not be able to
use data augmentation here... as using it is by far way more costly
(we would need to add Dense layers on top of the convolutional base , by using
DataImageGenerator
, and run it all on the input data).
def get_features(model, preprocess_input, images, target_size = TARGET_SIZE):
conv_base = model(input_shape = target_size, include_top=False, weights='imagenet', pooling = 'avg')
cnn = Sequential([
InputLayer(input_shape = target_size), # input layer
Lambda(preprocess_input), # preprocessing layer
conv_base ]) # base model
features = cnn.predict(images)
print('feature-map shape: {}'.format(features.shape))
return features
We start our process of feature extraction:
- Resnet50:
InceptionResNetV2(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
from keras.applications.inception_resnet_v2 import preprocess_input
resnet_preprocess = preprocess_input
resnet_features = get_features(InceptionResNetV2, resnet_preprocess, train_images)
Xception(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
from keras.applications.xception import preprocess_input
xception_preprocess = preprocess_input
xception_features = get_features(Xception, xception_preprocess, train_images)
InceptionV3(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
from keras.applications.inception_v3 import preprocess_input
inception_preprocess = preprocess_input
inception_features = get_features(InceptionV3, inception_preprocess, train_images)
NASNetLarge(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
from keras.applications.nasnet import preprocess_input
nasnet_preprocessor = preprocess_input
nasnet_features = get_features(NASNetLarge, nasnet_preprocessor, train_images)
Finally, we stack up all our features and flatten them so we can feed them to our new CNN:
final_features = np.concatenate([resnet_features, xception_features, inception_features, nasnet_features], axis = 1)
print('final features shape: {}'.format(final_features.shape))
>>>
final features shape: (10222, 9664)
Now, we define our densely connected classifier:
FEATURES_SHAPE = final_features.shape[1]
cnn_model = Sequential([
InputLayer(input_shape = (FEATURES_SHAPE, )), # input layer
BatchNormalization(),
Dense(4096, activation = 'relu'),
Dropout(0.5),
Dense(1024, activation='relu'),
Dropout(0.7),
Dense(len(class_to_index), activation="softmax")]) # output layer
# Compile the model
opt = Adam(lr=0.001) # Gradient descent ptimizer
#opt = RMSprop(lr=0.001, rho=0.9)
cnn_model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy']).summary()
>>>
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
batch_normalization_301 (Bat (None, 9664) 38656
_________________________________________________________________
dense (Dense) (None, 4096) 39587840
_________________________________________________________________
dropout (Dropout) (None, 4096) 0
_________________________________________________________________
dense_1 (Dense) (None, 1024) 4195328
_________________________________________________________________
dropout_1 (Dropout) (None, 1024) 0
_________________________________________________________________
dense_2 (Dense) (None, 120) 123000
=================================================================
Total params: 43,944,824
Trainable params: 43,925,496
Non-trainable params: 19,328
and train it on the data and labels that we just collected:
# Train the model
history1 = cnn_model.fit(final_features,
labels,
steps_per_epoch = len(final_features) // len(labels),
epochs = NUM_EPOCHS,
validation_split = 0.2,
callbacks = [early_stopping])
>>>
...
Epoch 4/100
1/1 [==============================] - 1s 806ms/step - loss: 0.3666 - accuracy: 0.9083 - val_loss: 2.5747 - val_accuracy: 0.9267
Epoch 5/100
1/1 [==============================] - 1s 827ms/step - loss: 0.3241 - accuracy: 0.9175 - val_loss: 2.3619 - val_accuracy: 0.9286
...
Epoch 40/100
1/1 [==============================] - 1s 853ms/step - loss: 0.0929 - accuracy: 0.9716 - val_loss: 0.5167 - val_accuracy: 0.9364
Epoch 41/100
1/1 [==============================] - 1s 801ms/step - loss: 0.0924 - accuracy: 0.9702 - val_loss: 0.5054 - val_accuracy: 0.9389
We can finally evaluate our model and see how it is doing. Here we choose two
gradient descent optimizers: Adam
and RMSprop
:

OPTIONAL: As our model seems a bit space-consuming, we can choose to save
it as an architecture (using the .to_json
method) and weights (using the .save_weights
method)
# serialize model to JSON
model_json = cnn_model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
cnn_model.save_weights("cnn_model_weights.h5")
print("Saved model to disk")
And to load it, one simply uses:
# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.load_weights("cnn_model_weights.h5")
print("Loaded model from disk")
Conclusion
We reached a pretty nice accuracy (>95%) so as a nice application, we would like to use our model to predict dog breeds from random images on the net. We will write some code to make that possible:
TARGET_SIZE = (331,331,3)
from PIL import Image
def dog_vision(model, source, local=False, target_size = TARGET_SIZE):
img = ''
if local:
# Get image from directory
image_paths = [source + fname for fname in os.listdir(source)]
images = np.zeros([len(image_paths), target_size[0], target_size[1], target_size[2]], dtype=np.uint8)
for _, path in enumerate(image_paths):
img = load_img(path, target_size = target_size)
images[_] = img_to_array(img)
else:
images = np.zeros([1, target_size[0], target_size[1], target_size[2]], dtype=np.uint8)
req = request.urlopen(source).read()
img = Image.open(BytesIO(req)).resize((target_size[0], target_size[1]))
images[0] = img_to_array(img)
del img
# run feature extraction
resnet_features = get_features(InceptionResNetV2, resnet_preprocess, images)
xception_features = get_features(Xception, xception_preprocess, images)
inception_features = get_features(InceptionV3, inception_preprocess, images)
nasnet_features = get_features(NASNetLarge, nasnet_preprocessor, images)
final_features = np.concatenate([resnet_features, xception_features, inception_features, nasnet_features], axis = 1)
del resnet_features, xception_features, inception_features, nasnet_features
# Make predictions
predictions = model.predict(final_features)
labels = dict(enumerate(train_df['breed'].unique()))
# Check custom image predictions
plt.figure(figsize=(10, 15))
for i, image in enumerate(images):
plt.subplot(4, 3, i+1)
plt.xticks([])
plt.yticks([])
plt.title(labels[np.argmax(predictions[i])])
plt.imshow(image);
tf.keras.backend.clear_session()
and you can try it out too!
local_dir = "files/dogs/"
URL = "https://images.theconversation.com/files/342682/original/file-20200618-41213-iu7wbs.jpg?ixlib=rb-1.1.0&rect=9%2C4%2C3268%2C2177&q=45&auto=format&w=926&fit=clip"
# load model
cnn = load_model("models/cnn_dogvision.h5")
dog_vision(cnn, local_dir, local=True)
#dog_vision(cnn, URL, local=False)
Here is a test with my two dearest babies:

-
If you are interested in the codes I used, you can find everything in
- my Github repository : link
- my Colab's notebook :
Building a book recommendation engine using Collaborative Filtering
20th April 2020
We use Collaborative Filtering to make our recommendations out based other user's similar preferences. We will use a machine learning algorithm known as the NearestNeighbors to produce book recommendations that are similar (based on user's ratings) to a given book.

We use the Book-Crossings dataset which contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users.
code Collaborative Filtering with KNN
We will create a book recommendation algorithm using Scikit learn's K-Nearest Neighbors. Here is a short summary of what we will be doing:
- Data analysis - we prepare, study and explore the dataset, highliting important correlations.
- Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
- Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
- Model experimentation - we try to improve the model through experimentation.
Let's import the main modules and libraries we will be working with:
## Data analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress, norm
from scipy.sparse import csr_matrix
## Scikit Learn Models:
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import TruncatedSVD
## String metrics
import Levenshtein as lev
1. Data analysis:
We read our files using using pandas
:
df_books = pd.read_csv('BX-Books.csv',
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['isbn', 'title', 'author'],
usecols=['isbn', 'title', 'author'],
dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})
df_ratings = pd.read_csv('BX-Book-Ratings.csv',
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'isbn', 'rating'],
usecols=['user', 'isbn', 'rating'],
dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})
df_users = pd.read_csv('BX-Users.csv',
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'location', 'age'],
usecols=['user', 'location', 'age'])
As our recommendation engine is based on user's ratings, we can start by gathering some information about users, say we have a look at the age distribution: We observe that most of the active users are between 20 and 30.

and that most of the ratings are 0:

We can see the TOP 5 rated books:
df_ratings_books = pd.merge(df_ratings, df_books, on='isbn')
df_ratings_books.sort_values('rating', ascending=False).head(5)
user | isbn | rating | title | author | |
---|---|---|---|---|---|
68283 | 273906 | 3596200261 | 10.0 | Fischer Taschenb�¼cher, Bd.26, Sch�¶ne neu... | Aldous Huxley |
108227 | 211152 | 0743203631 | 10.0 | Gap Creek: The Story Of A Marriage | Robert Morgan |
619156 | 240144 | 0345276469 | 10.0 | Agatha Christie: An Autobiography | Agatha Christie |
925376 | 144194 | 0446517283 | 10.0 | The Dragon King (Crimson Shadow/R.a. Salvatore) | R. A. Salvatore |
709904 | 200300 | 0525946500 | 10.0 | Charleston | John Jakes |
The ratings are very much unevenly distributed: there are books that are VERY much rated, while some have almost no ratings.

One also sees that, some users tend to leave a lot of reviews, while some very few. Let's see the 5 most rated books:
most_rated_books = num_book_ratings.sort_values('rating', ascending=False)
most_rated_books_preview = pd.merge(most_rated_books, df_books, on='isbn').drop('user', axis=1)
most_rated_books_preview.head(5)
isbn | rating | title | author | |
---|---|---|---|---|
0 | 0971880107 | 2502 | Wild Animus | Rich Shapero |
1 | 0316666343 | 1295 | The Lovely Bones: A Novel | Alice Sebold |
2 | 0385504209 | 883 | The Da Vinci Code | Dan Brown |
3 | 0060928336 | 732 | Divine Secrets of the Ya-Ya Sisterhood: A Novel | Rebecca Wells |
4 | 0312195516 | 723 | The Red Tent (Bestselling Backlist) | Anita Diamant |
For a higher statistical significance, we choose to ommit the users with less than 200 ratings, and books with less than 100 ratings:
ratings = df_ratings
#remove users that have given less than 200 ratings
users_count = ratings['user'].value_counts()
ratings = ratings[~ratings['user'].isin(users_count[users_count < 200].index)]
#remove books that have been rated less than 100 times.
books_count = ratings['rating'].value_counts()
ratings = ratings[~ratings['isbn'].isin(books_count[books_count < 100].index)]
It turns out that the most rated book has one of the lowest rating average:
num_book_ratings['average rating'] = df_ratings.groupby('isbn')['rating'].mean()
pd.merge(num_book_ratings, df_books[['title', 'isbn']], on='isbn').sort_values('rating', ascending=False).head()
isbn | user | rating | average rating | title | |
---|---|---|---|---|---|
215968 | 0971880107 | 2502 | 2502 | 1.019584 | Wild Animus |
38572 | 0316666343 | 1295 | 1295 | 4.468726 | The Lovely Bones: A Novel |
70803 | 0385504209 | 883 | 883 | 4.652322 | The Da Vinci Code |
7345 | 0060928336 | 732 | 732 | 3.448087 | Divine Secrets of the Ya-Ya Sisterhood: A Novel |
32372 | 0312195516 | 723 | 723 | 4.334716 | The Red Tent (Bestselling Backlist) |
This makes sense somehow, people tend to rate a lot books they love, or those they hate.
Correlation analysis:
We use bivariate correlation or PCC (Pearson's correlation coefficients) in order to calculate the (linear) correlation between two ratings:
# pivot ratings into matrix style
df_ratings_mtrx = ratings.pivot(index='user', columns='isbn', values='rating')
As someone who has read The Da Vinci Code (with isbn = 0385504209
), let's try and see corrolated results that Aearson's coefficient suggests:
davinci_corr = pd.DataFrame(df_ratings_mtrx.corrwith(df_ratings_mtrx['0385504209'], method='pearson'), columns=['pearson coeff']).dropna()
We can observe the top 10 suggestions whose ratings are highly corrolated to the ones of the Da Vinci Code :
isbn | pearson coeff | rating | average rating | title | |
---|---|---|---|---|---|
7 | 0385504209 | 1.000000 | 883 | 4.652322 | The Da Vinci Code |
15 | 0671027360 | 0.218741 | 586 | 3.718430 | Angels & Demons |
6 | 0375727345 | 0.214953 | 552 | 3.039855 | House of Sand and Fog |
1 | 0142001740 | 0.186145 | 615 | 4.219512 | The Secret Life of Bees |
16 | 067976402X | 0.179512 | 614 | 3.255700 | Snow Falling on Cedars |
14 | 059035342X | 0.166290 | 571 | 4.900175 | Harry Potter and the Sorcerer's Stone (Harry P... |
5 | 0345337662 | 0.149289 | 506 | 3.535573 | Interview with the Vampire |
4 | 0316666343 | 0.143711 | 1295 | 4.468726 | The Lovely Bones: A Novel |
12 | 0446672211 | 0.135783 | 585 | 4.105983 | Where the Heart Is (Oprah's Book Club (Paperba... |
3 | 0316601950 | 0.133880 | 568 | 3.593310 | The Pilot's Wife : A Novel |
This suggests that the correlation does work indeed, as we have a first suggestion that is from the same author, and the four others are all Mystery/Thrillers genre.
Data modelling
We convert our DataFrame into a scipy
sparse matrix
and fill the missing values with zeros (since we actually will calculate distances between rating vectors)
df_ratings_books = pd.merge(ratings, df_books, on='isbn').dropna(axis = 0, subset = ['title']).drop_duplicates(['user', 'title'])
df_ratings_books_mtrx = df_ratings_books.pivot(index = 'title', columns = 'user', values = 'rating').fillna(0)
# convert dataframe to scipy sparse matrix for more efficient calculations
ratings_mtrx = csr_matrix(df_ratings_books_mtrx.values)
We use the k-nearest neighbors algorithm from `KNN`: We choose the metric=cosine
to calculate the cosine similarity between rating vectors
history = NearestNeighbors(algorithm='brute',
leaf_size=30,
metric='cosine',
metric_params=None,
n_jobs=1,
n_neighbors=5,
p=2).fit(ratings_mtrx)
We write a function that would use our model to give the 5 most pertinent recommendations given a title book:
def get_recommends(book = "", verbose=False):
""" Take the title of a book and checks if it is in the database, then prints 5 recommendations using KNN and returns a list of
each recommendation with its distance, if verbose is set, it also prints the distances"""
try:
index = df_ratings_books_mtrx.index.get_loc(book)
except:
print("Couldn't find any :'(")
return [book, ["", 0.]*5]
knn_dist, knn_ind = model.kneighbors(df_ratings_books_mtrx.iloc[index, :].values.reshape(1, -1), n_neighbors = 6)
recommendations = [book, []]
for i in range(0, len(knn_dist.flatten())):
if i == 0:
book_to_recommand = df_ratings_books_mtrx.index[index]
print('Recommendations for {}:\n'.format(book_to_recommand))
else:
book_to_recommand = df_ratings_books_mtrx.index[knn_ind.flatten()[i]]
recommendations[1].append([book_to_recommand, knn_dist.flatten()[i]])
if verbose:
print('{}: {}, with a distance of {:.4f}'.format(i, book_to_recommand, knn_dist.flatten()[i]))
else :
print('{}: {}'.format(i, book_to_recommand))
return recommendations
And we test it: here I picked the same book The Da Vinci Code
get_recommends(book = "The Da Vinci Code");
>>>
Recommendations for The Da Vinci Code:
1: Angels & Demons
2: Widow's Walk
3: Doing Good
4: The Blue Nowhere : A Novel
5: Touching Evil
As a bonus, we write a function that would help us, using the Jaro distance to compare the similitudes between two chain of strings. We use the module Levenshtein for a very fast computation:
def title_suggest(title, lst=list(dict.fromkeys(list(df_books['title']))), k=20):
"""Gives available suggestions of books in the database based on the Jaro distance for string matching"""
comp = list()
for name in lst:
comp.append((name, lev.jaro(title, name)))
comp.sort(key=lambda x:x[1], reverse = True)
print("Possible suggestions:")
for i in range(k):
print(comp[i][0])
return comp[:5]
And we try to get suggestions, say Memoirs of a Geisha:
title_suggest("Memoirs of a Geisha");
>>>
Possible suggestions:
Memoirs of a Geisha
Memoirs of a Geisha Uk
Memoirs of a Beatnik
Memoirs of a Pet Therapist
Memoirs of Elise
Memoirs of the Forties
Memoirs of an Invisible Man
Memoirs of a Medieval Woman
Memoirs of a Geisha : A Novel (AUDIO CASSETTE)
Memoirs of a Survivor
Memoirs of Fanny Hill
Memoirs of a Race Traitor
Memoirs
Memoirs of an Unfit Mother
Memoirs of a Bengal Civilian
Memoirs of a Mangy Lover
Promise of Paradise
Memoirs of a Voluptuary
Memoirs of a Lightkeeper's Son
Memoirs of a Dutiful Daughter
The function title_suggest
can be implemented in the function get_recommends
in the case where no title matched the input title.
Cardiovascular disease classification
15th February 2019
We will use a basic Machine Learning framework to perform a binary classification task based on clinical examination results performed on patients with and without cardiovascular disease in order to train a model to guess if a patient, from his medical attributes, has a cardiovascular disease or not.

The original data were collected during medical examinations, and belong to the Cleveland database from UCI Machine Learning Repository. We use a more compact one with relevant parameters (only 14 features) from : Kaggle.
code cardiovascular disease diagnosis
Here is a short summary of what we will be doing:
- Data analysis - we prepare, study and explore the dataset, highliting important correlations.
- Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
- Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
- Feature importance - we highlight the important parameters in predictiong the presence of a cardiovascular disease.
- Conclusion
Let's import the main modules and libraries we will be working with:
## Data analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress
## Scikit Learn Models:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
## Scikit Learn Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_curve, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
## Scikit Learn preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, RobustScaler
from sklearn.decomposition import PCA
1. Data analysis:
The rows in the dataset represent patients and the columns represent examination results from various blood and medical tests.
We use the dataset to make a prediction on our target variable:
Features | Variable Type | Variable | Value Type |
---|---|---|---|
Age | Objective Feature | age | int(years) |
Gender | Objective Feature | sex | categorical code: 1 = male; 0 = female |
Chest pain type | Examination Feature | cp | categorical code : 0: Typical angina; 1: Atypical angina; 2: Non-anginal pain; 3: Asymptomatic |
Resting blood pressure | Examination Feature | trestbps | float (mm Hg) |
Serum cholestoral | Examination Feature | chol | float (mg/dl) |
Fasting blood sugar > 120 mg/dl | Examination Feature | fbs | binary |
Resting electrocardiographic results | Examination Feature | restecg | categorical code : 0: Nothing to note; 1: ST-T Wave abnormality; 2: Possible or definite left ventricular hypertrophy |
Maximum heart rate achieved | Examination Feature | thalach | float |
Exercise induced angina | Examination Feature | exang | binary |
ST depression induced by exercise relative to rest | Examination Feature | oldpeak | float |
The slope of the peak exercise ST segment | Examination Feature | slope | categorical code : 0: Upsloping; 1: Flatsloping; 2: Downslopins |
Number of major vessels colored by flourosopy | Examination Feature | ca | int (0-3) |
Thalium stress result | Examination Feature | thal | int: 1,3: normal; 6: fixed defect; 7: reversable defect |
Presence or absence of cardiovascular disease | Target Variable | target | binary: 1=yes, 0=no |
We start by reading our CSV file using pandas
heart_df = pd.read_csv('files/heart.csv')
heart_df.head(10)
and have a look at how many positive and negative samples there are:

Since the values are close to each other, this suggests that the target
column can be considered balanced.
Correlation analysis
We first explore some possible correlation between independent variables, as this may suggest which independent variables may or may not have an impact on our target variable:
corr_mtrx = heart_df.corr()
A correlation matrix tells us how related each variable is to the other. Since correlation matrices are symmetrical, half of it suffices to analyze correctly the data, hence we triangulate it as follow:
#Mask the upper triangle:
mask = np.triu(np.ones(corr_mtrx.shape),0)
seaborn
's heatmap is perfect to graphically highlit our correlations:

A higher positive (resp. negative) value means a potential positive (resp. negative) correlation. For example, we observe a negative correlation between cardiovascular diseases and gender. Indeed, if we look closer, this is partially due to the fact that there are more male than female patients in the sample:
heart_df['sex'].value_counts(normalize=True)
>>>
1 0.683168
0 0.316832
Name: sex, dtype: float64

We also observe a negative correlation of target
with age, while having
a positive correlation with thalac
. We try then to understand how age
and maximal heart rate relate in accordence to the presence, or not, of cardiovascular diseases:

First, we observe that the maximal heart rate regresses with age: this makes sense,
the younger someone is, the higher its maximal heart rate is. What is also interesting
is that the regression is more important in patients
with cardiovascular problems, as the slope of regression of target = 1
is clearly higher than that of target = 0
.
However, this might also be misleading due to the following fact: If we observe the age distribution of the patients

the Kernel Density Etimate is slightly swaying to the right, which also
reflects in the scatter plot above as there are more older patients.
On the other hand, the highest correlation on the tagret
variable is seen
with the chest pain variable cp
:

After a little bit of research, the American heart association supplies us with the following insights:
- Typical angina: chest pain or discomfort caused when the heart muscle doesn't get enough oxygen-rich blood.
- Atypical angina: chest pain that is not Typical Angina
- Non-anginal pain: typically esophageal spasms (not related to the heart)
- Asymptomatic: chest pain not showing signs of any disease
Here, the fact that the Atypical Angina it's not related to the heart but still seems to have a higher ratio of participants with cardiovascular disease than without. This is due to the actual definition of 'Atypical Angina' in the medical field, which seems to be confusing. Source: PubMed
2. Data modelling
Here we will try to build a model and train in order to predict our target
variable:
np.random.seed(1729)
X = heart_df.drop('target', axis = 1)
y = heart_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Our model is about classification, where we are predicting a categorical variable : target
. It has labeled data of less than 100.000 samples. Using Scikit Learn's documentation we opt for the following models:
- Log Reg : LogisticRegression
- K-Nearest Neighbors : KNeighborsClassifier
- Random Forest : RandomForestClassifier
- SVC : Support Vector Classification
- XGBoost : XGBoost.
In order to see which model performs the best, we proceed as follow:
2.1. Instantiate each model in a dictionary:
models = {'logReg': LogisticRegression(max_iter = 1000),
'KNN': KNeighborsClassifier(),
'SVC': SVC(),
'RandomForestClassifier': RandomForestClassifier(),
'XGB': XGBClassifier(use_label_encoder=False)}
2.2 Create function to fit and score models:
def mod_score(models, X_train, X_test, y_train, y_test):
'''
Takes a dictionarry of models and fits each on a training set and returns its score on a test set
'''
# Fix randomness
np.random.seed(1729)
# empty dict to append
results = {}
for name, model in models.items() :
model.fit(X_train, y_train)
results[name] = model.score(X_test, y_test)
return results
2.3 View the results:
mod_scores = mod_score(models, X_train, X_test, y_train, y_test)

3. Model fine tuning
We will try and normalize our datasets at first to see if it improves our scores and then, fine-tuning the hyperparameters of our classifiers to get the best possible results:
3.1. Datasets normalization
We will choose two types of scaling: MinMax and Robust:
# Normalized datasets
normal = Normalizer()
Xn_train = normal.fit_transform(X_train)
Xn_test = normal.fit_transform(X_test)
# Scaled datasets
minmax = MinMaxScaler()
Xm_train = minmax.fit_transform(X_train)
Xm_test = minmax.fit_transform(X_test)
scaler = RobustScaler()
Xsc_train = scaler.fit_transform(X_train)
Xsc_test = scaler.fit_transform(X_test)
Hyper-parameters tuning
For lack of data samples (only 4202) we cannot use a validation set to tune our models, we use k-cross-validation instead.
We use RandomizedSearchCV
:
# logReg hyperparameters
logReg_grid = { 'C': np.logspace(-4, 4, 50),
'solver': ['liblinear'],
'penalty' : ['l1', 'l2']}
# RandomForest hyperparameters
RandFor_grid = {'n_estimators': np.arange(10, 1000, 50),
'max_depth': [None, 3, 5, 10],
'min_samples_split': np.arange(2, 20, 2),
'min_samples_leaf': np.arange(1, 20, 2) }
# KNN hyperparameters
KNN_grid = { 'leaf_size' : [i for i in range(1,30)],
'n_neighbors' : [i for i in range(1,21)],
'p' : [1,2]
}
# SVC hyperparameters
SVC_grid = [{'C': [1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['rbf']}]
# XGB hyperparameters
XGB_grid = {'max_dept':range(3,10,2),
'max_dept':range(1,6,2),
'gamma':[i/10 for i in range(0,5)],
'subsample':[i/10 for i in range(6,10)],
'colsample_bytree':[i/10 for i in range(6,10)],
'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]}
Here, at least with the modelling that we have been doing, they both somehow come up with the same suggested tuning.
We make a comparison of the progress

3.Model evaluation
Now we want to evaluate our model using the logReg classifier that we have been tuning according to more metrics than just accuracy
:
# Instantiate best model with best hyperparameters
clf = LogisticRegression(max_iter = 800,
C = 339.3221771895323,
solver = 'liblinear',
penalty = 'l1').fit(Xn_train, y_train);
#make predictions
y_preds = clf.predict(Xn_test)
3.1. Confusion matrix:
A visual way to show where our model made nailed its predictions and where it failed:
conf_mtrx = confusion_matrix(y_test, y_preds)

3.2. ROC Curve and AUC Scores:
We basically compare the true positive rate to the false positive rate, i.e. a person that tests positive, but does not actually have the disease vs a person that tests negative although has a cardiovascular disease:
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(Xn_test)[:, 1])

3.3. Classification report
Now let's have a look on the main classification metrics:
# Create a classification report using the classification_report function
report = classification_report(y_test, y_preds, output_dict=True)
pd.DataFrame(report).T
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.923077 | 0.857143 | 0.888889 | 28.000000 |
1 | 0.885714 | 0.939394 | 0.911765 | 33.000000 |
accuracy | 0.901639 | 0.901639 | 0.901639 | 0.901639 |
macro avg | 0.904396 | 0.898268 | 0.900327 | 61.000000 |
weighted avg | 0.902864 | 0.901639 | 0.901264 | 61.000000 |
We tweak our previous function in order to evaluate our model on all these metrics:
def eval_mod(model, X_test, y_test):
''' evaluate a model on y_test'''
y_preds = model.predict(X_test)
acc = model.score(X_test, y_test)
prec = precision_score(y_test, y_preds)
recall = recall_score(y_test, y_preds)
f1 = f1_score(y_test, y_preds)
dft_eval = {'accuracy' : acc,
'precision' : prec,
'recall' : recall,
'f1 score' : f1}
for v, k in dft_eval.items():
print(v,': {:.2f}%'.format(k * 100))
return dft_eval
dft_eval = eval_mod(clf, Xn_test, y_test)
>>>
accuracy : 90.16%
precision : 88.57%
recall : 93.94%
f1 score : 91.18%
And using cross_val_score()
to Cross-validate for a more accurate and solid results:
def cv_eval_mod(model, X, y):
''' evaluate a model on y_test'''
y_preds = model.predict(X_test)
cv_acc = np.mean(cross_val_score(model, X, y, scoring = "accuracy", cv = 5))
cv_prec = np.mean(cross_val_score(model, X, y, scoring = "precision", cv = 5))
cv_recall = np.mean(cross_val_score(model, X, y, scoring = "recall", cv = 5))
cv_f1 = np.mean(cross_val_score(model, X, y, scoring = "f1", cv = 5))
cv_eval = {'accuracy' : cv_acc,
'precision' : cv_prec,
'recall' : cv_recall,
'f1 score' : cv_f1 }
for v, k in cv_eval.items():
print(v,': {:.2f}%'.format(k * 100))
return cv_eval
cv_eval = cv_eval_mod(clf, Xn_test, y_test)
>>>
accuracy : 77.18%
precision : 76.90%
recall : 84.76%
f1 score : 80.45%
We can observe our evaluation:

4. Feature importance
Finally, we will find which features were important in order to predict cardiovascular diseases using the medical examination results we had, or similarily:
Which charateristics contribute the most to logReg predicting whether someone has cadiovascular disease or not?
#get the coefficients from logReg
clf.coef_
# Match features to columns of our df
features_dict = dict(zip(heart_df.columns, list(clf.coef_[0])))

Now, the final remark to make is that our model is learning and figuring out the pattern : slope
increases $\Rightarrow$ target
$\longrightarrow$ positive.

Here, the data does suggest such a thing, although it seems kind of logical as well:
- Upsloping: better heart rate with excercise (uncommon)
- Flatsloping: minimal change (typical healthy heart)
- Downslopins: signs of unhealthy heart
5. Conclusion
We analysed our sample and trained a model that reached up to 90.16% on score accuracy and 83.81% on cross-validated accuracy. As our model here is pretty much balanced, we would want to aim for a higher accuracy, of at least > 95%. However, since false negative predictions are lower than the false positive ones (confusion matrix) we do have a pretty high recall (91.52% on cross-validation) which is good.
Breast tumours diagnostic using Logistic regression and support vector machine
6th february 2019
We will use a basic Machine Learning framework to perform a binary classification task based on medical examination results performed on patients with and without a breast tumour, in order to guess if a tumour is malignant or benign.

The original data is taken from the Breast Cancer Wisconsin (Diagnostic) Data Set) from UCI Machine Learning Repository. This is a very simple and compact example of a basic Machine Learning binary classification workflow.
codeBreast tumour diagnosis
Here is a short summary of what we will be doing:
- Data analysis - we prepare, study and explore the dataset, highliting important correlations.
- Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
- Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
- Model experimentation - we try to improve the model through experimentation by starting with 1000 images to make sure it works.
- Feature importance - we highlight the important parameters in predictiong the nature of a mamal tumour.
- Conclusion
To work through these topics, we will use Scikit learn
, matplotlib
and pandas
.
Let's import the main modules and libraries we will be working with:
## Data analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress, norm
## Scikit Learn Models:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
## Scikit Learn Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_curve, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
## Scikit Learn preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
1. Data analysis:
The rows in the dataset represent the 569 patients and the columns represent 30 characteristics of the cell nuclei present in image of a fine needle aspirate (FNA) of a breast mass. We use the dataset to make a prediction on our target variable:Features | Variable Type | Variable | Value Type | |
---|---|---|---|---|
0 | Radius (mean) | Statistical Feature | radius_m | float |
1 | Texture (mean) | Statistical Feature | texture_m | float |
2 | Perimeter (mean) | Statistical Feature | perimeter_m | float |
3 | Area (mean) | Statistical Feature | area_m | float |
4 | Smoothness (mean) | Statistical Feature | smoothness_m | float |
5 | Compactness (mean) | Statistical Feature | compactness_m | float |
6 | Concavity (mean) | Statistical Feature | concavity_m | float |
7 | Concave points (mean) | Statistical Feature | cp_m | float |
8 | Symmetry (mean) | Statistical Feature | symmetry_m | float |
9 | Fractal dimension (mean) | Statistical Feature | fd_m | float |
10 | Radius error | Statistical Feature | radius_std | float |
11 | Texture error | Statistical Feature | texture_std | float |
12 | Perimeter error | Statistical Feature | perimeter_std | float |
13 | Area error | Statistical Feature | area_std | float |
14 | Smoothness error | Statistical Feature | smoothness_std | float |
15 | Compactness error | Statistical Feature | compactness_std | float |
16 | Concavity error | Statistical Feature | concavity_std | float |
17 | Concave points error | Statistical Feature | cp_std | float |
18 | Symmetry error | Statistical Feature | symmetry_std | float |
19 | Fractal dimension error | Statistical Feature | fd_std | float |
20 | Worst radius | Statistical Feature | radius_L | float |
21 | Worst texture | Statistical Feature | texture_L | float |
22 | Worst perimeter | Statistical Feature | perimeter_L | float |
23 | Worst area | Statistical Feature | area_L | float |
24 | Worst smoothness | Statistical Feature | smoothness_L | float |
25 | Worst compactness | Statistical Feature | compactness_L | float |
26 | Worst concavity | Statistical Feature | concavity_L | float |
27 | Worst concave points | Statistical Feature | cp_L | float |
28 | Worst symmetry | Statistical Feature | symmetry_L | float |
29 | Worst fractal dimension | Statistical Feature | fd_L | float |
30 | Type of tumour | Target Variable | target | categorical code: 0 = Malignant; 1 = Benign |
We start by reading our CSV file using `pandas`, into a (5 rows × 31 columns) DataFrame
cancer_df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))
cancer_df.head()
and have a look at how many positive and negative samples there are:

Since the values are not too close to each other, this suggests that the
target
is no too unbalanced. We can plot some features and observe
somehow how our features are related: For example, we plot the mean radius and
smoothness of the tumours in both cases:

We can see that benign tumours tend to be smaller in radius than the malignant ones. We can also observe how some features overlap, which would make it hard to diagnose them:

Corr matrix:
We explore some possible correlation between independent variables, as this may suggest which independent variables may or may not have an impact on our target variable:
corr_mtrx = cancer_df.corr()
#Mask the upper triangle:
mask = np.triu(np.ones(corr_mtrx.shape),0)

$\displaystyle{ X_n = \frac{X - X_{min}}{X_{max} - X_{min}} }$and standard scaling :
$\displaystyle{ X_{sc} = \frac{X - \mathrm{mean}(X)}{\mathrm{std}{(X)}} }$$
We split our data into train and test sets:
np.random.seed(1729)
X = cancer_df.drop('target', axis = 1)
y = cancer_df['target']
# Non-normalized datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Normalized datasets
Xn_train = ( X_train - X_train.min() ) / ((X_train - X_train.min()).max())
Xn_test = ( X_test - X_test.min() ) / ((X_test - X_test.min()).max())
# Scaled datasets
scaler = StandardScaler()
Xsc_train = scaler.fit_transform(X_train)
Xsc_test = scaler.fit_transform(X_test)
# Using PCA
pca = PCA(n_components=3)
Xpca_train = pca.fit_transform(Xsc_train)
Xpca_test = pca.fit_transform(Xsc_test)
Our model is about classification, where we are predicting a categorical variable : target
. It has labeled data of less than 100.000 samples. Using Scikit Learn's documentation we opt for the following models:
- Log Reg : LogisticRegression
- K-Nearest Neighbors : KNeighborsClassifier
- SVC : Support Vector Classification
- Gaussian NB : Gaussian Naive Bayes
We test which model performs best:
# Instantiate each model in a dictionary
models = {'logReg': LogisticRegression(max_iter = 5000),
'KNN': KNeighborsClassifier(),
'SVC': SVC(),
'GNB': GaussianNB()}
# Create function to fit and score models
def mod_score(models, X_train, X_test, y_train, y_test):
'''
Takes a dictionarry of models and fits each on a training set and returns its score on a test set
'''
# Fix randomness
np.random.seed(1729)
# empty dict to append
results = {}
for name, model in models.items() :
model.fit(X_train, y_train)
results[name] = model.score(X_test, y_test)
return results
# score the resultes
mod_scores = mod_score(models, X_train, X_test, y_train, y_test)
We finally get the follwoing results:

Hyper-parameters tuning:
For lack of data samples (only 569) we cannot use a validation set to tune our
models, we use k-cross-validation instead. We will use
RandomizedSearchCV
# logReg hyperparameters
logReg_grid = { 'C': np.logspace(-4, 4, 50),
'solver': ['liblinear'],
'penalty' : ['l1', 'l2']}
# SVC hyperparameters
lSVC_grid = [{'C': [1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['rbf']}]
# KNN hyperparameters
KNN_grid = {'n_neighbors': [3, 4, 5, 10],
'weights': ['uniform', 'distance'],
'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
'leaf_size' : [10, 20, 30, 50],
'p' : [1,2]}
# GaussianNB hyperparameters
GNB_grid = {'var_smoothing': np.logspace(0,-9, num=100)}
We can visually make a comparison of the performences:

SVC(C = 10, kernel = 'linear', probability=True)
3. Model evaluation
We make our predictions:
# Instantiate best model with best hyperparameters
clf = SVC(C = 10, kernel = 'linear', probability=True).fit(Xsc_train, y_train);
# Make predictions
y_preds = clf.predict(Xsc_test)
3.1. Confusion matrix:
conf_mtrx = confusion_matrix(y_test, y_preds)

ROC Curve and AUC Scores
We basically compare the true positive rate to the false positive rate, i.e. a malignant diagnosis of a tumour, that is actually benign vs a benign diagnosis of a tumour, that is actually malignant (this last case is more dangerous)
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(Xsc_test)[:, 1])

3.3. Classification report
Let's have a look at the main evaluation metrics:
# Create a classification report using the classification_report function
report = classification_report(y_test, y_preds, output_dict=True)
pd.DataFrame(report).T
precision | recall | f1-score | support | |
---|---|---|---|---|
0.0 | 0.981132 | 0.981132 | 0.981132 | 53.000000 |
1.0 | 0.988889 | 0.988889 | 0.988889 | 90.000000 |
accuracy | 0.986014 | 0.986014 | 0.986014 | 0.986014 |
macro avg | 0.985010 | 0.985010 | 0.985010 | 143.000000 |
weighted avg | 0.986014 | 0.986014 | 0.986014 | 143.000000 |
Finally, we use a 5-fold cross_val_score()
to Cross-validate our
metrics
# cross-validated accuracy
cv_acc = np.mean(cross_val_score(clf, X, y, scoring = "accuracy", cv = 5))
# cross-validated precision
cv_prec = np.mean(cross_val_score(clf, X, y, scoring = "precision", cv = 5))
# cross-validated recall
cv_recall = np.mean(cross_val_score(clf, X, y, scoring = "recall", cv = 5))
# cross-validated f1-score
cv_f1 = np.mean(cross_val_score(clf, X, y, scoring = "f1", cv = 5))
and have a better and more accurate estimation of how our model is doing:
>>>
accuracy : 95.26%
precision : 95.89%he fractal dimension o
recall : 96.64%
f1 score : 96.24%

4. feature importance
Finally, we will find which features were important in order to predict malignant turmours from benigns using the results we had, or similarily:
Which charateristics contribute the most to SVC predicting whether a tumour is malignant or not?
#get the coefficients from SCV
clf.coef_
# Match features to columns of our df
features_dict = dict(zip(cancer_df.columns, list(clf.coef_[0])))

We can already see that our model is learning and figuring out some patterns to guess malignant and benign cases:

We see for example that it relies more on the standard deviation of the fractal dimension of a tumour to eliminate the malignant cases :

and, on the other hand, relies more on the largest area as well as the standard deviation of the area to detect malignant tumours.

Conclusion
We analysed our sample and trained a model that reached up to 98.60%% on score accuracy and 95.26% on cross-validated accuracy. As our model here is pretty much balanced we would want to aim for a very high **accuracy** (of at least > 95%). However, we would still like to aim for a higher recall and thus, reducing the values of false negatives (which in this case is ) from 1 to 0.
5. Future developpement:
I would like to work on an X-ray scan classification, using image segmentation or image classification, I think I saw a competition about that in Kaggle. Maybe in a near future!
Here are some project I did myself, for Uni and/or for pure fun and curiosity! I also do have a random junk of codes, available here developer_mode
Pythonic implementation of Conway's "Game of life", using Matplotlib's FuncAnimation's library
18th December 2020
As a huge fan of Conway's work, his beautiful mind and his charimatic and genius approach to mathematics in general, I decide to create for my Python class's final project, a "pythonic" implementation of his famous Game of life. Which shows how easily and how beautiful life can emmerge:
code Implementing the Game of life on Python :
Let's import the main modules and libraries we will be working with:
from IPython.display import HTML
from matplotlib.animation import FuncAnimation
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import convolve2d
1. A Baord
First, we create a function that would create a board : for that, we use numpy’s meshgrid function :
def create_board(size):
""" Takes a tuple size (n,m) to create a board (n x m) """
x = np.arange(0, size[0])
y = np.arange(0, size[1])
board = np.meshgrid(x, y)
return board
We use matplotlib
's scatter function to plot our board. This would look like something like this:
SIZE = (40,40)
marker_size = 350
board = create_board(SIZE)
fig, ax = plt.subplots(figsize=(16, 16))
scatter = ax.scatter(*board, animated=True, s=marker_size, edgecolor=None, c = '#ddc3a5')
ax.set_facecolor('#201e20')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
Our board will look like this:

2. A status function
We represent the status alive or dead of each cell on the board as True
and False
. We also can give a starting probability in [0,1[ for living cells:
def status(size, prob_life):
stat = (np.random.uniform(0, 1, size=size) >= prob_life)
return stat
status(SIZE, prob_life=0.3)
Now running the cell :
status(SIZE, prob_life=0.3)
will produce the following output:
array([[False, True, True, ..., False, False, False],
[ True, True, True, ..., True, True, True],
[ True, False, True, ..., True, False, True],
...,
[ True, True, False, ..., True, False, True],
[ True, False, True, ..., True, True, True],
[False, True, False, ..., False, True, True]])
Conway came up with a set of rules for this game:
- Survival : Any live cell with two or three live neighbors lives on to the next generation.
- Under-population : Any live cell with fewer than two live neighbors dies.
- Over-population : Any live cell with more than three live neighbors dies.
- Reproduction : Any dead cell with exactly three live neighbors becomes a live cell.
So a cell has a total of 8 neighbouring cells : up, down, left, right and diagonals. We create a function that would apply these rules to a given game status
.
3. The rules of life
We first create a function that will count how many living neighbours a cell has.
A really nice way to do that is to think in the following way: what happens when we reach the edge of the board? Well this makes me think of the game Super MarioBros3 Battle mode on SNES when I was young. Every time Mario reached the edge of the screen, he would reappear in the other edge:

This suggests that what we actually see as a board (or a TV screen in my case)
is actually a 3D band, where mario runs infinitely. We use scipy
's
2D wrapping function convolve2d
to wrap the edges of the board, using the spherical boundary conditions in
the parameter : boundary="wrap"
def count_alive_neighbors(status):
""" Counts the number of neighboring alive cells """
kernel = np.array(
[[1, 1, 1],
[1, 0, 1],
[1, 1, 1]])
count = convolve2d(status, kernel, mode='same', boundary="wrap")
return count
and we implement Conway's rules into a function:
def life_rules(status):
"""
Applies Conway's Game of Life rules given the current status of the game and returns a tuple of two arrays:
a the new status of each cell and a representation of the number of its neighbors
"""
alive_neighbors = count_alive_neighbors(status)
# Under-population :
survive_UP = ( alive_neighbors >= 2 )
# Over-population :
survive_OP = ( alive_neighbors <= 3 )
# Survivors :
survive = status * survive_UP * survive_OP
# Reproduction :
new_status = np.where(alive_neighbors == 3, True, survive)
new_neighbors = count_alive_neighbors(new_status)
return new_status, new_neighbors
For visual reasons, we can colour the cells depending on the the number of neighbours they have:
def get_colors(status, count):
cmap = mpl.cm.plasma
rescale = count / 8
colors = [cmap(neighbors) for neighbors in rescale.flatten()]
cell_alive = status.flatten()
colors = [(r, g, b, 0.9) if cell else (r, g, b, 0) for cell, (r, g, b, a) in zip(cell_alive, colors)]
return colors
Now, after spending hours reading and understanding matplotlib.animation
's'
API,
we can use the class FuncAnimation
and tweak it a tiny bit
to create a function that would retrun an instance of FuncAnimation
:
Basically, what FuncAnimation
does (or at least how i understood it)
is that it initiates a block (the board+initial conditions) and it has a
core loop that keeps updating and 'redrawing' the points that are
changing at each frame:
def let_there_be_life(prob_life, size=(40, 40)):
""" Implementing the Game of Life """
board = create_board(size)
stat = status(size, prob_life)
fig, ax = plt.subplots(figsize=(8, 8))
scatter = ax.scatter(*board, animated=True, s=70, edgecolor=None, c = '#ddc3a5')
ax.set_facecolor('#ffcce7')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False);
def update(frame):
nonlocal stat
stat, alive_neigbors = life_rules(stat)
colors = get_colors(stat, alive_neigbors)
scatter.set_facecolor(colors);
return scatter
return FuncAnimation(fig, update, frames = 500);
4. LET THERE BE LIFE!
Now we are ready to run the following code:
anim = let_there_be_life(prob_life=0.6)
html = HTML(anim.to_html5_video())
and see life emerging, right in front of our eyes:
Optional: we can also save it into a gif:
import time
tic = time.perf_counter()
anim.save("game_of_life.gif", writer=writer)
toc = time.perf_counter()
print(f"Produced the gif in {toc - tic:0.4f} seconds")
by using the writter writer = PillowWriter(fps=25)
or writer = imagemagick
Produced the gif in 90.1506 seconds
5. Further developement:
My teacher initially advised me to use <Bokeh as it is more Web-friendly, but I was too lazy i guess... or prefered maybe stick with something I kind of was more familiar with. I would gladly love any feedback/remark/improvement in that direction!
I am also currently working on deploying it as a web app using FastAPI or Flask, as producing the actual animation on matlotlib takes some time... (up to 2 mins), it doesnt really seem that promising and it seems as if it would be easier to do most of the wrok on the front end... let's see!
If you are interested in the codes I used, you can find everything in