Nobody expects the Spanish Inquisition!

- Monty Python

I am very familiar and have experience in the main data analysis workflow: data inspection, data cleaning/preprocesing, data transforming, data minning and data modelling. Here are some projects using Exploratory Data Analysis or Confirmatory Data Analysis that i have done/worked on:

Building a forum Page view time-series visualizer with Python and pandas

Data Anlaysis
Time series
python
pandas

code Page view time-series visualizer

For this project we visualize time series data using a line chart, bar chart, and box plots to understand the patterns in visits and identify yearly and monthly growth.

To work through this task, we will use pands, matplotlib and seaborn to visualize a dataset containing the number of page views each day on the FreeCodeCamp's forum from 2016-05-09 to 2019-12-03

Let's import the main modules and libraries we will be working with:


  import matplotlib.pyplot as plt
  import pandas as pd
  import seaborn as sns

1. Reading and cleaning the data

We read ours CSV file


  df = pd.read_csv('files/forum-pageviews.csv',
                 header = 0,
                 names=['Date', 'Views'],
                 parse_dates=['Date'],
                 index_col='Date')

and make sure to filter out days when the page views were in the top 2.5% of the dataset or bottom 2.5% of the dataset.


  mask = (df['Views'] >= df['Views'].quantile(0.025)) & (df['Views'] <= df['Views'].quantile(0.975))
  df_cln = df[mask]

Views
Date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
2016-05-28 19044
2016-05-29 20325

2. Plotting results

2.1 Daily views line plot:

We draw a line chart representing daily page views on the freeCodeCamp Forum from May 2016 to December 2019:

2.2 Monthly bar plot:

We copy and modify data for monthly bar plot:


  df_bar = df_cln.copy()
  df_bar['Month'] = df_bar.index.month_name()
  df_bar['Year'] = df_bar.index.year
  df_bar['month_index'] = df_bar.index.month
  df_bar = df_bar.sort_values(by='month_index')

And draw a bar plot using Seaborn that shows the average daily page views for each month grouped by year:

2.3 Year/Month-wise box plot:

We first sort and prepare data for box plots:


  df_box = df_bar.copy()
  df_box['Month'] = df_bar.index.month_name().str.slice(stop=3)

Then, we draw a box plot that shows how the values compare over time and how they are distributed within a given year:

or month:

Here is the full code as a function :

    If you are interested in the codes I used/wrote, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab

Sea level prediction using linear regression

Data Anlaysis
Time series
python
pandas

In this project, we anaylize a dataset of the global average sea level change since 1880 in order to predict the sea level change through the year 2050.

The data used here are the Global Average A bsolute Sea Level Change, 1880-2014 Data set, from the US Environmental Protection Agency using data from CSIRO, 2015; NOAA, 2015.

code Linear regression for time series prediction

We use linear regression to get the slope and y-intercept of the line of best fit. We then create a scatter plot with the line through the year 2050 to predict the sea level rise in 2050 and observe the rate of rise since the year 2000.

To work through these topics, we will use pandas, matplotlib and linregress from scipy.stats's.

Let's import the main modules and libraries we will be working with:


  import pandas as pd
  import matplotlib.pyplot as plt
  from scipy.stats import linregress

1. Read data:

  df = pd.read_csv('files/epa-sea-level.csv', na_values = ['?'])

Year CSIRO Adjusted Sea Level Lower Error Bound Upper Error Bound NOAA Adjusted Sea Level
0 1880 0.000000 -0.952756 0.952756 NaN
1 1881 0.220472 -0.732283 1.173228 NaN
2 1882 -0.440945 -1.346457 0.464567 NaN
3 1883 -0.232283 -1.129921 0.665354 NaN
4 1884 0.590551 -0.283465 1.464567 NaN

We can have a look at how the sampled Data looks like:

As we see, the data is a bit scattered. We will use linear regression to find the most fitting model to our prediction. The goal is basically to find the linear model $y = ax+b $ that fits the most i.e. with the minimum possible loss. For that we use the least squares regression which calculates

  • the slope, given by the equation:

    $\displaystyle{ a = \frac{ N \sum xy - \sum x \sum y}{N \sum x^2 - \left( \sum x \right)^2 } }$
  • the y-intercept, given by the equation:

    $\displaystyle{ b = \frac{ \sum y - a \sum x}{N} }$

We use the linregress function from scipi.stats to get the slope and y-intercept of the line of best fit:


  X = df['Year']
  Y = df['CSIRO Adjusted Sea Level']
  model = linregress(X, Y)

  (a, b) = (model.slope, model.intercept)

And we calculate the coordinates of the line of best fit:


  X_1 = list(range(1880, 2050))
  Y_1 = [ a * x + b for x in X_1 ]

We plot the line of best fit over the scatter plot making pass through our desired sea level prediction (i.e. x = 2050)

2. Comparison of sea rise rate:

Now we want to plot a new line of best fit sing only the data from year 2000:


  mask = (df['Year'] >= 2000)
  X_rec = df[mask]['Year']
  Y_rec = df[mask]['CSIRO Adjusted Sea Level']
  model_rec = linregress(X_rec, Y_rec)

  (a_rec, b_rec) = (model_rec.slope, model_rec.intercept)

and still make it pass by 2050 to see if the rate of rise continues as it has since 2000:


  X_2 = list(range(2000, 2050))
  Y_2 = [ a_rec * x + b_rec for x in X_2]

We can then observe the difference, and how the rate of sea rise went up:

It suggests that the level of the sea in 2050 will approximate the value of:


  print('{:.2f} inch ~ {:.2f} cm'.format(Y_2[-1], Y_2[-1] * 2.54) )
  
  >>>
      15.22 inch ~ 38.65 cm

Here is the full code as a function:

    If you are interested in the codes I used/wrote, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab
    I am very familiar and have experience in the main machine learning workflow: I worked with TensorFlow, Keras, as well as with PyTorch and I am familiar with:
  • Regression algorithms: Linear regression, Logistic regression, ...
  • Computer vision
  • Natrual Language Processing
  • Data Minning
  • Deep learning: CNN, RNN and LSTM

Here are some Machine learning projects I did/worked on myself, during my studies and/or for learning/practice.

COVID19 fakenews detection using Natural Language Processing

11th December 2020

Machine learning
Deep learning
python
NLP

One of the shared tasks of CONSTRAINT 2021's first Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Si​tuation, was about "Fighting an Infodemic: COVID-19 Fake News". Here is my modest participation at the Codalab's competition COVID19 Fake News Detection in English

code Building a tweet classification algorithm using Logistic Regression

The Fighting an Infodemic: COVID-19 Fake News Dataset contains tweets along with their associated binary labels: real f or real/verified COVID19 related informations and fake for hoaxes/fake news.

To work through this task we follow the usual workflow:

  • Data analysis - we prepare, study and explore the dataset, highliting important correlations.
  • Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
  • Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
  • Model experimentation - we try to improve the model through experimentation.

Let's import the main modules and libraries we will be working with:

  
    ## Data analysis libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import collections

    ## Scikit Learn Models:
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.svm import LinearSVC
    from xgboost import XGBClassifier

    ## Feature extraction
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.feature_extraction import stop_words

    ## text preprocessing
    import nltk
    from nltk.stem import PorterStemmer, WordNetLemmatizer
    import re
    import string
    import emoji

    ## Scikit Learn Model evaluators:
    from sklearn.model_selection import train_test_split, cross_val_score, KFold
    from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
    from sklearn import metrics
    from sklearn.metrics import confusion_matrix, classification_report, roc_curve
    from sklearn.metrics import precision_score, recall_score, f1_score
    from sklearn.pipeline import Pipeline, FeatureUnion
  
  

1. Data analysis:

We import our csv files:


  train_df = pd.read_csv('files/Constraint_English_Train - Sheet1.csv', index_col = 'id',)
  val_df = pd.read_csv('files/Constraint_English_Val - Sheet1.csv', index_col = 'id')
  test_df = pd.read_csv('files/english_test_with_labels - Sheet1.csv', index_col = 'id')

The data consists of 8560 with the labels real for real news and fake for the fake ones:

tweet label
0 Nancy Pelosi said, ???if you accept a check fr... fake
1 RT @CDCHeart_Stroke: #HeartAttacks and #stroke... real
2 The SARS-CoV-2 has been engineered by man edit... fake
3 We expect to complete this by the end of the d... real
4 RT @PIB_India: Cases per million population in... real

We will perform some statistical analysis on the dataset and get descriptive statistics in order to process our data: count words :

  • number of words in the tweet:
  •   
      count_words = main_df['tweet'].apply(lambda x: len(re.findall(r'\w+', x)))
    
      main_df['count_words'] = count_words
      
        
  • count mentions : referrals to other Twitter accounts, which are preceded by a `@`
  • 
      count_mentions = main_df['tweet'].apply(lambda x: len(re.findall(r'@\w+', x)))
    
      main_df['count_mentions'] = count_mentions
    
      
  • count hashtags : number of tag words, preceded by a #
  • 
      count_hashtags = main_df['tweet'].apply(lambda x: len(re.findall(r'#\w+', x)))
    
      main_df['count_hashtags'] = count_hashtags
    
    

    Most of the tweets do not contain hash tags, and the few that have tend to be real, maybe it can make a difference during training depending on if we get rid of it or not in text processing.

  • count capital words : number of uppercase words, could be used to somehow "shout accusations" and "denounce conspiracy"
  • 
      count_caps = main_df['tweet'].apply(lambda x: len(re.findall(r'\b[A-Z]{2,}\b', x)))
    
      main_df['count_caps'] = count_caps
    
    

    Most of the tweets do not contain capitalized words and so it probably won't have any effect if we lower() our texts.

  • count exclamation/question marks : number of question or exclamation marks
  • 
      count_excl_quest_marks = main_df['tweet'].apply(lambda x: len(re.findall(r'!|\?', x)))
    
      main_df['count_excl_quest_marks'] = count_excl_quest_marks
    
    

    The fake tweets seem to be using a bit more exclamation or question marks than the real ones.

  • count URLs : number of links in the tweet, preceded by http(s), might be a source that can be verified/denied.
  • 
      count_urls = main_df['tweet'].apply(lambda x: len(re.findall(r'http.?://[^\s]+[\s]?', x)))
    
      main_df['count_urls'] = count_urls
    
    

    It seems that more real tweets use URLS than fake ones, which seems a bit logical, maybe taking into account the tweets with verifiable urls might help our classification.

  • And finally, we count the number of emojis in each tweet : (we use the module emoji)
  • 
      count_emojis = main_df['tweet'].apply(lambda x: emoji.demojize(x)).apply(lambda x: len(re.findall(r':[a-z_&]+:', x)))
    
      main_df['count_emojis'] = count_emojis
    
    

    We can see that quasi all the tweets do not use emojis, hence it would not make that much of a difference to leave them.

2. Data moddeling

Here we will try to build a model and train in order to predict our label variable:

2.1. text processing:

We first convert the class labels into binary numerical variables for convenience:


  main_df['label'] = main_df.label.map({'real':1, 'fake':0})

We write a function that would clean and preprocess our tweets in order to feed it to our model:


  stopwords = stop_words.ENGLISH_STOP_WORDS

  def clean(text):

    """ Takes a string of text and preprocess it, by removing hashtapgs, punctuations, numbers and tokenizing"""
  # text = re.sub(r'@\w+', '', text) # remove mentions
    text = text.replace("#", "") # remove the hashtag character
    text = text.replace("RT", "") # remove the retweet mention
    text = text.lower() # lowercase
    text = "".join([char for char in text if char not in string.punctuation and not char.isdigit()]) #remove punctuation and numbers
    text = " ".join([token for token in text.split() if token not in stopwords])
  # porter = PorterStemmer()
  # text = " ".join([porter.stem(word) for word in text.split()]) #stemming #it actually works better without

    return text

Here is an example of how the clean functions works, applied to one of my favourite tweetter accounts:

  
  text = '@KyloR3n hello @hottopic i want a shirt that says I’m a different person but it should be the same color and basically the same shirt'

  clean(text)
  
  >>>
      hello want shirt say i’m differ person color basic shirt
  
  

Finally, we regroup everything together:


  clean_tweet = main_df['tweet'].apply(clean)
  main_df['clean_tweet'] = clean_tweet
  main_df = main_df.drop(['tweet'], axis = 1)


We can also see the distribution of the words per frequency in all the tweets


  vect = CountVectorizer()
  bag = vect.fit_transform(clean_tweet)
  word_freq = dict(zip(vect.get_feature_names(), np.asarray(bag.sum(axis=0)).ravel()))
  word_count = collections.Counter(word_freq)
  word_count_df = pd.DataFrame(word_count.most_common(20), columns = ['word', 'frequency'])


Data modelling

Our model is about classification, where we are predicting a categorical variable : label. It has labeled data of less than 100.000 samples. Using Scikit Learn's documentation we opt for the following models:


  X_train, X_val, y_train, y_val = train_test_split(main_df.drop(['label'], axis=1), main_df.label, test_size=0.2, random_state=1729)


In order to see which model performs the best, we proceed as follow: We instantiate each model in a dictionary


  models = {'logReg': LogisticRegression(max_iter = 1000),
            'MultiNB': MultinomialNB(),
            'lSVC': LinearSVC(class_weight='balanced'),
            'XGB': XGBClassifier(use_label_encoder=False)}

we create function to preprocess, vectorize the training and validation data, fits and scores models


  def mod_score(models, X_train, X_test, y_train, y_test, vect=CountVectorizer()):
    '''
    Takes a dictionarry of models and fits each on a training set and returns its score on a test set
    '''
    # empty dict to append
    results = {}
    # ffeature extraction:
    X_train_vect = vect.fit_transform(X_train.clean_tweet)
    #transform testing data (using training data's features)
    X_test_vect = vect.transform(X_test.clean_tweet)
    for name, model in models.items() :
        model.fit(X_train_vect, y_train)
        results[name] = model.score(X_test_vect, y_test)

    return results

and we observe the results:

  mod_score(models, X_train, X_val, y_train, y_val)

SVM's linear SVC and LogisticRegression both achieved an acuuracy of 92%! thats sounds great.

4. Hyper-parameters tuning

This is somehow, the most expensive, time and ressources consuming step of the workflow. As the vectorizers and the classification models have both tuneable parameters, and due to the small size of our sample (8650 tweets only) we use GridSearchCV to select the best parameters on a 5-fold cross-validation.

We use the following grid parameters:


  ## Vectorizers hyperparameters:
  # BOW hyperparameters
  vect_grid = {'features__pipe__vect__max_df': (0.25, 0.5, 0.75),
               'features__pipe__vect__min_df': (1,2) }
  # TF-IDF hyperparaleters
  tfidf_grid = {'features__pipe__vect__analyzer' : ('word', 'char'),
                'features__pipe__vect__ngram_range': ((1, 3), (3, 7), (5, 7), (2, 2), (1, 2)),
                'features__pipe__vect__binary' : (True, False) }

  ## Classifiers hyperparameters:
  # logReg hyperparameters
  logReg_grid = {'model__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
                 'model__solver': ['lbfgs', 'liblinear'],
                 'model__penalty' : ('l1', 'l2')}

  # MultionomialNB hyperparameters
  MultiNB_grid = {'model__alpha': (0.25, 0.5, 0.75)}

  # SVC hyperparameters
  lSVC_grid = {'model__C': (.05, .12, .25, .5, 1, 2, 4)}

We will use two vectorizing methods : the bag of words vectorizer : CountVectorizer : vect=CountVectorizer() and the Term Frequency - Inverse Document Frequency : TF - IDF vect=TfidfVectorizer()

accuracy_Cvect accuracy_TF-IDF
logReg 93.049% 94.217%
lSVC 93.166% 94.042%
MultiNB 93.283% 91.998%
XGB 88.084% 87.266%

We clearly see that the best results were achived by the following parametrization:


  tfidf_cstm_lorReg = TfidfVectorizer(binary=True, max_df=0.5, min_df=1, ngram_range=(1, 2))
  logReg_tfidf_cstm = LogisticRegression(max_iter = 1000, C=1000, penalty='l2', solver='liblinear')

Model evaluation

Now that we saw a slight improvement in our model's performance, we can evaluate it on the testing Dataset: We first prepare our testing dataset


  test_df = test_df.reset_index().drop('id', axis=1)
  test_df['label'] = test_df.label.map({'real':1, 'fake':0})
  test_df['clean_tweet'] = test_df['tweet'].apply(clean)
  test_df = test_df.drop(['tweet'], axis = 1)

  

and we evaluate our model on the test data:


  X_train, X_test, y_train, y_test = main_df.drop(['label'], axis=1), test_df.drop(['label'], axis=1), main_df.label, test_df.label

  # Preprocess and Vectorize the data:
  X_train_tfidfv = tfidf_cstm_lorReg.fit_transform(X_train.clean_tweet)
  X_test_tfidfv = tfidf_cstm_lorReg.transform(X_test.clean_tweet)

  # Instantiate best model with best hyperparameters
  history = logReg_tfidf_cstm.fit(X_train_tfidfv, y_train)

  # Make predictions
  y_preds = history.predict(X_test_tfidfv)

3.1. Confusion matrix:

We can have a visual idea on how our model's performance is, by looking at where the model nailed its predictions and where it failed:


  conf_mtrx = confusion_matrix(y_test, y_preds)

3.2. ROC Curve and AUC Scores:

We can also compare the TPR to the FPR, i.e tweet that is flagged real, but is fake vs a tweet that is flagged fake although it is real:

3.3. Classification report:

and finally, we have a look at the main evaluation metrics:


  # Create a classification report using the classification_report function
  report = classification_report(y_test, y_preds, output_dict=True)

  pd.DataFrame(report).T

precision recall f1-score support
0 0.939303 0.925490 0.932346 1020.000000
1 0.933040 0.945536 0.939246 1120.000000
accuracy 0.935981 0.935981 0.935981 0.935981
macro avg 0.936172 0.935513 0.935796 2140.000000
weighted avg 0.936025 0.935981 0.935957 2140.000000

Conclusion

We have reached a nice result, an accuracy of around more than 93% a recall of around 94%. We can manually try our model out, with some random made up tweets and see how it performs : here are some random tweets I found following the hashtag #COVID


  tweet1 = """The charity @TEDSliverpool says money raised in memory of Claudia Walsh
  means it'll be able to help more people with mental health issuesClaudia died with
  #coronoavirus on her 25th birthday. She also volunteered at @WhitechapelLiv. More
  than £21,000 has already been raised"""

  tweet2 = """Canadian-made SARS-CoV-2 detector set to ship worldwide | $KNR $KNR.ca
  #biocloud #smartbuildings #smartfactory #kontrol #smartcity #smartcities  #ontario
  #canada #safespaces #madeincanada #innovation #ai #iot #covid19 #covidー19 #sarscov2
  #coronoavirus https://bit.ly/3mXWV2O"""

  tweet3 = """#Coronoavirus : "Indigenous people are stepping up as volunteer firefighters,
  but they are now doubly strained: Closed borders have shriveled their income from sustainably
  harvested forest exports" @ACOFOP #Guatemala @alianzabosques https://apnews.com/e3cddd53e453a2"""

  tweet4 = """Masih Alinejad says Government hid news of #coronoavirus for political purposes"""

  
and we can test our model and predict the veracity of the tweets:
  
    # Make predictions
  y_preds = ['real' if v == 1 else 'fake' for v in history.predict(vect).tolist() ]

  y_preds
  >>>
      ['fake', 'real', 'fake', 'fake']
  
    

Further developpement

Maybe try Deep Learning? a CNN using Words Embeddings with LogReg's feature extractions... or roBert/daBert? Ernie2.0?

    If you are interested in the codes I used, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab

Dog bread classification using convolutional neural networks 🐕

25th August 2019

Machine learning
Deep learning
Computer vision
python
TensorFlow

In this multi-class image classification project, we use Deep Learning, and particularily Convolutional Neural Networks to build a model that will help us identify up to 120 different breeds of dogs 🐩

We will use Kaggle dog breed identification competition which consists of a collection of 10,000+ labelled images of 120 different dog breeds.

code Multi-label image classification:

In image classification problems, Convolutional Neural Networks (CNN for short) are somehow one of the most popular and effective neural network models to consider, especially if it is combined with Data augumentation.

Here is a short summary of what we will be doing:

  • Data analysis - we prepare, study and explore the dataset, highliting important correlations.
  • Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
  • Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
  • Model experimentation - we try to improve the model through experimentation by starting with 1000 images to make sure it works.

To work through these topics, we will use TensorFlow and Keras:

Let's import the main modules and libraries we will be working with:


  # Data analysis libraries
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import seaborn as sns
  from scipy.stats import linregress, norm

  ## Tensorflow:
  import tensorflow as tf
  import tensorflow_hub as hub

  ## keras model layers and callback:
  import keras
  from keras import models
  from keras.models import Sequential, model_from_json
  from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout, GlobalAveragePooling2D, BatchNormalization, InputLayer, Lambda, Input
  from keras.callbacks import TensorBoard, ModelCheckpoint
  from keras.optimizers import Adam, RMSprop
  from keras.models import Model, Input

  ## Keras models
  from keras import applications
  from keras.applications.inception_v3 import InceptionV3
  from keras.applications.nasnet import NASNetLarge
  from keras.applications.inception_resnet_v2 import InceptionResNetV2
  from keras.applications.xception import Xception

  ## Scikit Learn Model evaluators:
  from sklearn.model_selection import train_test_split, cross_val_score, KFold
  from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
  from sklearn import metrics
  from sklearn.metrics import confusion_matrix, classification_report
  from sklearn.metrics import precision_score, recall_score, f1_score

  ## Loading images
  import os
  from urllib import request
  from io import BytesIO

  ## Preprocessing:
  from keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array

    

1. Data analysis:

We start by reading our CSV file using pandas :


  train_df = pd.read_csv("files/dog_breed/files/labels.csv")
  train_df.describe()
  train_df.head()

    

We have in total 10.222 images IDs with a total of 120 breeds. Our DataFrame looks like this: The rows in the dataset represent image IDs for each dog breed.

id breed
0 000bec180eb18c7604dcecc8fe0dba07 boston_bull
1 001513dfcb2ffafc82cccf4d8bbaba97 dingo
2 001cdf01b096e06d78e9e5112d419397 pekinese
3 00214f311d5d2247d5dfe4fe24b2303d bluetick
4 0021f9ceb3235effd7fcde7f7538ed62 golden_retriever

We check how these labels are distributed: We have an average of 82 pictures per breed

1.1. Data pre-processing: Labels

We first make some Hot Encoding: we encode our labels into a boolean format


  labels = np.array(train_df['breed'])
  breeds = np.unique(labels)

  enc_labels = [ label == np.array(breeds) for label in labels ]

and we use Scikit Learn's train_test_split to create a validation set: Here, we will first train our model on a subset of data to see how it is performing.

  # dataset variables
  X = img_name
  y = enc_labels

  # set a size parameter
  SUBSET_SIZE = 1000

  # set random seed
  np.random.seed(1729)

  # create validation set for the subset_data:
  X_train, X_val, y_train, y_val = train_test_split(X[:SUBSET_SIZE],
                                                    y[:SUBSET_SIZE],
                                                    test_size = 0.2)

1.2. Data pre-processing: Images

We want to create a function that would process our images: TensorFlow's documentation suggests that we convert our data into batches of (image, label) tensors:

  
    # Choose a size
    IMG_SIZE = 224

    def img_process(img_path) :
        '''
        Processes and converts an image path into a tensor
        '''
        # read image path
        img = tf.io.read_file(img_path)
        #Tensorize the image
        img = tf.image.decode_jpeg(img, channels = 3)
        #Normalize the tensor
        img = tf.image.convert_image_dtype(img, tf.float32)
        #resize the image
        img = tf.image.resize(img, size = [IMG_SIZE, IMG_SIZE])

        return img

    def img_label_get(img_path, label):
        '''
        Processes and converts an image path into a tensor and returns a tuple (image, label)
        '''
        img = img_process(img_path)
        return img, label
  
  

and we create a function that will create small batches:

  
    # Define a batch size
    BATCH_SIZE = 32

    def Batch(img, label=None, batch_size = BATCH_SIZE, valid_data = False, test_data = False):
        '''
          Creates batches of data out of an (image, label) pairs.
          Shuffles the data only if it's training data.
          Accepts test data as input (no labels).
        '''
        # valid dataset, no need to shuffle
        if valid_data == True :
            print("Creating validation data batches...")
            data = tf.data.Dataset.from_tensor_slices((tf.constant(img), # filepaths
                                                       tf.constant(label)) #labels
                                                      ).cache().prefetch(tf.data.AUTOTUNE)
            data_batch = data.map(img_label_get).batch(BATCH_SIZE)
            print('Done!')
            return data_batch

        # test dataset, no labels
        if test_data == True :
            print("Creating test data batches...")
            data = tf.data.Dataset.from_tensor_slices((tf.constant(img))) # no labels
            data_batch = data.map(img_process).batch(BATCH_SIZE)
            print('Done!')
            return data_batch

        else:
            #Tensorize and shuffle training dataset
            print("Creating training data batches...")
            data = tf.data.Dataset.from_tensor_slices((tf.constant(img),
                                                      tf.constant(label))
                                                     ).cache().shuffle(buffer_size = len(img)).prefetch(tf.data.AUTOTUNE)
            data = data.map(img_label_get)
            data_batch = data.batch(BATCH_SIZE)
            print('Done!')
            return data_batch
  
  

Now, we batch our data sets:


  train_data = Batch(X_train, y_train)
  val_data = Batch(X_val, y_val, valid_data=True)

Our batches now have the following attributes:

((TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None),
  TensorSpec(shape=(None, 120), dtype=tf.bool, name=None))

2. Data modeling:

We will test our data set and our modelisation using Transfer learning: we will use a very known and pretrained model for dog related classifications: Mobilenet v2 from Tensorflow Hub

2.1. Instantiate a pre-trained model:
  
    # Setup input shape
    INPUT_SHAPE = [None, 224, 224, 3] # batch, height, width, colour channels

    # Setup output shape
    OUTPUT_SHAPE = len(breeds) # number of unique labels

    #pre trained model from tensorflow_hub:
    MODEL_URL = 'https://tfhub.dev/google/imagenet/mobilenet_v2_140_224/classification/4'
  
  

We create a function that will build a keras model from our inputs:

  
    def build_model(input_shape=INPUT_SHAPE, output_shape=OUTPUT_SHAPE, model_url=MODEL_URL):
      '''
      Creates and builds and returns a keras model from input shape, output shape and url
      '''
      print("Building model with:", model_url.split('/')[5])

      # Setup the layers
      model = Sequential([
      hub.KerasLayer(model_url), # input layer
      Dense(output_shape, activation="softmax")]) # output layer

      # Compile the model
      opt = Adam(lr=0.001) # Gradient descent ptimizer
      model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

      # Build the model
      model.build(input_shape) # Let the model know what kind of inputs it'll be getting

      return model
  
  

And we build our model:



  Building model with: mobilenet_v2_140_224
  Model: "sequential"
  _________________________________________________________________
  Layer (type)                 Output Shape              Param #
  =================================================================
  keras_layer (KerasLayer)     (None, 1001)              6158505
  _________________________________________________________________
  dense (Dense)                (None, 120)               120240
  =================================================================
  Total params: 6,278,745
  Trainable params: 120,240
  Non-trainable params: 6,158,505

2.2. Train its upper dense layer:

  NUM_EPOCHS = 100

  def build_model(input_shape=INPUT_SHAPE, output_shape=OUTPUT_SHAPE, model_url=MODEL_URL):
    '''
    Creates and builds and returns a keras model from input shape, output shape and url
    '''
    print("Building model with:", model_url.split('/')[5])

    # Setup the layers
    model = Sequential([
    hub.KerasLayer(model_url), # input layer
    Dense(output_shape, activation="softmax")]) # output layer

    # Compile the model
    opt = Adam(lr=0.001) # Gradient descent ptimizer
    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

    # Build the model
    model.build(input_shape) # Let the model know what kind of inputs it'll be getting

    return model

And we train our model: The first impression that we get is that the model is clearly overfitting: its performance on the training test outstands its results on the validation set



  ...
  Epoch 7/100
  25/25 [==============================] - 5s 216ms/step - loss: 0.0757 - accuracy: 1.0000 - val_loss: 1.2328 - val_accuracy: 0.6750
  Epoch 8/100
  25/25 [==============================] - 5s 215ms/step - loss: 0.0581 - accuracy: 1.0000 - val_loss: 1.2111 - val_accuracy: 0.6800
  ...
  Epoch 13/100
  25/25 [==============================] - 5s 216ms/step - loss: 0.0266 - accuracy: 1.0000 - val_loss: 1.1501 - val_accuracy: 0.6950
  Epoch 14/100
  25/25 [==============================] - 5s 215ms/step - loss: 0.0235 - accuracy: 1.0000 - val_loss: 1.1402 - val_accuracy: 0.6950

Now, training the model on the whole data set (preferably using a GPU)


model.evaluate(val_data)

We see that we reach an accuracy of 82% with a loss of 0.64



  64/64 [==============================] - 793s 12s/step - loss: 0.6375 - accuracy: 0.8152

It's time to make some predictions


predictions = model.predict(val_data, verbose=1)

and see how our model is doing:

We can start printing out a classification report:

  
    report = classification_report(np.argmax(predictions,axis=1), np.argmax(yf_val, axis=1), target_names=breeds, output_dict=True)

    report_df = pd.DataFrame(report).T
    report_df
  
  
precision recall f1-score support
affenpinscher 0.777778 0.875000 0.823529 16.000000
afghan_hound 0.931034 0.964286 0.947368 28.000000
african_hunting_dog 1.000000 1.000000 1.000000 16.000000
airedale 0.964286 0.900000 0.931034 30.000000
american_staffordshire_terrier 0.692308 0.450000 0.545455 20.000000
... ... ... ... ...
wire-haired_fox_terrier 0.857143 0.600000 0.705882 20.000000
yorkshire_terrier 0.750000 0.631579 0.685714 19.000000
accuracy 0.815159 0.815159 0.815159 0.815159
macro avg 0.804588 0.810483 0.800445 2045.000000
weighted avg 0.829992 0.815159 0.815969 2045.000000

Our model feels more confident about guessing the following breeds:

  
    report_df['precision'].sort_values(ascending = False)[:10]
  
  >>>
      japanese_spaniel               1.0
      soft-coated_wheaten_terrier    1.0
      african_hunting_dog            1.0
      curly-coated_retriever         1.0
      papillon                       1.0
      west_highland_white_terrier    1.0
      clumber                        1.0
      chow                           1.0
      saint_bernard                  1.0
      cardigan                       1.0
      Name: precision, dtype: float64
  

and way less about guessing the following ones:


  
    report_df['precision'].sort_values(ascending = True)[:10]
  
  
  >>>
      eskimo_dog                   0.181818
      miniature_poodle             0.357143
      walker_hound                 0.400000
      border_collie                0.400000
      boxer                        0.466667
      staffordshire_bullterrier    0.466667
      chihuahua                    0.523810
      lakeland_terrier             0.555556
      flat-coated_retriever        0.588235
      malamute                     0.600000
      Name: precision, dtype: float64
  

It does not seem too bad! Although, as we have seen in the subset training, our model is maybe overfitting and these results might be misleading. Let's try to improve our model and use a more accurate evaluation :

Model tuning

3.1. Data augmentation:

One way to avoid that is using Data augmentation: we take the training images and manipulate (crop, resize) or distort them (flip, rotate) to create more generalized patterns for the model to learn from.

NOTE: Data augmentation is an augmentation in the sense where it increases the generalizability of the model, and NOT an actual addition or increase in the dataset, i.e. it does not increase the size of the dataset per se.

We will use the class from keras.preprocessing. which will generate batches of tensor image data with real-time data augmentation:


 datagen = ImageDataGenerator(rescale=1./255,
                             validation_split=0.2,
                             rotation_range=45, # randomly rotate pictures
                             width_shift_range=0.2, # randomly translate pictures vertically
                             height_shift_range=0.2, # randomly translate pictures horizontally
                             shear_range=0.2, # randomly applying shearing transformations.
                             zoom_range=0.25, # randomly zoom inside pictures.
                             horizontal_flip=True, # randomly flip half the images horizontally
                             fill_mode='nearest')

  

This is how a batch of augmented images would look like:

3.2. Dropouts:

Dropouts is another technique that helps preventing overfitting, by regulizing deep neural networks : https://jmlr.org/papers/v15/srivastava14a.html

    
    
  Model: "sequential"
  _________________________________________________________________
  Layer (type)                 Output Shape              Param #
  =================================================================
  keras_layer (KerasLayer)     (None, 1001)              6158505
  _________________________________________________________________
  dense (Dense)                (None, 256)               256512
  _________________________________________________________________
  dropout (Dropout)            (None, 256)               0
  _________________________________________________________________
  dense_1 (Dense)              (None, 120)               30840
  =================================================================
  Total params: 6,445,857
  Trainable params: 287,352
  Non-trainable params: 6,158,505
    
  
3.3. K-fold cross validation:

We now evaluate our model on a 5-fold cross validation process. This will ensure that the estimation and evaluation of our model has the least bias. Here is a big pipeline-like code where we put everything together:


  VALIDATION_ACCURACY = []
  VALIDATION_LOSS = []

  fold = 1
  kf = KFold(n_splits = 5, shuffle = True, random_state = 1729)

  # Set up manual K-fold cross validation:
  for train_index, test_index in kf.split(train_df):
    trainData = train_df.iloc[train_index]
    testData = train_df.iloc[test_index]
    print('Initializing Kfold %s'%str(fold))
    print('Train shape:',trainData.shape)
    print('Test shape:',testData.shape)
    epochs = 100

    # prepare training and validation generators
    train_datagen = ImageDataGenerator(rescale=1./255,
                                    validation_split=0.2,
                                    rotation_range=45, # randomly rotate pictures
                                    width_shift_range=0.2, # randomly translate pictures vertically
                                    height_shift_range=0.2, # randomly translate pictures horizontally
                                    shear_range=0.2, # randomly applying shearing transformations.
                                    zoom_range=0.25, # randomly zoom inside pictures.
                                    horizontal_flip=True, # randomly flip half the images horizontally
                                    fill_mode='nearest')
    valid_datagen = ImageDataGenerator(rescale=1. / 255)

    # Create a flow from dataframe for training data
    train_gen = train_datagen.flow_from_dataframe(dataframe=trainData,
                                                  directory=train_dir,
                                                  x_col="id",
                                                  y_col="breed",
                                                  subset="training",
                                                  shuffle=True,
                                                  seed=1729,
                                                  target_size=(IMG_SIZE, IMG_SIZE),
                                                  batch_size=BATCH_SIZE,
                                                  class_mode="categorical")
    valid_gen = train_datagen.flow_from_dataframe(dataframe=trainData,
                                                  directory=train_dir,
                                                  x_col="id",
                                                  y_col="breed",
                                                  subset="validation",
                                                  target_size=(IMG_SIZE, IMG_SIZE),
                                                  batch_size=BATCH_SIZE,
                                                  class_mode="categorical")

    # Create the model
    model_full_aug = Sequential([
    hub.KerasLayer(MODEL_URL), # input layer
    Dense(256, activation='relu'),
    Dropout(0.5),
    Dense(OUTPUT_SHAPE, activation="softmax")]) # output layer

    # Compile the model
    opt = Adam(lr=0.001) # Gradient descent ptimizer
    model_full_aug.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

    # Build the model
    model_full_aug.build(INPUT_SHAPE)

    # Setting callbacks
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=8)
    checkpoint = tf.keras.callbacks.ModelCheckpoint(model_dir+get_model_name(fold),
                                                    monitor='val_accuracy',
                                                    verbose=1,
                                                    save_best_only=True,
                                                    mode='max')

    # Train the model
    history_aug = model_full_aug.fit(x = train_gen,
                                    steps_per_epoch = train_gen.n // train_gen.batch_size,
                                    epochs = NUM_EPOCHS,
                                    validation_data = valid_gen,
                                    validation_steps = valid_gen.n // valid_gen.batch_size,
                                    callbacks = [early_stopping, checkpoint])

    # LOAD BEST MODEL to evaluate the performance of the model
    model.load_weights(model_dir+"model_"+str(fold)+"fold.h5")
    results = model.evaluate(valid_gen)
    results = dict(zip(model_full_aug.metrics_names,results))

    VALIDATION_ACCURACY.append(results['accuracy'])
    VALIDATION_LOSS.append(results['loss'])

    tf.keras.backend.clear_session()

    fold += 1

And we can see the results:

Here we actually did not have any significant improvement in accuracy. However, the model is not overfitting as the gap is smaller between the validation and the training curves, which is still an improvement

4. Feature extraction

At last, we will use feature extraction on known trained models on imageNet in order to boost our transfer learning process. This might give us a significant improvement of our classification. We will explore the following models:

We first modify our function in order to quickly and efficiently preprocess our data: We chose the target size (331, 331, 3) because it is the one dictated by NasNetLarge if one loads it with weights='imagnet'


  TARGET_SIZE = (331,331,3)

  def images_to_array(directory, dataframe, target_size = TARGET_SIZE):
      images = np.zeros([len(dataframe), target_size[0], target_size[1], target_size[2]], dtype=np.uint8)
      img = ''
      for ix, image_name in enumerate(dataframe['id'].values):
          img_dir = os.path.join(directory, image_name + '.jpg')
          img = load_img(img_dir, target_size = target_size)
          images[ix] = img_to_array(img)
      del img
      label_dict = dict(enumerate(dataframe['breed'].unique()))
      return images, label_dict

We write a function that would run the convolutional base over our data, record its predictions so we can later feed them to a pretty simple CNN with a dense layer and some dropout: this is somehow less costly as we only need to run the convolutional base once for every image. However, we will not be able to use data augmentation here... as using it is by far way more costly (we would need to add Dense layers on top of the convolutional base , by using DataImageGenerator, and run it all on the input data).

  
  def get_features(model, preprocess_input, images, target_size = TARGET_SIZE):

      conv_base = model(input_shape = target_size, include_top=False, weights='imagenet', pooling = 'avg')

      cnn = Sequential([
      InputLayer(input_shape = target_size), # input layer
      Lambda(preprocess_input), # preprocessing layer
      conv_base ]) # base model

      features = cnn.predict(images)

      print('feature-map shape: {}'.format(features.shape))
      return features
  
  
We start our process of feature extraction:
  • Resnet50:
  • 
      InceptionResNetV2(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
      from keras.applications.inception_resnet_v2 import preprocess_input
    
      resnet_preprocess = preprocess_input
      resnet_features = get_features(InceptionResNetV2, resnet_preprocess, train_images)
    
    
  • Xception:
  • 
      Xception(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
      from keras.applications.xception import preprocess_input
    
      xception_preprocess = preprocess_input
      xception_features = get_features(Xception, xception_preprocess, train_images)
    
    
  • Inception V3:
  • 
      InceptionV3(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
      from keras.applications.inception_v3 import preprocess_input
    
      inception_preprocess = preprocess_input
      inception_features = get_features(InceptionV3, inception_preprocess, train_images)
    
    
  • Nasnet:
  • 
      NASNetLarge(input_shape=TARGET_SIZE, include_top=False, weights='imagenet')
      from keras.applications.nasnet import preprocess_input
    
      nasnet_preprocessor = preprocess_input
      nasnet_features = get_features(NASNetLarge, nasnet_preprocessor, train_images)
    
    

Finally, we stack up all our features and flatten them so we can feed them to our new CNN:


  final_features = np.concatenate([resnet_features, xception_features, inception_features, nasnet_features], axis = 1)

  print('final features shape: {}'.format(final_features.shape))
  
  >>>
      final features shape: (10222, 9664)

  

Now, we define our densely connected classifier:


  FEATURES_SHAPE = final_features.shape[1]

  cnn_model = Sequential([
  InputLayer(input_shape = (FEATURES_SHAPE, )), # input layer
  BatchNormalization(),
  Dense(4096, activation = 'relu'),
  Dropout(0.5),
  Dense(1024, activation='relu'),
  Dropout(0.7),
  Dense(len(class_to_index), activation="softmax")]) # output layer

  # Compile the model
  opt = Adam(lr=0.001) # Gradient descent ptimizer
  #opt = RMSprop(lr=0.001, rho=0.9)
  cnn_model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy']).summary()
  
  >>>
      Model: "sequential_4"
      _________________________________________________________________
      Layer (type)                 Output Shape              Param #
      =================================================================
      batch_normalization_301 (Bat (None, 9664)              38656
      _________________________________________________________________
      dense (Dense)                (None, 4096)              39587840
      _________________________________________________________________
      dropout (Dropout)            (None, 4096)              0
      _________________________________________________________________
      dense_1 (Dense)              (None, 1024)              4195328
      _________________________________________________________________
      dropout_1 (Dropout)          (None, 1024)              0
      _________________________________________________________________
      dense_2 (Dense)              (None, 120)               123000
      =================================================================
      Total params: 43,944,824
      Trainable params: 43,925,496
      Non-trainable params: 19,328

  

and train it on the data and labels that we just collected:


  # Train the model
  history1 = cnn_model.fit(final_features,
                      labels,
                      steps_per_epoch = len(final_features) // len(labels),
                      epochs = NUM_EPOCHS,
                      validation_split = 0.2,
                      callbacks = [early_stopping])
  
  >>>
      ...
      Epoch 4/100
      1/1 [==============================] - 1s 806ms/step - loss: 0.3666 - accuracy: 0.9083 - val_loss: 2.5747 - val_accuracy: 0.9267
      Epoch 5/100
      1/1 [==============================] - 1s 827ms/step - loss: 0.3241 - accuracy: 0.9175 - val_loss: 2.3619 - val_accuracy: 0.9286
      ...
      Epoch 40/100
      1/1 [==============================] - 1s 853ms/step - loss: 0.0929 - accuracy: 0.9716 - val_loss: 0.5167 - val_accuracy: 0.9364
      Epoch 41/100
      1/1 [==============================] - 1s 801ms/step - loss: 0.0924 - accuracy: 0.9702 - val_loss: 0.5054 - val_accuracy: 0.9389

  

We can finally evaluate our model and see how it is doing. Here we choose two gradient descent optimizers: Adam and RMSprop :

OPTIONAL: As our model seems a bit space-consuming, we can choose to save it as an architecture (using the .to_json method) and weights (using the .save_weights method)


  # serialize model to JSON
  model_json = cnn_model.to_json()
  with open("model.json", "w") as json_file:
    json_file.write(model_json)
  # serialize weights to HDF5
  cnn_model.save_weights("cnn_model_weights.h5")
  print("Saved model to disk")

  

And to load it, one simply uses:

  
  # load json and create model
  json_file = open('model.json', 'r')
  loaded_model_json = json_file.read()
  json_file.close()
  model = model_from_json(loaded_model_json)
  model.load_weights("cnn_model_weights.h5")
  print("Loaded model from disk")

  
Conclusion

We reached a pretty nice accuracy (>95%) so as a nice application, we would like to use our model to predict dog breeds from random images on the net. We will write some code to make that possible:

    
      TARGET_SIZE = (331,331,3)
      from PIL import Image

      def dog_vision(model, source, local=False, target_size = TARGET_SIZE):
        img = ''
        if local:
          # Get image from directory
          image_paths = [source + fname for fname in os.listdir(source)]
          images = np.zeros([len(image_paths), target_size[0], target_size[1], target_size[2]], dtype=np.uint8)
          for _, path in enumerate(image_paths):
            img = load_img(path, target_size = target_size)
            images[_] = img_to_array(img)
        else:
          images = np.zeros([1, target_size[0], target_size[1], target_size[2]], dtype=np.uint8)
          req = request.urlopen(source).read()
          img = Image.open(BytesIO(req)).resize((target_size[0], target_size[1]))
          images[0] = img_to_array(img)
        del img

          # run feature extraction
        resnet_features = get_features(InceptionResNetV2, resnet_preprocess, images)
        xception_features = get_features(Xception, xception_preprocess, images)
        inception_features = get_features(InceptionV3, inception_preprocess, images)
        nasnet_features = get_features(NASNetLarge, nasnet_preprocessor, images)
        final_features = np.concatenate([resnet_features, xception_features, inception_features, nasnet_features], axis = 1)
        del resnet_features, xception_features, inception_features, nasnet_features

          # Make predictions
        predictions = model.predict(final_features)

        labels = dict(enumerate(train_df['breed'].unique()))

        # Check custom image predictions
        plt.figure(figsize=(10, 15))
        for i, image in enumerate(images):
          plt.subplot(4, 3, i+1)
          plt.xticks([])
          plt.yticks([])
          plt.title(labels[np.argmax(predictions[i])])
          plt.imshow(image);

        tf.keras.backend.clear_session()
  
    

and you can try it out too!


  local_dir = "files/dogs/"
  URL = "https://images.theconversation.com/files/342682/original/file-20200618-41213-iu7wbs.jpg?ixlib=rb-1.1.0&rect=9%2C4%2C3268%2C2177&q=45&auto=format&w=926&fit=clip"

  # load model
  cnn = load_model("models/cnn_dogvision.h5")

  dog_vision(cnn, local_dir, local=True)
  #dog_vision(cnn, URL, local=False)

  

Here is a test with my two dearest babies:

    If you are interested in the codes I used, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab
I also recommend Jason Brownlee's amazing book Deep Learning With Python as well as the incredible content of his website, and François Chollet's very complete book Deep Learning With Python which helped me a lot during this and most of my projects.

Building a book recommendation engine using Collaborative Filtering

20th April 2020

Recommendation engine
Collaborative Filtering
Machine learning
python

We use Collaborative Filtering to make our recommendations out based other user's similar preferences. We will use a machine learning algorithm known as the NearestNeighbors to produce book recommendations that are similar (based on user's ratings) to a given book.

We use the Book-Crossings dataset which contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users.

code Collaborative Filtering with KNN

We will create a book recommendation algorithm using Scikit learn's K-Nearest Neighbors. Here is a short summary of what we will be doing:

  • Data analysis - we prepare, study and explore the dataset, highliting important correlations.
  • Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
  • Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
  • Model experimentation - we try to improve the model through experimentation.

Let's import the main modules and libraries we will be working with:


  ## Data analysis libraries
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import seaborn as sns
  from scipy.stats import linregress, norm
  from scipy.sparse import csr_matrix

  ## Scikit Learn Models:
  from sklearn.neighbors import NearestNeighbors
  from sklearn.decomposition import TruncatedSVD

  ## String metrics
  import Levenshtein as lev

    

1. Data analysis:

We read our files using using pandas :


  df_books = pd.read_csv('BX-Books.csv',
                         encoding = "ISO-8859-1",
                         sep=";",
                         header=0,
                         names=['isbn', 'title', 'author'],
                         usecols=['isbn', 'title', 'author'],
                         dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

  df_ratings = pd.read_csv('BX-Book-Ratings.csv',
                           encoding = "ISO-8859-1",
                           sep=";",
                           header=0,
                           names=['user', 'isbn', 'rating'],
                           usecols=['user', 'isbn', 'rating'],
                           dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

  df_users = pd.read_csv('BX-Users.csv',
                           encoding = "ISO-8859-1",
                           sep=";",
                           header=0,
                           names=['user', 'location', 'age'],
                           usecols=['user', 'location', 'age'])

As our recommendation engine is based on user's ratings, we can start by gathering some information about users, say we have a look at the age distribution: We observe that most of the active users are between 20 and 30.

and that most of the ratings are 0:

We can see the TOP 5 rated books:


  df_ratings_books = pd.merge(df_ratings, df_books, on='isbn')

  df_ratings_books.sort_values('rating', ascending=False).head(5)

user isbn rating title author
68283 273906 3596200261 10.0 Fischer Taschenb�¼cher, Bd.26, Sch�¶ne neu... Aldous Huxley
108227 211152 0743203631 10.0 Gap Creek: The Story Of A Marriage Robert Morgan
619156 240144 0345276469 10.0 Agatha Christie: An Autobiography Agatha Christie
925376 144194 0446517283 10.0 The Dragon King (Crimson Shadow/R.a. Salvatore) R. A. Salvatore
709904 200300 0525946500 10.0 Charleston John Jakes

The ratings are very much unevenly distributed: there are books that are VERY much rated, while some have almost no ratings.

One also sees that, some users tend to leave a lot of reviews, while some very few. Let's see the 5 most rated books:


  most_rated_books = num_book_ratings.sort_values('rating', ascending=False)
  most_rated_books_preview = pd.merge(most_rated_books, df_books, on='isbn').drop('user', axis=1)

  most_rated_books_preview.head(5)

isbn rating title author
0 0971880107 2502 Wild Animus Rich Shapero
1 0316666343 1295 The Lovely Bones: A Novel Alice Sebold
2 0385504209 883 The Da Vinci Code Dan Brown
3 0060928336 732 Divine Secrets of the Ya-Ya Sisterhood: A Novel Rebecca Wells
4 0312195516 723 The Red Tent (Bestselling Backlist) Anita Diamant

For a higher statistical significance, we choose to ommit the users with less than 200 ratings, and books with less than 100 ratings:


  ratings = df_ratings

  #remove users that have given less than 200 ratings
  users_count = ratings['user'].value_counts()
  ratings = ratings[~ratings['user'].isin(users_count[users_count < 200].index)]

  #remove books that have been rated less than 100 times.
  books_count = ratings['rating'].value_counts()
  ratings = ratings[~ratings['isbn'].isin(books_count[books_count < 100].index)]

  

It turns out that the most rated book has one of the lowest rating average:


  num_book_ratings['average rating'] = df_ratings.groupby('isbn')['rating'].mean()

  pd.merge(num_book_ratings, df_books[['title', 'isbn']], on='isbn').sort_values('rating', ascending=False).head()

isbn user rating average rating title
215968 0971880107 2502 2502 1.019584 Wild Animus
38572 0316666343 1295 1295 4.468726 The Lovely Bones: A Novel
70803 0385504209 883 883 4.652322 The Da Vinci Code
7345 0060928336 732 732 3.448087 Divine Secrets of the Ya-Ya Sisterhood: A Novel
32372 0312195516 723 723 4.334716 The Red Tent (Bestselling Backlist)

This makes sense somehow, people tend to rate a lot books they love, or those they hate.

Correlation analysis:

We use bivariate correlation or PCC (Pearson's correlation coefficients) in order to calculate the (linear) correlation between two ratings:


  # pivot ratings into matrix style
  df_ratings_mtrx = ratings.pivot(index='user', columns='isbn', values='rating')

  

As someone who has read The Da Vinci Code (with isbn = 0385504209), let's try and see corrolated results that Aearson's coefficient suggests:


davinci_corr = pd.DataFrame(df_ratings_mtrx.corrwith(df_ratings_mtrx['0385504209'], method='pearson'), columns=['pearson coeff']).dropna()

  

We can observe the top 10 suggestions whose ratings are highly corrolated to the ones of the Da Vinci Code :

isbn pearson coeff rating average rating title
7 0385504209 1.000000 883 4.652322 The Da Vinci Code
15 0671027360 0.218741 586 3.718430 Angels &amp; Demons
6 0375727345 0.214953 552 3.039855 House of Sand and Fog
1 0142001740 0.186145 615 4.219512 The Secret Life of Bees
16 067976402X 0.179512 614 3.255700 Snow Falling on Cedars
14 059035342X 0.166290 571 4.900175 Harry Potter and the Sorcerer's Stone (Harry P...
5 0345337662 0.149289 506 3.535573 Interview with the Vampire
4 0316666343 0.143711 1295 4.468726 The Lovely Bones: A Novel
12 0446672211 0.135783 585 4.105983 Where the Heart Is (Oprah's Book Club (Paperba...
3 0316601950 0.133880 568 3.593310 The Pilot's Wife : A Novel

This suggests that the correlation does work indeed, as we have a first suggestion that is from the same author, and the four others are all Mystery/Thrillers genre.

Data modelling

We convert our DataFrame into a scipy sparse matrix and fill the missing values with zeros (since we actually will calculate distances between rating vectors)


  df_ratings_books = pd.merge(ratings, df_books, on='isbn').dropna(axis = 0, subset = ['title']).drop_duplicates(['user', 'title'])
  df_ratings_books_mtrx = df_ratings_books.pivot(index = 'title', columns = 'user', values = 'rating').fillna(0)

  # convert dataframe to scipy sparse matrix for more efficient calculations
  ratings_mtrx = csr_matrix(df_ratings_books_mtrx.values)

  

We use the k-nearest neighbors algorithm from `KNN`: We choose the metric=cosine to calculate the cosine similarity between rating vectors


  history = NearestNeighbors(algorithm='brute',
                          leaf_size=30,
                          metric='cosine',
                          metric_params=None,
                          n_jobs=1,
                          n_neighbors=5,
                          p=2).fit(ratings_mtrx)

  

We write a function that would use our model to give the 5 most pertinent recommendations given a title book:


  def get_recommends(book = "", verbose=False):
      """ Take the title of a book and checks if it is in the database, then prints 5 recommendations using KNN and returns a list of
      each recommendation with its distance, if verbose is set, it also prints the distances"""
      try:
          index = df_ratings_books_mtrx.index.get_loc(book)
      except:
          print("Couldn't find any :'(")
          return [book, ["", 0.]*5]

      knn_dist, knn_ind = model.kneighbors(df_ratings_books_mtrx.iloc[index, :].values.reshape(1, -1), n_neighbors = 6)
      recommendations = [book, []]

      for i in range(0, len(knn_dist.flatten())):
          if i == 0:
              book_to_recommand = df_ratings_books_mtrx.index[index]
              print('Recommendations for {}:\n'.format(book_to_recommand))
          else:
              book_to_recommand = df_ratings_books_mtrx.index[knn_ind.flatten()[i]]
              recommendations[1].append([book_to_recommand, knn_dist.flatten()[i]])
              if verbose:
                  print('{}: {}, with a distance of {:.4f}'.format(i, book_to_recommand, knn_dist.flatten()[i]))
              else :
                  print('{}: {}'.format(i, book_to_recommand))
      return recommendations

And we test it: here I picked the same book The Da Vinci Code

  
    get_recommends(book = "The Da Vinci Code");
    
    >>>
        Recommendations for The Da Vinci Code:

        1: Angels & Demons
        2: Widow's Walk
        3: Doing Good
        4: The Blue Nowhere : A Novel
        5: Touching Evil
  
  

As a bonus, we write a function that would help us, using the Jaro distance to compare the similitudes between two chain of strings. We use the module Levenshtein for a very fast computation:


  def title_suggest(title, lst=list(dict.fromkeys(list(df_books['title']))), k=20):
      """Gives available suggestions of books in the database based on the Jaro distance for string matching"""

      comp = list()
      for name in lst:
          comp.append((name, lev.jaro(title, name)))
      comp.sort(key=lambda x:x[1], reverse = True)
      print("Possible suggestions:")
      for i in range(k):
          print(comp[i][0])
      return comp[:5]

And we try to get suggestions, say Memoirs of a Geisha:


  title_suggest("Memoirs of a Geisha");
  
  >>>
      Possible suggestions:
      Memoirs of a Geisha
      Memoirs of a Geisha Uk
      Memoirs of a Beatnik
      Memoirs of a Pet Therapist
      Memoirs of Elise
      Memoirs of the Forties
      Memoirs of an Invisible Man
      Memoirs of a Medieval Woman
      Memoirs of a Geisha : A Novel (AUDIO CASSETTE)
      Memoirs of a Survivor
      Memoirs of Fanny Hill
      Memoirs of a Race Traitor
      Memoirs
      Memoirs of an Unfit Mother
      Memoirs of a Bengal Civilian
      Memoirs of a Mangy Lover
      Promise of Paradise
      Memoirs of a Voluptuary
      Memoirs of a Lightkeeper's Son
      Memoirs of a Dutiful Daughter

The function title_suggest can be implemented in the function get_recommends in the case where no title matched the input title.

    If you are interested in the codes I used, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab

Cardiovascular disease classification

15th February 2019

Machine learning
binary classification
python
Scikit learn

We will use a basic Machine Learning framework to perform a binary classification task based on clinical examination results performed on patients with and without cardiovascular disease in order to train a model to guess if a patient, from his medical attributes, has a cardiovascular disease or not.

The original data were collected during medical examinations, and belong to the Cleveland database from UCI Machine Learning Repository. We use a more compact one with relevant parameters (only 14 features) from : Kaggle.

code cardiovascular disease diagnosis

Here is a short summary of what we will be doing:

  • Data analysis - we prepare, study and explore the dataset, highliting important correlations.
  • Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
  • Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
  • Feature importance - we highlight the important parameters in predictiong the presence of a cardiovascular disease.
  • Conclusion

Let's import the main modules and libraries we will be working with:


  ## Data analysis libraries
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import seaborn as sns
  from scipy.stats import linregress

  ## Scikit Learn Models:
  from sklearn.linear_model import LogisticRegression
  from sklearn.neighbors import KNeighborsClassifier
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.svm import SVC
  from xgboost import XGBClassifier

  ## Scikit Learn Model evaluators
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
  from sklearn import metrics
  from sklearn.metrics import confusion_matrix, roc_curve, classification_report
  from sklearn.metrics import precision_score, recall_score, f1_score
  from sklearn.metrics import plot_roc_curve

  ## Scikit Learn preprocessing
  from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, RobustScaler
  from sklearn.decomposition import PCA

    

1. Data analysis:

The rows in the dataset represent patients and the columns represent examination results from various blood and medical tests.

We use the dataset to make a prediction on our target variable:

Features Variable Type Variable Value Type
Age Objective Feature age int(years)
Gender Objective Feature sex categorical code: 1 = male; 0 = female
Chest pain type Examination Feature cp categorical code : 0: Typical angina; 1: Atypical angina; 2: Non-anginal pain; 3: Asymptomatic
Resting blood pressure Examination Feature trestbps float (mm Hg)
Serum cholestoral Examination Feature chol float (mg/dl)
Fasting blood sugar > 120 mg/dl Examination Feature fbs binary
Resting electrocardiographic results Examination Feature restecg categorical code : 0: Nothing to note; 1: ST-T Wave abnormality; 2: Possible or definite left ventricular hypertrophy
Maximum heart rate achieved Examination Feature thalach float
Exercise induced angina Examination Feature exang binary
ST depression induced by exercise relative to rest Examination Feature oldpeak float
The slope of the peak exercise ST segment Examination Feature slope categorical code : 0: Upsloping; 1: Flatsloping; 2: Downslopins
Number of major vessels colored by flourosopy Examination Feature ca int (0-3)
Thalium stress result Examination Feature thal int: 1,3: normal; 6: fixed defect; 7: reversable defect
Presence or absence of cardiovascular disease Target Variable target binary: 1=yes, 0=no

We start by reading our CSV file using pandas


  heart_df = pd.read_csv('files/heart.csv')
  heart_df.head(10)

and have a look at how many positive and negative samples there are:

Since the values are close to each other, this suggests that the target column can be considered balanced.

Correlation analysis

We first explore some possible correlation between independent variables, as this may suggest which independent variables may or may not have an impact on our target variable:


  corr_mtrx = heart_df.corr()

A correlation matrix tells us how related each variable is to the other. Since correlation matrices are symmetrical, half of it suffices to analyze correctly the data, hence we triangulate it as follow:


  #Mask the upper triangle:
  mask = np.triu(np.ones(corr_mtrx.shape),0)

seaborn's heatmap is perfect to graphically highlit our correlations:

A higher positive (resp. negative) value means a potential positive (resp. negative) correlation. For example, we observe a negative correlation between cardiovascular diseases and gender. Indeed, if we look closer, this is partially due to the fact that there are more male than female patients in the sample:


  heart_df['sex'].value_counts(normalize=True)

  >>>
  	1    0.683168
  	0    0.316832
  	Name: sex, dtype: float64

We also observe a negative correlation of target with age, while having a positive correlation with thalac. We try then to understand how age and maximal heart rate relate in accordence to the presence, or not, of cardiovascular diseases:

First, we observe that the maximal heart rate regresses with age: this makes sense, the younger someone is, the higher its maximal heart rate is. What is also interesting is that the regression is more important in patients with cardiovascular problems, as the slope of regression of target = 1 is clearly higher than that of target = 0.

However, this might also be misleading due to the following fact: If we observe the age distribution of the patients

the Kernel Density Etimate is slightly swaying to the right, which also reflects in the scatter plot above as there are more older patients. On the other hand, the highest correlation on the tagret variable is seen with the chest pain variable cp:

After a little bit of research, the American heart association supplies us with the following insights:

  • Typical angina: chest pain or discomfort caused when the heart muscle doesn't get enough oxygen-rich blood.
  • Atypical angina: chest pain that is not Typical Angina
  • Non-anginal pain: typically esophageal spasms (not related to the heart)
  • Asymptomatic: chest pain not showing signs of any disease

Here, the fact that the Atypical Angina it's not related to the heart but still seems to have a higher ratio of participants with cardiovascular disease than without. This is due to the actual definition of 'Atypical Angina' in the medical field, which seems to be confusing. Source: PubMed

2. Data modelling

Here we will try to build a model and train in order to predict our target variable:


  np.random.seed(1729)

  X = heart_df.drop('target', axis = 1)
  y = heart_df['target']

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Our model is about classification, where we are predicting a categorical variable : target. It has labeled data of less than 100.000 samples. Using Scikit Learn's documentation we opt for the following models:

In order to see which model performs the best, we proceed as follow:

2.1. Instantiate each model in a dictionary:

  models = {'logReg': LogisticRegression(max_iter = 1000),
            'KNN': KNeighborsClassifier(),
            'SVC': SVC(),
            'RandomForestClassifier': RandomForestClassifier(),
            'XGB': XGBClassifier(use_label_encoder=False)}

2.2 Create function to fit and score models:

  def mod_score(models, X_train, X_test, y_train, y_test):
    '''
    Takes a dictionarry of models and fits each on a training set and returns its score on a test set
    '''
    # Fix randomness
    np.random.seed(1729)
    # empty dict to append
    results = {}

    for name, model in models.items() :
        model.fit(X_train, y_train)
        results[name] = model.score(X_test, y_test)

    return results

2.3 View the results:

  mod_scores = mod_score(models, X_train, X_test, y_train, y_test)

3. Model fine tuning

We will try and normalize our datasets at first to see if it improves our scores and then, fine-tuning the hyperparameters of our classifiers to get the best possible results:

3.1. Datasets normalization

We will choose two types of scaling: MinMax and Robust:


  # Normalized datasets
  normal = Normalizer()
  Xn_train = normal.fit_transform(X_train)
  Xn_test = normal.fit_transform(X_test)

  # Scaled datasets
  minmax = MinMaxScaler()
  Xm_train = minmax.fit_transform(X_train)
  Xm_test = minmax.fit_transform(X_test)

  scaler = RobustScaler()
  Xsc_train = scaler.fit_transform(X_train)
  Xsc_test = scaler.fit_transform(X_test)

Hyper-parameters tuning

For lack of data samples (only 4202) we cannot use a validation set to tune our models, we use k-cross-validation instead.

We use RandomizedSearchCV :


  # logReg hyperparameters
  logReg_grid = { 'C': np.logspace(-4, 4, 50),
           'solver': ['liblinear'],
         'penalty' : ['l1', 'l2']}

  # RandomForest hyperparameters
  RandFor_grid = {'n_estimators': np.arange(10, 1000, 50),
                   'max_depth': [None, 3, 5, 10],
           'min_samples_split': np.arange(2, 20, 2),
            'min_samples_leaf': np.arange(1, 20, 2) }

  # KNN hyperparameters
  KNN_grid = { 'leaf_size' : [i for i in range(1,30)],
           'n_neighbors' : [i for i in range(1,21)],
                   'p' : [1,2]
           }

  # SVC hyperparameters
  SVC_grid = [{'C': [1, 10, 100, 1000],
             'kernel': ['linear']},
            {'C': [0.1, 1, 10, 100],
            'gamma': [1, 0.1, 0.01, 0.001],
          'kernel': ['rbf']}]

  # XGB hyperparameters
  XGB_grid = {'max_dept':range(3,10,2),
              'max_dept':range(1,6,2),
              'gamma':[i/10 for i in range(0,5)],
              'subsample':[i/10 for i in range(6,10)],
              'colsample_bytree':[i/10 for i in range(6,10)],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]}

Here, at least with the modelling that we have been doing, they both somehow come up with the same suggested tuning.

We make a comparison of the progress

3.Model evaluation

Now we want to evaluate our model using the logReg classifier that we have been tuning according to more metrics than just accuracy:


  # Instantiate best model with best hyperparameters
  clf = LogisticRegression(max_iter = 800,
                         C = 339.3221771895323,
                         solver = 'liblinear',
                         penalty = 'l1').fit(Xn_train, y_train);
  #make predictions
  y_preds = clf.predict(Xn_test)

3.1. Confusion matrix:

A visual way to show where our model made nailed its predictions and where it failed:

  
    conf_mtrx = confusion_matrix(y_test, y_preds)
  
  
3.2. ROC Curve and AUC Scores:

We basically compare the true positive rate to the false positive rate, i.e. a person that tests positive, but does not actually have the disease vs a person that tests negative although has a cardiovascular disease:


  fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(Xn_test)[:, 1])

3.3. Classification report

Now let's have a look on the main classification metrics:


  # Create a classification report using the classification_report function
  report = classification_report(y_test, y_preds, output_dict=True)

  pd.DataFrame(report).T

precision recall f1-score support
0 0.923077 0.857143 0.888889 28.000000
1 0.885714 0.939394 0.911765 33.000000
accuracy 0.901639 0.901639 0.901639 0.901639
macro avg 0.904396 0.898268 0.900327 61.000000
weighted avg 0.902864 0.901639 0.901264 61.000000

We tweak our previous function in order to evaluate our model on all these metrics:


  def eval_mod(model, X_test, y_test):
    ''' evaluate a model on y_test'''
    y_preds = model.predict(X_test)
    acc = model.score(X_test, y_test)
    prec = precision_score(y_test, y_preds)
    recall = recall_score(y_test, y_preds)
    f1 = f1_score(y_test, y_preds)
    dft_eval = {'accuracy' : acc,
               'precision' : prec,
                  'recall' : recall,
                'f1 score' : f1}
    for v, k in dft_eval.items():
        print(v,': {:.2f}%'.format(k * 100))
    return dft_eval


  dft_eval = eval_mod(clf, Xn_test, y_test)

  >>>
  	accuracy : 90.16%
  	precision : 88.57%
  	recall : 93.94%
  	f1 score : 91.18%

And using cross_val_score() to Cross-validate for a more accurate and solid results:


  def cv_eval_mod(model, X, y):
      ''' evaluate a model on y_test'''
      y_preds = model.predict(X_test)
      cv_acc = np.mean(cross_val_score(model, X, y, scoring = "accuracy", cv = 5))
      cv_prec = np.mean(cross_val_score(model, X, y, scoring = "precision", cv = 5))
      cv_recall = np.mean(cross_val_score(model, X, y, scoring = "recall", cv = 5))
      cv_f1 = np.mean(cross_val_score(model, X, y, scoring = "f1", cv = 5))
      cv_eval = {'accuracy' : cv_acc,
                'precision' : cv_prec,
                'recall' : cv_recall,
                'f1 score' : cv_f1 }
      for v, k in cv_eval.items():
          print(v,': {:.2f}%'.format(k * 100))
      return cv_eval


  cv_eval = cv_eval_mod(clf, Xn_test, y_test)

  >>>
      accuracy : 77.18%
      precision : 76.90%
      recall : 84.76%
      f1 score : 80.45%

We can observe our evaluation:

4. Feature importance

Finally, we will find which features were important in order to predict cardiovascular diseases using the medical examination results we had, or similarily:

Which charateristics contribute the most to logReg predicting whether someone has cadiovascular disease or not?


  #get the coefficients from logReg
  clf.coef_

  # Match features to columns of our df
  features_dict = dict(zip(heart_df.columns, list(clf.coef_[0])))

Now, the final remark to make is that our model is learning and figuring out the pattern : slope increases $\Rightarrow$ target $\longrightarrow$ positive.

Here, the data does suggest such a thing, although it seems kind of logical as well:

  • Upsloping: better heart rate with excercise (uncommon)
  • Flatsloping: minimal change (typical healthy heart)
  • Downslopins: signs of unhealthy heart

5. Conclusion

We analysed our sample and trained a model that reached up to 90.16% on score accuracy and 83.81% on cross-validated accuracy. As our model here is pretty much balanced, we would want to aim for a higher accuracy, of at least > 95%. However, since false negative predictions are lower than the false positive ones (confusion matrix) we do have a pretty high recall (91.52% on cross-validation) which is good.

    If you are interested in the codes I used, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab

Breast tumours diagnostic using Logistic regression and support vector machine

6th february 2019

Machine learning
Binary label classification
python
Scikit learn

We will use a basic Machine Learning framework to perform a binary classification task based on medical examination results performed on patients with and without a breast tumour, in order to guess if a tumour is malignant or benign.

The original data is taken from the Breast Cancer Wisconsin (Diagnostic) Data Set) from UCI Machine Learning Repository. This is a very simple and compact example of a basic Machine Learning binary classification workflow.

codeBreast tumour diagnosis

Here is a short summary of what we will be doing:

  • Data analysis - we prepare, study and explore the dataset, highliting important correlations.
  • Model modelling - we create and train a model to predict a target variable, then we tune the hyper-parameters of our model in order to improve it if possible.
  • Model evaluation and comparison - we evaluate it using some metrics and compare them to find the best one.
  • Model experimentation - we try to improve the model through experimentation by starting with 1000 images to make sure it works.
  • Feature importance - we highlight the important parameters in predictiong the nature of a mamal tumour.
  • Conclusion

To work through these topics, we will use Scikit learn, matplotlib and pandas.

Let's import the main modules and libraries we will be working with:


  ## Data analysis libraries
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import seaborn as sns
  from scipy.stats import linregress, norm

  ## Scikit Learn Models:
  from sklearn.linear_model import LogisticRegression
  from sklearn.neighbors import KNeighborsClassifier
  from sklearn.svm import SVC
  from sklearn.naive_bayes import GaussianNB

  ## Scikit Learn Model evaluators
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
  from sklearn import metrics
  from sklearn.metrics import confusion_matrix, roc_curve, classification_report
  from sklearn.metrics import precision_score, recall_score, f1_score
  from sklearn.metrics import plot_roc_curve

  ## Scikit Learn preprocessing
  from sklearn.preprocessing import StandardScaler
  from sklearn.decomposition import PCA

    

1. Data analysis:

The rows in the dataset represent the 569 patients and the columns represent 30 characteristics of the cell nuclei present in image of a fine needle aspirate (FNA) of a breast mass. We use the dataset to make a prediction on our target variable:
Features Variable Type Variable Value Type
0 Radius (mean) Statistical Feature radius_m float
1 Texture (mean) Statistical Feature texture_m float
2 Perimeter (mean) Statistical Feature perimeter_m float
3 Area (mean) Statistical Feature area_m float
4 Smoothness (mean) Statistical Feature smoothness_m float
5 Compactness (mean) Statistical Feature compactness_m float
6 Concavity (mean) Statistical Feature concavity_m float
7 Concave points (mean) Statistical Feature cp_m float
8 Symmetry (mean) Statistical Feature symmetry_m float
9 Fractal dimension (mean) Statistical Feature fd_m float
10 Radius error Statistical Feature radius_std float
11 Texture error Statistical Feature texture_std float
12 Perimeter error Statistical Feature perimeter_std float
13 Area error Statistical Feature area_std float
14 Smoothness error Statistical Feature smoothness_std float
15 Compactness error Statistical Feature compactness_std float
16 Concavity error Statistical Feature concavity_std float
17 Concave points error Statistical Feature cp_std float
18 Symmetry error Statistical Feature symmetry_std float
19 Fractal dimension error Statistical Feature fd_std float
20 Worst radius Statistical Feature radius_L float
21 Worst texture Statistical Feature texture_L float
22 Worst perimeter Statistical Feature perimeter_L float
23 Worst area Statistical Feature area_L float
24 Worst smoothness Statistical Feature smoothness_L float
25 Worst compactness Statistical Feature compactness_L float
26 Worst concavity Statistical Feature concavity_L float
27 Worst concave points Statistical Feature cp_L float
28 Worst symmetry Statistical Feature symmetry_L float
29 Worst fractal dimension Statistical Feature fd_L float
30 Type of tumour Target Variable target categorical code: 0 = Malignant; 1 = Benign

We start by reading our CSV file using `pandas`, into a (5 rows × 31 columns) DataFrame


  cancer_df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

  cancer_df.head()

and have a look at how many positive and negative samples there are:

Since the values are not too close to each other, this suggests that the target is no too unbalanced. We can plot some features and observe somehow how our features are related: For example, we plot the mean radius and smoothness of the tumours in both cases:

We can see that benign tumours tend to be smaller in radius than the malignant ones. We can also observe how some features overlap, which would make it hard to diagnose them:

Corr matrix:

We explore some possible correlation between independent variables, as this may suggest which independent variables may or may not have an impact on our target variable:

  
  corr_mtrx = cancer_df.corr()

  #Mask the upper triangle:
  mask = np.triu(np.ones(corr_mtrx.shape),0)
  
  
For higher accuracy, we will normalize our data. We use min-max normalization :
$\displaystyle{ X_n = \frac{X - X_{min}}{X_{max} - X_{min}} }$
and standard scaling :
$\displaystyle{ X_{sc} = \frac{X - \mathrm{mean}(X)}{\mathrm{std}{(X)}} }$
$

We split our data into train and test sets:


  np.random.seed(1729)

  X = cancer_df.drop('target', axis = 1)
  y = cancer_df['target']

  # Non-normalized datasets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

  # Normalized datasets
  Xn_train = ( X_train - X_train.min() ) / ((X_train - X_train.min()).max())
  Xn_test = ( X_test - X_test.min() ) / ((X_test - X_test.min()).max())

  # Scaled datasets
  scaler = StandardScaler()
  Xsc_train = scaler.fit_transform(X_train)
  Xsc_test = scaler.fit_transform(X_test)

  # Using PCA
  pca = PCA(n_components=3)
  Xpca_train = pca.fit_transform(Xsc_train)
  Xpca_test = pca.fit_transform(Xsc_test)

Our model is about classification, where we are predicting a categorical variable : target. It has labeled data of less than 100.000 samples. Using Scikit Learn's documentation we opt for the following models:

We test which model performs best:


  # Instantiate each model in a dictionary
  models = {'logReg': LogisticRegression(max_iter = 5000),
            'KNN': KNeighborsClassifier(),
            'SVC': SVC(),
            'GNB': GaussianNB()}

  # Create function to fit and score models
  def mod_score(models, X_train, X_test, y_train, y_test):
    '''
    Takes a dictionarry of models and fits each on a training set and returns its score on a test set
    '''
    # Fix randomness
    np.random.seed(1729)
    # empty dict to append
    results = {}

    for name, model in models.items() :
        model.fit(X_train, y_train)
        results[name] = model.score(X_test, y_test)

    return results

  # score the resultes
  mod_scores = mod_score(models, X_train, X_test, y_train, y_test)

We finally get the follwoing results:

Hyper-parameters tuning:

For lack of data samples (only 569) we cannot use a validation set to tune our models, we use k-cross-validation instead. We will use RandomizedSearchCV

  
    # logReg hyperparameters
    logReg_grid = { 'C': np.logspace(-4, 4, 50),
               'solver': ['liblinear'],
             'penalty' : ['l1', 'l2']}

    # SVC hyperparameters
    lSVC_grid = [{'C': [1, 10, 100, 1000],
                 'kernel': ['linear']},
                {'C': [0.1, 1, 10, 100],
                'gamma': [1, 0.1, 0.01, 0.001],
                'kernel': ['rbf']}]

    # KNN hyperparameters
    KNN_grid = {'n_neighbors': [3, 4, 5, 10],
                  'weights': ['uniform', 'distance'],
                  'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
                  'leaf_size' : [10, 20, 30, 50],
                  'p' : [1,2]}

    # GaussianNB hyperparameters
    GNB_grid = {'var_smoothing': np.logspace(0,-9, num=100)}
  
  

We can visually make a comparison of the performences:

We clearly see that SVC performed the best, with scaling and the parameters:

  SVC(C = 10, kernel = 'linear', probability=True)

3. Model evaluation

We make our predictions:


  # Instantiate best model with best hyperparameters
  clf = SVC(C = 10, kernel = 'linear', probability=True).fit(Xsc_train, y_train);

  # Make predictions
  y_preds = clf.predict(Xsc_test)

3.1. Confusion matrix:

conf_mtrx = confusion_matrix(y_test, y_preds)

ROC Curve and AUC Scores

We basically compare the true positive rate to the false positive rate, i.e. a malignant diagnosis of a tumour, that is actually benign vs a benign diagnosis of a tumour, that is actually malignant (this last case is more dangerous)


fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(Xsc_test)[:, 1])

3.3. Classification report

Let's have a look at the main evaluation metrics:


  # Create a classification report using the classification_report function
  report = classification_report(y_test, y_preds, output_dict=True)

  pd.DataFrame(report).T

precision recall f1-score support
0.0 0.981132 0.981132 0.981132 53.000000
1.0 0.988889 0.988889 0.988889 90.000000
accuracy 0.986014 0.986014 0.986014 0.986014
macro avg 0.985010 0.985010 0.985010 143.000000
weighted avg 0.986014 0.986014 0.986014 143.000000

Finally, we use a 5-fold cross_val_score() to Cross-validate our metrics

  
    # cross-validated accuracy
    cv_acc = np.mean(cross_val_score(clf, X, y, scoring = "accuracy", cv = 5))

    # cross-validated precision
    cv_prec = np.mean(cross_val_score(clf, X, y, scoring = "precision", cv = 5))

    # cross-validated recall
    cv_recall = np.mean(cross_val_score(clf, X, y, scoring = "recall", cv = 5))

    # cross-validated f1-score
    cv_f1 = np.mean(cross_val_score(clf, X, y, scoring = "f1", cv = 5))
  
  

and have a better and more accurate estimation of how our model is doing:


  
  >>>
      accuracy : 95.26%
      precision : 95.89%he fractal dimension o
      recall : 96.64%
      f1 score : 96.24%
  
  
4. feature importance

Finally, we will find which features were important in order to predict malignant turmours from benigns using the results we had, or similarily:

Which charateristics contribute the most to SVC predicting whether a tumour is malignant or not?

  
    #get the coefficients from SCV
    clf.coef_

    # Match features to columns of our df
    features_dict = dict(zip(cancer_df.columns, list(clf.coef_[0])))
  
  

We can already see that our model is learning and figuring out some patterns to guess malignant and benign cases:

We see for example that it relies more on the standard deviation of the fractal dimension of a tumour to eliminate the malignant cases :

and, on the other hand, relies more on the largest area as well as the standard deviation of the area to detect malignant tumours.

Conclusion

We analysed our sample and trained a model that reached up to 98.60%% on score accuracy and 95.26% on cross-validated accuracy. As our model here is pretty much balanced we would want to aim for a very high **accuracy** (of at least > 95%). However, we would still like to aim for a higher recall and thus, reducing the values of false negatives (which in this case is ) from 1 to 0.

5. Future developpement:

I would like to work on an X-ray scan classification, using image segmentation or image classification, I think I saw a competition about that in Kaggle. Maybe in a near future!

    If you are interested in the codes I used, you can find everything in
  • my Github repository : link
  • my Colab's notebook : Open In Colab

Here are some project I did myself, for Uni and/or for pure fun and curiosity! I also do have a random junk of codes, available here developer_mode

Pythonic implementation of Conway's "Game of life", using Matplotlib's FuncAnimation's library

18th December 2020

Pandas
Matplotlib
Python
The game of life

As a huge fan of Conway's work, his beautiful mind and his charimatic and genius approach to mathematics in general, I decide to create for my Python class's final project, a "pythonic" implementation of his famous Game of life. Which shows how easily and how beautiful life can emmerge:

code Implementing the Game of life on Python :

Let's import the main modules and libraries we will be working with:


  from IPython.display import HTML
  from matplotlib.animation import FuncAnimation
  import matplotlib as mpl
  import matplotlib.pyplot as plt
  import numpy as np
  from scipy.signal import convolve2d

    
1. A Baord

First, we create a function that would create a board : for that, we use numpy’s meshgrid function :


  def create_board(size):
      """ Takes a tuple size (n,m) to create a board (n x m) """
      x = np.arange(0, size[0])
      y = np.arange(0, size[1])
      board = np.meshgrid(x, y)
    return board

    

We use matplotlib's scatter function to plot our board. This would look like something like this:


  SIZE = (40,40)
  marker_size = 350

  board = create_board(SIZE)

  fig, ax = plt.subplots(figsize=(16, 16))
  scatter = ax.scatter(*board, animated=True, s=marker_size, edgecolor=None, c = '#ddc3a5')
  ax.set_facecolor('#201e20')
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)

      

Our board will look like this:

2. A status function

We represent the status alive or dead of each cell on the board as True and False. We also can give a starting probability in [0,1[ for living cells:


  def status(size, prob_life):
      stat = (np.random.uniform(0, 1, size=size) >= prob_life)
      return stat

  status(SIZE, prob_life=0.3)

      

Now running the cell :

  
  status(SIZE, prob_life=0.3)
  
        

will produce the following output:


  array([[False,  True,  True, ..., False, False, False],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True, False,  True, ...,  True, False,  True],
         ...,
         [ True,  True, False, ...,  True, False,  True],
         [ True, False,  True, ...,  True,  True,  True],
         [False,  True, False, ..., False,  True,  True]])

      

Conway came up with a set of rules for this game:

  • Survival : Any live cell with two or three live neighbors lives on to the next generation.
  • Under-population : Any live cell with fewer than two live neighbors dies.
  • Over-population : Any live cell with more than three live neighbors dies.
  • Reproduction : Any dead cell with exactly three live neighbors becomes a live cell.

So a cell has a total of 8 neighbouring cells : up, down, left, right and diagonals. We create a function that would apply these rules to a given game status.

3. The rules of life

We first create a function that will count how many living neighbours a cell has.

A really nice way to do that is to think in the following way: what happens when we reach the edge of the board? Well this makes me think of the game Super MarioBros3 Battle mode on SNES when I was young. Every time Mario reached the edge of the screen, he would reappear in the other edge:

This suggests that what we actually see as a board (or a TV screen in my case) is actually a 3D band, where mario runs infinitely. We use scipy's 2D wrapping function convolve2d to wrap the edges of the board, using the spherical boundary conditions in the parameter : boundary="wrap"

  
    def count_alive_neighbors(status):
        """ Counts the number of neighboring alive cells """
        kernel = np.array(
            [[1, 1, 1],
             [1, 0, 1],
             [1, 1, 1]])
        count = convolve2d(status, kernel, mode='same', boundary="wrap")
        return count
  
      

and we implement Conway's rules into a function:

  
    def life_rules(status):
        """
        Applies Conway's Game of Life rules given the current status of the game and returns a tuple of two arrays:
        a the new status of each cell and a representation of the number of its neighbors
        """
        alive_neighbors = count_alive_neighbors(status)

        # Under-population :
        survive_UP = ( alive_neighbors >= 2 )
        # Over-population :
        survive_OP = ( alive_neighbors <= 3 )
        # Survivors :
        survive = status * survive_UP * survive_OP
        # Reproduction :
        new_status = np.where(alive_neighbors == 3, True, survive)

        new_neighbors = count_alive_neighbors(new_status)
        return new_status, new_neighbors
  
      

For visual reasons, we can colour the cells depending on the the number of neighbours they have:

  
    def get_colors(status, count):
        cmap = mpl.cm.plasma
        rescale = count / 8
        colors = [cmap(neighbors) for neighbors in rescale.flatten()]
        cell_alive = status.flatten()
        colors = [(r, g, b, 0.9) if cell else (r, g, b, 0) for cell, (r, g, b, a) in zip(cell_alive, colors)]
        return colors
  
      

Now, after spending hours reading and understanding matplotlib.animation's' API, we can use the class FuncAnimation and tweak it a tiny bit to create a function that would retrun an instance of FuncAnimation: Basically, what FuncAnimation does (or at least how i understood it) is that it initiates a block (the board+initial conditions) and it has a core loop that keeps updating and 'redrawing' the points that are changing at each frame:

   
     def let_there_be_life(prob_life, size=(40, 40)):
    """ Implementing the Game of Life """
    board = create_board(size)
    stat = status(size, prob_life)

    fig, ax = plt.subplots(figsize=(8, 8))
    scatter = ax.scatter(*board, animated=True, s=70, edgecolor=None, c = '#ddc3a5')
    ax.set_facecolor('#ffcce7')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False);

    def update(frame):
        nonlocal stat
        stat, alive_neigbors = life_rules(stat)
        colors = get_colors(stat, alive_neigbors)
        scatter.set_facecolor(colors);
        return scatter

    return FuncAnimation(fig, update, frames = 500);
   
       
4. LET THERE BE LIFE!

Now we are ready to run the following code:


  anim = let_there_be_life(prob_life=0.6)

  html = HTML(anim.to_html5_video())

  

and see life emerging, right in front of our eyes:

Optional: we can also save it into a gif:


  import time

  tic = time.perf_counter()
  anim.save("game_of_life.gif", writer=writer)
  toc = time.perf_counter()

  print(f"Produced the gif in {toc - tic:0.4f} seconds")

  

by using the writter writer = PillowWriter(fps=25) or writer = imagemagick

  
Produced the gif in 90.1506 seconds
  
    
5. Further developement:

My teacher initially advised me to use <Bokeh as it is more Web-friendly, but I was too lazy i guess... or prefered maybe stick with something I kind of was more familiar with. I would gladly love any feedback/remark/improvement in that direction!

I am also currently working on deploying it as a web app using FastAPI or Flask, as producing the actual animation on matlotlib takes some time... (up to 2 mins), it doesnt really seem that promising and it seems as if it would be easier to do most of the wrok on the front end... let's see!

If you are interested in the codes I used, you can find everything in

  • my Github repository : link
  • my Colab's notebook : Open In Colab

Have questions, comments, feedback? Feel free to reach out!