[Post invitado] Part 3. Step by step - How to train an objects classifier understanding Computer Vision techniques with Python and OpenCV

In the previous post I explained how to create your own image detector with TensorFlow. It should be noted that you must differentiate between a classifier and an image detector. So, what is the difference between Object Detection and Object Recognition! Well, recognition simply implies establishing whether an image contains a specific object or not. while detection also requires the position of the object within the image. For example, there is an entry image that contains car, traffic lights, people, dogs, etc. The task is to be able to recognize which of the objects are contained in the image.

In this new post we will explain step by step how to create your own image classifier, not detection image as we did in the previous post. This time without using third-party technologies. If you have a computer with enough computer level and you also have a GPU like Nvidia GTX 650 or newer, you will not have any problems but also in the case that you do not have a very powerful computer you can follow this post and make a classifier without problems.

How to set up your virtual machine on Linux with Python and OpenCV

The classifier can be developed as we want, we can do it:

Data Science Virtual Machine for Linux(Ubuntu) on Azure

In the first place using virtual machines in Azure, already preconfigured. If we have deployed our machine in Azure as in the first post.

This time we can do the same but this time we will change OS instead of Windows, we will develop everything with Linux. Why? My intention is to show you that in machines with Windows 10 and Linux Ubuntu we can carry out our projects of machine learning with python without problems. Thanks to Anaconda we can create without problems our environment variables that will help us to carry out the same processes in both systems. You can see more information here:



Image Data Science Virtual Machine Linux (Ubuntu) summary

We can see that this VM support many of the current deep learning environments and comes with CUDA and cuDNN installed, although many times without configuring paths. In the first post we explain how to configure these tools paths in our pc. For that case we only need to have Python environment 2.7, this VM default use 3.5 env. Also, we will not have to necessarily use a graphic card to develope this classifier of Artworks. Therefore, the idea of develope this classifier in a virtual machine of this dimension is not the best idea or profitable.

In the case that we want to use this option we have two options to develope the classifier project, one is connecting directly with PUTTY to the virtual machine and command line to organize and launch the commands to configure anaconda and the rest of the project. Another option is to connect by X2GO, it provides us UI rather friendly and close to a desktop interface. Be free to choose!!

Option X2GO

We need craate a new session first

Then we need to fill this form:

  1. Host: The host that azure give us to connect
  2. User: The user that you assigned when you create your machine on Azure
  3. Port SSH: 22
  4. Type of session: Change KDE to KFCE

Since we have X2GO with our session set up we can start to start our project!

The first thing we must do is configure our environment in python with the libraries that our project files need. To create the environment variable, we can launch this command:

But there we are not assigning any library or what version of python we want, by default it will not assign the latest version and for the case of this project we do not need it but the 2.7. Therefore we need to write this command:

Once we have our environment already configured in anaconda with python 2.7 we will be able to start running our .py files that will be explained later.

Our PC or Laptop

To carry out the project from your PC, first you must download and install Anaconda:


It is very easy to start with python. Thanks to anaconda we can set up our environments easily and it's done in the same way that we explained earlier. Here you have a link to the anaconda documentation, it explains you how to configure your environments:


Download this tutorial's repository from GitHub

Download the full repository located on this page, scroll to the top and click Clone or Download and extract all the contents directly into the C:\ directory. This establishes a specific directory structure that will be used for the rest of the post. At this point, your project folder should looks like:

Image Github repository solutions. API, App, Object Detection and Object classifer.


The first two solutions (API and App) we will explain in the last sections of this tutorial. The latter is the one that we will explain more in detail.

Artworks Classifier Solution

Image many artworks of the classsifier

This folder contains the images of 20 artworks, python files needed to train artworks classifier. It also contains a CoreML model if you wanted to use on iOS project with Xcode, it was generated with Coremltools library

If you want to practice training your own artwork classifer, you can leave all the files as they are. You can follow along with this tutorial to see how each of the files were generated, and then run the training. You will still need to generate the K-Means Cluster model and SVM model as described in nexts steps.

If you want to train your own object classifier, delete the following files (do not delete the folders):

  • All files in \images\train

Now, you're ready to start from scratch by training your own object classifier. This publication will assume that all the files listed above are unknown and will continue to explain how to use these files for your own training data set. Also, in this tutorial I explain important topics that are needed to complete this tutorial, knowing them we will be easier to understand the why of the processes. Such as:

  • Machine learning techniques
  • Algorithms
  • Image descriptors
  • Histograms and color histograms for artworks
  • K-Mean Clustering
  • Bag of Visual Words

Introduction to Machine Learning Techniques (Computer Vision)

Machine learning is related to computer science specifically with Artificial Intelligence. Within the latter the goal of machine learning is to create methods that allow computers to have the ability to learn.

Within machine learning we find a subarea that we will use to create this project, called Computer Vision, it is an area that includes different functionalities such as acquiring, processing, analyzing and understanding real-world images in order to translate an image to numerical or symbolic information understandable to a computer. The acquisition of data is achieved by means such as camera, multidimensional data, scanners ... etc.

Today this discipline is used for different purposes, such as:

  • Object detection
  • Video analysis
  • 3D Vision

We will only focus on the first option, detection of objects in an image. The detection of objects is a part of the artificial vision that detects objects in an image based on their visual appearance. Within the detection of objects in an image we can distinguish two phases:

  • Extraction of image characteristics
    • Consists of obtaining mathematical models that summarize the content of the image. These characteristics are also called descriptors, we will comment more in detail in next sections.
  • Search for objects based on these characteristics
    • For the object search process, we will have to elaborate the classification of said objects. There are several machine learning algorithms that will allow us to assign an identifier or label to an image. We will discuss it in future sections.

Therefore, the classification of an image is the task of assigning a label to an image of a predefined set of categories. This means that, given an input image, our task is to analyze the image, return a label that categorizes the image. This tag is usually from a predefined set.

As, for example, if in our model classifier we do an inference with this image:

Image “Las Meninas”

It would have to return:

Label: Las Meninas

In a more formal way, given the previous image of input W x H pixels, with three channels, R (Red) G (Green) B (Blue), respectively, our objective will be to take the pixels W x H x3 = N and find out how to accurately classify the contents of the image.

In addition, we must realize that in computer vision we must give value to semantics. For a human it is trivial that he knows that the previous image is a picture of art, specifically the Meninas. But in a computer, it is not trivial, nor does it have to know it. For the computer to analyze them, it will differentiate three main properties of every image:

  • Spatial environment
  • Color
  • Texture

These properties are encoded in a computer thanks to what we previously named as descriptors, each descriptor specializes in space, colors, textures, etc. Finally, based on the computer with these characterizations of space, textures and colors, you can apply automatic learning to learn how each type of image is, in our case, diverse types of art pictures of different artists.

For this we also must understand how computers represent images to analyze them. As for example if we insert the picture of the Mona Lisa or Mona Lisa in our classifier, it would represent it like this:

Image Giocona and her features descriptors

In addition, another section to investigate on how to build our classifier of art pictures is how the image or an object appears in an image. That is, the different points of view of a painting, different dimensions of the pictures, deformations of the image, lighting and occlusions.

1.4.2 Types of learning

When carrying out the research prior to the realization of the project we have encountered this problem, what kind of learning we want for our project. We have observed that there are three types of learning:

  • Supervised
    • We have both image data (in image format or extracted feature vectors) along with the category label associated with each image so that we can teach our algorithm how each image category looks.
  • Non supervised
    • All we have is data from the image itself, we do not have labels or associated categories that we can use to teach our algorithm to make accurate predictions or classifications.
  • Semi-supervised
    • Try to be a middle ground between the previous two. We have a small group of our tagged image data and use that tagged information to generate more training data from the unlabelled data.

In the end we opted to use the type of supervised learning since in our system we will have an image dataset of 20 tables where the category or label is the name of the table and each category contains 21 replications of that table. Something like our classifier would be:

Label Features Vector
Las Meninas […]
3 de Mayo […]
Maja desnuda […]
Noche estrellada […]
Mona Lisa […]

Features vector is something like this:

  • Starry Night (Vincent Van Gogh):

1.4.3 Pipeline of the image classifier

Once seen how to manipulate the images and how to face our learning we will have to investigate on how to divide in processes to build our classifier of images of art pictures. One of the most common pipelines is the following one in which we distinguish 5 phases that are:

Phase 1: Structure our initial dataset

It will be necessary to create our categories each with their images and specific labeling. In this project it would be like this:

Categories = {Las Meninas, Shooting 3 of May, Gioconda, Maja Desnuda, Starry Night, …}

Phase 2: Splitting our dataset (train and test)

Once we have our initial data set, we must divide it into two parts, training set and evaluation set or tests. Our classifier uses a training set to "learn" what each category looks like by making predictions about the input data and then correct it when the predictions are incorrect. After the classifier has been trained, we can evaluate its performance in a test set.

Algorithms like the ones we use in this Random Forest Classifier project, have many configurable parameters that will help us if we configure them well to obtain optimal performance. These parameters are called hyperparameters.

Phase 3: Extract features

Once we have our final data divisions, we will need to extract functions to quantify and abstract each image. The most common options according to the previous investigation are:

  • Color descriptors
  • Histogram of gradients oriented (HOG)
  • Histograms with local binary patterns (BRIEF, ORB, BRISK, FREAK)
  • Local Invariant Descriptors (SIFT, SURF, RootSIFT)

Phase 4: Train our classification model

Given the feature vectors associated with the training data, we will be able to train our classifier. The objective here is for our classifier to learn how to recognize each of the categories in the data of our label. For this we have done an analytical study, such as Support Vector Machine, K-Nearest Neighbor.

Phase 5: Evaluate our classifier

Finally, we must evaluate our trained classifier. For each one of the vectors of characteristics in our test set, we present them to our classifier and we ask you to predict which is the label of the inserted image. Then we will have to tabulate the classifier's predictions for each point in the test set.

Finally, these classifier predictions are compared with the Ground-true label of our test set. The Ground-truth labels represent what the category really is. From there we can calculate the number of predictions that our classifier was successful and calculate aggregate reports such as Precision, Recall and F-Measure, which are used to quantify the performance of our classifier.

Classifications Algorithms (SVM and Random Forest)

We can find in the Machine Learning environment a great diversity of algorithms that we can use for our classification of art frames. After a series of tests between SVM, Decision three, K-Nearest Neighbor and Random Forests, we have chosen the latter due to the results we have found when we need to classify 5 pictorical styles. When our classifier classifies artworks, we used SVM.

Random Forest

We have investigated a bit how this algorithm works in detail. It was created and introduced into the scientific community by Leo Brieman in his 2001 article, Random Forests is one of those algorithms that many scientists still do not believe works, many say it is an elegant and straightforward way to perform a classification, others They say that there are multiple Decision trees but with a touch of randomness that clearly increases their accuracy.

From the satisfactory results and the good comments on it, it has been decided in this section to show the functioning of said algorithm. The Random Forests are a type of method of classification by sets, instead of using a single classifier as we have done in the tests with SVM and K-NN, this uses multiple classifiers that are added in one called goal classifier. In our case we will build multiple decision trees in a forest and then we will use our forest to make predictions.

As you can see in the previous figure, our random forest consists of multiple grouped decision trees. Each decision tree "votes" on what it believes is the final classification. These votes are tabulated by the classifying goal, and the category with the most votes is the final classification.

You have to make an appointment to the "Jensen Inequality" to understand a large part of how Random Forests works. Dietterich's seminal work (2000) details the theory of why ensemble methods can generally obtain greater precision than a single model alone. This work depends on Jensen's Inequality, which is known as "diversity" or "decomposition of ambiguity" in the machine learning literature.

The formal definition of Jensen's Inequality states that the combined (average) convex set will have an error that is less than or equal to the average error of the individual models. It may be that an individual model has an error smaller than the average of all the models, but since there is no criterion that we can use to "select" this model, we can be sure that the average of all the models will not be worse than the Select any random individual model.

Another crucial factor is boostrapping or randomization injection. These classifiers train each individual decision tree in an initial sample of the original training data. Boostrapping is used to improve the accuracy of machine learning algorithms while reducing the risk of overfitting.

In the following figure we simulate the selection of "votes" created in the nodes of our classifier. We pass a feature input vector and through each of the decision trees, you will receive the votes of the class label of each of the trees, then a count of the votes will be done to show the final classification or prediction.

In conclusion, we have used this classifier, since it is a set method that consists of multiple decision trees. The ensemble methods, such as the Random Forests, tend to obtain greater precision than other classifiers, since they averaged the results of each individual model. The reason why this average works is due to Jensen's Inequality.

Randomness is introduced in the selection of training data and in the selection of the characteristic column when training each tree in the forest. As Brieman discussed in his article Random forests, performing these two levels of random sampling helps (1) avoid overfitting and (2) generates a more precise classifier.

Support Vector Machine

The reason why SVMs are so popular is because they have quite solid theoretical foundations. The hyperparameters are still being improved, but in general, launching an SVM to a problem is an effective way to quickly obtain a prediction or a good result for a given problem. However, with what has given me problems with the adjustment of the parameters, if you want to obtain an optimal result you need to play with them and adjust it to the maximum to our problem, for this I found a Python library GridSearchCV that helps you to program the parameters that you want to touch and automate a workout executing all the possible combinations of said hyperparameters, with this we will save a lot of time in going testing different values ​​one by one.

Types of SVM

We will only explain how the linear SVM type worksLinear separability

Linear separability

In order to explain SVMs, we should first start with the concept of linear separability. A set of data is linearly separable if we can draw a straight line that clearly separates all data points in class #1 from all data points belonging to class #2:

Image. Given our decision boundary, I am more confident that the highlighted square is indeed a square, because it is farther away from the decision boundary than the circle is.

Take a few seconds and examine this plot and convince yourself that there is no way to draw a single straight line that cleanly divides the data points, so all blue squares are on one side of the line and all red circles on the other. Since we cannot do that, this is an example of data points that are not linear separable.

*Note: As we’ll see later in this lesson, we’ll be able to solve this problem using the kernel trick.

In the case of Plots, A and B, the line used to separate the data is called the separating hyperplane. In 2D space, this is just a simple line. In a 3D space, we end up with a plane. And in spaces > 3 dimensions, we have a hyperplane.

Regardless of whether we have a line, plane, or a hyperplane, this separation is our decision boundary or the boundary we use to decide if a data point is a blue rectangle or a red circle. All data points for a given class will lay on one side of the decision boundary, and all data points for the second class on the other.

Keeping this in mind, wouldn’t it be nice if we could construct a classifier where the farther a point is from the decision boundary, the more confident we are about its prediction?

Images Descriptors

A very important part of our project, not to say the main one, is to know how to extract the descriptors of the images in the best way and efficiently possible. Before going into explaining the world of descriptors, basic concepts in the field of computer vision will be briefly explained and many professionals overlook such concepts, such as image descriptors, feature descriptors and feature vectors. All these terms are very similar, but nevertheless it is very important to understand the difference between them.

To begin with, a feature vector is simply a list of numbers used to abstractly quantify the content of an image. Characteristic vectors are transmitted to other computer vision programs, such as the creation of an automatic learning classifier to recognize the contents of the image using feature vectors or comparing feature vectors for similarity when constructing an image search engine. To extract feature vectors from an image, we can use image descriptors or feature descriptors.

An image descriptor quantifies the complete image and returns a vector of characteristics per image. The descriptions of the images tend to be simple and intuitive to understand but may lack the ability to distinguish between different objects in the images.

On the contrary, a feature descriptor quantifies many regions of an image, returning multiple feature vectors per image. Feature descriptions tend to be much more powerful than simple image descriptors and more robust to changes in rotation, translation and point of view of the input image.

An impediment that arose when extracting feature descriptors is that not only do we have to store several feature vectors per image, which increases our storage overhead, but we also need to apply methods such as Bag of Visual Words to take the multiple vectors of characteristics extracted from an image and condensed into a single vector of characteristics. This Bag Visual Words technique will be explained in another section.

Within the descriptors there are several types, but for our case we will use the local invariant descriptors, consisting mainly of:

  • SURF (Speeded Up Robust Features), Keypoint detector DoG
  • SIFT (Scale-invariant Feature Transform), Keypoint Detector Fast-Hessian

We will explain the operation of each one of them with an example of execution with the art pictures and we will explain the reason of our choice of SIFT through results represented in graphs.

Understanding Local features

The most usual when analyzing an image above is to use image descriptors to complete image, this leaves us with a global quantification of the image. However, a global quantification of the image means that each pixel of the image is included in the calculation of the vector characteristics and may not always be the most appropriate.

Suppose for a second that we were tasked with building a computer vision system to automatically identify the covers of books. Our system would take a photo of a artwork cover captured from a mobile device such as an iPhone or Android, extract features from the image, and then compare the features to a set of artworks covers in a database.

We can use HOG for resolve this problem? Or LBP’s?

The problem with these descriptors is that they are all global image, and if we use it we will end up by quantifying image regions that do not interest us, such as people who are in front of the frames. Including these regions in our vector calculation can dramatically divert the vector of output characteristics and we run the risk of not being able to correctly identify the art box.

The solution to this is to use local characteristics, where we only describe small local areas of the image that are considered interesting instead of the whole image. These regions must be unique, easily compared and carry some kind of semantic meaning in relation to the contents of the image.

*Note: At the highest level, a “feature” is a region of an image that is both unique and easily recognizable.

Keypoint detection and feature extraction

The process of finding and describing interesting regions of an image is broken down into two phases: keypoint detection and feature extraction.

The first phase is to find the “interesting” regions of an image. These regions could be edges, corners, “blobs”, or regions of an image where the pixel intensities are approximately uniform. There are many different algorithms that we’ll study that can find and detect these “interesting” regions — but in all cases, we call these regions keypoints. At the very core, keypoints are simply the (x, y)-coordinates of the interesting, salient regions of an image.

Then for each of our keypoints, we must describe and quantify the region of the image surrounding the keypoint by extracting a feature vector. This process of extracting multiple feature vectors, one for each keypoint, is called feature extraction. Again, there are many different algorithms we can use for feature extraction, and we’ll be studying many of them in this module.

However, up until this point, we have had a one-to-one correspondence between images and feature vectors. For each input image, we would receive one feature vector out. However, now we are inputting an image and receiving multiple feature vectors out. If we have multiple feature vectors for an image, how do we compare them? And how do we know which ones to compare?

As we’ll find out later in this tutorial, the answer is to use either keypoint matching or the bag-of-visual-words model

Research on SIFT and SURF

First, we explain what SIFT is. SIFT descriptor is a lot easier to understand than the Difference of Gaussian (DoG) keypoint detector also proposed by David Lowe in his 1999 ICCV paper, Object recognition from local scale-invariant features. The SIFT feature description algorithm requires a set of input keypoints. Then, for each of the input keypoints, SIFT takes the 16 x 16-pixel region surrounding the center pixel of the keypoint region.

The SURF descriptor is a characteristic vector extraction technique developed by Bay in its 2006 ECCV document. It is very similar to SIFT, but it has two main advantages with respect to SIFT.

  1. The first advantage is that SURF is faster to calculate than SIFT, so it is more suitable for real-time applications.
  2. The second advantage is that SURF is only half the size of the SIFT descriptor. SIFT returns a feature vector of 128-dim and SURF returns a vector of 64-dim.

Having understood its advantages, we have decided to set an objective and it is to understand why SURF works and because it obtains such satisfactory results, for this we have had to understand a keypoint descriptor called Fast Hessian. Today, both the SURF image description algorithm and the Fast Hessian keypoint descriptor ended up calling both by the scientific community "SURF" although they may be confusing.

Initially before SIFT and SURF were introduced there was a trend in which the proposed algorithms included a keypoint detector and an image descriptor, so sometimes we see both keypoint detectors and image descriptors share the same name.

But how does the Fast Hessian keypoint descriptor work?

The motivation of Fast Hessian and SURF came from the slowness of DoG and SIFT. The computer vision researchers wanted a faster keypoint detector and image descriptor.

The Fast Hessian is based on the same principles as DoG in that keypoints must be repeatable and recognizable at different scales of an image. However, instead of calculating the Gauss Difference explicitly as done in DoG, Bay proposed that instead we approximate the Gaasian Differences step using what is called Haar Wavelets and integral images. We will not go into detail about this technique, but in the following image you can see how it would become a theory:

And from the previous result of building a matrix:

A region will be marked with a keypoint if the candidate pixel score is 3 x 3 x 3 times greater than the neighbor. Unlike SIFT, this time we are only interested in maximums, not maximums or minimums. The use of Fast Hessian for our application is the best choice since it is very appropriate for a real-time approach. In the following image we have tested an image of "The persistence of memory" of Dali executing on this Fast Hessian and extracting its keypoints:

Image Fast Hessian Keypoints of “Persistence of memory”

Once these keypoints are extracted our image description algorithm, SURF, we will go through each one of the keypoints obtained and we will divide the keypoint region into 4 x 4 sub-areas, just like in SIFT.

From this step is when SIFT and SURF begin to differentiate. SURF for each of these sub-areas of 4 x 4 extracts by means of the Haar Wavelet sample points of 5 x 5.

And for each extracted cell, both directions X and Y are processed by Haar Wavelets. This result is known as d_ {x} and d_ {y}.

Now that we have d_ {x} and d_ {y}, we will extract their weights by means of what is called as Gaussian Kernel as in SIFT. The results furthest from the center of the keypoint will contribute less to the vector of final characteristics, however, the results that are closer to the center of the keypoint will contribute to the vector of final characteristics. And finally, SURF to finish its process and finally process its characteristics vector for each sub-area of ​​4 x 4 process this formula:

Therefore, as we said at the beginning we have 4 x 4 = 16 subareas, returning a 4-dim vector to us. These characteristic vectors with 4-dim will concatenate with each other, giving rise to 16 x 4 = 64-dim of the characteristic vector of SURF. Below are results with paintings by Velázquez (Las Meninas) and DaVinci (The Last Supper).

Image SURF keypoints of “Las Meninas”

Image SURF Keypoints of “The Last Supper"

Histograms and color histograms for artworks

Within the fundamental concepts in computer vision we must give weight to the histograms. So, what exactly is a histogram? A histogram represents the distribution of pixel intensities (whether color or gray- scale) in an image. It can be visualized as a graph (or plot) that gives a high-level intuition of the intensity (pixel value) distribution. We are going to assume a RGB color space in this example, so these pixel values will be in the range of 0 to 255. And now what is a color histogram? Simply put, a color histogram counts the number of times a given pixel intensity (or range of pixel intensities) occurs in an image. Using a color histogram, we can express the actual distribution or “amount” of each color in an image. The counts for each color/color range are then used as our feature vector.

Image Athens School artwork

Image Histogram of Athens School artwork

We see there is a sharp peak in green and red histogram around bin 200, the end of the histogram represents that the men on the right with the toga with "green" and "red" tonality contain many pixels with that color range. We see that most of red, green, blue pixels are container in all the range except in final when blue decrease.

*Note: I’m intentionally revealing which image I used to generate this histogram. I’m simply demonstrating my thought process as I look at a histogram. Being able to interpret and understand the data you are looking at, without necessarily knowing its source, is a good skill to have in computer vision.

Now in the image below que can see a grapchic of computing 2d color histograms for each combination of the red, green, and blue channels. First is a 2D color histogram for the Green and Blue channels, the second for Green and Red, and the third for Blue and Red.

Image computing 2d color histograms for each combination of the red, green, and blue channels.

K-Means Clustering

Once we understand how we extract the color histograms thanks to K-means we can group colors and know which the predominant ones in each painting, artist or pictorial style are. In addition to relating a feeling with the colors that we extract from a painting, such as happy or sad.

There are many clustering algorithms in ML but k-means after investigation I discovered that it is the most popular, most used and easiest to understand clustering algorithm. The K-Means algorithm is a type of unsupervised learning algorithm (no label/category information associated with the images/feature vector), I was explained this concept before in Machine learning techniques chapter.

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points. Scientist call K-means because it finds “K” unique clusters where the center of each cluster(centroid) is the mean of all values in the cluster. The overall goal of use k-means is to put similar points in a cluster and dissimilar data points in a different cluster. For more information about clustering techniques enter in this link.

Many clustering algorithms are available in Scikit-Learn and elsewhere, but perhaps the simplest to understand is an algorithm known as k-means clustering, which is implemented in sklearn.cluster.KMeans.

Applications of k-means in computer vision

In my project the k-means algorithm can also be used to extract the dominant colors of an artworks like extract the top 5 colours used by Ziem Felix in his 20 more relevant artworks. However, we can also extract features for only one artwork or a style of art, only changing the dataset.

Image example of artwork of Ziem Felix (Romanticism)

Here I show you just one picture of Felix Ziem but in almost all we can see the same colors, intuitively we can distinguish the colors that predominate in this author but thanks to K-means we can get the correct colors like the following:

Image Top 4 Colors predominant in Felix Ziem artworks

You will think it is complicated but in a few lines of code knowing what we do with python we get it quickly. Only we need to take the raw pixel intensities of the image dataset as our data points, pass them on to k-means, and let the clustering algorithm determine the dominant colors. That’s all!

Example of extract Top 3 colors in only one artwork:

Image Vincent Van Gogh artwork Top 3 [n_clusters] colors (Cafe Terrace)

Image Salvador Dali artwork Top 5 [n_clusters] colors (Persistence of memory)

In addition, reading a little about color analysis articles, several modern artists and current psychologists have in many cases managed to understand what feelings and emotions had the artists or pictorial styles. Graphics like these explain the feelings and aptitudes of some colors and with the extraction of these colors, thanks to K-Means, we can relate them to this table.

Image example plot of color emotions

However, the most popular usage of k-means in computer vision/machine learning is the bag-of-visual words (BOVW) model where we:

  • Extract SIFT/SURF (or other local invariant feature vectors) from a dataset of images.
  • Cluster the SIFT/SURF features to form a “codebook”.
  • Quantize the SIFT/SURF features from each image into a histogram that counts the number of times each “visual word” appears.

Classification with Bag of Visual Words

This section is one of the most important! In the previous sections we have explained several very important concepts in computer vision to understand now the processes that are needed to carry out our image classifier, specifically artworks. But before I start to explain the classifier code, I will explain above what it means Bag of Visual words.

What is BOVW?

To begin explaining what this technique consists of I would like to mention the world of natural language processing or also known as NLP, where our intention is to compare and analyze multiple documents. Each document has a lot of different words and in a certain order. With this technique we ignore the order

and simply throw the words in a "bag". Then once all the words inside the bag we can analyze the occurrences of each word. Finally, with the NLP, each document becomes a histogram of word counts and can be used as characteristics for a ML process.

Visual Words

In the section on image descriptors, explain what they are and how they are generated (SIFT, SURF). If we think of an image as a document of "words" generated by SIFT, we can extend the Bag Visual Words model to classify images instead of text documents. We can imagine that a SIFT descriptor or "Visual Word" represents an object of the picture, as for example a descriptor can be the eye of the Mona Lisa.

These descriptors of SIFT have variations so we must have some grouping method to group words that represent the same. For example, all the characteristics of the eye of the Gioconda must go to the same cluster or container.

However, the characteristics of SIFT are not as literal as saying "eye of the Gioconda". We can not agrpar them according to a human definition but mathematically. For this the descriptions of SIFT are 128-dimensional vices so we can simply make a matrix with each SIFT descriptor in our training set as its own row, and 128 columns for each of the dimensions of the SIFT characteristics. Once all this has been done, we will have to connect that matrix to an aggregating algorithm, such as the K-Means explained above.

Image Example Bag Visual Words

Next we go through each individual image, and assign all of its SIFT descriptors to the bin they belong in. All the “eye” SIFT descriptors will be converted from a 128-dimensional SIFT vector to a bin label like “eye” or “Bin number 4”. Finally we make a histogram for each image by summing the number of features for each codeword. For example, with K=3, we might get a total of 1 eye feature, 3 mouth features, and 5 bridge features for image number 1, a different distribution for image number 2, and so on.

*Note: Remember, this is just a metaphor: real SIFT feature clusters won’t have such a human-meaningful definition.

At this point we have converted images with varying numbers of SIFT features into K features. We can feed the matrix of M observations and K features into a classifier like Random Forest, AdaBoost or SVC as our X, image labels as our y, and it ought to be able to predict image labels from images with some degree of accuracy.

But how can I know how many bins I need? And what is K?

For this artworks classifier, I performed a grid search across a range of K values and compared the scores of classifiers for each K. The code is on github and references other files in the repo.

*Note: Determining what is K to use for K-Means depending on your project and how long it takes to do a grid search, you might want try differents methods.

Prepare our classifier

Once we have understood the theory, we move on to the practical part to train our classifier and obtain our model to use it in our mobile application developed in Xamarin. The first thing we must do now is to understand each of the files that make up our solution of the object classifier.

Image repository files

We can see that it consists of a folder images and three python files

  • py
  • py
  • py

We will not have to modify anything of the code, it will only depend on each one the modification of the dataset to change the thematic of images to classify.

Image Dataset folder


From the project directory, issue the following command to begin training:

 python  detector_K_gridsearch.py –train_path images/train

If everything has been set up correctly, python will initialize the training. When training begins, it will look like this:

Image Reading all files of differents folders of artworks

Then after reading all the files, the program start to extract features descriptors for each image, in the image below I show you the example of Moulin Gallete Artwork, in which we see that the 21 images that we have of this picture now we have extracted their descriptors of characteristics:

Image Array of descriptors of Moulin Gallete Artwork

When we have finished generating the descriptors of each image, we begin to group the SIFT descriptors of each group of images to then perform the clustering and start creating our word notebook. Our codebook will increase with K, thanks to GridSearchCV we can do several tests with different values ​​of K to see which is the most optimal value for our bag of visual words. We will start with K = 50, K = 150, K = 300, K = 500 words.

Image Codebook K=50

Image Codebook K=150

Image Codebook K=300

Image Codebook K=500

Image Comparison result of GridSearchCV for value “K” beetween SVM and AdaBoost

At the end of the process we will automatically save, thanks to GridSearchCV, the best SVM model in this case, which gives us a result of accuracy 0.91 and also save the model of our cluster. Two very necessary files for our API.

Python App on Azure Web App with Flask

I try to explain you several steps that you must do it to build succesfully your Python Web App with Flask on Azure:

  1. Look this doc: /es-es/visualstudio/python/publishing-python-web-applications-to-azure-from-visual-studio
  2. Look this doc: /es-es/visualstudio/python/managing-python-on-azure-app-service

These docs will help you to do your build on azure succesfully, I attach you an image of my Flask API solution structure. But first, we need to create in visual studio a project with Flask

Image Flask Web Project template

Image API Solution files

As you can see, we have several files but many of them we only need for the start-up of our solution in our azure web app. If you look, we have a folder called “pickle” where we have the model of our classifier made by the SVM algorithm and we also have a model created by Bag of Visual Words technique with K-Means Clustering.

Image Python environment in VS17

In addition, we can configure our solution in “Python Environment option” and add the version of python that we want from visual studio itself and add the libraries that are important for our project, such as:

Image requirements.txt file

Do not worry about creating the requirement.txt file because you have everything in the github repository. Once explained this you have a series of images where I show you the web app in Azure and the options that we must activate before uploading our solution to Azure.

Image Azure Python Web App with Flask

Image set up python environment version

Image Add pytho to extension to our web app

Image Location of python extension with Kudu


As I commented before to correctly upload our solution to azure we must create a series of files, like the runtime.txt. In that one you only need to write: python-3.4 for example or your version that you wanna use.


Another one file is runserver.py, is our index.html, you only need to change in your web.config in this line:

 <add key="WSGI_HANDLER" value="runserver.app"/>

*Note: Remember that you need to set the file with .app extension.

 <?xml version="1.0" encoding="utf-8"?>
<!--This template is configured to use Python 3.5 on Azure App Service. To use a different version of Python,or to use a hosting service other than Azure, replace the scriptProcessor path below with the path givento you by wfastcgi-enable or your provider. 
For Python 2.7 on Azure App Service, the path is "D:\home\Python27\python.exe|D:\home\Python27\wfastcgi.py" 
The WSGI_HANDLER variable should be an importable variable or function (if followed by '()') that returnsyour WSGI object. 
See https://aka.ms/PythonOnAppService for more information.
    <add key="PYTHONPATH" value="D:\home\site\wwwroot"/>
    <!-- The handler here is specific to Bottle; other frameworks vary. -->
    <add key="WSGI_HANDLER" value="runserver.app"/>
    <add key="WSGI_LOG" value="D:\home\LogFiles\wfastcgi.log"/>
    <httpErrors errorMode="Detailed">
      <add name="PythonHandler" path="*" verb="*" modules="FastCgiModule"           scriptProcessor="D:\home\Python362x64\python.exe|D:\home\Python362x64\wfastcgi.py"           resourceType="Unspecified" requireAccess="Script"/>


Finally, you only need to add another web config like this image in a ProyectNameFolder/static/web.config and add lines of code like the image below:

 <?xml version="1.0" encoding="utf-8"?>
<!--This template removes any existing handler so that the default handlers will be used for this directory.Only handlers added by the other web.config templates, or with the name set to PythonHandler, are removed. 
See https://aka.ms/PythonOnAppService for more information.
      <remove name="PythonHandler"/>

When you have all these files correct in VS you need to publish and go to your azure web app and enter in development tools/Extension and install in your web app the env of Python that you would like to use.

Then go to kudu. In the second link that I attach you it explains you very well how to finish your deployment, you need to install in your web app via Kudu all requirements in your requirements.txt. You only need to locate your Python env that you were install previously in your web app, in my case in kudu the location is:


In these lines of code, you can see that I start to import of libraries to my solution:

 from flask import Flask, request, render_template, jsonify
from werkzeug import secure_filename
import logging
import sysimport cv2
import numpy as np
from sklearn.externals import joblib
import os
import pickle
import json

Here I start to declare variables, path to my models (SVM and cluster model) and console messages to see in execution:

 app.logger.info('\n\n* * *\n\nOpenCV version is %s. should be at least 3.1.0, with nonfree installed.' % cv2.__version__) 
app = Flask(__name__)
ALLOWED_EXTENSIONS = ['png', 'jpg', 'jpeg'] 
APP_DIR = os.path.dirname(os.path.realpath(__file__)) 
PICKLE_DIR = os.path.abspath(    os.path.join(    APP_DIR,    './pickles/')) 
LOG_PATH = os.path.abspath(os.path.join(APP_DIR,'../art_app.log')) 
logging.basicConfig(filename=LOG_PATH,level=logging.DEBUG,    format='%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]') 

In these lines of code, I write the sentences necessary to get my models in their respectives path of my solution:

 # TODO: make pickles for kmeans and single best classifier.
filename = os.path.join(PICKLE_DIR, 'svc\\svc.pickle')
model_svc = pickle.load(open(filename, 'rb'), encoding='latin1') 
filename = os.path.join(PICKLE_DIR, 'cluster_model\\cluster_model.pickle')
clust_model = pickle.load(open(filename, 'rb'), encoding='latin1') 
cluster_model = clust_model
clf = model_svc

Function to detect extension of input files of my API:

 def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
 # upload image with curl using:
# curl -F 'file=@/home/' ''

Function to resize the input files:

 def img_resize(img):
    height, width, _ = img.shape
    if height > width:
        # too tall
        resize_ratio = float(MAX_PIXEL_DIM)/height
        # too wide, or a square which is too big
        resize_ratio = float(MAX_PIXEL_DIM)/width
     dim = (int(resize_ratio*width), int(resize_ratio*height))
     resized = cv2.resize(img, dim, interpolation=cv2.INTER_AREA)
    app.logger.debug('resized to %s' % str(resized.shape))
     return resized

This is the more complex function cause here when I get the file, resize them we need to get features of the picture and do an inference to our cluster model, and finally get a result that we send to prediction function, the one in charge of making the inference to our SVM model:

def img_to_vect(img_np):

Given an image path and a trained clustering model (eg KMeans),    generates a feature vector representing that image.    Useful for processing new images for a classifier prediction.    
# img = read_image(img_path)
    height, width, _ = img_np.shape
    app.logger.debug('Color image size - H:%i, W:%i' % (height, width))
    if height > MAX_PIXEL_DIM or width > MAX_PIXEL_DIM:
        img_np = img_resize(img_np)
    gray = cv2.cvtColor(img_np, cv2.COLOR_BGR2GRAY)
    sift = cv2.xfeatures2d.SIFT_create()
    kp, desc = sift.detectAndCompute(gray, None)
     clustered_desc = cluster_model.predict(desc)
    img_bow_hist = np.bincount(clustered_desc, minlength=cluster_model.n_clusters)
     # reshape to an array containing 1 array: array[[1,2,3]]
    # to make sklearn happy (it doesn't like 1d arrays as data!)
    return img_bow_hist.reshape(1,-1)

Finally, prediction function gets the result of our img_to_vect function and inference our svm model and return the result to our client:

 def prediction(img_str):
    nparr = np.fromstring(img_str, np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    # convert to K-vector of codeword frequencies
    img_vect = img_to_vect(img_np)
    prediction = clf.predict(img_vect)
    return prediction[0]

And here’s our unique API function:

  @app.route('/predict', methods=['POST'])
def home():
    if request.method == 'POST':
        f = request.files['file'] 
       if f and allowed_file(f.filename):
            filename = secure_filename(f.filename)
            app.logger.debug('got file called %s' % filename)
            lb = prediction(f.read())
            print (lb)
            json_result = json.dumps(lb.decode("utf-8"))
            return json_result
        return 'Error. Something went wrong.'
        return render_template('img_upload.jnj')
  if __name__=="__main__":

Demo with Postman (Azure Web App and local):

Create the mobile app on Azure Mobile App with Xamarin

La aplicación de Xamarin que he desarrollado no esta actualizada a .Net Standard, pero no creo que de problemas con la estrategia de PCL (Portable Class Library) 😊 . La solución es muy sencilla, unicamente contaremos con una única pagina de interfaz de usuario y añadiremos por consiguiente los controles necesarios de los botones, que serán elegir imagen del carrete, tomar foto y clasificar.

 <?xml version="1.0" encoding="utf-8" ?>
<ContentPage xmlns="https://xamarin.com/schemas/2014/forms"             xmlns:x="https://schemas.microsoft.com/winfx/2009/xaml"             xmlns:local="clr-namespace:CustomVision"             x:Class="CustomVision.MainPage">
    <StackLayout Padding="20">
        <Image x:Name="Img" Source="ic_launcher.png" MinimumHeightRequest="100" MinimumWidthRequest="100">
        <Button Text="Elegir imagen" Clicked="ElegirImage">
        <Button Text="Sacar foto" Clicked="TomarFoto">
        <Button Text="Clasificar" Clicked="Clasificar">
        <Label x:Name="ResponseLabel">
        <ProgressBar x:Name="Accuracy" HeightRequest="20"/>

Ademas en esta misma pagina de xaml tendremos nuestro manejadores de eventos o funciones de cada botón:

 using Newtonsoft.Json;
using Plugin.Media;
using Plugin.Media.Abstractions;
using System;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Net.Http.Headers;
using Xamarin.Forms;

namespace CustomVision
    public partial class MainPage : ContentPage
        public const string ServiceApiUrl = "https://YOUR_WEB_APP_URL ";
        private MediaFile _foto = null;
        public MainPage()
         private async void ElegirImage(object sender, EventArgs e)
            await CrossMedia.Current.Initialize(); 
            _foto = await Plugin.Media.CrossMedia.Current.PickPhotoAsync(new PickMediaOptions());
            Img.Source = FileImageSource.FromFile(_foto.Path);
         private async void TomarFoto(object sender, EventArgs e)
            await CrossMedia.Current.Initialize();
             if (!CrossMedia.Current.IsCameraAvailable || !CrossMedia.Current.IsTakePhotoSupported)
             var foto = await CrossMedia.Current.TakePhotoAsync(new StoreCameraMediaOptions()
                PhotoSize = PhotoSize.Custom,                CustomPhotoSize = 10,                CompressionQuality = 92,                Name = "image.jpg"
            _foto = foto;
             if (_foto == null)
             Img.Source = FileImageSource.FromFile(_foto.Path);
         private async void Clasificar(object sender, EventArgs e)
            using (Acr.UserDialogs.UserDialogs.Instance.Loading("Clasificando..."))
                if (_foto == null) return;
                 var httpClient = new HttpClient();
                var url = ServiceApiUrl;
                var requestContent = new MultipartFormDataContent();
                var content = new StreamContent(_foto.GetStream());
                 content.Headers.ContentType =                    MediaTypeHeaderValue.Parse("image/jpg");
                 requestContent.Add(content, "file", "image.jpg");
                 var response = await httpClient.PostAsync(url, requestContent);
                 if (!response.IsSuccessStatusCode)
                    Acr.UserDialogs.UserDialogs.Instance.Toast("Hubo un error en la deteccion...");
                 var json = await response.Content.ReadAsStringAsync();
                 var prediction = JsonConvert.DeserializeObject<string>(json); 
               if (prediction == null)
                    Acr.UserDialogs.UserDialogs.Instance.Toast("Image no reconocida.");
                ResponseLabel.Text = $"{prediction}";
                //Accuracy.Progress = p.Probability;
             Acr.UserDialogs.UserDialogs.Instance.Toast("Clasificacion terminada...");


Demo Xamarin Forms App

Image Test mobile app on iOS 11.2 (iPhone 6 Plus)

Image Test mobile app on Android 7.1 Nougat or 8.0 Oreo (Nexus -API 25-26)

Next Post!

Well, we have finished with the third post. Honestly, I thought that this post was going to be smaller but while I was writing it I saw that there was a lack of information to understand key concepts when carrying out projects or computer vision solutions. I hope these explanations and sample demonstrations have been very helpful and that you have shown them a very beautiful area such as computer vision, have the possibility to manipulate the images and extract information from them.

In the next post, we will dedicate to do a comparison between the use of TensorFlow and CNTK, the results obtained, the management and usability, points of interest and final conclusions about the research that has been carried out for the development of all the posts.

Kind regards,
Alexander González (@GlezGlez96)
Microsoft Student Partner