Demonstration of capability of applying simple ML and TextMining techniques to perform prediction and draw allied characteristics
The Write-up is to demonstrate a simple ML algorithm that can pull-up the characteristic components of the data to predict the family to which it belongs. In this particular example, the data set had a list of id, ingredients and dish. There were 20 types of dish in the data set. The data-scientist is attempting to predict the dish based on available ingredients. This analysis could be extended to predict the family of the crop-disease based upon the individual characteristics of the disease like – weather type, soil-alkalinity, pesticide-application, seasonality. You could also use the same model to input sensor data (aka of ADC data) and predict the family of Engine-defect.
The dataset is derived from kaggle.com (Rather, from one of its competitions https://www.kaggle.com/c/whats-cooking/). The competition was to apply text mining ML algorithms to predict the dish based upon the individual ingredients that have been used for its preparation.
Why am I showcasing this? The problem may find application in the CropSciences domain or Engine-Problem class classification since, the analyst have to do a similar kind of prediction of the crop-disease based on the component factors like – weather conditions, soil acidity levels, fumigation, seasonality and other factors. Take the instance of Engine- Problem class detection, the sensor-dataset may be regressed to classify the problem-class.
I basically employed Text-Mining, XGBoost and Ensemble Modeling to get the final output. I have also tried the NB algorithms but with little success as the MSE of these algos was far too high compared to that of the XGBoost Ensemble model’s.
DATA-SOURCE:
https://www.kaggle.com/c/whats-cooking/data
In the dataset, we include the recipe id, the type of dish, and the list of ingredients of each recipe (of variable length). The data is stored in JSON format. An example of a recipe node in train.json:
In the test file test.json, the format of a recipe is the same as train.json, only the dish type is removed, as it is the target variable you are going to predict.
File descriptions
- train.json - the training set containing recipes id, type of dish, and list of ingredients
- test.json - the test set containing recipes id, and list of ingredients
PRE-REQUISITES:
Yeah! You could smell the process to get the prediction kick-started. Understanding of the below helps to comprehend the solution holistically:
- Comfortable with statistics (and error comprehension) of models
- Data-wragling techniques Click here for details
- Basic comprehension of Machine-Learning tools and techniques
- Basic R-Console commands
IMPORTING AND COMBINING DATA SET:
Since the data set is in json format, I have used the ‘jsonlite’ package to import data in R.
setwd('C:/Users/Mrinal/Desktop/Kaggle')
install.packages('jsonlite') library(jsonlite)
train <- fromJSON("train.json")
test <- fromJSON("test.json")
Let’s combine the ‘Test’ and the ‘Train’ test data to make the cleansing process less painful.
PRE-PROCESSING (TM Package)
Here are the steps used to clean the list of ingredients. I’ve used tm package for text mining:
#install package tm for text-mining library(tm)
#create corpus
corpus <- Corpus(VectorSource(combi$ingredients))
Convert text to lowercase and strip-it for mining purpose:
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, c(stopwords('english'))) corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
Converting the text into plain text document. This helps in pre-processing documents as text documents:
corpus <- tm_map(corpus, PlainTextDocument)
> #document matrix
> frequencies <- DocumentTermMatrix(corpus) Warning message:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers
> frequencies
A document-term matrix (49718 documents, 2756 terms)
Non-/sparse entries: 926350/136096458 Sparsity: 99%
Maximal term length: 18
Weighting : term frequency (tf)
Let's create the document matrix below:
DATA EXPLORATION
Lets check the frequency distribution table to start with...
I’ll remove only the terms having frequency less than 3 since they dont matter anyway:
# frequency of the terms
data_freq <- colSums(as.matrix(frequencies)) length(freq)
ord <- order(data_freq) ord
#To export the matrix :
sample <- as.matrix(frequencies)
dim(sample)
write.csv(m, file = 'matrix_freq.csv')
#check most and least frequent words freq[head(ord)]
matrix_freq[tail(ord)]
#check our table of 30 frequencies head(table(freq),30)
tail(table(freq),30)
sparse_dt <- removeSparseTerms(frequencies, 1 - 3/nrow(frequencies)) dim(sparse)
[1] 49718 2013
Let’s visualize the data now. But first, we’ll create a data frame.
We can also create a word cloud to check the most frequent terms. It is easy to build and gives an enhanced understanding of ingredients in this data. For this, I’ve used the package ‘word-cloud’.
#create wordcloud
install.packages('wordcloud')
library(wordcloud)
#plot 300 most used words
wordcloud(names(freq), freq, max.words = 300, scale = c(6, .1))
Let’s make final structural changes in the data:
> #check if all words are appropriate
> colnames(newsparse) <- make.names(colnames(newsparse))
> #check for the dominant dependent variable
> table(train$dish)
BOOSTING ML MODEL TO PREDICT THE DISH
To know more about boosting, you can refer to this introduction.
Why Boosting ML? It Works great on sparse matrices. Sparse Matrix is a matrix which has large number of zeroes in it. Remember, most of the matrix elements are drawing a zero. Since, I’ve a sparse matrix here, I expected it to give good results.
Lets start at the XGBoost model:
library(xgboost)
library(Matrix)
Now, I’ve created a sparse matrix using xgb.DMatrix of train data set. I’ve kept the set of independent variables and removed the dependent variable.
# creating the matrix for training the model
mymodel_train <- xgb.DMatrix(Matrix(data.matrix(my_datatrain[,!colnames(my_datatrain) %in% c('dish')])), label = as.numeric(my_datatrain$dish)-1)
I’ve created a sparse matrix for test data set too. This is done to create a watchlist.
# advanced data set preparation
mymodel_dtest <- xgb.DMatrix(Matrix(data.matrix(my_datatest[,!colnames(my_datatest) %in% c('dish')])))
watchlist <- list(train = mymodel_train, test = mymodel_dtest)
#predict 1
xgbmodel1.predict <- predict(xgbmodel, newdata = data.matrix(mytest_data[, !colnames(mytest_data) %in% c('dish')]))
xgbmodel1.predict.text <- levels(mytrain_data$dish)[xgbmodel1.predict + 1]
# Repeat and create another 2 xgboost models for ensemble
#data frame for predict 1
Let’s check the accuracy of the models:
sum(diag(table(mytest_data$dish, xgbmodel1.predict)))/nrow(mytest_data)
....
....
The simple key is ensemble. Now, I have three data frame for model . Now I can easily ensemble their predictions.
# ensembling
submit_match_all_datasets$dish_n <- submit_match_n_1$dish
...
Using the MODE function to extract the predicted value with highest frequency per id:
# function to find the maximum value row wise
Mode <- function(x) {
u <- unique(x) u[which.max(tabulate(match(x, u)))]
}
v <- Mode(submit_match_n[,c("dish","<dish from 1>",","<dish from 2>"")])
y <- apply(<Matches of the dishs>,Mode)
final_output <- data.frame(id= "<Match of N dishs>", $id, dish = "n")
#Final output stages
data.table(final_output)
#Writing to CSV file
write.csv(final_Predict, 'ensemble_output.csv', row.names = FALSE)