Machine Learning Algorithms

What is machine learning

Machine learning is a name for algorithms that use computational methods to acquire information directly from the data without relying on a predetermined equation as a model -- that is why it is called 'learning'. These algorithms adaptively improve their performance as the number of samples available for learning increases. Simply put, the machine learning algorithms will find natural patterns within the data, get insights about the patterns, and make the predictions about the nature of new data when tested. Based on how much human input they need for learning the patterns of the data, we can make a distinction between two machine learning techniques: supervised learning and unsupervised learning.

Supervised Learning

Supervised learning refers to finding patterns in the input data, and predicting the patterns of the output data. The algorithms from this family of techniques always need to be provided with a labeled dataset from which they can learn the interactions between the parameters and labels. Such dataset is called training dataset. For example, if we want to predict whether it will rain in Glasgow on Saturday, we will be using machine learning to solve a classification problem. For that, we will need a dataset which contains information on whether it did or didn't rain on a certain day in the last 10 years. The machine will then connect these labels, 'raining' and 'non-raining', to the day of the year, and based on that predict whether we will have rain on Saturday. In this case, we are making a binary classification, because we are making a prediction between two scenarios (classes). If we would make a dataset with the labels 'raining', 'cloudy', 'sunny', and 'snowing', then we would be making a multi-class prediction.

Alternatively, we could be interested in predicting the temperature in Glasgow on Sunday, based on the information about the temperature on the given day during the last ten years. This would be a problem of regression, because we would be predicting a continuous outcome.

Unsupervised Learning

This simplified example of weather forecasting based on one predictor illustrates the basic logic of machine learning, but definitely does not show its full potential. Machine learning becomes especially useful when dealing with high-dimensional datasets, sometimes containing thousands and thousands of predictors, because we don't need to explicitly define the relationship between the predictors; the machine will learn the patterns directly from the data.

Oftentimes we may not have a labelled dataset available. Then, to find patterns, we may use unsupervised learning. The will search through a new, unlabeled dataset, and based on patterns of interactions between datasets define distinct clusters, generate its own labels. For example, if we collect the data about the temperature in Glasgow on random days throughout 10 years, but don't have an information about the day on which the temperature was measured, machine learning will help us separate this dataset into four distinct categories with varying temperature ranges: spring, summer (if such a thing exists in Scotland), autumn, and winter. This is referred to as cluster analysis, and is only one of many types of unsupervised learning.

Splitting the dataset

The goal of machine learning is to train the data to predict future data. To do so, we want to first validate the model and check how accurately it can give us information about the new data. However, the information about the new data is not yet known; if it was known, we wouldn't need the machine in the first place! This is why we usually split the existing, labeled dataset during the process of training and validating the model. This allows us to get an estimate of the accuracy with which we will be predicting all the new, unknown, unlabeled data.

The usual approach is to split the dataset in two parts, hence forming a training set and test set. One of the common ways are to apply the Pareto principle and use 80% of the dataset for training, and 20% for testing. Alternatively, ratios of 66/33 or 50/50 are also sometimes used. The interactive graph shows how the decision boundary of the model changes when we split our example dataset in different ratios.

Cross-Validation

The training set is used for the first two phases of machine learning:
1. In the training phase, you fit ('train') your model on the data from your training set by pairing the input with expected output;
2. In the cross-validation phase, you are estimating how well your model has been trained and estimating model properties (e.g., classification errors).
These two steps are often performed together and commonly repeated multiple times to obtain an average, depending on which approach to cross-validation you choose (see Resampling sub-heading).

Once the parameters are estimated and model has been trained and validated, you can apply it to your test set in the test phase. In this phase, we apply our trained model to a set of used only to assess the final performance of a fully-trained classifier. The performance of the model in this phase reflects the performance we can expect if we apply it to the real-world data.

Why separate train and test sets?

The validation set and test set are not always separated, as we can already read out the accuracy of the model during the cross-validation. However, the error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model by repeatedly estimating the parameters. Therefore, it is highly recommendable to estimate the accuracy on a separate test set, and after that NOT tune the model any further to preserve the integrity of the model.

Choose sample size per group:

How large should the training set be?

50/50 66/33 80/20

Resampling

During the training and validation phases, we are cross-validating our model by splitting the dataset in different ways and checking how well the model trained on one part of the dataset predicts the other part of the dataset. Instead of just splitting the dataset in two and making only one prediction, we usually resample many times. In the graph you can see how the model decision boundary behaves when using some of the most common resampling methods.

Choose a resampling method for the training data:

Bootstrapping

Repeated cross-validation

LOOCV

Our Dataset

In this shiny Shiny app, we will explain the mechanism behind several most common machine learning techniques, illustrate how the decision boundary for classification is estimated, and challenge you to play a game against our Winemaker to create the best-quality wine!

The dataset we used is coming from the popular Wine quality dataset (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). It contains information about 11 objective tests of red wine samples (e.g., pH values, alcohol percentage) used as inputs, and expert evaluations of wine quality used as outputs. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The datasets can be viewed in the context of either classification or regression tasks; in this app, we are treating it with classification algorithms. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines.

We slightly modified the dataset to adjust it for the purposes of this app, and used 4898 examples in total. The 11 input variables are the following:
1 - acidity
2 - stabilization
3 - clarification
4 - sugars
5 - chlorides
6 - sulfites
7 - yeast
8 - density
9 - pH
10 - sulphates
11 - alcohol

The models we described in the following section are all trained on this dataset. The game you can play under the Machine Sommalier tab is also using the algorithm trained on this dataset. However, since the multidimensional data sets such as this one cannot be mapped onto the 2D space for visualization, we could not use it to represent the decision boundaries. Therefore, for representation of decision boundaries, we used the simplified models with only two variables from the dataset.

Plot

Model 1

Choose the ML method

No tuning parameters available

Multi-layer Perceptron

K-nearest neighbours

Choose k?

Random Forest

Decision Boundaries

Model 2

Choose the ML method

No tuning parameters available

Multi-layer Perceptron

K-nearest neighbours

Choose k?

Random Forest

General data and sampling parameters

Pick the sample size of the training set

Pick the number of training labels

The Winemaker

Your wine's quality:

library(shiny) library(shinydashboard) library(tidyverse) library(caret) library(parallel) library(doParallel) library(MASS) library(e1071) load("./caretprojectv2.RData") ################# FUNCTIONS ##################### ### functions decisionplot_basic <- function(model, data, class = NULL, predict_type = "class", resolution = 100, showgrid = TRUE, ...) { if(!is.null(class)) cl <- data[,class] else cl <- 1 data <- data[,1:2] k <- length(unique(cl)) plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+15L, ..., bty = "n") # make grid r <- sapply(data, range, na.rm = TRUE) xs <- seq(r[1,1], r[2,1], length.out = resolution) ys <- seq(r[1,2], r[2,2], length.out = resolution) g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution)) colnames(g) <- colnames(r) g <- as.data.frame(g) ### guess how to get class labels from predict ### (unfortunately not very consistent between models) p <- predict(model, g, type = predict_type) p <- as.matrix(p) p <- as.factor(p) if(showgrid) points(g, col = as.integer(p)+1L, pch = ".") z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE) if(showgrid) {contour(xs, ys, z, add = TRUE, drawlabels = FALSE, lwd = 2, levels = (1:(k-1))+.5) } invisible(z) } fields <- c("explanationtext", "trainingset", "resamplingmethod", "resamplingmethodhelp", "groupsize" ) explanationtext <- HTML( "This is a basic example to demonstrate machine learning with the Support Vector Machine (SVM) algorithm. It simulates data of two groups, in this case red and white wines. The top plot displays the raw simulated data with sweetness on the x-axis and fullness on the y-axis. We use SVM to classify our data in two groups. This will help us to classify new data. In the bottom plot you can see the decision boundary. New incoming data in the red part would be classified as a red wine, and data in the green part would be classified as a white wine. Play around with the settings to see which options influence the results of the machine learning algorithm.", align = "justify" ) bootstitle <- "Bootstrapping" bootstext <- "When bootstrapping, a random sample with replacement is drawn from the training set and trained. This procedure is repeated several times, hence allowing us to generates information about the precision, such as accuracy and standard deviation of accuracy." repeatedcvtitle <- "Repeated k-Fold cross-validation" repeatedcvtext <- "In k-fold cross validation, the training set is partitioned into k-samples; for example, in 10 equally sized samples. The k - 1 number of samples, in this case 10, are selected and used for training. The model is then validated on the remaining sample. This is done until every sample has also been the testing sample; in our case, that counts 10 times. The method is called 'k-fold' because of the k number of samples, and 'repeated' because the whole procedure is performed several times." loocvtitle <- "Leave-One-Out cross-validation" loocvtext <- "LOOCV is a special case of k-Fold Cross Validation. In this special case, the model is trained on all observations of the training set except one. Afterwards, it is tested on the left-out observation. This procedure is repeated for all observations." groupsize <- sliderInput( "groupsize", "Choose sample size per group:", min = 3, max = 150, step = 1, value = 50 ) trainingset <- radioButtons( "trainingset", "How large should the training set be?", c("50/50", "66/33", "80/20"), inline = TRUE ) resamplingmethod <- radioButtons( "resamplingmethod", "Choose a resampling method for the training data:", c("Bootstrapping" = "Bootstrapping", "Repeated cross-validation" = "Repeated cross-validation", "LOOCV" = "LOOCV") ) ################# TEXT ######################### machine_learning_1 <- HTML("Machine learning is a name for algorithms that use computational methods to acquire information directly from the data without relying on a predetermined equation as a model -- that is why it is called 'learning'. These algorithms adaptively improve their performance as the number of samples available for learning increases. Simply put, the machine learning algorithms will find natural patterns within the data, get insights about the patterns, and make the predictions about the nature of new data when tested. Based on how much human input they need for learning the patterns of the data, we can make a distinction between two machine learning techniques: supervised learning and unsupervised learning.") machine_learning_2 <- HTML("Supervised learning refers to finding patterns in the input data, and predicting the patterns of the output data. The algorithms from this family of techniques always need to be provided with a labeled dataset from which they can learn the interactions between the parameters and labels. Such dataset is called training dataset. For example, if we want to predict whether it will rain in Glasgow on Saturday, we will be using machine learning to solve a classification problem. For that, we will need a dataset which contains information on whether it did or didn't rain on a certain day in the last 10 years. The machine will then connect these labels, 'raining' and 'non-raining', to the day of the year, and based on that predict whether we will have rain on Saturday. In this case, we are making a binary classification, because we are making a prediction between two scenarios (classes). If we would make a dataset with the labels 'raining', 'cloudy', 'sunny', and 'snowing', then we would be making a multi-class prediction. Alternatively, we could be interested in predicting the temperature in Glasgow on Sunday, based on the information about the temperature on the given day during the last ten years. This would be a problem of regression, because we would be predicting a continuous outcome. ") machine_learning_3 <- HTML("This simplified example of weather forecasting based on one predictor illustrates the basic logic of machine learning, but definitely does not show its full potential. Machine learning becomes especially useful when dealing with high-dimensional datasets, sometimes containing thousands and thousands of predictors, because we don't need to explicitly define the relationship between the predictors; the machine will learn the patterns directly from the data. Oftentimes we may not have a labelled dataset available. Then, to find patterns, we may use unsupervised learning. The will search through a new, unlabeled dataset, and based on patterns of interactions between datasets define distinct clusters, generate its own labels. For example, if we collect the data about the temperature in Glasgow on random days throughout 10 years, but don't have an information about the day on which the temperature was measured, machine learning will help us separate this dataset into four distinct categories with varying temperature ranges: spring, summer (if such a thing exists in Scotland), autumn, and winter. This is referred to as cluster analysis, and is only one of many types of unsupervised learning. ") split_data <- HTML("The goal of machine learning is to train the data to predict future data. To do so, we want to first validate the model and check how accurately it can give us information about the new data. However, the information about the new data is not yet known; if it was known, we wouldn't need the machine in the first place! This is why we usually split the existing, labeled dataset during the process of training and validating the model. This allows us to get an estimate of the accuracy with which we will be predicting all the new, unknown, unlabeled data. The usual approach is to split the dataset in two parts, hence forming a training set and test set. One of the common ways are to apply the Pareto principle and use 80% of the dataset for training, and 20% for testing. Alternatively, ratios of 66/33 or 50/50 are also sometimes used. The interactive graph shows how the decision boundary of the model changes when we split our example dataset in different ratios. ") neural_network1 <- p("we want to have a generic model that is adapting to training data. Where creating networks by hand is too expensive and single neurons are not able to solve complex tasks, a Multilayer Perceptron (MLP) comes into play. A MLP can be seen as a logistic regression classifier where the input is is modified by a trained non-linear transformation. This transformation alters the data that it becomes linearly separable. The transformation layer on the intermediate level is referred to as a hidden layer. A single layer is already enough to form an appropriate approximator. However, as it is done in deep learning, substantial benefits can be gained when multiple levels of hidden layers are used. However, the most fundamental idea is that standard perceptrons calculate a discontinuous function: x = fstep(w0+[w,x]) ") neural_network2 <- p("Standard perceptrons calculate a discontinuous function x = fstep(w0+[w,x]) whereas more advanced models with multiple perceptrons calculate a smoothed function of this: x = flog(w0+[w, x]) with flog as a logistic function. Very complex functions can be calculated by adding multiple neurons. A MLP can be seen as a finite directed acyclic graph. Nodes that are not target of any input connection are called input neurons. For every dimension n that should be calculated, n number of input neurons are required. Nodes that are no source of any connection are called output neurons. The number of output neurons depends on the target values that are asked to be gained. All nodes that are neither input nor output neurons are called hidden neurons and can be organized in layers. Most MLP have a connection structure that connects all neurons of one layer to every neuron in the next layer. All connections are weighted with a real number. The weight of the connection i to j and is referred to as wji. Each output and hidden neurons have a specific bias. The bias weight of neuron i is referred to as wi0 ") knn_dummies <- "The K-nearest neighbour algorithm (kNN) can be used both for classification and regression. In the case of classification, we can use a labeled data set to train the model to predict the classes of the data points in the test set. Interestingly, no explicit training step is required because the algorithm uses the whole training set into account when predicting the class membership of the test data point; this is referred to as lazy learning. The algorithm makes the prediction by searching through the entire training set for the k number of similar instances (so-called neighbours), and based on their class membership predicts the class of the new data point. The value for k can be found by algorithm tuning, but it is recommended to try many different values for k and see what works best for your problem. We also use continuous weights called kernels to reflect the distance of a training point and the corresponding prediction point. Instead of using simple binary (i.e., neighbour/not-neighbour) distinctions, continuous kernels can capture the 'degree of neighbourship'). Basically, the closer two data points are, the greater the kernel value. You can check the effect of different values of k on accuracy for different kernels in the graph. " knn_smarties <- "The k- nearest neighbours (kNN) is a non-parametric method used for both classification and regression problem. The input consists of the k closest training examples in the feature space, and the output is either class membership or property value for the object, depending on problem type. In classification, the object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. k is a positive integer, and typically small; for example, when k = 1, the object is simply assigned to the class of that single nearest neighbor. Another parameter that can be manipulated when setting up the model is the type and width of the kernel, which reflects the distance between the training data point and the corresponding test point. The commonly used kernels are Gaussian, regularized, and [third name] kernel. An important feature of kernels is the kernel width (??). If the kernel width is small, the model will average over fewer points, and the bias will be smaller since closer distances are used; however, in this scenario we risk of the predictions being too bumpy. On the other side, if the kernel width is large, the model will be averaging over larger points and we will you have a larger bias since further apart distances are used; hence, we risk over-smoothing the parameter space. Therefore, it is important to consider the 'right' trade-off depending on your model. For determining the optimal k instances, the most commonly used method is the Euclidian distance, the square root of the sum of the squared differences between a new point (x) and an existing point (xi). EuclideanDistance(x, xi) = sqrt( sum( (xj ' xij)^2 ) ) The graph shows the effect of different k values and three most commonly used kernels on model accuracy." svm_dummies <- "SVMs are supervised learning models that can be used both for regression and classification. In a training set, where all datapoints can clearly be distinguished between two groups, a SVM builds a model on this data (training set) that can be used for new data, to predict the assignment to one category or another. Two parallel hyperplanes can be selected that divide the two groups of each other. The goal is to maximize the distance between them. The distance of the both hyperplanes is the margin. The maximum-margin hyperplane is the hyperplane that lies halfway between them. In a dataset, (x1',y2), '., (x',yn), we want to find the maximum-margin hyperplane that divides the group of points (with groups that are either yi = 1 or yi = -1) into a maximised distance between the hyperplane and the nearest point xi' from either group. Every hyperplane can be written down as w'*x'-b=0, where w' describes the normal vector to the hyperplane. Therefore, the equation can only have the outcomes w'*x'-b=1 and w'*x'-b=-1 that determine the class above or below the decision boundary." svm_smarties <- "Given a dataset of 1000 photographs of pandas and sloths, you can digitize them into 500x500 pixels. So you have x ' R n with n=250.000. Given another picture, you want to assess whether it is a panda or a sloth (Assumed supervised learning between two outcomes). Given your input/output sets (X,Y) and the training set (x1,y1)...,(xi,yi) and your previous x ' X you want to generalize to a suitable y ' Y You then train a classifier with y= f(x,??), where ?? are your parameters of interest of the function. A proper training function for f(x,??) is crucial. So you choose a function that suits well. Eg. by choosing: Rempirical_risk(??)= (1/m)*L*(f(xi,??),yi) for your Training error where L is your zero-one loss function, you can try to minimize the risk by setting R(??)=' L(f(x,??),y)dP(x,y) for your Test error where P(x,y) is the distribution of x and y." rf_dummies <- "Random forest is one of the most popular classification algorithms and can be used for both classification and regression. It is based on a concept of a decision tree - and that's what makes the algorithm name very witty. In a decision tree, we go through a flowchart of questions, and depending on the answer to each question, we keep adding up new ones. In that way, we predict the outcome based on the pathway chosen through the nodes of the tree. For example, imagine you want to predict if you are going to pass your next exam. You may start asking yourself if you've spent enough hours studying. If that number of is less than, say, 20 hours, you might ask how much you knew beforehand, and follow-up by questioning the teaching quality, or how difficult was the exam in the previous years. After several questions, you will have a class prediction for your exam: pass or fail. By adding additional trees, you can see how many of them predict a fail and how many predict a pass. Since these are smart, machine trees, they will take into account a number of random features and their interactions. The more trees you put in your forest, the more opinions you will have, and the more accurate your final prediction will be. You can check our graph to see how the number of trees relates to the accuracy of the model prediction." rf_smarties <- "For every kth tree, a random and independent vector k is generated. Every tree grows by using the training set with the help of the classifier h(x,k) where x is the input vector. After enough trees have been generated, the collection can be called a random forest. The random forest is still a classifier that consist of a subcollection off tree-classifiers {h(x,k),k=1,...,n} where every tree votes for its self-generated prediction. In an ensemble of tree classifiers h1(x),...,hk(x) and a random training set from the vector distribution Y,X, the margin measures to what extent the mean number of votes of X,Y for the correct class exceeds the mean of any other class. Therefore: The larger the margin, the better the classification. In mathematical terms the margin can defined as following: margin(X,Y)=meank*I(hk(X)=Y)-max(j/=Y)*meank*I(hk(X)=j) Remarkable is that random forest do not overfit the data when additional trees are added. However, a limit of the predicted generalization error will be reached." our_dataset <- HTML("In this shiny Shiny app, we will explain the mechanism behind several most common machine learning techniques, illustrate how the decision boundary for classification is estimated, and challenge you to play a game against our Winemaker to create the best-quality wine! The dataset we used is coming from the popular Wine quality dataset (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). It contains information about 11 objective tests of red wine samples (e.g., pH values, alcohol percentage) used as inputs, and expert evaluations of wine quality used as outputs. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). The datasets can be viewed in the context of either classification or regression tasks; in this app, we are treating it with classification algorithms. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. We slightly modified the dataset to adjust it for the purposes of this app, and used 4898 examples in total. The 11 input variables are the following: 1 - acidity 2 - stabilization 3 - clarification 4 - sugars 5 - chlorides 6 - sulfites 7 - yeast 8 - density 9 - pH 10 - sulphates 11 - alcohol The models we described in the following section are all trained on this dataset. The game you can play under the Machine Sommalier tab is also using the algorithm trained on this dataset. However, since the multidimensional data sets such as this one cannot be mapped onto the 2D space for visualization, we could not use it to represent the decision boundaries. Therefore, for representation of decision boundaries, we used the simplified models with only two variables from the dataset.") why_sample <- p("The validation set and test set are not always separated, as we can already read out the accuracy of the model during the cross-validation. However, the error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model by repeatedly estimating the parameters. Therefore, it is highly recommendable to estimate the accuracy on a separate test set, and after that NOT tune the model any further to preserve the integrity of the model.") cross_validation <- HTML("The training set is used for the first two phases of machine learning: 1. In the training phase, you fit ('train') your model on the data from your training set by pairing the input with expected output; 2. In the cross-validation phase, you are estimating how well your model has been trained and estimating model properties (e.g., classification errors). These two steps are often performed together and commonly repeated multiple times to obtain an average, depending on which approach to cross-validation you choose (see Resampling sub-heading). Once the parameters are estimated and model has been trained and validated, you can apply it to your test set in the test phase. In this phase, we apply our trained model to a set of used only to assess the final performance of a fully-trained classifier. The performance of the model in this phase reflects the performance we can expect if we apply it to the real-world data. ") resampling_ex <- p("During the training and validation phases, we are cross-validating our model by splitting the dataset in different ways and checking how well the model trained on one part of the dataset predicts the other part of the dataset. Instead of just splitting the dataset in two and making only one prediction, we usually resample many times. In the graph you can see how the model decision boundary behaves when using some of the most common resampling methods. ") # header -------------------- header <- dashboardHeader(title = "Machine Learning Algorithms") # sidebar ------------------- sidebar <- dashboardSidebar( sidebarMenu( menuItem("Introduction", tabName = "intro"), menuItem("Our Algorithms", tabName = "algorithms"), menuItem("Compare the models!", tabName = "compare"), menuItem("The Winemaker", tabName = "game") ) ) # body ---------------- body <- dashboardBody( includeScript("../../../Matomo-tquant.js"), tabItems( tabItem(tabName = "intro", fluidPage( tabsetPanel( tabPanel("Basic", box(title = "What is machine learning", width = 6, status = "success", solidHeader = T, machine_learning_1) ), tabPanel("Supervised", box(title = "Supervised Learning", width = 6, status = "success", solidHeader = T, machine_learning_2 ) ), tabPanel("Unsupervised", box(title = "Unsupervised Learning", width = 6, status = "success", solidHeader = T, machine_learning_3 ) ), tabPanel("Split the data", fluidPage( column(width = 4, box(title = "Splitting the dataset", width = 12, status = "success", solidHeader = TRUE, collapsible = TRUE, collapsed = TRUE, split_data ), box(title = "Cross-Validation", width = 12, status = "success", solidHeader = TRUE, collapsible = TRUE, collapsed = TRUE, cross_validation), box(title = "Why separate train and test sets?", status = "success", solidHeader = TRUE, width = 12, collapsible = TRUE, collapsed = TRUE, why_sample ), box( width = 12, groupsize, trainingset ) ), column(width = 8, box( width = 12, plotOutput("descplot"), plotOutput("machplot"), tableOutput("results") ) ) ) ), tabPanel("Resampling", box(title = "Resampling", status = "success", solidHeader = TRUE, width = 4, resampling_ex, resamplingmethod, h4(textOutput("restitle")), p(textOutput("restext"), align="justify") ), box( width = 8, plotOutput("descplot_2"), plotOutput("machplot_2"), tableOutput("results_2") ) ), tabPanel("Our Dataset", box(title = "Our Dataset", status = "success", solidHeader = T, our_dataset), box(title = "Plot", status = "success", solidHeader = T, plotOutput("box1"))) ) ) ), tabItem(tabName = "algorithms", fluidPage( tabsetPanel( tabPanel("K-Nearest Neighbour", tabBox( width = 4, selected = "Plain English", tabPanel("Plain English", knn_dummies), tabPanel("Jargon", knn_smarties), tabPanel("Example", "Test data details:", tableOutput("testData_knn"), "Test data quality prediction:", verbatimTextOutput("testQuality_knn"), "Test data real quality:", verbatimTextOutput("realQuality_knn"), actionButton("randomize_knn", "Randomize data!") ) ), box( title = "Plot", width = 8, status = "success", solidHeader = TRUE, radioButtons("vsize_knn", "Select the training sample size:", c("Pareto (80/20)" = "eightyTwenty", "66/33" = "oneThird", "50/50" = "halfHalf"), inline = TRUE ), checkboxGroupInput("vkernel_knn", "Select kernel(s):", choiceNames = list("Rectangular", "Gaussian", "Cosine"), choiceValues = list("rectangular", "gaussian", "cos"), selected = "rectangular", inline = TRUE), plotOutput("plot_knn") ) ), tabPanel("Support Vector Machines", tabBox( width = 4, selected = "Plain English", tabPanel("Plain English", svm_dummies), tabPanel("Jargon", svm_smarties), tabPanel("Example", "Test data details:", tableOutput("testData_svm"), "Test data quality prediction:", verbatimTextOutput("testQuality_svm"), "Test data real quality:", verbatimTextOutput("realQuality_svm"), actionButton("randomize_svm", "Randomize data!") ) ), box( title = "Plot", width = 8, status = "success", solidHeader = TRUE, radioButtons("vsize_svm", "Select the training sample size:", c("80/20" = "eightyTwenty", "64/33" = "oneThird", "50/50" = "halfHalf"), inline = TRUE ), checkboxGroupInput("vc_svm", "Select Cost value(s):", choiceNames = list("2", "4", "6", "8", "10"), choiceValues = list("2", "4", "6", "8", "10"), selected = "2", inline = TRUE ), plotOutput("plot_svm") ) ), tabPanel("Random Forest", tabBox( width = 4, selected = "Plain English", tabPanel("Plain English", rf_dummies), tabPanel("Jargon", rf_smarties), tabPanel("Example", print("Test data details:"), tableOutput("testData_rf"), print("Test data quality prediction:"), verbatimTextOutput("testQuality_rf"), print("Test data real quality:"), verbatimTextOutput("realQuality_rf"), actionButton("randomize_rf", "Randomize data!") ) ), box( title = "Plot", width = 8, status = "success", solidHeader = TRUE, radioButtons("vsize_rf", "Select the training sample size:", c("80/20" = "eightyTwenty", "64/33" = "oneThird", "50/50" = "halfHalf"), inline = TRUE), plotOutput("plot_rf") ) ), tabPanel("Neural Network", tabBox( width = 4, selected = "Plain English", tabPanel("Plain English", neural_network1), tabPanel("Jargon", neural_network2), tabPanel("Example", print("Test data details:"), tableOutput("testData_nn"), print("Test data quality prediction:"), verbatimTextOutput("testQuality_nn"), print("Test data real quality:"), verbatimTextOutput("realQuality_nn"), actionButton("randomize_nn", "Randomize data!")) ), box( title = "Plot", width = 8, status = "success", solidHeader = TRUE, radioButtons("vsize_nn", "Select the training sample size:", c("80/20" = "eightyTwenty", "64/33" = "oneThird", "50/50" = "halfHalf"), inline = TRUE ), checkboxGroupInput("vl3_nn", "Select Layer 3 perceptron number(s):", choiceNames = list("5", "6", "7", "8", "9"), choiceValues = list("5", "6", "7", "8", "9"), selected = "5", inline = TRUE ), plotOutput("plot_nn") ) ) ) ) ), tabItem(tabName = "compare", fluidRow( box( title="Model 1", width = 4, status = "success", solidHeader = TRUE, selectInput( "selection1", "Choose the ML method", c("Linear Discriminant Analysis" = "lda", "Logistic Regression" = "regLogistic", "Multi-layer Perceptron" = "mlp", "K-nearest Neighbours" = "kknn", "Random Forest" = "rf", "Support Vector Machine" = "svmRadial"), selected="lda" ), conditionalPanel( condition="input.selection1 == 'lda'", h3("No tuning parameters available") ), conditionalPanel( condition="input.selection1=='regLogistic'", h3("No tuning parameters available") ), conditionalPanel(condition="input.selection1 == 'mlp'", h3("Multi-layer Perceptron") ), conditionalPanel(condition="input.selection1 == 'kknn'", h3("K-nearest neighbours"), k.input1 <- sliderInput( "k.input1", "Choose k?", min = 1, max = 20, step = 1, value = 1 ) ), conditionalPanel(condition="input.selection1 == 'rf'", h3("Random Forest") ) ), box( title = "Decision Boundaries", width = 8, status = "primary", solidHeader = TRUE, plotOutput(outputId = "decisionPlot") ), box( title = "Model 2", width = 4, status = "success", solidHeader = TRUE, selectInput( "selection2", "Choose the ML method", c("Linear Discriminant Analysis" = "lda", "Logistic Regression" = "regLogistic", "Multi-layer Perceptron" = "mlp", "K-nearest Neighbours" = "kknn", "Random Forest" = "rf", "Support Vector Machine" = "svmRadial"), selected="lda" ), conditionalPanel( condition="input.selection2 == 'lda'", h3("No tuning parameters available") ), conditionalPanel( condition="input.selection2=='regLogistic'", h3("No tuning parameters available") ), conditionalPanel(condition="input.selection2 == 'mlp'", h3("Multi-layer Perceptron") ), conditionalPanel(condition="input.selection2 == 'kknn'", h3("K-nearest neighbours"), k.input2 <- sliderInput( "k.input2", "Choose k?", min = 1, max = 20, step = 1, value = 1 ) ), conditionalPanel(condition="input.selection1 == 'Random Forest'", h3("Random Forest") ) ), box( title = "General data and sampling parameters", width = 8, status = "primary", solidHeader = TRUE, slider_n.input <- sliderInput( "n", "Pick the sample size of the training set", min = 200, max = 450, step = 1, value = 250 ), slider_quality.n.input <- sliderInput( "quality.n", "Pick the number of training labels", min = 2, max = 5, step = 1, value = 2 ) ) ) ), tabItem(tabName = "game", titlePanel("The Winemaker"), tags$head( tags$style(type="text/css", " label.control-label, .selectize-control.single{ display: table-cell; text-align: left; vertical-align: middle; } .form-inline { display: table-row;} ") ), sidebarLayout( sidebarPanel( p("Make the best wine that fits the taste of our machine-learning model! Enter the variables below, and click on Test button. Your wine will be judged on a scale of 1 (worst) to 7 (best)."), column(6, style="padding-top:40px", sliderInput("var01", label="Fixed acidity", 3.5, 12, 15.5/2, step=0.05), sliderInput("var02", label="Volatile acidity", 0, 1, 0.5, step = 0.05), sliderInput("var03", label="Citric acid", 0, 1.8, 0.9, step = 0.05), sliderInput("var04", label="Residual sugar", 0.6, 24, 12, step=0.05), sliderInput("var05", label="Chlorides", 0.01, 0.25, 0.1, step=0.005), sliderInput("var06", label="Free SO2", 3, 131, 134/2, step=0.5) ), column(6, style="padding-top:40px", sliderInput("var07", label="Total SO2", 18, 315, 333/2,step=0.2), sliderInput("var08", label="Density", 0.98, 1.05, 0.99, step=0.005), sliderInput("var09", label="pH", 2.7, 3.9, 3.3, step=0.01), sliderInput("var10", label="Sulphates", 0.26, 1.05, 0.8, step=0.05), sliderInput("var11", label="Alcohol", 8, 14.1, 11, step=0.05), actionButton("submit2", "Test", style="padding: 20px; padding-left: 90px; padding-right: 90px; font-size: 200%; horizontal-align: middle") ) ), mainPanel( h3("Your wine's quality:"), verbatimTextOutput("quality"), imageOutput("wineglass") ) ) ) ) ) ui <- dashboardPage( header, sidebar, body) # ------------------ SERVER -------------------- # server <- function(input, output, session){ ############ kNN ################ vsizeget_knn <- reactive(input$vsize_knn) vkernelget_knn <- reactive(input$vkernel_knn) knETdf = kknnET.train$results[kknnET.train$results$distance == 2,] knOTdf = kknnOT.train$results[kknnOT.train$results$distance == 2,] knHHdf = kknnHH.train$results[kknnHH.train$results$distance == 2,] testData_knn <- eightTwoTest.white[1,] testQuality_knn <- predict(kknnET.train, testData_knn) observeEvent(input$randomize_knn, { testData_knn <- eightTwoTest.white[sample(1:nrow(eightTwoTest.white), 1),] testQuality_knn <- predict(kknnET.train, testData_knn) cat("Button pressed!") output$testData_knn <- renderTable(gather(testData_knn, "Feature", "Value")) output$testQuality_knn <- renderText(testQuality_knn) output$realQuality_knn <- renderText(testData_knn$quality) } ) output$plot_knn <- renderPlot( if(vsizeget_knn() == "eightyTwenty"){ testData_knn <- eightTwoTest.white[1,] testQuality_knn <- predict(kknnET.train, testData_knn) ggplot(data = knETdf, aes(x = kmax, y = Accuracy)) + geom_line(data=subset(knETdf, kernel == vkernelget_knn()[1] | kernel == vkernelget_knn()[2] | kernel == vkernelget_knn()[3]), aes(colour = kernel)) + ylim(0.5080, 0.6510) } else if(vsizeget_knn() == "oneThird"){ testData_knn <- oneThirdTest.white[1,] testQuality_knn <- predict(kknnOT.train, testData_knn) ggplot(data = knOTdf, aes(x = kmax, y = Accuracy)) + geom_line(data=subset(knOTdf, kernel == vkernelget_knn()[1] | kernel == vkernelget_knn()[2] | kernel == vkernelget_knn()[3]), aes(colour = kernel)) + ylim(0.5080, 0.6510) } else if(vsizeget_knn() == "halfHalf"){ testData_knn <- halfHalfTest.white[1,] testQuality_knn <- predict(kknnHH.train, testData_knn) ggplot(data = knHHdf, aes(x = kmax, y = Accuracy)) + geom_line(data=subset(knHHdf, kernel == vkernelget_knn()[1] | kernel == vkernelget_knn()[2] | kernel == vkernelget_knn()[3]), aes(colour = kernel)) + ylim(0.5080, 0.6510) } ) ########## SVM ################# vsizeget_svm <- reactive(input$vsize_svm) vcget_svm <- reactive(input$vc_svm) svmETdf = svmET.train$results svmOTdf = svmOT.train$results svmHHdf = svmHH.train$results testData_svm <- eightTwoTest.white[1,] testQuality_svm <- predict(svmET.train, testData_svm) observeEvent(input$randomize_svm, { testData_svm <- eightTwoTest.white[sample(1:nrow(eightTwoTest.white), 1),] testQuality_svm <- predict(svmET.train, testData_svm) cat("Button pressed!") print(testData_svm) print(testQuality_svm) output$testData_svm <- renderTable(gather(testData_svm, "Feature", "Value")) output$testQuality_svm <- renderText(testQuality_svm) output$realQuality_svm <- renderText(testData_svm$quality) }) output$plot_svm <- renderPlot( if(vsizeget_svm() == "eightyTwenty"){ ggplot(data = svmETdf, aes(x = sigma, y = Accuracy, color = C)) + geom_line(data=subset(svmETdf, C == vcget_svm()[1])) + geom_point(data=subset(svmETdf, C == vcget_svm()[1])) + geom_line(data=subset(svmETdf, C == vcget_svm()[2])) + geom_point(data=subset(svmETdf, C == vcget_svm()[2])) + geom_line(data=subset(svmETdf, C == vcget_svm()[3])) + geom_point(data=subset(svmETdf, C == vcget_svm()[3])) + geom_line(data=subset(svmETdf, C == vcget_svm()[4])) + geom_point(data=subset(svmETdf, C == vcget_svm()[4])) + geom_line(data=subset(svmETdf, C == vcget_svm()[5])) + geom_point(data=subset(svmETdf, C == vcget_svm()[5])) + ylim(0.6080, 0.6510) } else if(vsizeget_svm() == "oneThird"){ ggplot(data = svmOTdf, aes(x = sigma, y = Accuracy, color = C)) + geom_line(data=subset(svmOTdf, C == vcget_svm()[1])) + geom_point(data=subset(svmOTdf, C == vcget_svm()[1])) + geom_line(data=subset(svmOTdf, C == vcget_svm()[2])) + geom_point(data=subset(svmOTdf, C == vcget_svm()[2])) + geom_line(data=subset(svmOTdf, C == vcget_svm()[3])) + geom_point(data=subset(svmOTdf, C == vcget_svm()[3])) + geom_line(data=subset(svmOTdf, C == vcget_svm()[4])) + geom_point(data=subset(svmOTdf, C == vcget_svm()[4])) + geom_line(data=subset(svmOTdf, C == vcget_svm()[5])) + geom_point(data=subset(svmOTdf, C == vcget_svm()[5])) + ylim(0.5080, 0.6510) } else if(vsizeget_svm() == "halfHalf"){ ggplot(data = svmHHdf, aes(x = sigma, y = Accuracy, color = C)) + geom_line(data=subset(svmHHdf, C == vcget_svm()[1])) + geom_point(data=subset(svmHHdf, C == vcget_svm()[1])) + geom_line(data=subset(svmHHdf, C == vcget_svm()[2])) + geom_point(data=subset(svmHHdf, C == vcget_svm()[2])) + geom_line(data=subset(svmHHdf, C == vcget_svm()[3])) + geom_point(data=subset(svmHHdf, C == vcget_svm()[3])) + geom_line(data=subset(svmHHdf, C == vcget_svm()[4])) + geom_point(data=subset(svmHHdf, C == vcget_svm()[4])) + geom_line(data=subset(svmHHdf, C == vcget_svm()[5])) + geom_point(data=subset(svmHHdf, C == vcget_svm()[5])) + ylim(0.5080, 0.6510) } ) ######## NEURAL NETWORK ####### vsizeget <- reactive(input$vsize_nn) vl3get <- reactive(input$vl3_nn) mlpETdf = mlpET.train$results[mlpET.train$results$layer1 == 5,] mlpOTdf = mlpOT.train$results[mlpET.train$results$layer1 == 5,] mlpHHdf = mlpHH.train$results[mlpET.train$results$layer1 == 5,] testData_nn <- eightTwoTest.white[1,] testQuality_nn <- predict(mlpET.train, testData_nn) observeEvent(input$randomize_nn, { testData_nn <- eightTwoTest.white[sample(1:nrow(eightTwoTest.white), 1),] testQuality_nn <- predict(mlpET.train, testData_nn) cat("Button pressed!") print(testData_nn) print(testQuality_nn) output$testData_nn <- renderTable(gather(testData_nn, "Feature", "Value")) output$testQuality_nn <- renderText(testQuality_nn) output$realQuality_nn <- renderText(testData_nn$quality) }) output$plot_nn <- renderPlot( if(vsizeget() == "eightyTwenty"){ ggplot(data = mlpETdf, aes(x = layer2, y = Accuracy, color = layer3)) + geom_line(data=subset(mlpETdf, layer3 == vl3get()[1])) + geom_point(data=subset(mlpETdf, layer3 == vl3get()[1])) + geom_line(data=subset(mlpETdf, layer3 == vl3get()[2])) + geom_point(data=subset(mlpETdf, layer3 == vl3get()[2])) + geom_line(data=subset(mlpETdf, layer3 == vl3get()[3])) + geom_point(data=subset(mlpETdf, layer3 == vl3get()[3])) + geom_line(data=subset(mlpETdf, layer3 == vl3get()[4])) + geom_point(data=subset(mlpETdf, layer3 == vl3get()[4])) + geom_line(data=subset(mlpETdf, layer3 == vl3get()[5])) + geom_point(data=subset(mlpETdf, layer3 == vl3get()[5])) + ylim(0.525, 0.56) } else if(vsizeget() == "oneThird"){ ggplot(data = mlpOTdf, aes(x = layer2, y = Accuracy, color = layer3)) + geom_line(data=subset(mlpOTdf, layer3 == vl3get()[1])) + geom_point(data=subset(mlpOTdf, layer3 == vl3get()[1])) + geom_line(data=subset(mlpOTdf, layer3 == vl3get()[2])) + geom_point(data=subset(mlpOTdf, layer3 == vl3get()[2])) + geom_line(data=subset(mlpOTdf, layer3 == vl3get()[3])) + geom_point(data=subset(mlpOTdf, layer3 == vl3get()[3])) + geom_line(data=subset(mlpOTdf, layer3 == vl3get()[4])) + geom_point(data=subset(mlpOTdf, layer3 == vl3get()[4])) + geom_line(data=subset(mlpOTdf, layer3 == vl3get()[5])) + geom_point(data=subset(mlpOTdf, layer3 == vl3get()[5])) + ylim(0.525, 0.56) } else if(vsizeget() == "halfHalf"){ ggplot(data = mlpHHdf, aes(x = layer2, y = Accuracy, color = layer3)) + geom_line(data=subset(mlpHHdf, layer3 == vl3get()[1])) + geom_point(data=subset(mlpHHdf, layer3 == vl3get()[1])) + geom_line(data=subset(mlpHHdf, layer3 == vl3get()[2])) + geom_point(data=subset(mlpHHdf, layer3 == vl3get()[2])) + geom_line(data=subset(mlpHHdf, layer3 == vl3get()[3])) + geom_point(data=subset(mlpHHdf, layer3 == vl3get()[3])) + geom_line(data=subset(mlpHHdf, layer3 == vl3get()[4])) + geom_point(data=subset(mlpHHdf, layer3 == vl3get()[4])) + geom_line(data=subset(mlpHHdf, layer3 == vl3get()[5])) + geom_point(data=subset(mlpHHdf, layer3 == vl3get()[5])) + ylim(0.525, 0.56) } ) ###### RANDOM FOREST ######## vsizeget_rf <- reactive(input$vsize_rf) vkernelget_rf <- reactive(input$vkernel_rf) rfETdf = rfET.train$results rfOTdf = rfOT.train$results rfHHdf = rfHH.train$results testData_rf <- eightTwoTest.white[1,] testQuality_rf <- predict(rfET.train, testData_rf) observeEvent(input$randomize_rf, { testData_rf <- eightTwoTest.white[sample(1:nrow(eightTwoTest.white), 1),] testQuality_rf <- predict(rfET.train, testData_rf) cat("Button pressed!") print(testData_rf) print(testQuality_rf) output$testData_rf <- renderTable(gather(testData_rf, "Feature", "value")) output$testQuality_rf <- renderText(testQuality_rf) output$realQuality_rf <- renderText(testData_rf$quality) } ) output$plot_rf <- renderPlot( if(vsizeget_rf() == "eightyTwenty"){ testData <- eightTwoTest.white[1,] testQuality_rf <- predict(rfET.train, testData_rf) ggplot(data = rfETdf, aes(x = mtry, y = Accuracy)) + geom_line(data=rfETdf) + ylim(0.60, 0.67) } else if(vsizeget_rf() == "oneThird"){ testData_rf <- oneThirdTest.white[1,] testQuality_rf <- predict(rfOT.train, testData_rf) ggplot(data = rfOTdf, aes(x = mtry, y = Accuracy)) + geom_line(data=rfOTdf) + ylim(0.6, 0.67) } else if(vsizeget_rf() == "halfHalf"){ testData_rf <- halfHalfTest.white[1,] testQuality_rf <- predict(rfHH.train, testData_rf) ggplot(data = rfHHdf, aes(x = mtry, y = Accuracy)) + geom_line(data=rfHHdf) + ylim(0.6, 0.67) } ) ####### GAME ######## varnames = c("var01", "var02", "var03", "var04", "var05", "var06", "var07", "var08", "var09", "var10", "var11") userData <- eightTwoTest.white[1,] userData$quality = 0 qchar <-"0" vargetter <- reactive(c( input$var01, input$var02, input$var03, input$var04, input$var05, input$var06, input$var07, input$var08, input$var09, input$var10, input$var11) ) observeEvent(input$submit2, { varList <- vargetter() userData[1] <- varList[1] for(i in 1:11){ userData[i] <- varList[i] } output$quality <- renderText( print(as.integer(as.character(predict(rfET.train, userData)))-2) ) qchar <- as.character(predict(rfET.train, userData)) print(qchar) output$wineglass <- renderImage({ print(qchar) if(qchar == "0"){ return(list( src = "./images/wine_1.png", contentType = "image/png", alt = NULL )) } else if(qchar == "3"){ return(list( src = "./images/wine_2.png", contentType = "image/png", alt = NULL )) } else if(qchar == "4"){ return(list( src = "./images/wine_3.png", contentType = "image/png", alt = NULL )) } else if(qchar == "5"){ return(list( src = "./images/wine_4.png", contentType = "image/png", alt = NULL )) } else if(qchar == "6"){ return(list( src = "./images/wine_5.png", contentType = "image/png", alt = NULL )) } else if(qchar == "7"){ return(list( src = "./images/wine_6.png", contentType = "image/png", alt = NULL )) } else if(qchar == "8"){ return(list( src = "./images/wine_7.png", contentType = "image/png", alt = NULL )) } else if(qchar == "9"){ return(list( src = "./images/wine_7.png", contentType = "image/png", alt = NULL )) } }, deleteFile = FALSE) } ) ########### INTRO ################ output$restext <- renderText({ button <- input$resamplingmethod if(button == "Bootstrapping") {restext <- bootstext} else if(button == "Repeated cross-validation") {restext <- repeatedcvtext} else if(button == "LOOCV") {restext <- loocvtext} restext }) output$restitle <- renderText({ button <- input$resamplingmethod if(button == "Bootstrapping") {restitle <- bootstitle} else if(button == "Repeated cross-validation") {restitle <- repeatedcvtitle} else if(button == "LOOCV") {restitle <- loocvtitle} restitle }) # creating the group data output$descplot <- renderPlot({ n <- reactive(input$groupsize) set.seed(1) group1 <- mvrnorm(n = n(), mu = c(2,6), Sigma = matrix(c(1, -0.3, -0.3, 1), nrow = 2), empirical = TRUE) group2 <- mvrnorm(n = n(), mu = c(4,4), Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2), empirical = TRUE) dummy <- as.factor(c(rep(1, n()), rep(2, n()))) data <- rbind(group1, group2) data <- as.data.frame(cbind(data,dummy)) colnames(data) <- c("Sweetness", "Fullness", "group") # plot plot(group1, pch = 15, col = "red", bty = "n", xlim = range(group1[,1],group2[,1]), ylim = range(group1[,2],group2[,2]), xlab = "Sweetness", ylab = "Fullness", main = "Wine") points(group2, pch = 16, col = "green") legend('topright', legend = c("Red Wine", "White Wine"), col = c("red","green"), pch = c(16,17), bty = "n") }) output$descplot_2 <- renderPlot({ n <- reactive(input$groupsize) set.seed(1) group1 <- mvrnorm(n = n(), mu = c(2,6), Sigma = matrix(c(1, -0.3, -0.3, 1), nrow = 2), empirical = TRUE) group2 <- mvrnorm(n = n(), mu = c(4,4), Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2), empirical = TRUE) dummy <- as.factor(c(rep(1, n()), rep(2, n()))) data <- rbind(group1, group2) data <- as.data.frame(cbind(data,dummy)) colnames(data) <- c("Sweetness", "Fullness", "group") # plot plot(group1, pch = 15, col = "red", bty = "n", xlim = range(group1[,1],group2[,1]), ylim = range(group1[,2],group2[,2]), xlab = "Sweetness", ylab = "Fullness", main = "Wine") points(group2, pch = 16, col = "green") legend('topright', legend = c("Red Wine", "White Wine"), col = c("red","green"), pch = c(16,17), bty = "n") }) output$machplot <- renderPlot({ n <- reactive(input$groupsize) set.seed(1) group1 <- mvrnorm(n = n(), mu = c(2,6), Sigma = matrix(c(1, -0.3, -0.3, 1), nrow = 2), empirical = TRUE) group2 <- mvrnorm(n = n(), mu = c(4,4), Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2), empirical = TRUE) dummy <- as.factor(c(rep(1, n()), rep(2, n()))) data <- rbind(group1, group2) data <- as.data.frame(cbind(data,dummy)) colnames(data) <- c("Sweetness", "Fullness", "group") trainn <- reactive( if(input$trainingset == "50/50") {trainn <- 0.5} else if (input$trainingset == "66/33") {trainn <- 0.66} else if (input$trainingset == "80/20") {trainn <- 0.8} ) resmeth <- reactive( if(input$resamplingmethod == "Bootstrapping") {resmeth <- "boot"} else if(input$resamplingmethod == "Repeated cross-validation") {resmeth <- "repeatedcv"} else if(input$resamplingmethod == "LOOCV") {resmeth <- "LOOCV"} ) #subset data group <- as.factor(data[,3]) selected.data <- cbind(data[,1], data[,2]) ##Create Dummy Variables to remove factor features. dv <- dummyVars( ~., data = selected.data) ##Read dummy variables into new data frame. dv.data <- predict(dv, newdata = selected.data) ##Make Preprocessed object. pp <- preProcess(dv.data, method = 'knnImpute') ##Create object with preprocessed data impute.data <- predict(pp, dv.data) ##Load imputed data into data frame. imputed.data <- data.frame(dv.data) imputed.data$group <- as.factor(group) train.model.ind <- createDataPartition(imputed.data$group, p = trainn(), list = FALSE) ### 33/66, 50/50, 80/20 train.model <- imputed.data[train.model.ind,] test.model <- imputed.data[-train.model.ind,] tc <- trainControl(allowParallel = TRUE, method = resmeth()) ### svm plot model <- train(group ~ .,data=train.model, method="svmRadial", trControl=tc, preProcess=c("center","scale")) decisionplot_basic(model, train.model, class = "group", main = "SVM (radial kernel)",predict_type="raw", xlab = "Sweetness", ylab = "Fullness") }) output$machplot_2 <- renderPlot({ n <- reactive(input$groupsize) set.seed(1) group1 <- mvrnorm(n = n(), mu = c(2,6), Sigma = matrix(c(1, -0.3, -0.3, 1), nrow = 2), empirical = TRUE) group2 <- mvrnorm(n = n(), mu = c(4,4), Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2), empirical = TRUE) dummy <- as.factor(c(rep(1, n()), rep(2, n()))) data <- rbind(group1, group2) data <- as.data.frame(cbind(data,dummy)) colnames(data) <- c("Sweetness", "Fullness", "group") trainn <- reactive( if(input$trainingset == "50/50") {trainn <- 0.5} else if (input$trainingset == "66/33") {trainn <- 0.66} else if (input$trainingset == "80/20") {trainn <- 0.8} ) resmeth <- reactive( if(input$resamplingmethod == "Bootstrapping") {resmeth <- "boot"} else if(input$resamplingmethod == "Repeated cross-validation") {resmeth <- "repeatedcv"} else if(input$resamplingmethod == "LOOCV") {resmeth <- "LOOCV"} ) #subset data group <- as.factor(data[,3]) selected.data <- cbind(data[,1], data[,2]) ##Create Dummy Variables to remove factor features. dv <- dummyVars( ~., data = selected.data) ##Read dummy variables into new data frame. dv.data <- predict(dv, newdata = selected.data) ##Make Preprocessed object. pp <- preProcess(dv.data, method = 'knnImpute') ##Create object with preprocessed data impute.data <- predict(pp, dv.data) ##Load imputed data into data frame. imputed.data <- data.frame(dv.data) imputed.data$group <- as.factor(group) train.model.ind <- createDataPartition(imputed.data$group, p = trainn(), list = FALSE) ### 33/66, 50/50, 80/20 train.model <- imputed.data[train.model.ind,] test.model <- imputed.data[-train.model.ind,] tc <- trainControl(allowParallel = TRUE, method = resmeth()) ### svm plot model <- train(group ~ .,data=train.model, method="svmRadial", trControl=tc, preProcess=c("center","scale")) decisionplot_basic(model, train.model, class = "group", main = "SVM (radial kernel)",predict_type="raw", xlab = "Sweetness", ylab = "Fullness") }) output$results <- renderTable({ n <- reactive(input$groupsize) set.seed(1) group1 <- mvrnorm(n = n(), mu = c(2,6), Sigma = matrix(c(1, -0.3, -0.3, 1), nrow = 2), empirical = TRUE) group2 <- mvrnorm(n = n(), mu = c(4,4), Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2), empirical = TRUE) dummy <- as.factor(c(rep(1, n()), rep(2, n()))) data <- rbind(group1, group2) data <- as.data.frame(cbind(data,dummy)) colnames(data) <- c("Sweetness", "Fullness", "group") trainn <- reactive( if(input$trainingset == "50/50") {trainn <- 0.5} else if (input$trainingset == "66/33") {trainn <- 0.66} else if (input$trainingset == "80/20") {trainn <- 0.8} ) resmeth <- reactive( if(input$resamplingmethod == "Bootstrapping") {resmeth <- "boot"} else if(input$resamplingmethod == "Repeated cross-validation") {resmeth <- "repeatedcv"} else if(input$resamplingmethod == "LOOCV") {resmeth <- "LOOCV"} ) #subset data group <- as.factor(data[,3]) selected.data <- cbind(data[,1], data[,2]) ##Create Dummy Variables to remove factor features. dv <- dummyVars( ~., data = selected.data) ##Read dummy variables into new data frame. dv.data <- predict(dv, newdata = selected.data) ##Make Preprocessed object. pp <- preProcess(dv.data, method = 'knnImpute') ##Create object with preprocessed data impute.data <- predict(pp, dv.data) ##Load imputed data into data frame. imputed.data <- data.frame(dv.data) imputed.data$group <- as.factor(group) train.model.ind <- createDataPartition(imputed.data$group, p = trainn(), list = FALSE) ### 33/66, 50/50, 80/20 train.model <- imputed.data[train.model.ind,] test.model <- imputed.data[-train.model.ind,] tc <- trainControl(allowParallel = TRUE, method = resmeth()) model <- train(group ~ .,data=train.model, method="svmRadial", trControl=tc, preProcess=c("center","scale")) print(model$results) }) output$results_2 <- renderTable({ n <- reactive(input$groupsize) set.seed(1) group1 <- mvrnorm(n = n(), mu = c(2,6), Sigma = matrix(c(1, -0.3, -0.3, 1), nrow = 2), empirical = TRUE) group2 <- mvrnorm(n = n(), mu = c(4,4), Sigma = matrix(c(1, 0.3, 0.3, 1), nrow = 2), empirical = TRUE) dummy <- as.factor(c(rep(1, n()), rep(2, n()))) data <- rbind(group1, group2) data <- as.data.frame(cbind(data,dummy)) colnames(data) <- c("Sweetness", "Fullness", "group") trainn <- reactive( if(input$trainingset == "50/50") {trainn <- 0.5} else if (input$trainingset == "66/33") {trainn <- 0.66} else if (input$trainingset == "80/20") {trainn <- 0.8} ) resmeth <- reactive( if(input$resamplingmethod == "Bootstrapping") {resmeth <- "boot"} else if(input$resamplingmethod == "Repeated cross-validation") {resmeth <- "repeatedcv"} else if(input$resamplingmethod == "LOOCV") {resmeth <- "LOOCV"} ) #subset data group <- as.factor(data[,3]) selected.data <- cbind(data[,1], data[,2]) ##Create Dummy Variables to remove factor features. dv <- dummyVars( ~., data = selected.data) ##Read dummy variables into new data frame. dv.data <- predict(dv, newdata = selected.data) ##Make Preprocessed object. pp <- preProcess(dv.data, method = 'knnImpute') ##Create object with preprocessed data impute.data <- predict(pp, dv.data) ##Load imputed data into data frame. imputed.data <- data.frame(dv.data) imputed.data$group <- as.factor(group) train.model.ind <- createDataPartition(imputed.data$group, p = trainn(), list = FALSE) ### 33/66, 50/50, 80/20 train.model <- imputed.data[train.model.ind,] test.model <- imputed.data[-train.model.ind,] tc <- trainControl(allowParallel = TRUE, method = resmeth()) model <- train(group ~ .,data=train.model, method="svmRadial", trControl=tc, preProcess=c("center","scale")) print(model$results) }) output$box1 <- renderPlot({ boxplot(white.raw) }) ###### COMPARISON ##### kknn.grid1 <- expand.grid(kmax = 1, distance = c(1,2), kernel = c("rectangular", "gaussian", "cos")) kknn.grid2 <- expand.grid(kmax = 1, distance = c(1,2), kernel = c("rectangular", "gaussian", "cos")) decisionplot <- function(model, data, class = NULL, predict_type = "class", resolution = 100, showgrid = TRUE, ...) { if(!is.null(class)) cl <- data[,class] else cl <- 1 data <- data[,1:2] k <- length(unique(cl)) plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+1L, ...) # make grid r <- sapply(data, range, na.rm = TRUE) xs <- seq(r[1,1], r[2,1], length.out = resolution) ys <- seq(r[1,2], r[2,2], length.out = resolution) g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution)) colnames(g) <- colnames(r) g <- as.data.frame(g) ### guess how to get class labels from predict p <- predict(model, g, type = predict_type) p <- as.matrix(p) p <- as.factor(p) if(showgrid) points(g, col = as.integer(p)+1L, pch = ".") z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE) contour(xs, ys, z, add = TRUE, drawlabels = FALSE, lwd = 2, levels = (1:(k-1))+.5,col="red") invisible(z) } addContour <- function(model, data, class = NULL, predict_type = "class", resolution = 100, showgrid = TRUE, ...) { if(!is.null(class)) cl <- data[,class] else cl <- 1 data <- data[,1:2] k <- length(unique(cl)) # make grid r <- sapply(data, range, na.rm = TRUE) xs <- seq(r[1,1], r[2,1], length.out = resolution) ys <- seq(r[1,2], r[2,2], length.out = resolution) g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution)) colnames(g) <- colnames(r) g <- as.data.frame(g) ### guess how to get class labels from predict p <- predict(model, g, type = predict_type) p <- as.matrix(p) p <- as.factor(p) if(showgrid) points(g, col = as.integer(p)+1L, pch = ".") z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE) contour(xs, ys, z, add = TRUE, drawlabels = FALSE, lwd = 2, levels = (1:(k-1))+.5,col="blue") invisible(z) } set.seed(1000) # decisionplot ------------------------------------------------------------ white.url <- "winequality-white_alt.csv" white.raw <- read.csv(white.url, header = TRUE, sep = ";") white.new <- white.raw[,c("alcohol","pH","quality")] white.new <- white.new[white.new$quality >= 2,] white.new <- white.new[white.new$quality < 7,] output$decisionPlot<-renderPlot({ N <- reactive(input$n) quality.n <- reactive(input$quality.n) white.new <- white.new[white.new$quality >= quality.n(),] white.new <- white.new[white.new$quality < 7,] white.new <- white.new[1:N(),] white.new$quality<-as.factor(white.new$quality) x<-white.new x<-na.exclude(x) t.ctrl <- trainControl(method = "repeatedcv", number = 2, repeats = 1) getK1 <- reactive(input$k.input1) getK2 <- reactive(input$k.input2) model1<-reactive(input$selection1) model2<-reactive(input$selection2) costLogReg1<-reactive(input$regLog.cost.input1) costLogReg2<-reactive(input$regLog.cost.input2) lossLogReg1<-reactive(input$regLog.loss.input1) lossLogReg2<-reactive(input$regLog.loss.input2) plot(x[,1:2], col = x[,3]) if(model1()=="kknn"){ kknn.grid1 <- expand.grid(kmax = getK1(), distance = c(1,2), kernel = c("rectangular", "gaussian", "cos")) } if(model2()=="kknn"){ kknn.grid2 <- expand.grid(kmax = getK2(), distance = c(1,2), kernel = c("rectangular", "gaussian", "cos")) } if(model1()=="kknn"){ modelPlot1 <- train(quality ~ .,data = x,method="kknn", trControl=t.ctrl,tuneGrid=kknn.grid1,preProcess=c("center","scale")) } else modelPlot1 <- train(quality ~ .,data=x,method=model1(),trControl=t.ctrl,preProcess=c("center","scale")) if(model2()=="kknn"){ modelPlot2 <- train(quality ~ .,data = x,method="kknn", trControl=t.ctrl,tuneGrid=kknn.grid1,preProcess=c("center","scale")) } else modelPlot2 <- train(quality ~ .,data=x,method=model2(),trControl=t.ctrl,preProcess=c("center","scale")) decisionplot(modelPlot1, x, class = "quality", main = model1(),predict_type="raw",showgrid=FALSE) addContour(modelPlot2, x, class = "quality", main = model2(),predict_type="raw",showgrid=FALSE) }) } shinyApp(ui, server)