Mandar Chandorkar PhD Researcher Applied mathematician working at the intersection of machine learning, physics and computation. Working as a PhD student at the CWI, Amsterdam

Logistic Regression: Classification of Wine Quality


In the previous post, we trained DynaML’s feed forward neural networks on the wine quality data set. Lets compare how single layer feed forward neural networks compare to a simple logistic regression trained using Gradient Descent. The TestLogisticWineQuality program in the examples package does precisely that (check out the source code below).

Red Wine

TestLogisticWineQuality(stepSize = 0.2, maxIt = 120,
mini = 1.0, training = 800,
test = 800, regularization = 0.2,
wineType = "red")
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:21:57 INFO BinaryClassificationMetrics: ============================
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Accuracy: 0.8475
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Area under ROC: 0.7968417788802267
16/04/01 15:21:57 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7493563745371187

red-roc

red-fmeasure

White Wine

TestLogisticWineQuality(stepSize = 0.26, maxIt = 300,
mini = 1.0, training = 3800,
test = 1000, regularization = 0.0,
wineType = "white")
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 15:27:17 INFO BinaryClassificationMetrics: ============================
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Accuracy: 0.829
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Area under ROC: 0.7184782682020251
16/04/01 15:27:17 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7182203962483446

red-roc

red-fmeasure

Comparison with Neural Networks

Considering that a simple logistic regression model performs quite well on the data, and that logistic regression is equivalent to a single perceptron neural network model, we can train a neural net with 0 hidden layers using the TestNNWineQuality program.

TestNNWineQuality(0, List(), List("tansig"), stepSize = 0.2, maxIt = 120, 
mini = 1.0, alpha = 0.0, training = 1200, test = 400, regularization = 0.0, 
wineType = "red")
16/04/01 14:04:34 INFO BinaryClassificationMetrics: Classification Model Performance
16/04/01 14:04:34 INFO BinaryClassificationMetrics: ============================
16/04/01 14:04:34 INFO BinaryClassificationMetrics: Accuracy: 0.895
16/04/01 14:04:34 INFO BinaryClassificationMetrics: Area under ROC: 0.8209578913532626
16/04/01 14:04:34 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7975192758967482

Which gives a performance in the same ball park as the logistic regression model, here it must be noted that, we used a larger training set fraction and a hyperbolic tangent activation function.

Source Code

Neural Networks: Classification of Wine Quality


The wine quality data set is a common example used to benchmark classification models. Here we use the DynaML scala machine learning environment to train classifiers to detect ‘good’ wine from ‘bad’ wine. A short listing of the data attributes/columns is given below. The UCI archive has two files in the wine quality data set namely winequality-red.csv and winequality-white.csv. We train two separate classification models, one for red wine and one for white.

Wine: Representative Image

Data Set

Inputs:

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol

Output (based on sensory data):

  1. quality (score between 0 and 10)

Data Output Preprocessing

The wine quality target variable can take integer values from 0 to 10, first we convert this into a binary class variable by setting the quality to be ‘good’(encoded by the value 1) if the numerical value is greater than 6 and ‘bad’ (encoded by value -1.0) otherwise.

Wine Quality: Neural Network Experiment

The TestNNWineQuality program in the DynaML examples package contains all the required code for model building and testing, see the gist below for more details.

Red Wine

TestNNWineQuality(hidden = 1, nCounts = List(2),
acts = List("linear", "tansig"), stepSize = 0.2, maxIt = 130,
mini = 1.0, alpha = 0.0,
training = 1000, test = 600,
regularization = 0.001,
wineType = "red")
16/03/30 18:59:38 INFO BinaryClassificationMetrics: Classification Model Performance: red wine
16/03/30 18:59:38 INFO BinaryClassificationMetrics: ============================
16/03/30 18:59:38 INFO BinaryClassificationMetrics: Accuracy: 0.8566666666666667
16/03/30 18:59:38 INFO BinaryClassificationMetrics: Area under ROC: 0.7782440503121889
16/03/30 18:59:38 INFO BinaryClassificationMetrics: Maximum F Measure: 0.755966787057378

red-roc

red-fmeasure

White Wine

TestNNWineQuality(hidden = 1, nCounts = List(3),
acts = List("linear", "tansig"), stepSize = 0.16, maxIt = 100,
mini = 1.0, alpha = 0.0,
training = 1500, test = 3000,
regularization = 0.001,
wineType = "white")

white-roc

white-fmeasure

16/03/30 18:49:58 INFO BinaryClassificationMetrics: Classification Model Performance: white wine
16/03/30 18:49:58 INFO BinaryClassificationMetrics: ============================
16/03/30 18:49:58 INFO BinaryClassificationMetrics: Accuracy: 0.8096666666666666
16/03/30 18:49:58 INFO BinaryClassificationMetrics: Area under ROC: 0.7784814672924049
16/03/30 18:49:58 INFO BinaryClassificationMetrics: Maximum F Measure: 0.7570286230962675

Source Code

System Identification using Gaussian Processes: Abott Power Plant, Champaign, Illinois


In this post, we use the DynaML Scala machine learning environment to train Gaussian Process models to analyse time series data taken from a coal power plant.


Abott: Representative Image


The Data Set

From the Daisy system identification database, we download the abott power plant data. The data characteristics are summarized below.

Description: The data comes from a model of a Steam Generator at Abbott Power Plant in Champaign IL.

Sampling Frequency: 3 sec

Number: 9600

Inputs: 1. Fuel scaled 0-1 2. Air scaled 0-1 3. Reference level inches 4. Disturbance definde by the load level

Outputs: 5. Drum pressure PSI 6. Excess Oxygen in exhaust gases % 7. Level of water in the drum 8. Steam Flow Kg./s

Nonlinear AutoRegressive with eXogenous inputs (NARX)

A candidate output signal modeled as a function of the previous values of itself and the exogenous inputs


Gaussian Processes

Gaussian Processes are powerful non-parametric methods to solve regression and classification problems. They are based on a structural assumption about the finite dimensional distributions over spaces of functions, as shown in the equations below.

Formulation

Posterior Predictive Distribution

In the presence of training data , one may calculate using Bayes Theorem the posterior predictive distribution assuming , the test inputs are known.


For an in depth treatment of Gaussian Processes refer to the book.


Modelling Power Plant Outputs

Drum pressure PSI

AbottPowerPlant(new PolynomialKernel(2, 0.49), new DiracKernel(0.09),
opt = Map("globalOpt" -> "GS", "grid" -> "4", "step" -> "0.004"),
num_training = 200, num_test = 1000, deltaT = 2, column = 5)


water level


Excess Oxygen in exhaust gases (as %)

AbottPowerPlant(new PolynomialKernel(2, 0.49), new DiracKernel(0.09),
opt = Map("globalOpt" -> "GS", "grid" -> "4", "step" -> "0.004"),
num_training = 200, num_test = 1000, deltaT = 2, column = 6)


water level


Level of water in the drum

AbottPowerPlant(new PolynomialKernel(2, 0.49), new DiracKernel(0.09),
opt = Map("globalOpt" -> "GS", "grid" -> "4", "step" -> "0.004"),
num_training = 200, num_test = 1000, deltaT = 2, column = 7)


water level


Steam Flow Kg./s

AbottPowerPlant(new PolynomialKernel(2, 0.49), new DiracKernel(0.09),
opt = Map("globalOpt" -> "GS", "grid" -> "4", "step" -> "0.004"),
num_training = 200, num_test = 1000,
deltaT = 2, column = 8)


water level


Source Code

Below is the example program as a github gist, to view the original program in DynaML, click here.

System Identification using Gaussian Processes: Santa Fe Laser Data Set


System Identification

For a short introduction to system identification and some common models refer to this previous post. Below I give a short tour of the Santa Fe Laser example which comes shipped with the DynaML machine learning library.

Santa Fe Laser Generated Data

Santa Fe

The Santa Fe laser data is a standard benchmark data set in system identification. It serves as good starting point to start exploring time series models. It records only one observable (laser intensity), has little noise and is generated from a known physics dynamical process. A more detailed explanation is given below.

The measurements were made on an 81.5-micron 14NH3 cw (FIR) laser, pumped optically by the P(13) line of an N2O laser via the vibrational aQ(8,7) NH3 transition. The basic laser setup can be found in Ref. 1. The intensity data was recorded by a LeCroy oscilloscope. No further processing happened. The experimental signal to noise ratio was about 300 which means slightly under the half bit uncertainty of the analog to digital conversion. The data is a cross-cut through periodic to chaotic intensity pulsations of the laser. Chaotic pulsations more or less follow the theoretical Lorenz model (see References) of a two level system.

Source

Santa Fe Laser: NAR model

The data set is unidimensional, so we can only train a Nonlinear Auto-Regressive (NAR) model for the laser intensity. Choosing the auto-regressive order , we train two candidate NAR models.

Choice of Kernel Function

For this problem we build models based on two kernels.

  • Radial Basis Function (RBF):
SantaFeLaser(new RBFKernel(1.5), new DiracKernel(1.0),
opt = Map("globalOpt" -> "GS", "grid" -> "9", "step" -> "0.1"),
num_training = 200, num_test = 500, deltaT = 2)
16/03/07 22:03:10 INFO RegressionMetrics: Regression Model Performance: Laser Intensity
16/03/07 22:03:10 INFO RegressionMetrics: ============================
16/03/07 22:03:10 INFO RegressionMetrics: MAE: 10.919757407593648
16/03/07 22:03:10 INFO RegressionMetrics: RMSE: 18.527723082632765
16/03/07 22:03:10 INFO RegressionMetrics: RMSLE: 0.41343485025397475
16/03/07 22:03:10 INFO RegressionMetrics: R^2: 0.8550953005807426
16/03/07 22:03:10 INFO RegressionMetrics: Corr. Coefficient: 0.928916961722154
16/03/07 22:03:10 INFO RegressionMetrics: Model Yield: 0.6597256758459964
16/03/07 22:03:10 INFO RegressionMetrics: Std Dev of Residuals: 18.1615168822832

Steam Fe

Santa Fe

  • Fractional Brownian Field (FBM):
DynaML>SantaFeLaser(new FBMKernel(1.1), new DiracKernel(1.0),
opt = Map("globalOpt" -> "GS", "grid" -> "10", "step" -> "0.1"),
num_training = 200, num_test = 500, deltaT = 2)
16/03/07 22:07:46 INFO RegressionMetrics: Regression Model Performance: Laser Intensity
16/03/07 22:07:46 INFO RegressionMetrics: ============================
16/03/07 22:07:46 INFO RegressionMetrics: MAE: 8.466099689528546
16/03/07 22:07:46 INFO RegressionMetrics: RMSE: 13.523138654434868
16/03/07 22:07:46 INFO RegressionMetrics: RMSLE: 0.38303731310173433
16/03/07 22:07:46 INFO RegressionMetrics: R^2: 0.9228042537204268
16/03/07 22:07:46 INFO RegressionMetrics: Corr. Coefficient: 0.964525269647539
16/03/07 22:07:46 INFO RegressionMetrics: Model Yield: 0.7656581073289345
16/03/07 22:07:46 INFO RegressionMetrics: Std Dev of Residuals: 14.742253950108552

Steam Fe

Santa Fe

Source Code

System Identification using LSSVMs: Pont-sur-Sambre Power Plant

System Identification

System identification refers to the process of learning a predictive model for a given dynamic system i.e. a system whose dynamics evolve with time. The most important aspect of these models is their structure, specifically the following are the common dynamic system models for discretely sampled time dependent systems.

Nonlinear AutoRegresive (NAR)

Signal modeled as a function of its previous values

Nonlinear AutoRegressive with eXogenous inputs (NARX)

Signal modeled as a function of the previous values of itself and the exogenous inputs


DaISy is a database of (artificial and real world) dynamic systems maintained by the STADIUS research group at KU Leuven. We will work with the power plant data set listed on the DaISy home page in this post. Using DynaML, which comes preloaded with the power plant data, we will train LSSVM models to predict the various output indicators of the power plant in question.

Pont-sur-Sambre: Representative Image

Attributes

Instances: 200

Inputs:

  1. Gas flow
  2. Turbine valves opening
  3. Super heater spray flow
  4. Gas dampers
  5. Air flow

Outputs:

  1. Steam pressure
  2. Main stem temperature
  3. Reheat steam temperature

System Modelling

An LSSVM NARX of autoregressive order is chosen to model the plant output data. An LSSVM model builds a predictor of the following form.

Which is the result of solving the following linear system.

Here the matrix is constructed from the training data using a kernel function .

Choice of Kernel Function

For this problem we choose a polynomial kernel.

Steam Pressure

DynaML>DaisyPowerPlant(new PolynomialKernel(2, 0.5),
opt = Map("regularization" -> "2.5", "globalOpt" -> "GS",
"grid" -> "4", "step" -> "0.1"),
num_training = 100, deltaT = 2,
column = 6)
16/03/04 17:13:43 INFO RegressionMetrics: Regression Model Performance: steam pressure
16/03/04 17:13:43 INFO RegressionMetrics: ============================
16/03/04 17:13:43 INFO RegressionMetrics: MAE: 82.12740530161123
16/03/04 17:13:43 INFO RegressionMetrics: RMSE: 104.39251587470388
16/03/04 17:13:43 INFO RegressionMetrics: RMSLE: 0.9660077848586197
16/03/04 17:13:43 INFO RegressionMetrics: R^2: 0.8395534877128238
16/03/04 17:13:43 INFO RegressionMetrics: Corr. Coefficient: 0.9311734118932473
16/03/04 17:13:43 INFO RegressionMetrics: Model Yield: 0.6288000962818303
16/03/04 17:13:43 INFO RegressionMetrics: Std Dev of Residuals: 87.82754320038951

Steam Pressure

Steam Pressure

Reheat Steam Temperature

DaisyPowerPlant(new PolynomialKernel(2, 1.5),
opt = Map("regularization" -> "2.5", "globalOpt" -> "GS",
"grid" -> "4", "step" -> "0.1"), num_training = 150,
deltaT = 1, column = 8)
16/03/04 16:50:42 INFO RegressionMetrics: Regression Model Performance: reheat steam temperature
16/03/04 16:50:42 INFO RegressionMetrics: ============================
16/03/04 16:50:42 INFO RegressionMetrics: MAE: 124.60921194767073
16/03/04 16:50:42 INFO RegressionMetrics: RMSE: 137.33314302068544
16/03/04 16:50:42 INFO RegressionMetrics: RMSLE: 0.5275727128626408
16/03/04 16:50:42 INFO RegressionMetrics: R^2: 0.8247581957573777
16/03/04 16:50:42 INFO RegressionMetrics: Corr. Coefficient: 0.9744133881055823
16/03/04 16:50:42 INFO RegressionMetrics: Model Yield: 0.7871288689840381
16/03/04 16:50:42 INFO RegressionMetrics: Std Dev of Residuals: 111.86852905896446

Steam Temp

Steam Temp


Source Code

Below is the example program as a github gist, to view the original program in DynaML, click here.

Boston Housing Data: Gaussian Process Regression Models


Boston Housing Data

Boston: Representative Image

The Housing data set is a popular regression benchmarking data set hosted on the UCI Machine Learning Repository. It contains 506 records consisting of multivariate data attributes for various real estate zones and their housing price indices. The task is then to learn a regression model that can predict the price index or range. In this blog post, I use the DynaML machine learning library to train the GP models.

The following meta-data is taken directly from the UCI repository, the final column indicating the property value.

Attribute Information:

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centres
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per $10,000
  11. PTRATIO: pupil-teacher ratio by town
  12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT: % lower status of the population
  14. MEDV: Median value of owner-occupied homes in $1000’s

Gaussian Process

Given the lack of data volume (~500 instances) with respect to the dimensionality of the data (13), it makes sense to try smoothing or non-parametric models to model the unknown price function. For a detailed introduction to Gaussian Processes, refer to the famous book by Ramussen and Williams. For a succint introduction, you can also refer to the DynaML wiki pages.

Modelling Experiments

In the examples folder of the DynaML repository, a program called TestGPHousing.scala can be used to test GP models with various kernels, a typical call to TestGPHousing looks like.

TestGPHousing(kernel = new ..., noise = new ..., grid = 10,
step = 0.03, globalOpt = "GS", trainFraction = 0.45)

Kernels

FBM kernel with Gaussian Covariate noise

DynaML>TestGPHousing(kernel = new FBMKernel(0.55),
noise = new SEKernel(1.5, 1.5), grid = 5,
step = 0.03, globalOpt = "GS", trainFraction = 0.45)
16/03/03 20:17:42 INFO GridSearch: Optimum value of energy is: 246.38482492249904
Configuration: Map(hurst -> 0.52, bandwidth -> 1.35, amplitude -> 1.35)
16/03/03 20:17:42 INFO SVMKernel$: Constructing kernel matrix.
16/03/03 20:17:42 INFO SVMKernel$: Dimension: 227 x 227
16/03/03 20:17:43 INFO GPRegression: Generating error bars
16/03/03 20:17:43 INFO RegressionMetrics: Regression Model Performance: MEDV
16/03/03 20:17:43 INFO RegressionMetrics: ============================
16/03/03 20:17:43 INFO RegressionMetrics: MAE: 5.804371810611489
16/03/03 20:17:43 INFO RegressionMetrics: RMSE: 7.676433880135313
16/03/03 20:17:43 INFO RegressionMetrics: RMSLE: 0.4108750385573816
16/03/03 20:17:43 INFO RegressionMetrics: R^2: 0.3713246243782846
16/03/03 20:17:43 INFO RegressionMetrics: Corr. Coefficient: 0.7700074003860581
16/03/03 20:17:43 INFO RegressionMetrics: Model Yield: 0.7243148481557278
16/03/03 20:17:43 INFO RegressionMetrics: Std Dev of Residuals: 6.289145946687416


FBM-SE



Composite FBM + Laplacian Kernel with Uncorrelated Gaussian Noise

DynaML>TestGPHousing(kernel = new FBMKernel(0.55) +
new LaplacianKernel(2.5), noise = new RBFKernel(1.5),
grid = 5, step = 0.03, globalOpt = "GS", trainFraction = 0.45)
16/03/03 20:45:41 INFO GridSearch: Optimum value of energy is: 278.1603309851301
Configuration: Map(hurst -> 0.4, beta -> 2.35, bandwidth -> 1.35)
16/03/03 20:45:41 INFO SVMKernel$: Constructing kernel matrix.
16/03/03 20:45:42 INFO GPRegression: Generating error bars
16/03/03 20:45:42 INFO RegressionMetrics: Regression Model Performance: MEDV
16/03/03 20:45:42 INFO RegressionMetrics: ============================
16/03/03 20:45:42 INFO RegressionMetrics: MAE: 5.800070254265218
16/03/03 20:45:42 INFO RegressionMetrics: RMSE: 7.739266267762397
16/03/03 20:45:42 INFO RegressionMetrics: RMSLE: 0.4150438478412412
16/03/03 20:45:42 INFO RegressionMetrics: R^2: 0.3609909626630624
16/03/03 20:45:42 INFO RegressionMetrics: Corr. Coefficient: 0.7633838930006132
16/03/03 20:45:42 INFO RegressionMetrics: Model Yield: 0.7341944950376289
16/03/03 20:45:42 INFO RegressionMetrics: Std Dev of Residuals: 6.287519509352036


FBM-SE


Source Code

Below is the example program as a github gist, to view the original program in DynaML, click here.

Masters Thesis: ESAT, KU Leuven

Fixed Size Least Squares Support Vector Machines: A Scala based programming framework for Large Scale Classification

Abstract

We propose FS-Scala, a flexible and modular Scala based implementation of the Fixed Size Least Squares Support Vector Machine (FS-LSSVM) for large data sets. The framework consists of a set of modules for (gradient and gradient free) optimization, model representation, kernel functions and evaluation of FS-LSSVM models.

A kernel based Fixed-Size Least Squares Support Vector Machine (FS-LSSVM) model is implemented in the proposed framework, while heavily employing distributed MapReduce via the parallel computing capabilities of Apache Spark. Global optimization routines like Coupled Simulated Annealing (CSA) and Grid Search are implemented and used to tune the hyper-parameters of the FS-LSSVM model.

Finally, we carry out experiments on benchmark data sets like Forest Cover Type, Magic Gamma and Adult, recording the performance and tuning time of various kernel based FS-LSSVM models.

FS-LSSVM: Formulation

The solution of which is given by:

Citation

@mastersthesis{
    author = {Chandorkar, M. H.},
    title = {Fixed Size Least Squares Support Vector Machines:
	A Scala based programming framework for Large Scale Classification},
    school = {Katholieke Universitiet Leuven},
    year = {2015}
}

FS-Scala can be found here.