Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:
- Fit a general linear model to some data
- Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
- Apply this formula to new data and see how well the predictor does
It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.
model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset) coef(model)
(Intercept) param1 param2 0.435395087 -0.093295388 0.148154339 param3 param2:param3 0.024399530 0.021100300
There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:
plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")
I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.
coefapply <- function(coefficients, row) { result <- 0 for (i in 1:length(coefficients)) { subresult <- as.numeric(coefficients[i]) if (!is.na(subresult)) { name <- names(coefficients[i]) if (name != "(Intercept)") { subnames <- strsplit(name, ":", fixed=TRUE)[[1]] for (n in subnames) { subresult <- subresult * as.numeric(row[n]) } } result <- result + subresult } } return(result) } calculate_scores <- function(data, coefficients) { scores <- vector(mode="numeric", length=nrow(data)) for (i in 1:nrow(data)) { row <- data[i,] scores[i] <- coefapply(coefficients, row) } return(scores) }
Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:
model_coef <- coef(model) # Test if our scores are the same values as the model's fitted values training_scores <- calculate_scores(model_coef, trainingset) sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores) # Calculate scores for our test set and plot them test_scores <- calculate_scores(model_coef, testset) plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")
It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.