get started | torch for R

Guess the correlation

Mon, 01 Jan 0001 00:00:00 +0000

Get the packages

To use torch, you first need to install it. Same for its high-level wrapper, luz. While torch has all the basic functionality, luz provides a declarative, concise API that lets you train a network in a few lines of code.

To get the respective CRAN versions, do

install.packages("torch")
install.packages("luz")

Does it work? Here’s a quick test:

library(torch)
library(luz)
torch_tensor(1)

torch_tensor
 1
[ CPUFloatType{1} ]

Now, while torch contains all the core functionality, and luz, the training logic, there is a whole ecosystem built around them.

Notably, torchvision is essential to image-processing tasks. In this example, we don’t use it much – overtly, that is. It’s used more prominently behind the scenes. Let’s get it:

install.packages("torchvision")

library(torchvision)

Finally, we install torchdatasets , that wraps datasets in a convenient format, rendering them immediately usable from torch. Let’s get this as well, as we’re going to use one of the datasets it provides.

install.packages("torchdatasets")

library(torchdatasets)

Get the dataset

“Guess the correlation” is a fun dataset that tasks one – a person, if they feel like, or a program, if we train it – to estimate the (linear) correlation between two variables displayed in a scatterplot.

torchdatasets will download, unpack, and preprocess it for us.

The training set is huge; it has 150000 observations. For instruction purposes, we don’t really need so much data – we’ll restrict ourselves to small subsets, for each of training, validation, and test sets.

train_indices <- 1:10000
val_indices <- 10001:15000
test_indices <- 15001:20000

Now, the following snippet does the following:

download and unpack the dataset,
do some custom preprocessing on the images (on top of what is already done by default) – more on that soon

take just the first 10000 observations and put them in a torch Dataset object named train_ds.

add_channel_dim <- function(img) img$unsqueeze(1)
crop_axes <- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)

root <- file.path(tempdir(), "correlation")

train_ds <- guess_the_correlation_dataset(
# where to unpack
root = root,
# additional preprocessing 
transform = function(img) crop_axes(img) %>% add_channel_dim(),
# don't take all data, but just the indices we pass in
indexes = train_indices,
download = TRUE
)

As we’re at it, let’s do the same for the validation and test sets. We don’t need to download again, as we’re building on the same underlying data. We just pick different observations.

valid_ds <- guess_the_correlation_dataset(
    root = root,
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    indexes = val_indices,
    download = FALSE
  )

test_ds <- guess_the_correlation_dataset(
    root = root,
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    indexes = test_indices,
    download = FALSE
  )

Let’s counter-check we got what we wanted. How many items are there in each set?

length(train_ds)
length(valid_ds)
length(test_ds)

[1] 10000
[1] 5000
[1] 5000

And how does a single observation look like? Here is the first one:

train_ds[1]

$x
torch_tensor
(1,.,.) = 
 Columns 1 to 9  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{1,130,130} ]

$y
torch_tensor
-0.45781
[ CPUFloatType{} ]

$id
[1] "arjskzyc"

It’s a list of three items, the last of which we’re not interested in for our purposes.

The second, a scalar tensor, is the correlation value, the thing we want the network to learn. The first, x, is the scatterplot: a tensor representing an image of dimensionality 130*130. But wait – what is that 1 in the shape output?

[ CPUFloatType{1,130,130} ]

This really is a three-dimensional tensor! The first dimension holds different channels – or the single channel, if the image has but one. In fact, the reason x came in this format is that we requested it, here:

add_channel_dim <- function(img) img$unsqueeze(1)

train_ds <- guess_the_correlation_dataset(
    # ...
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    # ...
  )

add_channel_dim() was passed in as a custom transformation, to be applied to every item of the dataset. It calls one of torch’s many tensor operations, unsqueeze(), that adds a singleton dimension at a requested position.

How about the second custom transformation?

crop_axes <- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)

Here, we crop the image, cutting off the axes and labels on the left and bottom. These image regions don’t contribute any distinctive information, and having the images be smaller saves memory.

Work with batches

Now, we’ve done so much work already, but you haven’t actually seen any of the scatterplots yet! The reason we’ve been waiting until now is that we want to show a bunch of them at a time, and for that, we need to know how to handle batches of data.

So let’s create a DataLoader object from the training set. We’ll soon use it to train the model, but right now, we’ll just plot the first batch.

A DataLoader needs to know where to get the data – namely, from the Dataset it gets passed –, as well as how many items should go in a batch. Optionally, it can return data in random order (shuffle = TRUE).

train_dl <- dataloader(train_ds, batch_size = 64, shuffle = TRUE)

Like a Dataset, we can query a DataLoader for its length. For the Dataset, this meant number of items; for a DataLoader , it means number of batches:

length(train_dl)

[1] 157

To access the first batch, we create an iterator from the DataLoader and ask it for the first batch. Even if it weren’t for plotting, you might do this just to check that the dimensions look ok:

batch <- dataloader_make_iter(train_dl) %>% dataloader_next()

dim(batch$x)
dim(batch$y)

[1]  64   1 130 130
[1] 64

And plot! Note how we first remove the channels dimension – as.raster() wouldn’t like it – and then, convert the tensor to R for further processing:

par(mfrow = c(8,8), mar = rep(0, 4))

images <- as.array(batch$x$squeeze(2))

images %>%
  purrr::array_tree(1) %>%
  purrr::map(as.raster) %>%
  purrr::iwalk(~{plot(.x)})

Want to try your skill at guessing these? Here is the corresponding ground truth:

batch$y %>% as.numeric() %>% round(digits = 2)

[1] -0.29  0.58 -0.57  0.56  0.10  0.09  0.21  0.45 -0.24  0.65  0.70  0.40  0.71  0.20  0.07  0.66  0.65 -0.56  0.73
[20] -0.40 -0.18 -0.42 -0.46 -0.45 -0.77  0.09 -0.19  0.40 -0.70 -0.04 -0.16 -0.13 -0.18  0.01  0.25  0.54  0.21  0.28
[39]  0.49  0.86 -0.70  0.51  0.47 -0.46  0.88  0.00  0.24  0.28  0.28 -0.04 -0.74  0.43  0.74  0.01 -0.21  0.66 -0.45
[58] -0.44  0.50 -0.69 -0.65 -0.66 -0.55 -0.53

Now, just as they got their own Dataset objects, test and validation data each need their own DataLoader.

valid_dl <- dataloader(valid_ds, batch_size = 64)
length(valid_dl)

[1] 79

test_dl <- dataloader(test_ds, batch_size = 64)
length(test_dl)

[1] 79

And we’re ready to create the model!

Create the model

Let’s first see what we’re trying to accomplish. Our input data are images; normally this means we’ll work with some kind of convolutional neural network (CNN). In torch, a neural network is a module: a container for more granular modules, which themselves may be built up of yet more fine-grained modules. While in theory, this kind of compositionality is unlimited, in our example there are just two levels: a top-level module representing the model, and submodules that, in other frameworks, would be called layers.

The overall model is created by a call to nn_module(). This instantiates an nn_Module, an R6 class that knows how to act as a neural network. This object can have any number of methods, but two are essential:

initialize(), the place to instantiate any submodules; and
forward(), the place to define what should happen when this module is called.

In initialize() , we instantiate five submodules – two convolutional layers and two linear ones:

# zooming in on just initialize() - don't run standalone

net <- nn_module(
  
  # ...
  
  initialize = function() {
    
    self$conv1 <- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 <- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 <- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 <- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 1)
    
  },
  
 # ...
)

The convolutional (often abbreviated conv) layers apply a filter (or: kernel) of size 3 x 3. This filter slides over the image and computes local aggregates. In fact, there is not just a single filter, there are:

32 of them in the first conv layer,
64 in the second, and
128 in the third.

The filters are trained to pick up informative spatial features, features that will be able to tell us something about the image.

In addition to the three conv layers, we have two linear ones. These are the prototypical neural network layers that get input from all units in the previous layer, combine individual contributions as they see fit, and send on their own individual results to all units in the next layer. The first linear layer will act on the features received from the last conv layer; it consists of 128 units. The second one is the output layer. It outputs a single numeric value, a value that represents the guess our network is making about the size of the correlation.

Now, while initialize() defines the layers, forward() specifies the order in which to call them – and what to do “in between”:

# zooming in on just forward() - don't run standalone

net <- nn_module(
  
 # ...
  
  forward = function(x) {
    
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv2() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv3() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      
      self$fc2()
  }
)

What are these things that happen “in between”? In fact, they are of different types.

Firstly, we have nnf_relu(), called three times: after each of the conv layers and after the first linear layer. This is a so-called activation function – a function that takes the raw results computed by a layer and performs some operation on them. In the case of nnf_relu() (ReLU - Rectified Linear Unit) what it does is leave positive values alone while setting negative ones to 0. You’ll encounter additional activation functions when you continue your torch journey, but ReLU is among the very-most-in-use ones today.

Secondly, we have nnf_avg_pool2d(2) , called after each conv layer. This function downsizes the image, replacing a 2 x 2 patch of pixels by its average. So while we’re going up in the number of channels (from 1 via 32 and 64 to 128), we decrease spatial resolution.

Thirdly, there is torch_flatten(). This one doesn’t compute anything - it just reshapes its inputs, going – in this case – from a four-dimensional structure outputted by the second conv layer to the two-dimensional one expected by the first linear layer.

Now, here is the complete model creation code:

torch_manual_seed(777)

net <- nn_module(
  
  "corr-cnn",
  
  initialize = function() {
    
    self$conv1 <- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 <- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 <- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 <- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv2() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv3() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      
      self$fc2()
  }
)

Even before training, we can call the model on a batch of data – this immediately tells us if we got all shapes matching up:

model <- net()
model(batch$x)

torch_tensor
0.01 *
-2.8979
 -2.8873
 -2.8699
 -2.9787
 -2.8223
 -3.0255
 -3.1181
 -3.0603
 -3.0520
 -2.8242
 -3.0000
 -2.9150
 -2.9497
 -2.7662
 -2.7980
 -2.9540
 -2.8548
 -2.7927
 -3.0426
 -2.9540
 -2.8846
 -2.8008
 -2.8966
 -2.8358
 -2.9266
 -2.9022
 -2.8667
 -2.8716
 -2.7371
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{64,1} ]

After all that hard work, training the model with luz is a breeze.

Train the network

What happens when you train a neural network? Conceptually, the following has to happen for every batch. (Wait – don’t execute these lines :-) You’ll see luz taking care of it for you in a minute.)

Run the model on the input, to obtain its current predictions:
```
output <- model(b$x)
```
Calculate the loss, a measure of divergence between model estimate and ground truth:
```
loss <- nnf_mse_loss(output, b$y$unsqueeze(2))
```
Have that loss propagate back through the network, causing gradients to be computed for all parameters:
```
loss$backward()
```
Ask the optimizer to update the parameters accordingly:
```
optimizer$step()
```

Fortunately, with luz, we don’t have to compute the training loop ourselves! All this is taken care of by a pair of two functions: setup() and fit().

In setup(), we decide which loss function and which optimization algorithm to use. For regression problems, the most popular loss is mean squared error: nn_mse_loss().

And among optimization algorithms (“optimizers”), among the most popular ones is Adam (optim_adam()).

setup() is called on the model definition, like so:

fitted <- net %>%
  setup(
    loss = function(y_hat, y_true) nnf_mse_loss(y_hat, y_true$unsqueeze(2)),
    optimizer = optim_adam
  )

Then fit() is used to pass the training data loader, the number of epochs to train for, and optionally, the validation data loader. After every epoch, the model is run on the validation data, in “test mode” (no parameter updates involved). That way, you immediately see whether you’re overfitting to the training set. Here are both calls together – everything we need to start training:

fitted <- net %>%
  setup(
    loss = function(y_hat, y_true) nnf_mse_loss(y_hat, y_true$unsqueeze(2)),
    optimizer = optim_adam
  ) %>%
  fit(train_dl, epochs = 10, valid_data = test_dl)

As you can see, the network has made good progress – on both training and validation set. How about the test set? And how good of a fit are the inferred correlations?

Evaluate performance

We use luz::predict() to get predictions on the test set:

preds <- predict(fitted, test_dl)

How do predictions and ground truth line up? Well, since all this has been about scatterplots, why not create one to investigate that?

preds <- preds$to(device = "cpu")$squeeze() %>% as.numeric()
test_dl <- dataloader(test_ds, batch_size = 5000)
targets <- (test_dl %>% dataloader_make_iter() %>% dataloader_next())$y %>% as.numeric()

df <- data.frame(preds = preds, targets = targets)

library(ggplot2)

ggplot(df, aes(x = targets, y = preds)) +
  geom_point(size = 0.1) +
  theme_classic() +
  xlab("true correlations") +
  ylab("model predictions")

Want to guess the correlation …?

So that’s it - you’ve seen the complete workflow end-to-end, from data loading to model evaluation. The next tutorial asks a few what if? questions – e.g., what if I don’t want to predict a numerical output? – and offers some ideas for experimentation.

What if? Experiments and adaptations

Mon, 01 Jan 0001 00:00:00 +0000

What is it that we’ve done in the previous tutorial? Put abstractly, we’ve trained a network to take in images and output a continuous numerical value.

In the process, we’ve made decisions all the time – what, and how many, layers to use; how to calculate the loss; what optimization algorithm to apply; how long to train; and more. We can’t go into all of them here, and we can’t go into great detail. But the good thing is: With deep learning, you can always experiment and find out. (In fact, more often than not, experiment and find out is the only way to find out!)

So this page is basically an invitation to try out things for yourself.

What if … we were working with a different kind of data – not images?

With deep learning, the type of input data decides the type of architecture we use. Or architectures. (Quick note: By architecture, I mean something more like a family than a specific model. For example, convolutional neural networks (CNNs) would be one; or Long Short-Term Memory model (LSTM); or Transformer.)

Sometimes there are several established architectures for a problem; sometimes there’s one most prominent family. Even in the latter case though, there is no rule you have to use it.

For example, take our scatterplot images. The canonical architecture in image recognition are CNNs. But, you could still work on image data using nothing but linear layers. Depending on the task, this may or may not work so well.

So why not give it a try? If you want to try this, there are three places you have to modify: the dataset, the model, and the line that calculate the loss.

The model’s first linear layer is going to deal with the image input. Being a linear layer, it will want to be presented with a flat structure of numbers. So where the previous dataset took two-dimensional inputs and added an additional channels dimension, the new one, on the contrary, is flattening the 2d matrix into a 1-d vector:

crop_axes <- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)

root <- file.path(tempdir(), "correlation")

# change valid_ds and test_ds analogously
train_ds <- guess_the_correlation_dataset(
    # where to unpack
    root = root,
    # additional preprocessing 
    transform = function(img) crop_axes(img) %>% torch_flatten(),
    # don't take all data, but just the indices we pass in
    indexes = train_indices,
    download = TRUE
  )

The model now consists of all linear layers:

torch_manual_seed(777)

net <- nn_module(
  
  "corr-cnn",
  
  initialize = function() {
    
    self$fc1 <- nn_linear(in_features = 130 * 130, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 256)
    self$fc3 <- nn_linear(in_features = 256, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %>% 
      self$fc1() %>%
      nnf_relu() %>%
      
      self$fc2() %>%
      nnf_relu() %>%
      
      self$fc3() 
  }
)

model <- net()

Compared to the convnet, how well does this work? You will find it performs a lot worse. In a way, this is no surprise – it’s not for nothing that we use convolutional architectures with images. However, the extent to which a convnet outperforms a linear model is still input- and task-dependent. Were you to run an analogous comparison for MNIST digit classification (the mnist_dataset() that comes with torch) you’d find that a linear model is able to achieve sensible results.

What if … we wanted to classify the images, not predict a continuous target?

Assume we had the same input data as before, but now we just care if there’s a substantial correlation or not. Let’s say we’re interested in whether its magnitude is below or above 0.5.

This time, we only have to make modifications in two places . The dataset now binarizes the target according to our new requirements, passing in a target_transform in addition to the transform destined for the image:

add_channel_dim <- function(img) img$unsqueeze(1)
crop_axes <- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)
binarize <- function(tensor) torch_round(torch_abs(tensor))

root <- file.path(tempfile(), "correlation")

# same for validation set and test set
train_ds <- guess_the_correlation_dataset(
    # where to unpack
    root = root,
    # additional preprocessing 
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    # binarize target data
    target_transform = binarize,
    # don't take all data, but just the indices we pass in
    indexes = train_indices,
    download = TRUE
  )

Now that we want the network to output a 0 or a 1 instead of a continuous value, we need to use a different loss function. nnf_binary_cross_entropy_with_logits() takes the output, computes the log, and calculates cross entropy between that and the targets. (If you’re thinking, “where is the sigmoid, shouldn’t we have had the network apply a sigmoid activation in the end?”, – it’s not needed because of that taking-the-log step in the loss function itself.)

fitted <- net %>%
  setup(
    loss = function(y_hat, y_true) nnf_binary_cross_entropy_with_logits(y_hat, y_true$unsqueeze(2)),
    optimizer = optim_adam
  )

And again, loss decreases. But now that we’re using cross entropy instead of mean squared error, it is a lot more difficult to get an impression how well this really works! To find out, why don’t you check out predictions on the test set?

What if … we made changes to the optimization routine?

Thankfully, torch takes care of all gradient computations for us, and unless we’re implementing custom operations, we don’t normally need to think about this. However, the way these gradients are being made use of is something we can influence. Optimizers differ in how they compute weight updates, and choosing a different algorithm may make a significant difference.

Truth be told, though, this is mostly a matter of experimentation. The Adam algorithm used here is among the most-established ones; however you could try a few others for comparison: for example, SGD or RMSProp.

In addition to trying different optimizers, you can experiment with how they’re configured. Different optimizers have different tuning knobs, but most of them have one in common: the learning rate, a parameter indicating how big a step to take in optimization. Change the learning rate to a higher or lower value and find out how this affects optimization performance.

Speaking of learning rates, torch has learning rate schedulers that allow you to change learning rates over time. For example, lr_step() allows you to shrink it, by some degree, every configurable number of steps. If you’re interested in pursuing this topic, a current best-practice approach to handling learning rates is illustrated in this post.

What if … we made the network bigger or trained it for a longer time?

If you make a network “bigger”, increasing the number of parameters (for a linear layer, output_features, for a convolutional one, channels), in theory it gets more powerful. Analogously, if you give it more time to train, it may arrive at better results. However, depending on the task, you may or may not see improvements – again, the only way to know is to try.

And there is something else to think about. If something you do improves performance on the training set, does it generalize to the test set? As in machine learning in general, in deep learning one needs to be wary of overfitting. But what are countermeasures you could take?

Before thinking of anything technical, you’d always want to think through what you know about the data and the underlying context. Analytically, what could cause the training and the test data to come from different distributions? Is there a way to have these distributions become more similar?

The next thing, then, is not quite technical either. If there’s no compelling reason to assume that the test data will be systematically different, it’s just: the more data the better. This is why in our example task, we don’t see much overfitting – the dataset is gigantic (and we’ve been using but a tiny fraction!).

If getting more data is not an option, we can add regularization. In deep learning, the most popular ways of doing this are dropout and batch normalization.

Dropout adds random noise by stochastically removing units during training, making the net more robust to presence/absence of individual features. In our example, you could add dropout as follows. (Here p passed to nnf_dropout() is the dropout probability. Not surprisingly, this, again, is a hyperparameter you’ll want to experiment with.)

net <- nn_module(
  
  "corr-cnn",
  
  initialize = function() {
    
    self$conv1 <- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 <- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 <- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 <- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      nnf_dropout(p = 0.2) %>%
      
      self$conv2() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      nnf_dropout(p = 0.2) %>%
      
      self$conv3() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      nnf_dropout(p = 0.2) %>%
      
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      nnf_dropout(p = 0.2) %>%
      
      self$fc2()
  }
)

Batch normalization is less well understood theoretically, but can be extremely effective in some cases. Besides acting as a regularizer, it also stabilizes training and may allow for using higher learning rates.

With batch normalization, our network could look like this:

net <- nn_module(
  
  "corr-cnn",
  
  initialize = function() {
    
    self$conv1 <- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 <- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 <- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$bn1 <- nn_batch_norm2d(num_features = 32)
    self$bn2 <- nn_batch_norm2d(num_features = 64)
    self$bn3 <- nn_batch_norm2d(num_features = 128)
    self$bn4 <- nn_batch_norm1d(num_features = 128)
    
    self$fc1 <- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      self$bn1() %>%
      nnf_avg_pool2d(2) %>%
      nnf_dropout(p = 0.2) %>%
      
      self$conv2() %>%
      nnf_relu() %>%
      self$bn2() %>%
      nnf_avg_pool2d(2) %>%
      nnf_dropout(p = 0.2) %>%
      
      self$conv3() %>%
      nnf_relu() %>%
      self$bn3() %>%
      nnf_avg_pool2d(2) %>%
      nnf_dropout(p = 0.2) %>%
      
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      self$bn4() %>%
      
      self$fc2()
  }
)

Once you’re found that a regularizing measure works – meaning, performance on the validation set is similar to that on the training set, or maybe even better – you can go back and add more capacity to the network: add more layers, train for a longer time, etc. Maybe you’ll arrive at better performance overall!

Create your own Dataset

Mon, 01 Jan 0001 00:00:00 +0000

Unless the data you’re working with comes with some package in the torch ecosystem, you’ll need to wrap in a Dataset.

`torch` `Dataset` objects

A Dataset is an R6 object that knows how to iterate over data. This is because it acts as supplier to a DataLoader , who will ask it to return some number of items.

(How many? That is dependent on the batch size – but batch sizes are handled by DataLoaders, so it needn’t be concerned about that. All it has to know is what to do when asked for, e.g., item no. 7.)

While a Dataset may have any number of methods – each responsible for some aspect of pre-processing logic, for example – just three methods are required:

initialize() , to pre-process and store the data;
.getitem(i), to pick the item at position i, and
.length(), to indicate to the DataLoader how many items it has.

Let’s see an example.

Penguins

penguins is a very nice dataset that lives in the palmerpenguins package.

library(dplyr)
library(palmerpenguins)

penguins %>% glimpse()

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2,…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6,…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193,…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675…
$ sex               <fct> male, female, female, NA, female, male, female…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

There are three species, and we’ll infer them making use of all available information: “biometrics” like bill_length_mm, geographic indicators like the island the penguins inhabit, and more.

Predictors are of two different types, categorical and continuous.

Continuous features, of R type double, may be fed to torch without further ado. We just directly use them to initialize a torch tensor, which will be of type Float:

library(torch)
torch_tensor(1)

torch_tensor
 1
[ CPUFloatType{1} ]

It’s different with categorical data though. Firstly, torch needs all data to be in numerical form, so vectors of type character need to become factors – which can then be treated as numeric via level extraction. In the penguins dataset, island, sex , as well as the target column, species, are factors already. So can we just do an as.numeric() and that’s it?

Not quite: We also need to reflect on the semantic side of things.

Categorical data in deep learning

If we just replace islands Biscoe, Dream, and Torgersen by numbers 1, 2, and 3, we present them to the network as interval data, which of course they’re not.

We have two options: transform them to one-hot vectors, where e.g. Biscoe would be 0,0,1, Dream 0,1,0, and Torgersen, 1,0,0, or leave them as they are, but have the network map each discrete value to a multidimensional, continuous representation. The latter is called embedding, and it often helps networks make sense of discrete data.

Embedding modules expect their inputs to be of type Long. A tensor created from an R value will have the correct type if we make sure it’s an integer:

torch_tensor(as.integer(as.numeric(as.factor("one"))))

torch_tensor
 1
[ CPULongType{1} ]

Now, let’s create a dataset for penguins.

A dataset for penguins

In initialize(), we convert the data as planned and store them for later delivery. Like the categorical input features, species, the target, is discrete, and thus, converted to torch Long.

penguins_dataset <- dataset(
  
  name = "penguins_dataset",
  
  initialize = function(df) {
    
    df <- na.omit(df) 
    
    # continuous input data (x_cont)   
    x_cont <- df[ , c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "year")] %>%
      as.matrix()
    self$x_cont <- torch_tensor(x_cont)
    
    # categorical input data (x_cat)
    x_cat <- df[ , c("island", "sex")]
    x_cat$island <- as.integer(x_cat$island)
    x_cat$sex <- as.integer(x_cat$sex)
    self$x_cat <- as.matrix(x_cat) %>% torch_tensor()

    # target data (y)
    species <- as.integer(df$species)
    self$y <- torch_tensor(species)
    
  },
  
  .getitem = function(i) {
     list(x_cont = self$x_cont[i, ], x_cat = self$x_cat[i, ], y = self$y[i])
    
  },
  
  .length = function() {
    self$y$size()[[1]]
  }
 
)

Unlike initialize, .getitem(i) and .length() are just one-liners.

Let’s see if this behaves like we want it to. We randomly split the data into training and validation sets and query their respective lengths:

train_indices <- sample(1:nrow(penguins), 250)

train_ds <- penguins_dataset(penguins[train_indices, ])
valid_ds <- penguins_dataset(penguins[setdiff(1:nrow(penguins), train_indices), ])

length(train_ds)
length(valid_ds)

[1] 242
[1] 91

We can index into Datasets in an R-like way:

train_ds[1]

$x_cont
torch_tensor
   45.2000
   16.4000
  223.0000
 5950.0000
 2008.0000
[ CPUFloatType{5} ]

$x_cat
torch_tensor
 1
 2
[ CPULongType{2} ]

$y
torch_tensor
3
[ CPULongType{} ]

From here on, everything proceeds like in the first tutorial: We use the Datasets to instantiate DataLoaders…

train_dl <- train_ds %>% dataloader(batch_size = 16, shuffle = TRUE)

valid_dl <- valid_ds %>% dataloader(batch_size = 16, shuffle = FALSE)

… and then, create and train the network. The network will look pretty different now though: most notably, you’ll see embeddings at work.

Classifying penguins – the network

We just heard that embedding layers work with a datatype that’s different from most other neural network layers. It is therefore convenient to have them work in a space of their own, that is, put them into a dedicated container.

Here we define a specialized module that has one embedding layer for every categorical feature. It gets passed the cardinalities of the respective features, and creates an nn_embedding() for each of them.

When called, it iterates over its submodules, lets them do their work, and returns the concatenated output.

embedding_module <- nn_module(
  
  initialize = function(cardinalities) {
    
    self$embeddings = nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim = ceiling(x/2))))
    
  },
  
  forward = function(x) {
    
    embedded <- vector(mode = "list", length = length(self$embeddings))
    for (i in 1:length(self$embeddings)) {
      embedded[[i]] <- self$embeddings[[i]](x[ , i])
    }
    
    torch_cat(embedded, dim = 2)
  }
)

The top-level module has three submodules: said embedding_module and two linear layers.

The first linear layer takes the output from embedding_module , computes an affine transformation as it sees fit, and passes its result to the output layer. output then has three units, one for every possible target class.

The activation function we apply to the raw aggregation, nnf_log_softmax(), composes two operations: the popular-in-deep-learning softmax normalization and taking the logarithm. Like that, we end up with the format expected by nnf_nll_loss(), the loss function that computes the negative log likelihood (NLL) loss between inputs and targets.

net <- nn_module(
  "penguin_net",

  initialize = function(cardinalities,
                        n_cont,
                        fc_dim,
                        output_dim) {
    
    self$embedder <- embedding_module(cardinalities)
    self$fc1 <- nn_linear(sum(purrr::map(cardinalities, function(x) ceiling(x/2)) %>% unlist()) + n_cont, fc_dim)
    self$output <- nn_linear(fc_dim, output_dim)
    
  },

  forward = function(x_cont, x_cat) {
    
    embedded <- self$embedder(x_cat)
    
    all <- torch_cat(list(embedded, x_cont$to(dtype = torch_float())), dim = 2)
    
    all %>% self$fc1() %>%
      nnf_relu() %>%
      self$output() %>%
      nnf_log_softmax(dim = 2)
    
  }
)

Let’s instantiate the top-level module:

model <- net(
  cardinalities = c(length(levels(penguins$island)), length(levels(penguins$sex))),
  n_cont = 5,
  fc_dim = 32,
  output_dim = 3
)

And we’re ready for training!

Model training

optimizer <- optim_adam(model$parameters, lr = 0.01)

for (epoch in 1:20) {

  model$train()
  train_losses <- c()  

  coro::loop(for (b in train_dl) {
    
    optimizer$zero_grad()
    output <- model(b$x_cont, b$x_cat)
    loss <- nnf_nll_loss(output, b$y)
    
    loss$backward()
    optimizer$step()
    
    train_losses <- c(train_losses, loss$item())
    
  })

  model$eval()
  valid_losses <- c()

  coro::loop(for (b in valid_dl) {
    
    output <- model(b$x_cont, b$x_cat)
    loss <- nnf_nll_loss(output, b$y)
    valid_losses <- c(valid_losses, loss$item())
    
  })

  cat(sprintf("Loss at epoch %d: training: %3.3f, validation: %3.3f\n", epoch, mean(train_losses), mean(valid_losses)))
}

Loss at epoch 1: training: 34.962, validation: 4.354
Loss at epoch 2: training: 8.207, validation: 14.512
Loss at epoch 3: training: 7.804, validation: 2.820
Loss at epoch 4: training: 5.998, validation: 8.525
Loss at epoch 5: training: 8.293, validation: 5.594
Loss at epoch 6: training: 6.375, validation: 4.540
Loss at epoch 7: training: 7.478, validation: 2.120
Loss at epoch 8: training: 3.470, validation: 3.508
Loss at epoch 9: training: 12.155, validation: 4.266
Loss at epoch 10: training: 10.168, validation: 4.285
Loss at epoch 11: training: 5.963, validation: 1.888
Loss at epoch 12: training: 3.035, validation: 2.454
Loss at epoch 13: training: 1.993, validation: 1.185
Loss at epoch 14: training: 2.454, validation: 2.200
Loss at epoch 15: training: 1.641, validation: 0.588
Loss at epoch 16: training: 0.996, validation: 1.959
Loss at epoch 17: training: 0.912, validation: 0.674
Loss at epoch 18: training: 1.517, validation: 0.487
Loss at epoch 19: training: 1.569, validation: 1.202
Loss at epoch 20: training: 0.735, validation: 1.313`

get started | torch for R

Guess the correlation

Get the packages

Get the dataset

Work with batches

Create the model

Train the network

Evaluate performance

What if? Experiments and adaptations

What if … we were working with a different kind of data – not images?

What if … we wanted to classify the images, not predict a continuous target?

What if … we made changes to the optimization routine?

What if … we made the network bigger or trained it for a longer time?

Create your own Dataset

torch Dataset objects

Penguins

Categorical data in deep learning

A dataset for penguins

Classifying penguins – the network

Model training

`torch` `Dataset` objects