<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>get started | torch for R</title>
    <link>/start/</link>
      <atom:link href="/start/index.xml" rel="self" type="application/rss+xml" />
    <description>get started</description>
    <generator>Hugo -- gohugo.io</generator><language>en-us</language>
    <item>
      <title>Guess the correlation</title>
      <link>/start/guess_the_correlation/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>/start/guess_the_correlation/</guid>
      <description>

&lt;h1 id=&#34;get-the-packages&#34;&gt;Get the packages&lt;/h1&gt;

&lt;p&gt;To use &lt;code&gt;torch&lt;/code&gt;, you first need to install it. Same for its high-level wrapper, &lt;a href=&#34;https://mlverse.github.io/luz/index.html&#34; target=&#34;_blank&#34;&gt;luz&lt;/a&gt;. While torch has all the basic functionality, &lt;code&gt;luz&lt;/code&gt; provides a declarative, concise API that lets you train a network in a few lines of code.&lt;/p&gt;

&lt;p&gt;To get the respective CRAN versions, do&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;install.packages(&amp;quot;torch&amp;quot;)
install.packages(&amp;quot;luz&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Does it work? Here&amp;rsquo;s a quick test:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(torch)
library(luz)
torch_tensor(1)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;torch_tensor
 1
[ CPUFloatType{1} ]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, while &lt;code&gt;torch&lt;/code&gt; contains all the core functionality, and &lt;code&gt;luz&lt;/code&gt;, the training logic, there is a whole ecosystem built around them.&lt;/p&gt;

&lt;p&gt;Notably, &lt;code&gt;torchvision&lt;/code&gt; is essential to image-processing tasks. In this example, we don&amp;rsquo;t use it much &amp;ndash; overtly, that is. It&amp;rsquo;s used more prominently behind the scenes. Let&amp;rsquo;s get it:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;install.packages(&amp;quot;torchvision&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(torchvision)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally, we install &lt;code&gt;torchdatasets&lt;/code&gt; , that wraps datasets in a convenient format, rendering them immediately usable from &lt;code&gt;torch&lt;/code&gt;. Let&amp;rsquo;s get this as well, as we&amp;rsquo;re going to use one of the datasets it provides.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;install.packages(&amp;quot;torchdatasets&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(torchdatasets)
&lt;/code&gt;&lt;/pre&gt;

&lt;h1 id=&#34;get-the-dataset&#34;&gt;Get the dataset&lt;/h1&gt;

&lt;p&gt;&amp;ldquo;Guess the correlation&amp;rdquo; is a fun dataset that tasks one &amp;ndash; a person, if they feel like, or a program, if we train it &amp;ndash; to estimate the (linear) correlation between two variables displayed in a scatterplot.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;torchdatasets&lt;/code&gt; will download, unpack, and preprocess it for us.&lt;/p&gt;

&lt;p&gt;The training set is huge; it has 150000 observations. For instruction purposes, we don&amp;rsquo;t really need so much data &amp;ndash; we&amp;rsquo;ll restrict ourselves to small subsets, for each of training, validation, and test sets.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;train_indices &amp;lt;- 1:10000
val_indices &amp;lt;- 10001:15000
test_indices &amp;lt;- 15001:20000
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, the following snippet does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;download and unpack the dataset,&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;do some custom preprocessing on the images (on top of what is already done by default) &amp;ndash; more on that soon&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;take just the first 10000 observations and put them in a &lt;code&gt;torch&lt;/code&gt; &lt;code&gt;Dataset&lt;/code&gt; object named &lt;code&gt;train_ds&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;add_channel_dim &amp;lt;- function(img) img$unsqueeze(1)
crop_axes &amp;lt;- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)

root &amp;lt;- file.path(tempdir(), &amp;quot;correlation&amp;quot;)

train_ds &amp;lt;- guess_the_correlation_dataset(
# where to unpack
root = root,
# additional preprocessing 
transform = function(img) crop_axes(img) %&amp;gt;% add_channel_dim(),
# don&#39;t take all data, but just the indices we pass in
indexes = train_indices,
download = TRUE
)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As we&amp;rsquo;re at it, let&amp;rsquo;s do the same for the validation and test sets. We don&amp;rsquo;t need to download again, as we&amp;rsquo;re building on the same underlying data. We just pick different observations.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;valid_ds &amp;lt;- guess_the_correlation_dataset(
    root = root,
    transform = function(img) crop_axes(img) %&amp;gt;% add_channel_dim(),
    indexes = val_indices,
    download = FALSE
  )

test_ds &amp;lt;- guess_the_correlation_dataset(
    root = root,
    transform = function(img) crop_axes(img) %&amp;gt;% add_channel_dim(),
    indexes = test_indices,
    download = FALSE
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s counter-check we got what we wanted. How many items are there in each set?&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;length(train_ds)
length(valid_ds)
length(test_ds)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1] 10000
[1] 5000
[1] 5000
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And how does a single observation look like? Here is the first one:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;train_ds[1]
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;$x
torch_tensor
(1,.,.) = 
 Columns 1 to 9  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{1,130,130} ]

$y
torch_tensor
-0.45781
[ CPUFloatType{} ]

$id
[1] &amp;quot;arjskzyc&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;rsquo;s a list of three items, the last of which we&amp;rsquo;re not interested in for our purposes.&lt;/p&gt;

&lt;p&gt;The second, a scalar tensor, is the correlation value, the thing we want the network to learn. The first, &lt;code&gt;x&lt;/code&gt;, is the scatterplot: a tensor representing an image of dimensionality 130*130. But wait &amp;ndash; what is that &lt;code&gt;1&lt;/code&gt; in the shape output?&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[ CPUFloatType{1,130,130} ]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This really is a three-dimensional tensor! The first dimension holds different &lt;em&gt;channels&lt;/em&gt; &amp;ndash; or the single channel, if the image has but one. In fact, the reason &lt;code&gt;x&lt;/code&gt; came in this format is that we requested it, here:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;add_channel_dim &amp;lt;- function(img) img$unsqueeze(1)

train_ds &amp;lt;- guess_the_correlation_dataset(
    # ...
    transform = function(img) crop_axes(img) %&amp;gt;% add_channel_dim(),
    # ...
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;add_channel_dim()&lt;/code&gt; was passed in as a custom transformation, to be applied to every item of the dataset. It calls one of &lt;code&gt;torch&lt;/code&gt;&amp;rsquo;s many tensor operations, &lt;code&gt;unsqueeze()&lt;/code&gt;, that adds a singleton dimension at a requested position.&lt;/p&gt;

&lt;p&gt;How about the second custom transformation?&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;crop_axes &amp;lt;- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here, we crop the image, cutting off the axes and labels on the left and bottom. These image regions don&amp;rsquo;t contribute any distinctive information, and having the images be smaller saves memory.&lt;/p&gt;

&lt;h1 id=&#34;work-with-batches&#34;&gt;Work with batches&lt;/h1&gt;

&lt;p&gt;Now, we&amp;rsquo;ve done so much work already, but you haven&amp;rsquo;t actually &lt;em&gt;seen&lt;/em&gt; any of the scatterplots yet! The reason we&amp;rsquo;ve been waiting until now is that we want to show a bunch of them at a time, and for that, we need to know how to handle &lt;em&gt;batches&lt;/em&gt; of data.&lt;/p&gt;

&lt;p&gt;So let&amp;rsquo;s create a &lt;code&gt;DataLoader&lt;/code&gt; object from the training set. We&amp;rsquo;ll soon use it to train the model, but right now, we&amp;rsquo;ll just plot the first batch.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;DataLoader&lt;/code&gt; needs to know where to get the data &amp;ndash; namely, from the &lt;code&gt;Dataset&lt;/code&gt; it gets passed &amp;ndash;, as well as how many items should go in a batch. Optionally, it can return data in random order (&lt;code&gt;shuffle = TRUE&lt;/code&gt;).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;train_dl &amp;lt;- dataloader(train_ds, batch_size = 64, shuffle = TRUE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Like a &lt;code&gt;Dataset&lt;/code&gt;, we can query a &lt;code&gt;DataLoader&lt;/code&gt; for its length. For the &lt;code&gt;Dataset&lt;/code&gt;, this meant number of items; for a &lt;code&gt;DataLoader&lt;/code&gt; , it means number of batches:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;length(train_dl)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1] 157
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To access the first batch, we create an iterator from the &lt;code&gt;DataLoader&lt;/code&gt; and ask it for the first batch. Even if it weren&amp;rsquo;t for plotting, you might do this just to check that the dimensions look ok:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;batch &amp;lt;- dataloader_make_iter(train_dl) %&amp;gt;% dataloader_next()

dim(batch$x)
dim(batch$y)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1]  64   1 130 130
[1] 64
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And plot! Note how we first remove the &lt;em&gt;channels&lt;/em&gt; dimension &amp;ndash; &lt;code&gt;as.raster()&lt;/code&gt; wouldn&amp;rsquo;t like it &amp;ndash; and then, convert the tensor to R for further processing:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;par(mfrow = c(8,8), mar = rep(0, 4))

images &amp;lt;- as.array(batch$x$squeeze(2))

images %&amp;gt;%
  purrr::array_tree(1) %&amp;gt;%
  purrr::map(as.raster) %&amp;gt;%
  purrr::iwalk(~{plot(.x)})
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src=&#34;correlations.png&#34; width=&#34;80%&#34; /&gt;&lt;/p&gt;

&lt;p&gt;Want to try your skill at guessing these? Here is the corresponding ground truth:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;batch$y %&amp;gt;% as.numeric() %&amp;gt;% round(digits = 2)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1] -0.29  0.58 -0.57  0.56  0.10  0.09  0.21  0.45 -0.24  0.65  0.70  0.40  0.71  0.20  0.07  0.66  0.65 -0.56  0.73
[20] -0.40 -0.18 -0.42 -0.46 -0.45 -0.77  0.09 -0.19  0.40 -0.70 -0.04 -0.16 -0.13 -0.18  0.01  0.25  0.54  0.21  0.28
[39]  0.49  0.86 -0.70  0.51  0.47 -0.46  0.88  0.00  0.24  0.28  0.28 -0.04 -0.74  0.43  0.74  0.01 -0.21  0.66 -0.45
[58] -0.44  0.50 -0.69 -0.65 -0.66 -0.55 -0.53
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, just as they got their own &lt;code&gt;Dataset&lt;/code&gt; objects, test and validation data each need their own &lt;code&gt;DataLoader&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;valid_dl &amp;lt;- dataloader(valid_ds, batch_size = 64)
length(valid_dl)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1] 79
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;test_dl &amp;lt;- dataloader(test_ds, batch_size = 64)
length(test_dl)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1] 79
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And we&amp;rsquo;re ready to create the model!&lt;/p&gt;

&lt;h1 id=&#34;create-the-model&#34;&gt;Create the model&lt;/h1&gt;

&lt;p&gt;Let&amp;rsquo;s first see what we&amp;rsquo;re trying to accomplish. Our input data are images; normally this means we&amp;rsquo;ll work with some kind of convolutional neural network (CNN). In &lt;code&gt;torch&lt;/code&gt;, a neural network is a &lt;code&gt;module&lt;/code&gt;: a container for more granular &lt;code&gt;modules&lt;/code&gt;, which themselves may be built up of yet more fine-grained &lt;code&gt;modules&lt;/code&gt;. While in theory, this kind of compositionality is unlimited, in our example there are just two levels: a top-level &lt;code&gt;module&lt;/code&gt; representing the &lt;em&gt;model&lt;/em&gt;, and &lt;em&gt;submodules&lt;/em&gt; that, in other frameworks, would be called &lt;em&gt;layers&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The overall model is created by a call to &lt;code&gt;nn_module()&lt;/code&gt;. This instantiates an &lt;code&gt;nn_Module&lt;/code&gt;, an R6 class that knows how to act as a neural network. This object can have any number of methods, but two are essential:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;initialize()&lt;/code&gt;, the place to instantiate any &lt;em&gt;submodules&lt;/em&gt;; and&lt;/li&gt;
&lt;li&gt;&lt;code&gt;forward()&lt;/code&gt;, the place to define what should happen when this module is called.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;code&gt;initialize()&lt;/code&gt; , we instantiate five submodules &amp;ndash; two convolutional layers and two linear ones:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;# zooming in on just initialize() - don&#39;t run standalone

net &amp;lt;- nn_module(
  
  # ...
  
  initialize = function() {
    
    self$conv1 &amp;lt;- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 &amp;lt;- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 &amp;lt;- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 &amp;lt;- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 &amp;lt;- nn_linear(in_features = 128, out_features = 1)
    
  },
  
 # ...
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The convolutional (often abbreviated &lt;em&gt;conv&lt;/em&gt;) layers apply a filter (or: kernel) of size 3 x 3. This filter slides over the image and computes local aggregates. In fact, there is not just a single filter, there are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;32 of them in the first conv layer,&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;64 in the second, and&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;128 in the third.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The filters are trained to pick up informative spatial features, features that will be able to tell us something about the image.&lt;/p&gt;

&lt;p&gt;In addition to the three conv layers, we have two linear ones. These are the prototypical neural network layers that get input from all units in the previous layer, combine individual contributions as they see fit, and send on their own individual results to all units in the next layer. The first linear layer will act on the features received from the last conv layer; it consists of 128 units. The second one is the output layer. It outputs a single numeric value, a value that represents the guess our network is making about the size of the correlation.&lt;/p&gt;

&lt;p&gt;Now, while &lt;code&gt;initialize()&lt;/code&gt; defines the layers, &lt;code&gt;forward()&lt;/code&gt; specifies the order in which to call them &amp;ndash; and what to do &amp;ldquo;in between&amp;rdquo;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;# zooming in on just forward() - don&#39;t run standalone

net &amp;lt;- nn_module(
  
 # ...
  
  forward = function(x) {
    
    x %&amp;gt;% 
      self$conv1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      
      self$conv2() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      
      self$conv3() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      
      torch_flatten(start_dim = 2) %&amp;gt;%
      self$fc1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      
      self$fc2()
  }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What are these things that happen &amp;ldquo;in between&amp;rdquo;? In fact, they are of different types.&lt;/p&gt;

&lt;p&gt;Firstly, we have &lt;code&gt;nnf_relu()&lt;/code&gt;, called three times: after each of the conv layers and after the first linear layer. This is a so-called activation function &amp;ndash; a function that takes the raw results computed by a layer and performs some operation on them. In the case of &lt;code&gt;nnf_relu()&lt;/code&gt; (ReLU - Rectified Linear Unit) what it does is leave positive values alone while setting negative ones to 0. You&amp;rsquo;ll encounter additional activation functions when you continue your &lt;code&gt;torch&lt;/code&gt; journey, but ReLU is among the very-most-in-use ones today.&lt;/p&gt;

&lt;p&gt;Secondly, we have &lt;code&gt;nnf_avg_pool2d(2)&lt;/code&gt; , called after each conv layer. This function downsizes the image, replacing a 2 x 2 patch of pixels by its average. So while we&amp;rsquo;re going &lt;em&gt;up&lt;/em&gt; in the number of channels (from 1 via 32 and 64 to 128), we &lt;em&gt;decrease&lt;/em&gt; spatial resolution.&lt;/p&gt;

&lt;p&gt;Thirdly, there is &lt;code&gt;torch_flatten()&lt;/code&gt;. This one doesn&amp;rsquo;t compute anything - it just reshapes its inputs, going &amp;ndash; in this case &amp;ndash; from a four-dimensional structure outputted by the second conv layer to the two-dimensional one expected by the first linear layer.&lt;/p&gt;

&lt;p&gt;Now, here is the complete model creation code:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;torch_manual_seed(777)

net &amp;lt;- nn_module(
  
  &amp;quot;corr-cnn&amp;quot;,
  
  initialize = function() {
    
    self$conv1 &amp;lt;- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 &amp;lt;- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 &amp;lt;- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 &amp;lt;- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 &amp;lt;- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %&amp;gt;% 
      self$conv1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      
      self$conv2() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      
      self$conv3() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      
      torch_flatten(start_dim = 2) %&amp;gt;%
      self$fc1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      
      self$fc2()
  }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Even before training, we can call the model on a batch of data &amp;ndash; this immediately tells us if we got all shapes matching up:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;model &amp;lt;- net()
model(batch$x)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;torch_tensor
0.01 *
-2.8979
 -2.8873
 -2.8699
 -2.9787
 -2.8223
 -3.0255
 -3.1181
 -3.0603
 -3.0520
 -2.8242
 -3.0000
 -2.9150
 -2.9497
 -2.7662
 -2.7980
 -2.9540
 -2.8548
 -2.7927
 -3.0426
 -2.9540
 -2.8846
 -2.8008
 -2.8966
 -2.8358
 -2.9266
 -2.9022
 -2.8667
 -2.8716
 -2.7371
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{64,1} ]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After all that hard work, training the model with &lt;code&gt;luz&lt;/code&gt; is a breeze.&lt;/p&gt;

&lt;h1 id=&#34;train-the-network&#34;&gt;Train the network&lt;/h1&gt;

&lt;p&gt;What happens when you train a neural network? &lt;em&gt;Conceptually&lt;/em&gt;, the following has to happen for every batch. (Wait &amp;ndash; don&amp;rsquo;t execute these lines :-) You&amp;rsquo;ll see &lt;code&gt;luz&lt;/code&gt; taking care of it for you in a minute.)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Run the model on the input, to obtain its current predictions:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;output &amp;lt;- model(b$x)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Calculate the &lt;em&gt;loss&lt;/em&gt;, a measure of divergence between model estimate and ground truth:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;loss &amp;lt;- nnf_mse_loss(output, b$y$unsqueeze(2))
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Have that loss &lt;em&gt;propagate back&lt;/em&gt; through the network, causing gradients to be computed for all parameters:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;loss$backward()
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Ask the optimizer to update the parameters accordingly:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;optimizer$step()
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Fortunately, with &lt;code&gt;luz&lt;/code&gt;, we don&amp;rsquo;t have to compute the training loop ourselves! All this is taken care of by a pair of two functions: &lt;code&gt;setup()&lt;/code&gt; and &lt;code&gt;fit()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;setup()&lt;/code&gt;, we decide which loss function and which optimization algorithm to use. For regression problems, the most popular loss is mean squared error: &lt;code&gt;nn_mse_loss()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And among optimization algorithms (&amp;ldquo;optimizers&amp;rdquo;), among the most popular ones is Adam (&lt;code&gt;optim_adam()&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;setup()&lt;/code&gt; is called on the model definition, like so:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;fitted &amp;lt;- net %&amp;gt;%
  setup(
    loss = function(y_hat, y_true) nnf_mse_loss(y_hat, y_true$unsqueeze(2)),
    optimizer = optim_adam
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then &lt;code&gt;fit()&lt;/code&gt; is used to pass the training data loader, the number of epochs to train for, and optionally, the validation data loader. After every epoch, the model is run on the validation data, in &amp;ldquo;test mode&amp;rdquo; (no parameter updates involved). That way, you immediately see whether you&amp;rsquo;re overfitting to the training set. Here are both calls together &amp;ndash; everything we need to start training:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;fitted &amp;lt;- net %&amp;gt;%
  setup(
    loss = function(y_hat, y_true) nnf_mse_loss(y_hat, y_true$unsqueeze(2)),
    optimizer = optim_adam
  ) %&amp;gt;%
  fit(train_dl, epochs = 10, valid_data = test_dl)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, the network has made good progress &amp;ndash; on both training and validation set. How about the test set? And how good of a fit are the inferred correlations?&lt;/p&gt;

&lt;h1 id=&#34;evaluate-performance&#34;&gt;Evaluate performance&lt;/h1&gt;

&lt;p&gt;We use &lt;code&gt;luz::predict()&lt;/code&gt; to get predictions on the test set:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;preds &amp;lt;- predict(fitted, test_dl)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;How do predictions and ground truth line up? Well, since all this has been about scatterplots, why not create one to investigate that?&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;preds &amp;lt;- preds$to(device = &amp;quot;cpu&amp;quot;)$squeeze() %&amp;gt;% as.numeric()
test_dl &amp;lt;- dataloader(test_ds, batch_size = 5000)
targets &amp;lt;- (test_dl %&amp;gt;% dataloader_make_iter() %&amp;gt;% dataloader_next())$y %&amp;gt;% as.numeric()

df &amp;lt;- data.frame(preds = preds, targets = targets)

library(ggplot2)

ggplot(df, aes(x = targets, y = preds)) +
  geom_point(size = 0.1) +
  theme_classic() +
  xlab(&amp;quot;true correlations&amp;quot;) +
  ylab(&amp;quot;model predictions&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src=&#34;scatter.png&#34; width=&#34;80%&#34; /&gt;&lt;/p&gt;

&lt;p&gt;Want to guess the correlation &amp;hellip;?&lt;/p&gt;

&lt;p&gt;So that&amp;rsquo;s it - you&amp;rsquo;ve seen the complete workflow end-to-end, from data loading to model evaluation. The next tutorial asks a few &lt;em&gt;what if?&lt;/em&gt; questions &amp;ndash; e.g., what if I don&amp;rsquo;t want to predict a numerical output? &amp;ndash; and offers some ideas for experimentation.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>What if? Experiments and adaptations</title>
      <link>/start/what_if/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>/start/what_if/</guid>
      <description>

&lt;p&gt;What is it that we&amp;rsquo;ve done in the previous tutorial? Put abstractly, we&amp;rsquo;ve trained a network to &lt;em&gt;take in images&lt;/em&gt; and &lt;em&gt;output&lt;/em&gt; a &lt;em&gt;continuous&lt;/em&gt; numerical value.&lt;/p&gt;

&lt;p&gt;In the process, we&amp;rsquo;ve made decisions all the time &amp;ndash; what, and how many, layers to use; how to calculate the loss; what optimization algorithm to apply; how long to train; and more. We can&amp;rsquo;t go into all of them here, and we can&amp;rsquo;t go into great detail. But the good thing is: With deep learning, you can always experiment and find out. (In fact, more often than not, experiment and find out is the only way to find out!)&lt;/p&gt;

&lt;p&gt;So this page is basically an invitation to try out things for yourself.&lt;/p&gt;

&lt;h1 id=&#34;what-if-we-were-working-with-a-different-kind-of-data-not-images&#34;&gt;What if &amp;hellip; we were working with a different kind of data &amp;ndash; not images?&lt;/h1&gt;

&lt;p&gt;With deep learning, the type of input data decides the type of architecture we use. Or architectures. (Quick note: By architecture, I mean something more like a family than a specific model. For example, convolutional neural networks (CNNs) would be one; or Long Short-Term Memory model (LSTM); or Transformer.)&lt;/p&gt;

&lt;p&gt;Sometimes there are several established architectures for a problem; sometimes there&amp;rsquo;s one most prominent family. Even in the latter case though, there is no rule you &lt;em&gt;have&lt;/em&gt; to use it.&lt;/p&gt;

&lt;p&gt;For example, take our scatterplot images. The canonical architecture in image recognition are CNNs. &lt;em&gt;But&lt;/em&gt;, you could still work on image data using nothing but linear layers. Depending on the task, this may or may not work so well.&lt;/p&gt;

&lt;p&gt;So why not give it a try? If you want to try this, there are three places you have to modify: the dataset, the model, and the line that calculate the loss.&lt;/p&gt;

&lt;p&gt;The model&amp;rsquo;s first linear layer is going to deal with the image input. Being a linear layer, it will want to be presented with a flat structure of numbers. So where the previous dataset took two-dimensional inputs and added an additional &lt;code&gt;channels&lt;/code&gt; dimension, the new one, on the contrary, is flattening the 2d matrix into a 1-d vector:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;crop_axes &amp;lt;- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)

root &amp;lt;- file.path(tempdir(), &amp;quot;correlation&amp;quot;)

# change valid_ds and test_ds analogously
train_ds &amp;lt;- guess_the_correlation_dataset(
    # where to unpack
    root = root,
    # additional preprocessing 
    transform = function(img) crop_axes(img) %&amp;gt;% torch_flatten(),
    # don&#39;t take all data, but just the indices we pass in
    indexes = train_indices,
    download = TRUE
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The model now consists of all linear layers:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;torch_manual_seed(777)

net &amp;lt;- nn_module(
  
  &amp;quot;corr-cnn&amp;quot;,
  
  initialize = function() {
    
    self$fc1 &amp;lt;- nn_linear(in_features = 130 * 130, out_features = 128)
    self$fc2 &amp;lt;- nn_linear(in_features = 128, out_features = 256)
    self$fc3 &amp;lt;- nn_linear(in_features = 256, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %&amp;gt;% 
      self$fc1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      
      self$fc2() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      
      self$fc3() 
  }
)

model &amp;lt;- net()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Compared to the convnet, how well does this work? You will find it performs a lot worse. In a way, this is no surprise &amp;ndash; it&amp;rsquo;s not for nothing that we use convolutional architectures with images. However, the extent to which a convnet outperforms a linear model is still input- and task-dependent. Were you to run an analogous comparison for MNIST digit classification (the &lt;code&gt;mnist_dataset()&lt;/code&gt; that comes with &lt;code&gt;torch&lt;/code&gt;) you&amp;rsquo;d find that a linear model is able to achieve sensible results.&lt;/p&gt;

&lt;h1 id=&#34;what-if-we-wanted-to-classify-the-images-not-predict-a-continuous-target&#34;&gt;What if &amp;hellip; we wanted to &lt;em&gt;classify&lt;/em&gt; the images, not predict a continuous target?&lt;/h1&gt;

&lt;p&gt;Assume we had the same input data as before, but now we just care if there&amp;rsquo;s a substantial correlation or not. Let&amp;rsquo;s say we&amp;rsquo;re interested in whether its magnitude is below or above 0.5.&lt;/p&gt;

&lt;p&gt;This time, we only have to make modifications in two places . The dataset now binarizes the target according to our new requirements, passing in a &lt;code&gt;target_transform&lt;/code&gt; in addition to the &lt;code&gt;transform&lt;/code&gt; destined for the image:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;add_channel_dim &amp;lt;- function(img) img$unsqueeze(1)
crop_axes &amp;lt;- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)
binarize &amp;lt;- function(tensor) torch_round(torch_abs(tensor))

root &amp;lt;- file.path(tempfile(), &amp;quot;correlation&amp;quot;)

# same for validation set and test set
train_ds &amp;lt;- guess_the_correlation_dataset(
    # where to unpack
    root = root,
    # additional preprocessing 
    transform = function(img) crop_axes(img) %&amp;gt;% add_channel_dim(),
    # binarize target data
    target_transform = binarize,
    # don&#39;t take all data, but just the indices we pass in
    indexes = train_indices,
    download = TRUE
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now that we want the network to output a &lt;code&gt;0&lt;/code&gt; or a &lt;code&gt;1&lt;/code&gt; instead of a continuous value, we need to use a different loss function. &lt;code&gt;nnf_binary_cross_entropy_with_logits()&lt;/code&gt; takes the output, computes the log, and calculates cross entropy between that and the targets. (If you&amp;rsquo;re thinking, &amp;ldquo;where is the sigmoid, shouldn&amp;rsquo;t we have had the network apply a sigmoid activation in the end?&amp;rdquo;, &amp;ndash; it&amp;rsquo;s not needed because of that taking-the-log step in the loss function itself.)&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;fitted &amp;lt;- net %&amp;gt;%
  setup(
    loss = function(y_hat, y_true) nnf_binary_cross_entropy_with_logits(y_hat, y_true$unsqueeze(2)),
    optimizer = optim_adam
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And again, loss decreases. But now that we&amp;rsquo;re using cross entropy instead of mean squared error, it is a lot more difficult to get an impression how well this really works! To find out, why don&amp;rsquo;t you check out predictions on the test set?&lt;/p&gt;

&lt;h1 id=&#34;what-if-we-made-changes-to-the-optimization-routine&#34;&gt;What if &amp;hellip; we made changes to the optimization routine?&lt;/h1&gt;

&lt;p&gt;Thankfully, &lt;code&gt;torch&lt;/code&gt; takes care of all gradient computations for us, and unless we&amp;rsquo;re implementing custom operations, we don&amp;rsquo;t normally need to think about this. However, the way these gradients are being made use of is something we can influence. Optimizers differ in how they compute weight updates, and choosing a different algorithm may make a significant difference.&lt;/p&gt;

&lt;p&gt;Truth be told, though, this is mostly a matter of experimentation. The &lt;a href=&#34;https://torch.mlverse.org/docs/reference/optim_adam.html&#34; target=&#34;_blank&#34;&gt;Adam&lt;/a&gt; algorithm used here is among the most-established ones; however you could try a few others for comparison: for example, &lt;a href=&#34;https://torch.mlverse.org/docs/reference/optim_sgd.html&#34; target=&#34;_blank&#34;&gt;SGD&lt;/a&gt; or &lt;a href=&#34;https://torch.mlverse.org/docs/reference/optim_rmsprop.html&#34; target=&#34;_blank&#34;&gt;RMSProp&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In addition to trying different optimizers, you can experiment with how they&amp;rsquo;re configured. Different optimizers have different tuning knobs, but most of them have one in common: the learning rate, a parameter indicating how big a step to take in optimization. Change the learning rate to a higher or lower value and find out how this affects optimization performance.&lt;/p&gt;

&lt;p&gt;Speaking of learning rates, &lt;code&gt;torch&lt;/code&gt; has learning rate schedulers that allow you to change learning rates over time. For example, &lt;a href=&#34;https://torch.mlverse.org/docs/reference/lr_step.html&#34; target=&#34;_blank&#34;&gt;lr_step()&lt;/a&gt; allows you to shrink it, by some degree, every configurable number of steps. If you&amp;rsquo;re interested in pursuing this topic, a current best-practice approach to handling learning rates is illustrated &lt;a href=&#34;https://blogs.rstudio.com/ai/posts/2020-10-19-torch-image-classification/#training&#34; target=&#34;_blank&#34;&gt;in this post&lt;/a&gt;.&lt;/p&gt;

&lt;h1 id=&#34;what-if-we-made-the-network-bigger-or-trained-it-for-a-longer-time&#34;&gt;What if &amp;hellip; we made the network bigger or trained it for a longer time?&lt;/h1&gt;

&lt;p&gt;If you make a network &amp;ldquo;bigger&amp;rdquo;, increasing the number of parameters (for a linear layer, &lt;code&gt;output_features&lt;/code&gt;, for a convolutional one, &lt;code&gt;channels&lt;/code&gt;), in theory it gets more powerful. Analogously, if you give it more time to train, it may arrive at better results. However, depending on the task, you may or may not see improvements &amp;ndash; again, the only way to know is to try.&lt;/p&gt;

&lt;p&gt;And there is something else to think about. If something you do improves performance on the training set, does it generalize to the test set? As in machine learning in general, in deep learning one needs to be wary of overfitting. But what are countermeasures you could take?&lt;/p&gt;

&lt;p&gt;Before thinking of anything technical, you&amp;rsquo;d always want to think through what you know about the data and the underlying context. Analytically, what could cause the training and the test data to come from different distributions? Is there a way to have these distributions become more similar?&lt;/p&gt;

&lt;p&gt;The next thing, then, is not quite technical either. If there&amp;rsquo;s no compelling reason to assume that the test data will be systematically different, it&amp;rsquo;s just: the more data the better. This is why in our example task, we don&amp;rsquo;t see much overfitting &amp;ndash; the dataset is gigantic (and we&amp;rsquo;ve been using but a tiny fraction!).&lt;/p&gt;

&lt;p&gt;If getting more data is not an option, we can add regularization. In deep learning, the most popular ways of doing this are &lt;em&gt;dropout&lt;/em&gt; and &lt;em&gt;batch normalization&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Dropout adds random noise by stochastically removing units during training, making the net more robust to presence/absence of individual features. In our example, you could add dropout as follows. (Here &lt;code&gt;p&lt;/code&gt; passed to &lt;code&gt;nnf_dropout()&lt;/code&gt; is the dropout probability. Not surprisingly, this, again, is a hyperparameter you&amp;rsquo;ll want to experiment with.)&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;net &amp;lt;- nn_module(
  
  &amp;quot;corr-cnn&amp;quot;,
  
  initialize = function() {
    
    self$conv1 &amp;lt;- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 &amp;lt;- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 &amp;lt;- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 &amp;lt;- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 &amp;lt;- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %&amp;gt;% 
      self$conv1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      self$conv2() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      self$conv3() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      torch_flatten(start_dim = 2) %&amp;gt;%
      self$fc1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      self$fc2()
  }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Batch normalization is less well understood theoretically, but can be extremely effective in some cases. Besides acting as a regularizer, it also stabilizes training and may allow for using higher learning rates.&lt;/p&gt;

&lt;p&gt;With batch normalization, our network could look like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;net &amp;lt;- nn_module(
  
  &amp;quot;corr-cnn&amp;quot;,
  
  initialize = function() {
    
    self$conv1 &amp;lt;- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 &amp;lt;- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 &amp;lt;- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$bn1 &amp;lt;- nn_batch_norm2d(num_features = 32)
    self$bn2 &amp;lt;- nn_batch_norm2d(num_features = 64)
    self$bn3 &amp;lt;- nn_batch_norm2d(num_features = 128)
    self$bn4 &amp;lt;- nn_batch_norm1d(num_features = 128)
    
    self$fc1 &amp;lt;- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 &amp;lt;- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %&amp;gt;% 
      self$conv1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      self$bn1() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      self$conv2() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      self$bn2() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      self$conv3() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      self$bn3() %&amp;gt;%
      nnf_avg_pool2d(2) %&amp;gt;%
      nnf_dropout(p = 0.2) %&amp;gt;%
      
      torch_flatten(start_dim = 2) %&amp;gt;%
      self$fc1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      self$bn4() %&amp;gt;%
      
      self$fc2()
  }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once you&amp;rsquo;re found that a regularizing measure works &amp;ndash; meaning, performance on the validation set is similar to that on the training set, or maybe even better &amp;ndash; you can go back and add more capacity to the network: add more layers, train for a longer time, etc. Maybe you&amp;rsquo;ll arrive at better performance overall!&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Create your own Dataset</title>
      <link>/start/custom_dataset/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>/start/custom_dataset/</guid>
      <description>

&lt;p&gt;Unless the data you&amp;rsquo;re working with comes with some package in the &lt;code&gt;torch&lt;/code&gt; ecosystem, you&amp;rsquo;ll need to wrap in a &lt;code&gt;Dataset&lt;/code&gt;.&lt;/p&gt;

&lt;h1 id=&#34;torch-dataset-objects&#34;&gt;&lt;code&gt;torch&lt;/code&gt; &lt;code&gt;Dataset&lt;/code&gt; objects&lt;/h1&gt;

&lt;p&gt;A &lt;code&gt;Dataset&lt;/code&gt; is an R6 object that knows how to iterate over data. This is because it acts as supplier to a &lt;code&gt;DataLoader&lt;/code&gt; , who will ask it to return some number of items.&lt;/p&gt;

&lt;p&gt;(How many? That is dependent on the batch size &amp;ndash; but batch sizes are handled by &lt;code&gt;DataLoaders&lt;/code&gt;, so it needn&amp;rsquo;t be concerned about that. All it has to know is what to do when asked for, e.g., item no. 7.)&lt;/p&gt;

&lt;p&gt;While a &lt;code&gt;Dataset&lt;/code&gt; may have any number of methods &amp;ndash; each responsible for some aspect of pre-processing logic, for example &amp;ndash; just three methods are required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;initialize()&lt;/code&gt; , to pre-process and store the data;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;code&gt;.getitem(i)&lt;/code&gt;, to pick the item at position &lt;code&gt;i&lt;/code&gt;, and&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;code&gt;.length()&lt;/code&gt;, to indicate to the &lt;code&gt;DataLoader&lt;/code&gt; how many items it has.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let&amp;rsquo;s see an example.&lt;/p&gt;

&lt;h1 id=&#34;penguins&#34;&gt;Penguins&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;penguins&lt;/code&gt; is a very nice dataset that lives in the &lt;code&gt;palmerpenguins&lt;/code&gt; package.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(dplyr)
library(palmerpenguins)

penguins %&amp;gt;% glimpse()
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;Rows: 344
Columns: 8
$ species           &amp;lt;fct&amp;gt; Adelie, Adelie, Adelie, Adelie, Adelie, Adelie…
$ island            &amp;lt;fct&amp;gt; Torgersen, Torgersen, Torgersen, Torgersen…
$ bill_length_mm    &amp;lt;dbl&amp;gt; 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2,…
$ bill_depth_mm     &amp;lt;dbl&amp;gt; 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6,…
$ flipper_length_mm &amp;lt;int&amp;gt; 181, 186, 195, NA, 193, 190, 181, 195, 193,…
$ body_mass_g       &amp;lt;int&amp;gt; 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675…
$ sex               &amp;lt;fct&amp;gt; male, female, female, NA, female, male, female…
$ year              &amp;lt;int&amp;gt; 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are three species, and we&amp;rsquo;ll infer them making use of all available information: &amp;ldquo;biometrics&amp;rdquo; like &lt;code&gt;bill_length_mm&lt;/code&gt;, geographic indicators like the &lt;code&gt;island&lt;/code&gt; the penguins inhabit, and more.&lt;/p&gt;

&lt;p&gt;Predictors are of two different types, categorical and continuous.&lt;/p&gt;

&lt;p&gt;Continuous features, of R type &lt;code&gt;double&lt;/code&gt;, may be fed to &lt;code&gt;torch&lt;/code&gt; without further ado. We just directly use them to initialize a &lt;code&gt;torch&lt;/code&gt; tensor, which will be of type &lt;code&gt;Float&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(torch)
torch_tensor(1)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;torch_tensor
 1
[ CPUFloatType{1} ]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It&amp;rsquo;s different with categorical data though. Firstly, &lt;code&gt;torch&lt;/code&gt; needs all data to be in numerical form, so vectors of type &lt;code&gt;character&lt;/code&gt; need to become factors &amp;ndash; which can then be treated as numeric via level extraction. In the &lt;code&gt;penguins&lt;/code&gt; dataset, &lt;code&gt;island&lt;/code&gt;, &lt;code&gt;sex&lt;/code&gt; , as well as the target column, &lt;code&gt;species&lt;/code&gt;, are factors already. So can we just do an &lt;code&gt;as.numeric()&lt;/code&gt; and that&amp;rsquo;s it?&lt;/p&gt;

&lt;p&gt;Not quite: We also need to reflect on the semantic side of things.&lt;/p&gt;

&lt;h1 id=&#34;categorical-data-in-deep-learning&#34;&gt;Categorical data in deep learning&lt;/h1&gt;

&lt;p&gt;If we just replace islands &lt;em&gt;Biscoe&lt;/em&gt;, &lt;em&gt;Dream&lt;/em&gt;, and &lt;em&gt;Torgersen&lt;/em&gt; by numbers 1, 2, and 3, we present them to the network as interval data, which of course they&amp;rsquo;re not.&lt;/p&gt;

&lt;p&gt;We have two options: transform them to one-hot vectors, where e.g. &lt;em&gt;Biscoe&lt;/em&gt; would be &lt;code&gt;0,0,1&lt;/code&gt;, &lt;em&gt;Dream&lt;/em&gt; &lt;code&gt;0,1,0&lt;/code&gt;, and &lt;em&gt;Torgersen&lt;/em&gt;, &lt;code&gt;1,0,0&lt;/code&gt;, or leave them as they are, but have the network map each discrete value to a multidimensional, continuous representation. The latter is called embedding, and it often helps networks make sense of discrete data.&lt;/p&gt;

&lt;p&gt;Embedding modules expect their inputs to be of type &lt;code&gt;Long&lt;/code&gt;. A tensor created from an R value will have the correct type if we make sure it&amp;rsquo;s an &lt;code&gt;integer&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;torch_tensor(as.integer(as.numeric(as.factor(&amp;quot;one&amp;quot;))))
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;torch_tensor
 1
[ CPULongType{1} ]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, let&amp;rsquo;s create a dataset for penguins.&lt;/p&gt;

&lt;h1 id=&#34;a-dataset-for-penguins&#34;&gt;A dataset for penguins&lt;/h1&gt;

&lt;p&gt;In &lt;code&gt;initialize()&lt;/code&gt;, we convert the data as planned and store them for later delivery. Like the categorical input features, &lt;code&gt;species&lt;/code&gt;, the target, is discrete, and thus, converted to &lt;code&gt;torch&lt;/code&gt; &lt;code&gt;Long&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;penguins_dataset &amp;lt;- dataset(
  
  name = &amp;quot;penguins_dataset&amp;quot;,
  
  initialize = function(df) {
    
    df &amp;lt;- na.omit(df) 
    
    # continuous input data (x_cont)   
    x_cont &amp;lt;- df[ , c(&amp;quot;bill_length_mm&amp;quot;, &amp;quot;bill_depth_mm&amp;quot;, &amp;quot;flipper_length_mm&amp;quot;, &amp;quot;body_mass_g&amp;quot;, &amp;quot;year&amp;quot;)] %&amp;gt;%
      as.matrix()
    self$x_cont &amp;lt;- torch_tensor(x_cont)
    
    # categorical input data (x_cat)
    x_cat &amp;lt;- df[ , c(&amp;quot;island&amp;quot;, &amp;quot;sex&amp;quot;)]
    x_cat$island &amp;lt;- as.integer(x_cat$island)
    x_cat$sex &amp;lt;- as.integer(x_cat$sex)
    self$x_cat &amp;lt;- as.matrix(x_cat) %&amp;gt;% torch_tensor()

    # target data (y)
    species &amp;lt;- as.integer(df$species)
    self$y &amp;lt;- torch_tensor(species)
    
  },
  
  .getitem = function(i) {
     list(x_cont = self$x_cont[i, ], x_cat = self$x_cat[i, ], y = self$y[i])
    
  },
  
  .length = function() {
    self$y$size()[[1]]
  }
 
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Unlike &lt;code&gt;initialize&lt;/code&gt;, &lt;code&gt;.getitem(i)&lt;/code&gt; and &lt;code&gt;.length()&lt;/code&gt; are just one-liners.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s see if this behaves like we want it to. We randomly split the data into training and validation sets and query their respective lengths:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;train_indices &amp;lt;- sample(1:nrow(penguins), 250)

train_ds &amp;lt;- penguins_dataset(penguins[train_indices, ])
valid_ds &amp;lt;- penguins_dataset(penguins[setdiff(1:nrow(penguins), train_indices), ])

length(train_ds)
length(valid_ds)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;[1] 242
[1] 91
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can index into &lt;code&gt;Dataset&lt;/code&gt;s in an R-like way:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;train_ds[1]
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;$x_cont
torch_tensor
   45.2000
   16.4000
  223.0000
 5950.0000
 2008.0000
[ CPUFloatType{5} ]

$x_cat
torch_tensor
 1
 2
[ CPULongType{2} ]

$y
torch_tensor
3
[ CPULongType{} ]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;From here on, everything proceeds like in the first tutorial: We use the &lt;code&gt;Dataset&lt;/code&gt;s to instantiate &lt;code&gt;DataLoader&lt;/code&gt;s&amp;hellip;&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;train_dl &amp;lt;- train_ds %&amp;gt;% dataloader(batch_size = 16, shuffle = TRUE)

valid_dl &amp;lt;- valid_ds %&amp;gt;% dataloader(batch_size = 16, shuffle = FALSE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&amp;hellip; and then, create and train the network. The network will look pretty different now though: most notably, you&amp;rsquo;ll see embeddings at work.&lt;/p&gt;

&lt;h1 id=&#34;classifying-penguins-the-network&#34;&gt;Classifying penguins &amp;ndash; the network&lt;/h1&gt;

&lt;p&gt;We just heard that embedding layers work with a datatype that&amp;rsquo;s different from most other neural network layers. It is therefore convenient to have them work in a space of their own, that is, put them into a dedicated container.&lt;/p&gt;

&lt;p&gt;Here we define a specialized module that has one embedding layer for every categorical feature. It gets passed the cardinalities of the respective features, and creates an &lt;code&gt;nn_embedding()&lt;/code&gt; for each of them.&lt;/p&gt;

&lt;p&gt;When called, it iterates over its submodules, lets them do their work, and returns the concatenated output.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;embedding_module &amp;lt;- nn_module(
  
  initialize = function(cardinalities) {
    
    self$embeddings = nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim = ceiling(x/2))))
    
  },
  
  forward = function(x) {
    
    embedded &amp;lt;- vector(mode = &amp;quot;list&amp;quot;, length = length(self$embeddings))
    for (i in 1:length(self$embeddings)) {
      embedded[[i]] &amp;lt;- self$embeddings[[i]](x[ , i])
    }
    
    torch_cat(embedded, dim = 2)
  }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The top-level module has three submodules: said &lt;code&gt;embedding_module&lt;/code&gt; and two linear layers.&lt;/p&gt;

&lt;p&gt;The first linear layer takes the output from &lt;code&gt;embedding_module&lt;/code&gt; , computes an affine transformation as it sees fit, and passes its result to the output layer. &lt;code&gt;output&lt;/code&gt; then has three units, one for every possible target class.&lt;/p&gt;

&lt;p&gt;The activation function we apply to the raw aggregation, &lt;code&gt;nnf_log_softmax()&lt;/code&gt;, composes two operations: the popular-in-deep-learning &lt;code&gt;softmax&lt;/code&gt; normalization and taking the logarithm. Like that, we end up with the format expected by &lt;code&gt;nnf_nll_loss()&lt;/code&gt;, the loss function that computes the negative log likelihood (NLL) loss between inputs and targets.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;net &amp;lt;- nn_module(
  &amp;quot;penguin_net&amp;quot;,

  initialize = function(cardinalities,
                        n_cont,
                        fc_dim,
                        output_dim) {
    
    self$embedder &amp;lt;- embedding_module(cardinalities)
    self$fc1 &amp;lt;- nn_linear(sum(purrr::map(cardinalities, function(x) ceiling(x/2)) %&amp;gt;% unlist()) + n_cont, fc_dim)
    self$output &amp;lt;- nn_linear(fc_dim, output_dim)
    
  },

  forward = function(x_cont, x_cat) {
    
    embedded &amp;lt;- self$embedder(x_cat)
    
    all &amp;lt;- torch_cat(list(embedded, x_cont$to(dtype = torch_float())), dim = 2)
    
    all %&amp;gt;% self$fc1() %&amp;gt;%
      nnf_relu() %&amp;gt;%
      self$output() %&amp;gt;%
      nnf_log_softmax(dim = 2)
    
  }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s instantiate the top-level module:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;model &amp;lt;- net(
  cardinalities = c(length(levels(penguins$island)), length(levels(penguins$sex))),
  n_cont = 5,
  fc_dim = 32,
  output_dim = 3
)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And we&amp;rsquo;re ready for training!&lt;/p&gt;

&lt;h1 id=&#34;model-training&#34;&gt;Model training&lt;/h1&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;optimizer &amp;lt;- optim_adam(model$parameters, lr = 0.01)

for (epoch in 1:20) {

  model$train()
  train_losses &amp;lt;- c()  

  coro::loop(for (b in train_dl) {
    
    optimizer$zero_grad()
    output &amp;lt;- model(b$x_cont, b$x_cat)
    loss &amp;lt;- nnf_nll_loss(output, b$y)
    
    loss$backward()
    optimizer$step()
    
    train_losses &amp;lt;- c(train_losses, loss$item())
    
  })

  model$eval()
  valid_losses &amp;lt;- c()

  coro::loop(for (b in valid_dl) {
    
    output &amp;lt;- model(b$x_cont, b$x_cat)
    loss &amp;lt;- nnf_nll_loss(output, b$y)
    valid_losses &amp;lt;- c(valid_losses, loss$item())
    
  })

  cat(sprintf(&amp;quot;Loss at epoch %d: training: %3.3f, validation: %3.3f\n&amp;quot;, epoch, mean(train_losses), mean(valid_losses)))
}
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;Loss at epoch 1: training: 34.962, validation: 4.354
Loss at epoch 2: training: 8.207, validation: 14.512
Loss at epoch 3: training: 7.804, validation: 2.820
Loss at epoch 4: training: 5.998, validation: 8.525
Loss at epoch 5: training: 8.293, validation: 5.594
Loss at epoch 6: training: 6.375, validation: 4.540
Loss at epoch 7: training: 7.478, validation: 2.120
Loss at epoch 8: training: 3.470, validation: 3.508
Loss at epoch 9: training: 12.155, validation: 4.266
Loss at epoch 10: training: 10.168, validation: 4.285
Loss at epoch 11: training: 5.963, validation: 1.888
Loss at epoch 12: training: 3.035, validation: 2.454
Loss at epoch 13: training: 1.993, validation: 1.185
Loss at epoch 14: training: 2.454, validation: 2.200
Loss at epoch 15: training: 1.641, validation: 0.588
Loss at epoch 16: training: 0.996, validation: 1.959
Loss at epoch 17: training: 0.912, validation: 0.674
Loss at epoch 18: training: 1.517, validation: 0.487
Loss at epoch 19: training: 1.569, validation: 1.202
Loss at epoch 20: training: 0.735, validation: 1.313`
&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
  </channel>
</rss>
