Day 5 - Why neurons spike: SNN applications-Mihai Petrovici, Emre Neftci, Wolfgang Maas

                  Yesterday's hail (Photo credit: Yves Fregnac)

today's authors: Muhammad Aitsam, Eleni Nisioti, Tobi Delbruck

To spike or not to spike, that seems to be the question for many neuromorphic-computing researchers but was actually not the theme that the speakers chose to address.

Today's discussion may help us in our spiking-existential dread.

Before starting the lecture, we were informed on the proper use of napkins: you can use them to clean yourself but not to draw your arguments during lunch discussions. So you have been warned for the rest of the workshop. Florian Engert confessed to be the one who instigated this kind request from the hotel.

The lecture started with Mihai Petrovici providing illustrations of how spiking neural networks can compute gradients. 

Mihai said that they have two goals in mind: 

  • one is to use ideas from mathematics, electronics etc to better understand the brain. For example it took the development of the electric motor to understand how the flagellum moves
  • the other is finding useful elements from biology to feed into applications (the trick lies in finding what "useful" means)


 We will talk about the general problem of learning. Let's say we have this network on the top:





 Mihai's drawings annotated by us with the names of variables


It could represent either a brain or a chip. The point is that there are neurons that receive inputs from their environment, exchange information with each other and give some output. Learning is adapting the parameters that determine how this network operates in order to reduce an error. The error is calculated as the difference between the output of the network and a target value. This value is provided by a teacher neuron. This is a supervised learning setting. 

This problem can be solved by a general family of algorithms that update a weight by computing the gradient of the loss with respect to that weight. This provingly leads to minimizing the error.

The challenge lies in communicating these errors across space and time taking into account memory constraints, energy efficiency and problems that come into play when scaling the problem size.

Mihai then drew a graph with neurons being placed on a Cartesian plane with x-axis denoting time and y-axis space. Thus, in order to compute these gradients the neurons need to communicate in space and time.

 




 

The most classiscal algorithm for computing these gradients is back-propagation through time (BPTT). Its  complexity scales linearly in time and space. But the problem is that, to update a weight, the neuron needs to know the future (the outputs of neurons between it and the output). For this reason BPTT cannot be applied in an online manner.

Spoiler alert: we will see later that a spiking NN may actually be able to "know the future", essentially by acting like a differentiator that predicts the slope of its collective input.

The following discussion on bio-plausible BPTT (which includes a brief explanation of and references to all of the papers in https://wiki.local.capocaccia.cc/index.php/Fully-local-learning-through-space-and-time) is based on the arxiv paper  Ellenberger et al. 2024.

An algorithm that attempts to bypass this problem is RTRL (Real-time Recurrent Learning). Here we have a three-dimensional table indexed as M_jk**i, where jk is the synapse between two neurons and i is another neuron and M is the corresponding partial derivative. This models the influence of any synapse to any neuron. It is a non-local operation and requires vast memory (O(n^3)). Ultimately in its operation every neuron will be influenced somehow by any synapse so this matrix can be dense. RTRL is the ugly duckling of the back-propagation family, it's as old as BPTT but no one uses it because of the huge memory requirement.

Why call it M? This table is so difficult to compute that Mihai called it "magic". In his paper this matrix is called influence matrix.

Then Mihai quickly presented Random Feedback Online Learning (RFOL), an attempt at reducing the complexity and addressing the non-locality of RTRL. Let's say you do not want to consider all possible influences across space. Maybe it is good enough to just remember the influences of the immediate pre-synaptic interactions. You can use a reduced two-dimensional eligibility trace instead of the full 3-d tensor formally required for BPTT.

Instead of computing the influence matrix we compute the eligibility trace: this is the product of the firing rate with phi, the derivative of the firing rate (not in respect to time but in respect to the activation of the neuron). You then pass this product through a low-pass filter. This eligibility trace gives you a record of the past, while the current error is related to the present. Applying this approach on spiking neurons gives you the so-called eprop algorithm. This algorithm has N^2 complexity which is better than RTRL but still not easily implementable on neuromorphic hardware.

We then moved to spikes. Mihai drew the spiking neuron below: we apply an input current linear with timestep and we observe how the membrane potential changes and spikes on the way.  What happens if we decrease the slop of the current? then the first spike will happen later. If we reduce the slope even more then the neuron will not fire at all (it's like pushing an oscillator very slowly) So the output of the neuron does not depend only on the input current but also the rate at which it is changing. But let's see why this matters.



Let's see an example that illustrates why the firing rate of the neuron depends on the derivative of the potentail: 

You have a small population of neurons, you inject some current and you measure membrane potentials. If you plot the distribution of potentials in will look like a Gaussian centred around the mean of the potential. The firing rate depends on the mean of this distribution. Then imagine you shift the distribution up: this will lead to more neurons firing. The speed with which the Gaussian moves over the threshold (which is the same with the rate at which I sweep the area) is proportional to the firing rate.



Mihai, then, drew another simple illustration to show why, once you know that the firing rate depends on the derivative of the potential, you know that the neuron can know the future and implement back-propagation:

Let's consider our equation r = u + derivative(u). If we assume that u is a sinusoid then we can easily show that r will be a sinusoid shifted in relation to u. The interesting bit is that, instead of being displaced towards the past, it is shifted forward in time. If a neuron can look into the future then it can compute a derivative by interpolating. So it may have all the ingredients it needs to implement back-propagation.

We need to interject here a comment by Florian that is well known to all analog circuit designers: the integral is the past, the value is the present and the derivative is the future.

If this is true, we can rely on this ability of spiking neurons to perform supervised learning, rather than continue engineering algorithms.

How could this SNN look like? 


We have two parallel pathways: the forward pathway contains normal. slow neurons. We call them retrospective, as they look back in time. They carry errors. The other pathway is the one with prospective neurons. These carry signals. You need to carry information from one path-way to the other and you need to do multiplications, so synapses are necessary. The error needs to be used to update the forward pass. This can be done with a fully-local learning rule. This network can compute M, our "magic" operator.


 
This network performs "magic" (or computes our M matrix)

Mihai closed his talk by drawing this diagram accompanying intense discussion that we blog authors did not try to follow.

 

The next discussion led by Emre Neftci provided a dive into (the really interesting bits) of training Large Language Models (LLMs):

What we know about the observed empirical scaling laws governing these models may be indicating to us that neuromorphic computing might hold the power to bring large improvements in training and/or inference efficiency.

He began by drawing curves that describe how the accuracy performance of LLMs scales with their cost. We only care here about the cost of training, which managers view as incurring the bulk of their up-front cost, as inference is much less resource-demanding and the same model can adapt to any user without further training.  From the discussion however it was not really clear how the training and runtime costs really relate in practice. If a model costs 100M$ to train, how much is it really used and much does this electricity and GPU cost compare to the training cost/GPU count compare? That information is not generally released by the big AI companies.


 
Scaling law of Large Language Models

The two axes are logarithmic. Loss, the vertical axis, is  related to the performance you want to achieve in a down-stream task (any possible task you would like to be able to solve at inference withour re-training/fine-tuning). 

The horizontal axis is cost/complexity of training. It is the product of the size of the model (N) and the size of the dataset (D). In this space of models (LLMs) there is so much data that a model can never overfit, and generally each sample is only seen once during training. This characteristic makes training LLMs very different that training smaller DNNs on labeled datasets that most participants are more familiar with. I.e. LLMs are trained with <1 epoch, until the accuracy reaches some level or the money is burned up. 

The different lines in the graph correspond to different models: from top to bottom, we go to larger models, which have progressively lower loss. If we draw the Pareto front of these curves we get the scaling law: a line (which on log log scale means a power law) that shows that bigger models achieve a better trade-off between loss and cost. 

It is important to note that we are assuming that the quality of the data is good. Garbage in, garbage out is still true for LLMs.

What would happen if we changed the technology used to train these models? To neuromorphic computing for example? Then we could expect to get a different curve, shifted in parallel to the existing one as sketched by the possible orange curve. If the neuromorphic slope were steeper by it use of local learning rules or sparsity,  the economical benefits of this shift could be significant.

The next speaker was Wolfgang Maass, Professor at Graz University of Technology.

Wolfgang started by laying out works that inspired him in the research plan he will talk about:works by Rodney, Tony Zador and Barabasi that draw links between biological and artificial spiking neural networks. A major influence was the work by (Billeh et a, 2020) on understanding how structural rules affect the functional properties of biological brains.

If we had to summarize the talk of Wolfgang in six words that would be: neuron types, not individual neurons, matter. Biological brains are very complex and it is hard to know what kind of data we need and when we have enough to understand how function emerges from structure. Looking at how cell types connect to other cell types may be the best path we have. Compared to only looking at connectomes (how individual neurons connect to each other) and modelling them directly, we may be better off by building what Wolfgang calls a probabilistic skeleton of the genetic code. Such skeleton can be estimated from the fantastic Allen brain atlas dataset, which in contrast to the Human Brain Project dataset was completely open from it initial publication.



Let's see how you can train Wolfgang's model to solve a task, such as training a robot to learn how to locomote (picture a simulation of a robotic ant with four legs that tries to run as quickly as possible)

Wolfgang's model requires you to set a number of neuron types. The neurons will be arranged in columns, where distance between neurons in the same column is negligible. The trainable parameters of the model is a matrix where each row and each column is a neuron type and the matrix denotes their probability of connecting. When deciding whether to connect two neurons the model looks at the product of this probability and the distance between them (how many columns further apart they are). 

Thus we have a probabilistic model that tries to minimize distances and only cares about neurons types when deciding on connections, not the exact index of the neuron.

Despite the simplicity of the model, Wolfgang's experiments showed that training the robot works.

Wolfgang asked us: is there some computation that this model cannot do?

 To answer this question Wolfgang talked to us about Cellular Automata (CAs), a computational model that exemplifies how you can get complex function from a simple model.

A CA is a grid of cells where each cell is characterized by its state, a discrete value. At each time step, a cell looks at the states of all its neighbors and updates its own state using a rule. This rule is determined by the designer. For example, if I have more than 3 or less than 1 neighbors that are 1, then I become 0. You then run the system for some timesteps, updating all cells at each timestep, and marvel at the patterns you get.

The cool thing is that, even such simple rules can give you complex behavior, such as the glider creature discovered in a particular rule called the Game of Life.


 

Recently CA were extended to Neural CA (NCA): (add citation here) the rule is not fixed by you but is defined as a neural network that you can train with a target image. If you for example use as target image a gecko, you will get a grid of cells that have learned, not just to make a gecko, but also to regenerate it if you  delete some pixels:



Wolfgang then said that you can basically see his model as equivalent to an NCA, which proves his point that there is no constraint on the computation it can perform.

And what could be the benefits of this model?

  • we have redundancy in the model and the ability to regenerate parts of it that are removed. It is therefore more robust
  • optimizing a lower-dimensional representation (genome) and using a generative model of the phenotype (the probabilistic skeleton) introduces compression into the training, which may improve generalisation during inference. This is the Genomic Bottleneck argument of Tony Zador
  • this model can be useful on our path for self-constructing machines. We can have feedback loops that improve the architecture during development

***

Poker session yesterday night and some people's luck spiked more than others:

The next morning found clear physical evidence of late night revelry in the lobby, including a kind request of the hotel night man Mattia not to sleep on the couches and to please not play loud music at 2am in the lobby since there are other guests at the hotel besides the CCNW participants



Comments

  1. It is worth noting that many real neurons (cortical pyramidal, definately) both spike and burst: that is they can produce single spikes, but they can also produce short bursts of spikes that at=re about 2ms apart. there have been suggetsions that the latter sort are important in synaptic adaptation.

    ReplyDelete

Post a Comment

Popular posts from this blog

Day 2: Sensing in animals and robots - Florian Engert, Gaby Maimon, and Wolfgang Maass

Day 4 - Neuromorphic Circuits: past, present and future - Melika Payvand, Giacomo Indiveri, Johannes Schemmel

Day 12 - Final presentations and demos, goodbye activities