Jamie Simon, UC Berkeley: On theoretical principles for how neural networks learn and generalize

June 29, 2023

RSS · Spotify · Apple Podcasts · Pocket Casts

Jamie Simon is a 4th year Ph.D. student at UC Berkeley advised by Mike DeWeese, and also a Research Fellow with us at Generally Intelligent. He uses tools from theoretical physics to build fundamental understanding of deep neural networks so they can be designed from first-principles. In this episode, we discuss reverse engineering kernels, the conservation of learnability during training, infinite-width neural networks, and much more.

Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.

Some highlights from our conversation

“I do think that the deeper idea of reverse engineering kernels is powerful and probably holds across architectures. The central message isn’t really like: here’s the particular theory on fully-connected networks. The central message is: let’s think about the inductive bias of architectures in kernel space directly and see if we can do our design work in kernel space instead of in parameter space.”

“At first glance, the idea of an infinite-width neural network as a useful object of study sounds insane; and why should this be a reasonable limit to take? Like, why, if we want to understand a neural network which like obviously has to be finite to do anything useful, could we hope to learn anything by just making something infinite? Like that, especially is baffling from the viewpoint of classical statistics, where you, you hope to find a parsimonious model you wanna like wield Occam’s razor like a sword. So, it seems baffling at first that this should be useful, but it turns out actually a number of like, breakthrough results in the, especially, you know, around the early part of my PhD found that some really, like non-trivial, insightful behavior emerge when you take this infinite width limit.”

“In the case of infinite width: If the neural tangent kernel only has trivial alignment, like just chance alignment with the target function of the data it won’t generalize on it. But in practice, we see very good alignment between this kernel object and then the target function.”

“A question you could ask is, why do convolutional networks do better than fully connect networks on image data? Well, it turns out their kernels have better alignment with image data.”

“Although, people have shown interestingly that if you take the neural tangent kernel of a network after training then the real neural network after training looks a lot as if it had always had its final neural contingent kernel. So like you don’t have to worry so much about the evolution over time so much as where it ended up only.”

Referenced in this podcast

Thanks to Tessa Hall for editing the podcast.