Plausible Tomorrows: What's Ahead in the Age of AI

Understanding the Elegant Math Behind Modern Machine Learning

February 27, 2025

ABOUT THE EPISODE

Artificial intelligence is evolving at an unprecedented pace—what does that mean for the future of technology, venture capital, business, and even our understanding of ourselves?

Award-winning journalist and writer Anil Ananthaswamy joins us for our latest episode to discuss his latest book Why Machines Learn: The Elegant Math Behind Modern AI.

Anil helps us explore the journey and many breakthroughs that have propelled machine learning from simple perceptrons to the sophisticated algorithms shaping today’s AI revolution, powering GPT and other models. The discussion aims to demystify some of the underlying math that powers modern machine learning to help everyone grasp this technology impacting our lives, even if your last math class was in high school.

Anil walks us through the power of scaling laws, the shift from training to inference optimization, and the debate among AI’s pioneers about the road to AGI—should we be concerned, or are we still missing key pieces of the puzzle? The conversation also delves into AI’s philosophical implications—could understanding how machines learn help us better understand ourselves? And what challenges remain before AI systems can truly operate with agency?

If you enjoy this episode, please subscribe and leave us a review on your favorite podcast platform. Sign up for our newsletter at techsurgepodcast.com for exclusive insights and updates on upcoming TechSurge Live Summits.

Show Notes

Links:

Read Why Machines Learn, Anil’s latest book on the math behind AIhttps://www.amazon.com/Why-Machines-Learn-Elegant-Behind/dp/0593185749

Learn more about Anil Ananthaswamy’s work and writing

https://anilananthaswamy.com/

Watch Anil Ananthaswamy’s TED Talk on AI and intelligence

https://www.ted.com/speakers/anil_ananthaswamy

Discover the MIT Knight Science Journalism Fellowship that shaped Anil’s AI research

https://ksj.mit.edu/

Understand the Perceptron, the foundation of neural networks

https://en.wikipedia.org/wiki/Perceptron

Read about the Perceptron Convergence Theorem and its significance

https://www.nature.com/articles/323533a0

Chapters:

00:00 The Future of AI: Agency and Exponential Growth

01:03 Introduction to Machine Learning and AI Concepts

04:50 The Journey of Writing 'Why Machines Learn'

10:07 Understanding the Perceptron and Its Significance

13:40 The Evolution of Neural Networks and AI Winters

19:35 Backpropagation: The Game Changer in Neural Networks

29:51 Convolutional Neural Networks: Advancements in Image Recognition

35:20 The Impact of Large Datasets on Machine Learning

38:12 Understanding Perceptrons and SVMs

41:39 The Impact of ImageNet on Machine Learning

44:44 The Evolution of GPUs in AI

48:09 Transformers: A New Era in Neural Networks

54:21 The Exponential Growth of AI Models

01:01:21 Diverging Views on AGI: Hinton vs. LeCun

01:05:08 The Common Thread in Anil's Work

01:09:40 Future Disruptions in AI and Machine Learning

Transcription

Sriram Viswanathan: Today, we have a very special guest, Anil Ananthaswamy, a good friend of mine from quite a while ago who, uh, we just got connected more recently, but Anil is a, an award winning journalist and a writer whose work delves into some of the deepest questions and fundamental truths.

We are grappling with science and technology today. Anil is the author of several books, the latest of which is right here, which is Why Machines Learn, The Elegant Math Behind Modern AI. Anil, thank you for being here. It's uh, it's really a pleasure to have you at this podcast.

Anil Ananthaswamy: Oh, thank you, Sriram. It's my pleasure to be on this podcast.

Sriram Viswanathan: So, I think before, um, I mean, it was just serendipity that, you know, you and I have known each other and I, uh, I, last I saw you was, was at the TED talk that you gave. You know, this book, as we were just talking about prior to the podcast, you know, you started writing it in 2020, almost, almost prescient that you were talking about a topic that has become super hot right now and germane to all the technology busts and excitement that is happening right now in AI and.

And generative AI and AGI and all these models and all that. But this book kind of attempts to sort of peel the onion on the underlying math and how it all comes together. And when I read this book, what became obvious to me is that the power of actually building something so sophisticated that we're all used to today, which almost feels like magic when you interface with, um, GPT or, or Lama or any of these things, they're really built on some foundational, simple concepts in basic math that we all have.

sort of studied in high school math. So, you say in this book that this book doesn't require a PhD in math, but it really requires a proficient understanding of basic math from high school. So perhaps that's a good place to start. Um, you know, what drove you to write this book and what did you, uh, what was the, what was the desire to go try to answer this question about machine learning?

and why machines learn, but really look at it from the math standpoint.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: Because I'm mindful of the fact that, you know, you started it before all this buzz started. So what drove you to do that in the first place?

Anil Ananthaswamy: You mentioned my other book, so I think maybe I can start by giving a little bit of background about how I even came to be thinking about writing about ML.

Um, you know, I used to be a software engineer before I became a journalist. And, uh, And when I was doing most of my journalism, it was about, like you said, particle physics, cosmology, even neuroscience, computational neuroscience, quantum mechanics. And you know, when I write about those subjects and I interview the researchers and, you know, try to understand what they're doing, I don't really have any sort of ideas about doing that work myself.

That's not my expertise. And I was perfectly happy writing about it, but somewhere around 2015, 2016, I started noticing that I was beginning to find more and more stories coming my way that had some component of machine learning. And so when I would interview the scientists, uh, the computer scientists or whoever was doing the work, I think the software engineer in me got really intrigued.

I, you know, this was finally a topic where, you know, I could get my hands dirty, so to say. You know, I could do the software, I could understand what they were saying in a much more visceral sense because of my background. So somewhere, uh, the desire to, you know, learn this technology, the nuts and bolts of this technology grew.

And I got a wonderful opportunity in 2019 when I got a fellowship at MIT, the Knight Science Journalism Fellowship. And you have to do a project as part of that fellowship, and I basically proposed that I would, you know, build some simple deep learning system. And I knew nothing about the topic. So I took that opportunity to go back to school, literally and figuratively.

I ended up at...

Sriram Viswanathan: MIT.

Anil Ananthaswamy: So at MIT, you know, the fellowship allows you to. take classes at Harvard, at MIT, any number of classes that you want. So I was back sort of sitting in CS 101. Uh, I don't know if that's, that's not the name of the course that they have, but you know, you know what I'm talking about? Like literally sitting with, you know, teenagers learning about Python programming all over again.

And, uh, and then I slowly sort of bootstrapped myself and learned to implement some basic, uh, deep neural network stuff. Did my project, uh, At that point, it was just mechanics. I was learning how to code. I could, you know, code some simple stuff and make some things work. At some point, I, when I was learning the math, I started seeing some of the Simplicity and the beauty of the math, the elegance

Sriram Viswanathan: of it, the elegance of it.

Anil Ananthaswamy: And elegance, of course, people are going to, uh, you know, argue about whether the math of machine learning is elegant or not. That's a subjective thing. I found it. Uh, especially there is a, uh, I clearly remember the moment when I felt like, Oh, I wanted to write something along these lines was when I encountered something called the Perceptron Convergence Proof, you know, the Perceptron Rosenblatt Proof.

Right. So Frank Rosenblatt had designed the Perceptron in the late 1950s, and it was a very simple algorithm. It was a single layer neural network, and it could, you know, uh, given two kinds of data. If they are, uh, linearly separable in some, uh, you know, space, it could find the boundary that divides in finite time.

Uh, the algorithm was very simple, but if you just looked at the algorithm, you, there was no reason to. The algorithm itself didn't tell you why this thing would work, right? And soon after the algorithm was developed, people started coming up with mathematical proofs of why this thing would work. And not only would it work, but it was a guaranteed, uh, you know, it would, the proof guaranteed that the algorithm would find a solution if one existed.

infinite time. This is a pretty big deal in computer science that, you know, anytime you can set, uh, bounds on your algorithm and, you know, definitely say something about what it's going to do.

Sriram Viswanathan: If I, if I may interject, I just want to make sure we're grounding ourselves here because you're using A bunch of terms with a lot of our audience may not really fully appreciate.

So, so, so before we get into the, the, the specifics of the algorithm itself, you can just back up for a second, uh, and tell us, uh, when you wrote this book, when you titled it as why machines learn versus how machines learn. There's obviously a distinction. Yes. Why did you choose this title versus not how?

Anil Ananthaswamy: It ties in with this story that I'm telling you, which is that, uh, how is, how to me is the algorithm, right? How is the algorithm? So if you look at the perceptron algorithm, it just tells you how that thing works, right? It doesn't tell you the rationale. It doesn't. Yeah. And the rationale for why it should work.

Yeah. You know, and the why is hidden in the math.

Sriram Viswanathan: Yeah.

Anil Ananthaswamy: Right. So when you look at the proof, and it's a very simple proof that just requires basic understanding of vectors and matrices. Yeah. Basically, in your algebra, and suddenly you have this amazing proof that tells you why this thing should work.

Sriram Viswanathan: Which basically classifies a bunch of unstructured data into two categories based on some. Let's say you have,

Anil Ananthaswamy: let's say you have data, uh, that is belongs to two categories, images of dogs and images of cats, right? And in some mathematical space, they occupy different regions of that mathematical space such that there is a clear boundary between the two.

This algorithm will find a clear boundary. It doesn't, uh, guarantee it'll find the optimal boundary, but it will find. One boundary of the many that exist. Um, and, uh, and it was a, you know, when I looked at that proof, it was, that was what made me feel like, oh, okay, you know, I would love to explain this to, um, my readers.

Sriram Viswanathan: But Rosenblatt, who's arguably sort of the starting point of modern day sort of machine learning, uh, you know, there are lots of things that have happened, you know, after that. Sort of transition the sophistication of the math and the sophistication of the learning Algorithms to result in GPT 4 or deep seek today.

Yeah, but if you go back even before You know, maybe the early 60s, I mean there was there's lots of work on AI or what people thought it was AI Happened Uh, but there was some seminal things that changed. Uh, can you talk about what that, what, what, what was the thing that changed the trajectory of the science and the evolution of of the math and the people, the way people understood the algorithm that changed this trajectory?

Or was there one?

Anil Ananthaswamy: There was no single, I mean, you can always pick a single thread through this storyline. And, you know, one obvious thread to pick at is the story that relates to how neural networks work, right? Developed. But, uh, you know, machine learning is not just neural networks. And that I think is Part of the bigger story about machine learning is that, you know, when Rosenblatt kind of designed his perceptron algorithm, that was a single layer neural network.

It could, it only had a single layer, layer of artificial neurons, you know, an artificial neuron is simply a computational unit, takes an inputs, does some computation, produces an output, right? And in Rosenblatt's perceptron, you could just have. One layer such that the inputs were coming in and the output was being collected on the other side.

Sriram Viswanathan: And the inputs being unstructured dogs and cats and output being some classification of these. Exactly.

Anil Ananthaswamy: So, so the input could be if you had, uh, um, a 10 by 10 image of a dog or 10 by 10 image of a cat, you would turn each pixel of that image into an input. So there'll be 10, uh, 10 by 10, so a hundred inputs coming in.

to the single layer. And on the output side, you would have the network telling you zero for cat and one for dog or something like that. Something very, very straightforward. But at the time, people didn't know how to train a network that had more than One layer. So the moment you put in another layer between the input and output, the training, the algorithm that Rosenblatt had didn't work.

And they didn't know how to do that at that point.

Sriram Viswanathan: And that's when they gave up. That, that was that the lull period of the early AI?

Anil Ananthaswamy: Yes. So the late 60s, there was a very, very influential and seminal book. Um, the book published by, uh, Marvin Minsky and Seymour Papert, it, in fact, it was called Perceptrons in honor of Rosenblatt's, uh, Perceptron.

And in that, uh, uh, book, I mean, it's a heavily mathematical book, a big tome, a wonderful piece of work. Uh, and they, for instance, in that book, they have one of the most elegant proofs of why the Perceptron. And the algorithm should converge upon a solution. But they also had another proof, which was that the moment you gave the perceptron a problem which was very, very simple problem, it was called the XOR problem.

But the general idea is if the two categories of data are not linearly separable, which means that there is no straight line that you can draw between the two clusters. Then this algorithm, you know Fails. Yeah. And they proved it mathematically, right? And they took a very simple, uh, case of that kind of nonlinear problem to show that even this simple thing can't be solved.

With a single layer. With a single layer.

Sriram Viswanathan: So with additional layers, you can sort of fit the curve between the two categories.

Anil Ananthaswamy: Except what, what Marvin Minsky and Seymour Papert did, uh, was they kind of insinuated without proving that multilayer. networks will also not be able to solve the problem. Right.

So no one proved it. Mathematically, it was just by kind of association that, oh, because the single layer network couldn't solve it, maybe the multi layer ones could also not solve that problem. Right, But there was no proof. And, but that was enough for a lot of people to become extremely skeptical of the neural network agenda.

Right. So late sixties, early seventies interest in neural network. Research dropped. So that was kind of, you know, an AI winter, so to say. People stopped working. There was other stuff going on, you know, people were like, for instance, the, uh, there's an algorithm called the K nearest neighbor, which almost anyone who's done machine learning will encounter that as one of the first simplest algorithms that you can, uh, design.

Yeah. And that was developed. And that was also around that time, uh, you know, mid sixties, late sixties, uh, there was a lot of work that was, uh, done with, uh, sort of, you know, Bayesian classifiers. So the, you know, a lot of work going on with optimal Bayes, naive Bayes classifiers. Um, and, uh, even, you know, Even this thing, uh, of wanting, like I said earlier, that Rosenblatt's Perceptron finds some boundary between two clusters of data.

It won't find the most optimal one. So people are also trying to figure out how to design optimal classifiers, right? Uh, you know, there's a way in which you can divide, uh, two clusters of data such that the prediction that you make based on the Uh, answer you find will be, uh, minimizing the error you make because no matter what solution you come up with, there's always a chance of error that you might end up misclassifying the new data as this or that.

And how do you minimize that? Right? Uh, so there was a lot of that work going on. people, some people had not given up on neural networks. Probably the most, the biggest name that, uh, now is, is a big name, but not then, uh, he hadn't, uh, uh, Jeff Hinton, right? So he was doing his PhD at that time in the late sixties, early seventies.

And he was convinced that neural networks were going to be important. And yet we didn't have a way to train multilayer neural networks.

Sriram Viswanathan: So Rosenblatt comes up with the single layer. Uh, uh, perceptron neural network, uh, algorithm, you know, and then they realize that the more number of layers creates a way for the prediction, uh, accuracy to improve, uh, and Kinton comes and says there's a way to actually implement this concept called back propagation.

So that makes a big difference in how these algorithms become. Capable of, you know, doing more machine learning. So can you talk about what is the, what is the fundamental aha that Hinton was able to do, enhancing what? Uh, the, the original Rosenblatt concept was.

Anil Ananthaswamy: Yeah. So, um, I think we should have probably, uh, before going to intern, we should mention another seminal, uh, neural network result that was also in the late fifties, right?

It happened here in Stanford. Um, and this was, uh, Bernie Widrow and Ted Hoff.

Sriram Viswanathan: Ted Hoff of the Intel fame. Yeah.

Anil Ananthaswamy: So this wonderful story about how, uh, Ted Hoff was looking for an advisor. Yeah. For his thesis. He comes knocking on Bernie, uh, Widrow's door. Yeah. Uh, one afternoon, this is Friday afternoon, uh, and, um, you know, he comes in, you know, this young kid and, uh, Widrow basically starts explaining some of his work to tell him that maybe this is the research we could do.

And in the course of that discussion, this is just a preliminary discussion and a potential advisor and a potential grad student are having. They design. Uh, this thing called the least mean squares algorithm, LMS, and it Turns out, and I know I interviewed Bernie Bedro, he must have been in his late 80s, early 90s, then he's, you know, he's still around, amazing man.

And he remembers, he recalled that moment and was telling me that, you know, I wish I had had a camera at that point to take a shot of what we wrote on the blackboard. Because I felt like we had discovered the answer to life and everything. It was like this. It turns out that the algorithm that they, uh, had invented is a very, very noisy form of what is something called, uh, gradient descent.

Yeah. And gradient descent is the central thing that is being used to train, uh, neural networks today. And they had figured out, without using calculus, just algebra, a very noisy version of that algorithm. And they did this on a Friday evening. Uh, they wanted to build it in hardware. The Stanford supply room was closed.

So they go to the nearest hardware shop, buy all the things that they want. Over the weekend, they basically sold all the stuff together. And by Monday morning they have a, the world's first functioning single layer, you know, single artificial neuron in hardware working on their desk. Right. Uh, and The LMS algorithm is actually the true precursor to the back propagation algorithm that you mentioned that Hinton had.

Uh, Hinton's name is attached to, um, back propagation. The idea of back propagation, and I'll explain that, uh, in a minute, but it's, you know, Hinton, uh, Hinton's name is associated with it today, but actually that, you know, even he will acknowledge that there were other people. Before him, who had worked out various parts of that, including his postdoc advisor, you know, Rumelhart, who was, you know, if he had been alive today, he died early, at a very young age, but if he had been alive, I think he would be, you know, legitimately feted for forever. The backpropagation algorithm. And Hinton would acknowledge that, uh, very much so.

Sriram Viswanathan: I mean, just for the benefit of our audience, uh, what currently exists as open AI, Ilya and others actually came from Hinton's lab in the University of Toronto.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: So as you point out, the backpropagation was a seminal Event that took the gradient descent algorithm, which built on the rosenblatt perceptron.

Uh, and there's a bunch of other things that happened, but I wanna just sort of, you know, pause here for a second. On the gradient descent algorithm that you talk about, there's this one picture on your book.

Anil Ananthaswamy: Hmm.

Sriram Viswanathan: Uh, which all of us have probably have seen. If you're flying over Asia, especially in Vietnam, you'll see these rice paddy fields, rice paddy, terrorist fields.

Mm-hmm . And that is a physical. representation of how the algorithm works. If you look at the farmers that are trying to go down to the bottom of the hill, can you talk about just the gradient descent? What, what is that algorithm?

Anil Ananthaswamy: Yeah. So, uh, the, the way to think about it is you, you've got some machine learning model that you're training.

You provide it an input. It produces some output because in the beginning the parameters of the model, You know, whatever they are, we don't have to, you know, because the model can be anything. It doesn't have to be a neural network. Uh, in the beginning, the model parameters are randomly initialized. You provide it an input, it produces some output.

And if you're using this idea called supervised learning, where you actually have training data in the form of input output pairs, where you have collected enough human labeled or human annotated data that tells you that for this output, for this input, I should get this output. So in the beginning, if it produces a wrong output, you calculate the error that it makes, like, which is the difference between what it.

Predicts versus what it ought to have predicted. Right. The error that it makes. And if you formulate that error in terms of the parameters of the model, you can, uh, in some instances you can get a function that is bull shaped.

Sriram Viswanathan: Yeah.

Anil Ananthaswamy: Right. The error function is bull shaped. Yeah. Right. Uh, not always, but it's, especially in neural, it's a three-dimensional, you know, pat light, whatever.

If, if it is, if it is just two parameters, then the, it's a pat, then it'll be a, you know, para light. Right. Um, but, uh. So when it makes an error, essentially what happens is you, you, you situate yourself somewhere high up on that parabolite. And gradient descent is simply this idea that at every point, at every instance when you made an error, you calculate the slope of that curve.

And you basically take a small step towards, uh, uh, in the negative direction of the slope. So you're coming down to the bottom of the bowl.

Sriram Viswanathan: So that you're minimizing the error rate.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: So that by the time you're at the bottom of the bowl, at the lowest energy state,

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: the model has given you the perfect prediction.

Prediction. Yeah. Of what the output needs to be. And conceivably that makes a difference between the, the right answer versus the wrong answer that you might get. Yes.

Anil Ananthaswamy: And this is what, uh, Bernie Woodrow and Ted Hoff had figured out how to do algebraically and it was a very noisy version. Noisy version simply meaning that.

When you're at some location on that surface of that, you know, that curve, uh, if you go down precisely along the gradient, you're doing exactly gradient descent. But if you just randomly kind of walk around, you know, hoping to come down like a drunkard's walk, eventually you'll land, come down to the bottom, but you can wander around a lot.

And that's what a noisy version would be.

Sriram Viswanathan: And this is the reason why there's analogy that you had on the paddy fields. Yes. Makes sense if you're trying to go down the paddy field. the fastest way will make these minor steps towards agreeing Yes, from one

Anil Ananthaswamy: terrace you just look around what is the steepest, uh, you know, way down to the next terrace, so you go down, you wander around that terrace, find the next steepest path.

And before you know you're at the bottom of the valley, right?

Sriram Viswanathan: Right.

Anil Ananthaswamy: So, and that's exactly what's happening, uh, in these algorithms, that you formulate your error in terms of the parameters of the model. In such a way that the function hopefully has so called a convex shape, uh, it's kind of weird because convex we think of as being upside down like that.

Correct. But, uh, you know, in function for functions, convex So if it's a convex function, then if you can use gradient descent and be guaranteed that you will find what's called the global minimum, right? Uh, and, uh, and that's it. I mean, literally that is the basis of,

Sriram Viswanathan: how does that, how does that go from there? To Hinton's work on backpropagation, because that's sort of the precursor to where Hinton led.

Anil Ananthaswamy: So this, so this algorithm we are saying works really well if you have just a single layer and you can formulate it. But the moment you put another layer between the input and the output, it was really hard to figure out how to formulate the, so the, you have to propagate your gradient from the outermost layer back.

towards the input layers. And they hadn't quite figured out how to do that. So essentially when you make an error, you have to assign blame for each of the models parameters saying each models parameter, each of the model parameters are responsible for this much of the error. This idea was there. This idea was there in the sixties.

People doing research in rocketry, like control systems for, you know, controlling rockets in space. Yeah. Uh, they had some version of this idea in their, in their algorithms. There were people in economics and other places who were kind of grappling with this issue because this is not just about training neural networks.

It's about any model that you're building where you might have this issue. So people had sort of, you know, grappled with various versions of this. Uh, Hinton, Rumelhart, Hinton, 1986 Nature paper, they were the first to very clearly show how to use this idea for training. Uh, so called deep neural networks, neural networks with one or more hidden layers between the output and input.

And they were the first to also show if you use this algorithm, which they call backpropagation, then, um, you know, what is it that the network learns? Like, you know, so they kind of really brought it all together to say, okay, this is how we train neural networks with more than one hidden layer. And that changed the game, right?

I mean, that was a big inflection point. Until then, we didn't even know how to train these networks.

Sriram Viswanathan: So this, this was a seminal moment of just improving the accuracy of the prediction. But then there was, there was also other, um, substantial advancements in the algorithms itself in terms of, um, you know, before even we got to transformers, you know, the convolution neural network happened.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: So can you talk about, you know, what led to that leap? So from backpropagation, they understood how to actually tweak the weights of the parameters to improve. And as you said, the, the blame assignment to improve the, the, uh, the, uh, the accuracy of the prediction. But then there was another, uh, evolution of that to the convolution neural network.

So just tie that back to us on how, how does that?

Anil Ananthaswamy: Yeah. So, uh, you're exactly right. I think, so the back propagation algorithm was one inflection point, which allowed them to make more accurate predictions, but also it changed the nature of the problems you could tackle.

Sriram Viswanathan: Right.

Anil Ananthaswamy: Right. Right. Right. Before.

Before, uh, multilayer, uh, neural networks, you could only tackle so called linear problems. And now suddenly you could find nonlinear boundaries in order to separate. Or categorize data.

Sriram Viswanathan: Like, what is an example?

Anil Ananthaswamy: Anytime you have, like, for instance, just, you know, in two dimensions, if you have a bunch of data that's clustered around the origin, uh, let's say, circles, and then you have triangles that are scattered in an annular ring around the circles, and there's a gap between the two, and you want to find a curve that Fits it.

Fits that gap. A single layer neural network will not find it, because it can only find straight lines. Right. Right? But with multi layer neural networks, you can find that circle. Yeah. And then suddenly you're able to classify the inside. You're able to separate the inside from the outside. So this,

Sriram Viswanathan: this is profound because one of our companies is a company called 5C and another related company is a company called White Rabbit.

And both of them are, are AI companies focused on detection of, uh, of cancer, you know, in mammograms.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: So if you take a regular mammogram and you have a, an image and if you want to predict if some part of that image. resembles a, a tumor. What you just said is what enables that detection. to become accurate by the use of non linear, this non linear thing.

Anil Ananthaswamy: Absolutely. So, uh, and, and I would not be surprised if what, uh, those companies that you mentioned are using some form of convolutional neural networks or maybe even transformers today. Uh, but they are just, you know, it's, it's a progression that began with the ability to train multi layer neural networks, right?

So Hinton did that. And then, uh, somewhere in the, uh, nineties, uh, Yann LeCun, who was working, uh, you know, who had independently kind of come up with a version of the backpropagation algorithm. And he got in touch with Hinton and then came to work with Hinton in Toronto for a year. And then at that time figured out, uh, the convolutional neural network, the architecture, right?

Sriram Viswanathan: So, so Yann LeCun is, uh, who currently is the head of MetaAI's, uh, AI work. Uh, with Lama and all that. Yeah. So he is, uh, arguably the person that pushed the CNN.

Anil Ananthaswamy: Yeah. So the CNN really, uh, you know, again, there is a history there also. There were other people, Japanese researchers, and there's this whole history of people who are messing with various aspects of it.

And Lacoon is credited with kind of putting it all together to demonstrate, uh, a working, uh, prototype of what's. Today called the lay net. Yeah. Uh, and it was in one of the first real applications of the back propagation algorithm. Yeah. Uh, to train this thing called a convolution neural network. And a convolution neural network is, is a particular way of arranging, uh, the model parameters, uh, so that you know, you are, you are able to, it, it's a, it's a way in which the interconnections of the neurons are arranged.

So that you're better able to pay attention to aspects of an image, right? So you, for instance, uh, one of the things you want to do when you're trying to do image recognition, uh, or, uh, anything to do with images is have something called, uh, translation invariance, right? If you have an edge that you want to detect.

Right. Uh, you should be able to detect that edge no matter where it is in the image.

Sriram Viswanathan: Whichever shape it is in, whichever orientation it is in.

Anil Ananthaswamy: So the orientation is, uh, rotational invariance, but translation invariance is just if you're moving it in one axis, along one axis, right? So, uh, and the CNN allows you.

Yeah. To architect the structure of your neural network in a way that it has rotational invariance, it has translational invariance. And then also once you detect edges or curves, et cetera, then you want to be able to compose those things. Yeah. So let's say you're trying to recognize a cylinder. Now a cylinder is going to have straight lines, but it's also going to have some curves.

So the first layers will recognize the basic features, the edges and the curves. The second layers, you know, higher layers will start composing them to start recognizing more complex shapes. And before you know, you know, it has recognized a cylinder. So, and you can think of how something like that would work in the context of a dog or a cat.

Right. Right. You're trying to recognize edges and curves, et cetera, and then you start putting all those together. And before, you know. In the higher layers, you will recognize a cat or a dog.

Sriram Viswanathan: So, so this is a good point to sort of touch on. So, Feifei Li's work in Stanford and...

Anil Ananthaswamy: ImageNet...

Sriram Viswanathan: ...played a big role because they were you know, in a parallel universe.

They were trying to solve image recognition using some of these early techniques and there are tons of data. So can you talk about how the availability of this large dataset actually pushed the envelope on, on just fine tuning this whole recognition problem.

Anil Ananthaswamy: Yeah, um, yeah, there was a, there was a fair gap between, uh, Yann LeCun's early convolutional neural network work and, uh, Feifei Li's work, which then led to, uh, You know, Hinton and Ilya Satskever and Alex Krishevsky's AlexNet, right?

So let me just give a bit, uh, your listeners a bit of an idea of what happened in between, because that's actually very, uh, amazing, uh, time in machine learning, which was so, uh, Yann LeCun, you know, uh, creates his, uh, convolutional neural network, but it, it didn't take off in the industry, mainly because even he knew that neural networks required a lot of data to train.

They were data hungry things and also computationally intensive. So while he had created the, uh, convolutional neural network, it was a very bespoke sort of implementation. Uh, and also, um, LeCun and, uh, his, uh, colleague, I forget his name, uh, but they had come up with, uh, a framework for implementing. neural networks, but that framework was not open source, it was proprietary.

And so anyone else who would want to implement a neural network would literally have to do a PhD project all over again to And so that kind of It is

Sriram Viswanathan: ironic Ironic. that Lama is now led by, uh, Jan Lakoon, who is like Probably from the godfather of open source.

Anil Ananthaswamy: Yeah. Probably because of lessons learned from that time, right?

And so there were two things. One, it was not openly accessible, the framework. And the other thing was, it was extremely data hungry. So people didn't really think it was going to be very useful because there was no such data available to train these networks. So early 90s, another big development happened in machine learning, which was the development of these things called support vector machines.

SVMs. SVMs. SVMs just took off. They were very, very good at, uh, working with, you know, low amounts of data. And remember, we talked about Rosenblatt's, uh, linear classifier, which was a perceptron. And it, it found a, uh, classification boundary between two clusters of data. And it was not guaranteed to find the optimal boundary.

Support vector machines, uh, find an optimal boundary, but then they use something You know, really amazing. And one of the, you know, one of the mathematical pieces in the book that I find to be the most elegant is this idea of taking data that is in low dimensions, which may not have a linear boundary to separate the two.

You push the data into high dimensions and in, in some, uh, you know, respectably a higher dimension that data linearly separates. So,

Sriram Viswanathan: so just give us a sense for in practical terms. And I can understand you have a bunch of unstructured data set, um, you know, which let's say comprises of. the data set. And you have a linear separation that you can do with what Rosenblatt predicted.

Anil Ananthaswamy: Yeah, but the thing is that with the, with the, with the perceptron, if the image is complex enough, right, let's say cats and dogs. And if you just turn the entire image into a long vector where every, like if it's a, You know, a million pixel image, maybe not a million is too much, let's say a thousand pixel image, right?

You have, you turn that into one long vector, which is thousand pixels lined up, and you feed that into your neural network and it has to classify that as zero or one, depending on whether it's a cat or a dog. In, in thousand dimensional space. The images of cats and dogs may not separate cleanly. Yeah, they may not separate cleanly in the sense that you could not possibly draw a hyperplane to separate the two.

Right? So. Yeah. Perceptron won't work. Yeah. And you need, so that's when you need to project this data into some, like from thousand dimensional space to million dimensional space. So

Sriram Viswanathan: that's where SVMs or, or, or, you know, support vector. Kernel methods. So support

Anil Ananthaswamy: vector machines, what they do is if you have a boundary between two sets of data, they will find a classifier that is centered really in between the two.

So that when you then do predictions, you're more likely to be right.

Sriram Viswanathan: Yeah.

Anil Ananthaswamy: Right. But. If you give a support vector machine unstructured data that is not linearly separable, it will also fail. But what they did was they combined support vector machines with this amazing idea called kernel. methods. And Colonel methods, what they do is they project the data from let's say a thousand dimensions into a million dimensions in the million dimensions.

There is a space between the two. So the SVM gets to work. Right. And then you project it back. And when you project that SVM boundary back into thousand dimensions, you will find that it's an only knee or code. Right. The, the really important, amazing part about Colonel methods is that. In order for the SVMs to work, they have to do some linear algebra of the data using the data points.

So if in thousand dimensions it costs you a certain amount of computation, if you move the data to a million dimensions, it's going to just blow up the amount of computation you need to do.

Sriram Viswanathan: So the, so the idea would have to be then the computation happens at the lowest dimension. Yeah. Or the lowest layer.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: And then it abstracts it away and you get the accuracy at the highest level, but the low energy state is where the actual competition happens.

Anil Ananthaswamy: So my, yeah, exactly. So kernel methods, what they do is they allow you to project the data into high dimensions, but they do the computation in the lower dimensions.

So the computational cost is much lower of the, whatever you were doing in thousand dimensional space. But the effect, the. actual effect of the computation is as if you had done the computation in the million dimensional space Right, and that became a big deal for throughout the sort of late 90s and early 2000s Support vector machines were ruling the roost in terms of machine learning Yeah, so if you look at the you know, if you talk to industry veterans, they'll tell you That, you know, so much of the machine learning work at that time was support vector machines and kernel methods combined, right?

So then what you were referring to earlier with Feifei Li's, uh, so Feifei Li started this project of saying, okay, we need good data. Feifei Li

Sriram Viswanathan: is the ImageNet, you know, professor at Stanford who started the ImageNet project. ImageNet project. Yeah.

Anil Ananthaswamy: So ImageNet project was, uh, this gigantic effort to create a data set of images.

Uh, that then could be used for image recognition challenges, right? They literally created the largest curated data set of images at that point. And it was a very big deal because

Sriram Viswanathan: this involved having to actually take an image, do the tagging and then train the algorithm to actually be able to recognize.

Anil Ananthaswamy: Yeah. So they were not, uh, so they were at that point, not, uh, the ImageNet project itself is just the creating the data set, right? So then it was open for others to then. Uh, you know, use, use that data set to train their, uh, classifiers and, and it became part of a contest, the ImageNet image recognition contest, right?

And in the first year or two of that contest, neural networks were not in the picture, right? There were standard, um, sort of ideas in, uh, image recognition that were again winning that competition. And 2011 was when we saw. You know, I think it was 2011 or 2012 that, uh, the team led by Hinton, Ilya, and Krzyzewski, you know, they created AlexNet.

And that was the inflection point. So there were two things that happened at that point that made it possible for AlexNet to win the competition. One was the availability of the data that was provided by ImageNet. The other was the use of GPUs. They were, they were amongst the first teams to recognize that we can use these GPUs which were actually manufactured for some completely different purpose, that they could co op these GPUs for training deep neural networks.

Sriram Viswanathan: So I want to, I want to put a pause on that because I think this is, this is an important point you're bringing up because I think you've sort of laid out the early history of the development. There are still a few other, uh, key inflections that happen, including uh, the 2018 uh, transformer uh, uh, neural network that happened with, with Google.

Uh, but before we talk about that, each one of these things that have actually contributed to the evolution of the, of the, of this area, uh, have made incremental progress. But what has happened is that, you know, you, you, you alluded to Ted Hoff in the early days, you know, it was an Intel guy who did the early, uh, classifier in hardware, uh, you know, uh, you know, you, you fast forward all the way and then you made the comment about GPU, that's a big leap from, you know, Intel as a compute, uh, CPU related,

Sriram Viswanathan: know, engine to then finding a new.

You know, nvidia, which actually makes GPU becoming a much better targeted platform to do all of this work. Just bridge that for us. What, what made that change for A GPU to be a more ideal compute platform? For this kind of algorithmic work or not a traditional, you know, CF transistors, you know, billions of transistors, CPU, why couldn't a CPU be used?

Anil Ananthaswamy: Right. So this is, this is why I think the math of machine learning is so amazing, right? I mean, so it's all tied to the way the math works. If you think of what a neural network is doing, uh, forgetting for a moment, nonlinear issues, etc. You have. Input coming in, which is just a string of numbers, you know, any data that you have, whether it's text or images or audio can be converted into a vector, right?

The vector comes in and on the output side. You, you might be asking for another image, another snippet of audio, or just one number saying something is zero or one, like a dog or a cat, whatever, it doesn't matter. So on the output side also, you have a vector. The vector could have just one number or, you know, a million numbers to represent an image.

So you have a vector coming in on the input side, a vector coming out on the output side. In the middle. is a huge matrix. You can, you can think of your entire model as being this huge matrix that just multiplies the input vector and turns it into an output vector, right? It's implementing some function.

Sriram Viswanathan: Usually it's a sparse matrix.

Anil Ananthaswamy: Could be. Yeah. Uh, depends on the, uh, data and depends on the architecture of your, you know, uh, neural, if it's a neural network, then it can be densely, I mean, yeah, it depends on the problem and the data. Yeah. Uh, so, but generally there's a matrix in the middle. And it turns out that GPUs, which are essentially, which were essentially built to update a matrix of numbers, which are the pixels of an image.

I mean, they were, they were gaming engines, right? So they were really, really optimized to do fast calculations to update a matrix of numbers. Pixel values.

Sriram Viswanathan: A scene in a game.

Anil Ananthaswamy: Yeah. Anything. It's just a screen, right? You're updating your, uh, you know, video screen at a very fast refresh rate. And so they were really good at crunching numbers so that they could update

Uh, this matrix of numbers and, uh, and it was sitting right there. It wasn't designed for machine learning, but the underlying hardware was very, very good at exactly this kind of linear transformations that you needed to do. And so teams recognize that Hinton's team in particular, uh, Alice Krzyzewski, who was a whiz kid with programming GPUs, uh, famous for Alex net and Alex net is named after him after him.

Right. And, uh, Ilya was the guy who saw the potential. You know, he was the guy who convinced Hinton that, look, I think, you know, we need to do this because it will happen. That we should, we will be able to crack this problem if we put enough compute on it. Yeah. Right? And build a deep And this

Sriram Viswanathan: was the early, um, it was not even the H100s, I assume.

It's like the early NVIDIA chips.

Anil Ananthaswamy: Yeah.

Sriram Viswanathan: That were being used for gaming. Gaming, yeah, yeah. That, that they used.

Anil Ananthaswamy: Yeah, there was, it was, machine learning was nowhere in sight. Nowhere in sight, yeah. With regards to these GPUs, right? Uh, it was not designed particularly, or optimized particularly for these things.

Sriram Viswanathan: So, so, let's just fast forward from there to the transformer because that's another big evolution in the, in the nature of the algorithm that has changed the way what we currently use in GPT is, you know, it's a transformer algorithm.

Anil Ananthaswamy: Yeah. So talk about that. Um, yeah. So again, maybe for context, let's let, let me give a little bit of the connecting dots of the, the transform, why the transformer came about will make sense, right? So we had, uh, uh, Alex net and other convolutional neural networks. And these were, you know, so called feed forward neural networks where the input comes in, the computation just proceeds in one direction and the output comes out. There is no recurrence inside the networks, right?

So the, Uh, state of the output is not fed back somehow, but there were other things that were also happening in neural network design, uh, which were called recurrent neural networks. In particular there was something called long short term memory networks, LSTMs, right? These were really, uh, being used for what transformers are being used for today.

But when you go back to, uh, you know, uh, 2015, 2014, around that time, Google was using LSTMs for doing machine translation. So, uh, so called sequence to sequence transformation. You have a sequence of, uh, words in one language, let's say English, and it has to be translated into German, which is a sequence of, you know, words in a different language.

And there is no one to one correspondence between the input word and the first position. And a lot

Sriram Viswanathan: of computation happening in the middle. A lot of

Anil Ananthaswamy: computation happening in the middle, and, uh, and so the sequence is coming in one word at a time. And another sequence is coming out on the other output side, one word at a time.

Uh, and the sequence matters, right? Yeah. And, and so the network has to remember what has come before up to a point Yeah. So that it can process, yeah. Uh, produce the right kind of output. So l SDMs for all the rage that, you know, yeah. They were, but they, they, they, they had limitations. Right. Uh, you, you could only remember so much of the sequence that was coming in, right.

If the, if the dependence of the. Language was sentence that you were trying to translate was such that the hundredth word dependent on the tenth word that you had produced that much information was hard to keep. Right? So it had inherent limitations and also it was sequential. So there was a performance problem, right?

It was, it was not parallelized.

Sriram Viswanathan: Computationally very expensive. It was.

Anil Ananthaswamy: Yeah. So 2017 was when the team at Google Research and Google Brain, um, you know, the eight, there are eight authors on that attention is all you need paper who came up with, uh, the architecture, which basically says that you can parallelize the entire operation of taking a certain set of input words, you know, if it's a sequence of words and then producing your output Uh, sequence, whatever that might be.

Uh, so they did it for machine translation, right? That paper is The

Sriram Viswanathan: whole distinction there is really the parallelism that they were able to achieve versus the sequential nature of doing it with SDSM before that. Yeah. So

Anil Ananthaswamy: essentially what the transformer does is you have to, the way you can think about it is You start with a set of words, right?

And these words have to be turned into vectors. So you embed them into some vector space so that they have some semantic meaning. Uh, and then you keep transforming those vectors layer by layer. And that's hence the name transformer. Uh, and the transformation depends on. Uh, what each word means to the other word.

So let's say a word in the 10th position, uh, relies on something that happened in the second position. So it's, it's paying attention to what's enhanced the word attention. So different words are paying attention to different parts of the sentence. And you keep doing this layer by layer. And at the end of a, let's say a set of 10 transformer layers, you end up with a new set of vectors, embedding, so to say.

And now those things can be then turned into words. And if you're doing sequence to sequence where English to German, then the final thing, you know, you would have to then add, it's a complicated architecture between you have to encode it first and then decode it next. Right. Uh, but essentially that's what it's doing.

It's transforming your. Input vectors into some sort of, uh, new set of vectors that now contain the information necessary, the semantic information necessary to then decode it into any language you might want. So, um, and then the, but when the 2017 paper came out, they had done a very particular architecture which required.

You to encode your data first and then decode it to produce a new sequence. The stuff that happened with GPT, for instance, people realize that the same thing could be used for, uh, next token prediction or generating text. Yeah. Right. And so in this case, what, what you do is you start with the same thing.

You start with, you know, some Uh, set of words, convert them into vectors, keep transforming those vectors so that you get a new set of vectors that now has all of the semantic information encoded in each position. The final one is your vector, which is representing the one that you need to predict that is supposed to follow.

Let's say you had 10 words and you are trying to predict the 11th word. The 11th position will now, the vector that you end up with at the very end will have all of the necessary information. Somehow because of the series of transformations, right? That you can say, oh, this is the most likely 11th word.

Sriram Viswanathan: So this is, this is absolutely remarkable, you know, grounding, and I can, I can talk to you for hours just delving into this topic. And, uh, but let's just try to, you know, converge this on some of the more recent sort of, uh, uh, things that are being battled, uh, in the market. You know, clearly the Transformer architecture is evolving, constrained only by the underlying compute power that you can throw at it.

Anil Ananthaswamy: Mm

Sriram Viswanathan: hmm. Uh, and as you put more compute power in, you can obviously, you know, pre train a number of these models. You know, we won't get into sort of the open model versus the closed model and all of that. But the reality is that the, this progression seems to have surprised everybody. Yes. As to why it is going at such an exponential rate.

You've got to. You know, clearly, uh, you know, Moore's law, you could sort of talk about the compute power is growing at a dramatic, uh, uh, scale, but that doesn't seem to be linearly translating to the improvements that we're seeing in the underlying model and the, you know, pre training that has happened.

And in fact, Sam Altman just this week says that, you know, it's actually, the models are becoming more efficient, you know, 1200 times, you know, almost every, every year. How, how, how, what, what causes that? Why is it that it is exploding at this, at the pace at which it is exploding in terms of the ability of these models to do the tasks that we thought 18 months ago were simply not possible?

Anil Ananthaswamy: Yeah, I, I, I think the, the way to think about why this might be happening is to think about what the, uh, transformer is learning, right? So, the transformer basically Every time you give it a set of words, it has to predict on the output side, the most likely next word to follow, right? So let's say it has a vocabulary of a thousand words, right?

So for each of those thousand words, it has to predict the probability of what is the probability for word number one, word number two, all the way to word number thousand, right? So if you look at the transformer as a black box, it's essentially a function that takes in. A vector, which is your bunch of words and on the output side produces a vector for one word, right?

Which is the basically the most likely next word, but in the inside it, it has to calculate something called the conditional probability distribution, right? Over its entire vocabulary, give no matter what input you give it, right? It has to calculate this conditional probability distribution and then sample from it so that it takes the.

Most likely next word and say, okay, here's your output. This thing that it has to do is an extremely high dimensional calculation. It's a, it's a calculation that is done in extremely high dimensions and high dimensional surfaces. When you're doing computations in high dimensions, you have this problem called curse of dimensionality.

You really require a lot of data and you have to estimate a very, very complex shape. Uh, so when the models were small and when the data was small, just three years ago, the shape of this probability distribution that had to be calculated was not very accurate. So the predictions were not accurate. But as you started throwing compute.

Yeah. And more data. And data added. It just became better and better at, you know, at every conceivable sort of, uh, aspect of this probability distribution could be calculated. Yeah. Right? Uh, so mathematically, it's not a surprise that the pace of improvement, well, the pace of improvement is more also a technological thing.

They do all sorts of things to, you know, fine tune their models, but just in terms of sort of why. Putting, uh, sort of more compute in terms of taking more data and training the model, but also increasing the size of the model. Because the more parameters you have, the better you are able to model this extremely complex high dimensional surface.

Yeah. Right. So mathematically, it's not a surprise that they're getting better because the moment you can correctly estimate all of the conditional probability distributions, no matter what input you give. It can calculate the appropriate conditional probability distribution and then spit out the next word, right?

It's getting better at that. But what has been surprising is that that's all it's been trained to do. It's just been trained to do next token prediction, next word prediction. And despite that, because nowhere in the training regimen is there. Any idea of solving a problem or summarizing a text or any of these things or

Sriram Viswanathan: writing poetry,

Anil Ananthaswamy: whatever, nothing, right?

None of that stuff is there. So you take this base model, pre trained model that only does next word prediction, and you can fine tune it to do these amazing tasks just out of this very, very simple sounding training procedure. That has surprised people.

Sriram Viswanathan: But is there, is there an asymptotic? Sort of max or decay in this thing where, where it slows down.

And why would it slow down? Because all,

Anil Ananthaswamy: because all, all curves that when, you know, in nature, when we see them rising exponentially, usually we're only seeing the part where they're rising. Eventually they tail off. Right. So, so there's a, it's always a S curve. It starts like that, goes up and then, you know, ends up plateauing.

So we've already seen that happening with large language models. So a lot of the. sort of scaling laws that were empirically figured out in 2021 applied to the pre training part, the stuff that I just mentioned to you. People already started seeing that the, the, the, the sort of benefits that you were getting, law of diminishing returns was kicking in.

Sriram Viswanathan: But is that, is that largely because of the natural way these models are expected to do, or is it because of lack of data? Because one could argue, You're running to the sort of the peak data problem of, you know, you don't have enough data. That's why you have synthetic data. Right?

Anil Ananthaswamy: Yeah. I mean, all of those things will play into the, like lack of data, lack of good quality data.

Synthetic data might end up, uh, you know, skewing your model in a certain way because depends on what is generating that synthetic data. So there are lots of issues there. You will. So essentially in practical terms, that exponential curve was. That curve was hitting, uh, you know, um, sort of, it was kind of saturating.

I see. Right? So what, what has happened in the last six months has been very interesting, where instead of putting compute and data efforts on the pre training, we have now shifted to providing more compute to the large language models. So once you have a trained model, right? Before what used to happen was once you have a trained model and you asked it a query, it would just generate a series of tokens, right?

Through many passes through the network. And those series of tokens translated into words represented your answer. And, but if it had to do If, if the question involves some sort of logic puzzle or this quote unquote, thinking slow, slow thinking, slow thinking, it wasn't doing that. It was just producing that output.

But now we have developed the ability where you will spend more time generating, you know, maybe it's different versions of the answer or sifting through your different versions of the answer to find the right one. So you have moved the responsibility of how much compute you use. Yeah. So there's more compute being, uh, expended on the inference side now, where the model spends more time generating a variety of tokens and potential chains of thought, so called chains of thought, as it tries to answer that question.

And so the scaling law has shifted. So now, now we are, we have entered the regime of a new scaling law where putting more compute there might give better performance for another length of time.

Sriram Viswanathan: But that leads to a good question because I think you touched on, uh, uh, Jan Lakoon and, uh, uh, Hinton, they, they're sort of cut from the same cloth, as it were.

But the both of them have sort of diverged pretty dramatically in their Beliefs around where this evolution is headed, especially in the context of A. G. I. Where Hinton seems to be Suggesting that, you know, this has to be reigned in otherwise, it's really, you know, could be disastrous. Yeah, and the anilakun doesn't think that.

Sriram Viswanathan: So can you shed some light on, you know, what is the core thesis of the argument on the both sides?

Anil Ananthaswamy: So I, you're right. I think, uh, so of the so called three godfathers of AI, um, Hinton, uh, Yoshua Bengio and Yann LeCun, uh, Hinton and Bengio have become very, very concerned about how we're going to be able to rein in or make AI safe, they are afraid of sort of runaway processes where we may not able to, we may not be able to curb these, uh, AI, especially agentic AI.

You know, if we build autonomous, like right now, AIs are still. kind of reactive. They're reacting to us, our prompts and our inputs. But agentic AI is coming where we're building AI is to be autonomous. And I think, uh, Hinton and Bengio are concerned about where that is going, and they have been raising the alarm.

Um, Yann LeCun hasn't gone on the same path. He has, in fact, talked about how So, uh, you know, the current technology that we have, LLMs in particular, are not enough to get us to AGI. Right? He is very clear that we need other innovations that will make these, um, AIs come close to what we call AGI, Artificial General Intelligence.

And so he's both. I think not concerned about the power of these systems because he thinks, you know, this is not the path to AGI. But also I think he has a different view of how safety, like he, he doesn't want to worry about safety until we have built that thing. It's like saying, you know, we anticipate a technology that we haven't yet developed.

You know, how are we going to design safety mechanisms when we don't even know what the design of that thing is that we are trying to aim for? So he has a different approach. He's certainly less concerned about that than Hinton and Bengio. And yeah, you're right. I think it's very interesting to watch how they have diverged in their opinions.

I don't know, you know, the inner workings of their minds as to why they have taken these positions. But, you know, in their public pronouncements, you can clearly see this difference.

Sriram Viswanathan: One of them is right. I don't see a scenario where both of them are right, obviously.

Anil Ananthaswamy: No, no, it could be that, uh, both of them could be right in the sense that we may get these breakthroughs, um, that Jan Lekon is talking about that makes these AIs truly more AGI like.

I don't think, you know, AGI in its purest form is probably very far away, but we are going to get very powerful systems, especially when we make them autonomous. Uh, and, uh, you know, and it doesn't take. Too much for things to go wrong, uh, once you make things autonomous. So I think both of them, you know, we might end up with some mix of the two concerns.

Right. So, so we'll see.

Sriram Viswanathan: So let's, uh, let's, let's reel this back in. Uh, I, I wanna sort of, you know, the last section I wanna just touch on some of your other books that you have sort of, uh, so wonderfully, uh, gotten into, you know, comes from physics and, and, uh, and some of these other areas that you've got.

There seems to be a common thread and I, I'm trying to figure out, other than saying that everything that you write about, I find fascinating and I, I, you know, if I was not doing my day job, I would be, you know, following you around and trying to figure out how you think about things and how you learn and write because it's really fascinating.

So is there a common thread that drives you to write these sorts of topics and, and, uh, problem statements?

Anil Ananthaswamy: Um. Absolutely. Absolutely. Absolutely. deep curiosity and, you know, utter fascination with the natural world. Yeah. Right. I think, uh, so if there's a common thread, it's just that it's, these books are all sort of looking at, uh, issues that are of, you know, you know, very deep concern to.

Humanity. Right? Like, on one hand, the first book, The Edge of Physics, was about cosmology and, you know, it's about the universe. How did this universe come to be from a physics perspective? Uh, the man who wasn't there is, it's about the nature of the human self. Like, who am I? You know, and that's a question we ask ourselves all the time

Sriram Viswanathan: A little bit of a spiritual tilt to it in the way you told the story on that book. Uh,

Anil Ananthaswamy: yes and no, in the sense that I kind of deliberately stayed away from religion. Right. You know, it I said spiritual, I didn't say religion. Yeah, so, uh, so, so yes, theology has had a lot to say about this question. Right? Um, especially, actually all theologies have attempted some form of An answer to the question, who am I?

So, you know, in the epilogue of the book, I do kind of touch upon that and in the prologue, but the, the core of the book is about the science of this question, right? It's about neuroscience, neuropsychology, it's uh, it's looking at the ways in which the self comes apart in a condition like let's say schizophrenia or Alzheimer's

And they all disrupt ourselves in different ways. So by looking at. The different ways in which we come apart, the book has a thesis about how we might be put together by the brain and body. But

Sriram Viswanathan: these are all key issues that one would have to grapple with as we try to build a Absolutely. AGI system. Yes.

Because, you know, a lot of what, how we think about ourselves, uh, and how we learn and how we interpret the real world. is going to get replicated in some form in the systems that we build.

Anil Ananthaswamy: Yes, especially, especially if we have to build AIs that are not as data hungry as, uh, they are right now. Like human, humans are, you know, consuming nowhere close to the kind of data that AI is required to learn.

So there is something about the way our brain's work, uh, that is very, very efficient in terms of the way it models the world outside, how it can predict and, you know, uh, act in the world. So if we are going to come up with and how we even, you know, uh, deal with the world symbolically, right? The way we reason.

So there is something, there is still something lacking. In modern machine learning systems, uh, and if we are to get closer and closer to the way humans function, then most likely we're going to replicate some of these processes. We're going to end up with machines that have a sense of being a machine and and giving them the agency.

to, you know, go and explore their world and in, in, in a kind of data efficient manner, similar to humans. If that happens, yeah, these are questions that are going to need to be answered. And it's very, it's, it's actually quite profound that the answer to the big question of who am I in the human context might actually come when we build machines that are able to do something similar.

And then because we build them, we understand it better and it might end up. Shedding light on what we are,

Sriram Viswanathan: what we are. It's more analysis of it's almost self reflection on how we think about ourselves as we build these complex systems. If you had a crystal ball, and if you were to look in the next 5 to 10 years.

Where do you think, based on your research looking at the last, you know, 50, 60 years of all these key disruptions and innovations that have happened, where are the areas that you think, uh, you know, substantial sort of, you know, nonlinear, you know, disruptions are likely to occur? I mean, I'm not, you know, it's not about, you know, going from traditional CMOS or silicon to quantum computing or any of that.

But in terms of the software and the machine learning. algorithmic work and, and all of that. Are there one or two areas that you think are hard nuts to crack that, you know, might fundamentally change the, the, the, uh, the area?

Anil Ananthaswamy: I, uh, I think so. I think there's in, you know, Neuroscience has a certain way of explaining and understanding how our brains work.

And, you know, there is a sort of dominant paradigm now that basically tells us that the way our brains work is by creating internal models of the world outside. And every time we have incoming information, uh, the brain is basically using its internal models to create new models. might be the external causes of that information.

So if there's light falling on your eyes, it's not like the brain is basically Uh, constructing everything from bottom up, saying, detecting edges and curves and colors and shapes. And then suddenly, aha, I'm looking at a dog. It's more likely that what the brain is doing, that it already has constructed internal models of what a dog might be like.

And it's predicting that, okay, the light that's falling on your eyes is most likely because of this thing that is outside, right? So there is kind of a reversal of the processing, uh, that we traditionally think of as happening. And this idea that, Control systems like our brains have to actually build models of the world outside in suitable, with suitable abstractions.

And situate themselves, like the agents, as selves within that world model. And then use all of that to navigate and do whatever humans do. Um, I think when the AI folks crack that problem, like how do you let a machine do exactly that? Like, navigate its world, build a model of its environment, situate itself within that model as being a machine in that model.

And do so in a data efficient manner where it's not Asking for more and more data, but it's very selective about what data it looks for and uses that to fine tune its own models. It means giving it agency, right? So this combination of giving a machine agency and making it data efficient to build its own models and function in the world.

When they crack that problem, I think it will change AI in a very, it will be another phase change in what will happen. And I think that's coming.

Sriram Viswanathan: Would you, would you say, is it three years, five years, ten years?

Anil Ananthaswamy: I would say, so there could be an, there could be extremely intelligent systems that are just reactive things, right?

I mean, like an LLM is, you know, no one is going to argue that an LLM doesn't have some kind of intelligence. Of course. But no agency. We haven't built that into them, but that's what agentic AI is beginning to do, where you have these external scaffoldings around LLMs that are going to act like agentic things.

To me, those are still patchwork things, right? If you can figure out, like when you think of the human brain, this is one single unit, right? And this has evolved by itself. Over 500 million years of evolution of the nervous system, this nervous system has figured out what it needs to do in order for this body and brain to survive, right?

Nobody has been feeding, nobody has been training this outside of itself. I mean, it's nature is doing this on its own. Uh, no in principle reason that a machine, we can't figure out what that algorithmic processes that will get machines to this point, except that we don't quite know what those answers are.

Right? So, you know, in, in some level, at some, in some way of thinking, these machines that we are building, these machines that we are trying to imagine, is also evolution in action.

Sriram Viswanathan: Yeah, that's for sure. Well, Anil, this has been just remarkably You know, fascinating for me personally, I don't know, uh, you know, as I said early on in this podcast, we can talk about so many topics and subtrees within this general area, but thank you so much for, uh, for just giving us a glimpse, uh, of, uh, of the way you think about it and what you have tried to capture, uh, in, in this current book, uh, why machines learn, uh, for our audience, I think, uh, if you are even slightly me.

Uh, uh, curious and technically open to sort of delving into your high school, early math and, uh, and areas of that nature. You will find this book absolutely fascinating, as you will some of the other books Anil has written. So thank you, Anil, for joining the podcast. It was truly a pleasure and honor to have you on this.

Anil Ananthaswamy: Thank you, Sriram, for having me. It was my pleasure entirely. Thank you. Thank you.

Sriram Viswanathan: Thanks for tuning in to the TechSurge podcast from Celesta Capital. If you enjoyed this episode, feel free to share it, subscribe or leave us a review on your favorite podcast platform. We'll be back every two weeks with more insights and discussions on all things deep tech.

Thank you very much. Bye for now.

RECENT EPISODES

March 13, 2025

Hype vs. Reality: Why AI Isn't Ready to Make Medicines Yet

Many in venture capital and biopharma are anointing artificial intelligence the savior of drug discovery—but what can AI actually do?

In this eye-opening episode, Michael Marks sits down with Mike Nohaile, CEO of Prellis Biologics, to explore the hype versus reality in AI-enabled drug discovery. Mike details why, despite significant breakthroughs like AlphaFold and recent Nobel Prize win for computational protein design, fully AI-generated medicines still present challenges. He also discusses why we urgently need more effective medicines and details Prellis’ unique system which combines laser printed human organoids and an externalized human immune system with AI, enabling the discovery of fully human antibodies.

If you enjoy this episode, please subscribe and leave us a review on your favorite podcast platform. Sign up for our newsletter at techsurgepodcast.com for exclusive insights and updates on upcoming TechSurge Live Summits.

Listen Now

February 13, 2025

21st Century Financial Warfare: Examining the Nexus of Technology, Economy, & National Security

With global economic alliances shifting and new threats emerging, will the U.S. maintain its dominance in an increasingly complex world?

From cryptocurrency to chips to cyberterrorist threats, the battle for global dominance is no longer just fought on the battlefield—it’s playing out in markets, boardrooms, and cyberspace. In this episode, we sit down with Juan Zarate, a key member of the post-9/11 Bush Administration team fighting terrorist financing and financial crimes, and an architect of how we view modern financial warfare.

We explore how the U.S. has used its economic dominance as a powerful weapon—and whether countries like China and Russia are now using the same playbook to push back. Juan shares insights on the weaponization of the dollar, how financial crime networks are evolving in the digital age, and why strategies around cryptocurrency could either threaten or reinforce U.S. economic power.

The conversation dives into the intersection of technology, economic, and national security strategy, tackling key issues like cyber threats, semiconductor supply chains, and the growing role of AI in financial security. Juan also introduces his latest venture, Consilient, which is pioneering federated AI to revolutionize the fight against financial crime.

Listen Now

January 30, 2025

Adapting to Disruption: AI, Housing, Geopolitics, & More with Global Head of McKinsey Bob Sternfels

The world is evolving faster than ever – how will we keep up?

From AI breakthroughs to global supply chain disruptions, the forces shaping business and technology today are relentless. In this episode of the TechSurge Deep Tech VC Podcast, we sit down with Bob Sternfels, Global Managing Partner of McKinsey & Company, the global consulting firm who has been on the frontlines of helping businesses and industries navigate relentless change for the past 100 years.

We explore the shifting landscape of venture capital and opportunities for VCs, startups, and consultancies to explore new partnership models. Bob shares his view on shifting global supply chain strategies, why full economic decoupling between the U.S. and China is a dangerous and difficult proposition, and how India may be on track to become the economic powerhouse of the twenty-first century. The discussion digs into the complexities of housing affordability and why the future of housing insurance is at risk – could our homes of the future soon be uninsurable due to climate change?

Bob helps us dive into the evolving demands of 21st-century leadership, where resilience, adaptability, and even humor are becoming essential CEO traits. Bob explains why today’s leaders must rethink their approach to disruption, risk, and innovation – or risk being left behind.

Listen Now