Oral-History:Bernard Widrow

About Bernard Widrow

Bernard Widrow was born in 1929 in Norwich, CT. He graduated MIT in 1951 with an engineering degree and took a research position at MIT's Lincoln Laboratory digital computer lab. He finished his Ph.D. in 1956 at MIT and was subsequently appointed to MIT's faculty. That same year he moved to Stanford, and has remained on Stanford's faculty since that time. He is the recipient of the IEEE's Centennial Medal and the Alexander Graham Bell Medal (1986). His major research interests have been in the fields of pattern recognition, adaptive filters and adaptive controls, bioengineering, adaptive beam-forming, adaptive geophysical imaging, and particularly adaptive neural networks. Widrow is the co-author of two major engineering texts, Adaptive Signal Processing (with S. D. Stearns, 1985), and Adaptive Inverse Control (with E. Walach,1994). He holds fifteen patents and is the author or co-author of over 100 articles. Widrow is a fellow of the IEEE (1976) [fellow award for "contributions to adaptive antenna systems and to the theory of quantization error"] and the AAAS (1980), and he has received the IEEE's Centennial Award, the Alexander Graham Bell Award (1984), and the Neural Networks Pioneer Medal (1991). He has served as associate editor of three IEEE Transactions: Information Sciences, Pattern Recognition, and Circuits, Systems, and Signal Processing.

The bulk of the interview concerns Widrow's explanations of his solutions to several research problems related to signal processing. There is a lengthy discussion of his Ph.D. dissertation, which developed the theory of area sampling as a solution to quantization error. He also describes in depth his work on adaptive FIR filtering, which has applications in modem technology. Finally he discusses his development of an adaptive antenna in the late 1960s.

About the Interview

BERNARD WIDROW: An Interview Conducted by Andrew Goldstein, Center for the History of Electrical Engineering, 14 March 1997

Interview #329 for the Center for the History of Electrical Engineering, The Institute of Electrical and Electronics Engineers, Inc., and Rutgers, The State University of New Jersey

Copyright Statement

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, Rutgers - the State University, 39 Union Street, New Brunswick, NJ 08901-8538 USA. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

It is recommended that this oral history be cited as follows:
Bernard Widrow, an oral history conducted in 1997 by Andrew Goldstein, IEEE History Center, Rutgers University, New Brunswick, NJ, USA

Interview

Interview: Bernard Widrow
Interviewer: Andrew Goldstein
Date: 14 March 1997
Place: Stanford, California

Childhood and education

Goldstein:

Can we begin by talking about your early life, leading up to your education and how you became involved with signal processing?

Widrow:

Yes. I grew up in the small town of Norwich, Connecticut. I was born December 24th, 1929. Christmas Eve.

Goldstein:

As a gift for your parents.

Widrow:

Yes, a little Christmas present I suppose. Boy, what a surprise. As I grew up, my dream and hope was to be able to go to MIT and become an electrical engineer. As a kid, I played with electrical apparatus and radios, and I was completely hooked on electronics. But I didn't think that I had much of a chance at getting into MIT, as it was well known to be very competitive, and here I was coming from nowhere in this little town. But a very distinguished gentleman whom I never met before and never saw again asked me, "Sonny, what do you want to be when you grow up?" I told him, but I said, "I don't think I'm even going to apply to MIT. I could never get in." He said, "That's the wrong idea. You apply. You'll never forgive yourself if you don't. You can do it just as well as anybody else. You go ahead and do that." I said, "Yes sir, I'll do it." So I did. I went to high school close to our home, and I would very frequently come home for lunch. So one day in my senior year of high school, I came home for lunch and my father was waiting for me. He said that the medical doctor at MIT wanted to know about my health. I said to my father, "Well, why on earth would he care about that?" My father said, "You dope. They accepted you!" So off I went to MIT, and studied electrical engineering and electronics.

MIT

Digital Computer Laboratory

Goldstein:

What years were you there?

Widrow:

I began in 1947 and got my undergraduate degree in 1951. When I first went there, I thought that a great life would be to learn all I could about electrical engineering and get a really good job with a big electronics company or a big electrical house like GE or Westinghouse. That was my goal and ambition. But then as I went further, I began to hear people talking about graduate school. That sounded like a good idea. And so, I was able to get a research assistantship in the MIT Digital Computer Laboratory, which at that time was developing a machine called Whirlwind. It was all hand-made, and pieces of it are now in the Smithsonian. I began graduate study as an electronic engineer in the magnetic core memory group.

There was a great rivalry going on between RCA and MIT. A group at RCA, led by a very famous man named Jan Rajchman, was developing core memory, and they competed with the MIT group led by Jay Forrester. This was a competition that everybody got their heart and soul into. We had a slogan, "Hip-Hip-Array, Beat RCA." The array of course referred to the magnetic core memory array. I still have an array up on that shelf, right near the little robot. There was a time when I had the largest magnetic core memory on the planet on my workbench. It stored all of 256 bits. It was just remarkable how well it worked. Then other memories were built that were bigger, and pretty soon it became clear that the magnetic core memory was going to supplant all other forms of random access memory. So RAM was magnetic core until some people working at Fairchild decided to leave the company to develop integrated circuit memory. Nobody believed that you could afford to use a flip-flop to store 1 bit. But this group went off, and this led to the founding of Intel.

Goldstein:

Right.

Magnetic-core memory

Widrow:

So, my experience with computing started with work on memory. I always had a perspective on computers which is sort of a memory's eye view. I look for the memory and see what you have to connect around it. In my work for a master's degree I wanted to improve the signal-to-noise ratio of the sensing signal coming out of core memory. In those days, the materials for making magnetic cores were very poor compared to what they finally evolved into. What was needed was a very square hysteresis loop, and they couldn't get that exactly. The signals coming from the selected core in the memory plane containing the bit that the computer was trying to read became corrupted with noise, and sometimes the signal could be noisy enough to cause an error. I had the idea of driving the Cartesian x and y axis grid lines of the core with currents of two different frequencies, and I chose 10 mhz and 10.5 mhz. The core at the intersection would produce a beat frequency signal at a half a megahertz. I discovered that if the core was in the "zero" state, the beat frequency appeared at a certain phase. But if the core was flipped to the "one" state, then the phase of the beat frequency signal would be flipped 180 degrees. So you have to sense whether the phase was zero or 180 degrees, which would indicate a one or a zero. There were noise, interference at 10 mhz and its harmonics, and at 10.5 mhz and its harmonics. But the signal of interest was at half a mhz. With filtering, I was able to get a nice clean signal, and we were able, in principal, to make the memory planes much bigger. The bigger you made them, you see, the poorer was the signal-to-noise ratio. The more things you create, the more mischief you are going to find.

Goldstein:

Right.

Widrow:

So my master's thesis contained signal processing concepts. It proposed a completely different way to sense the information from a magnetic core.

Goldstein:

I guess in retrospect you can call it signal processing, but you did you use that terminology at the time?

Widrow:

The phrase "signal processing" was not used. But in fact, I was dealing with signals and determining what was happening with nonlinearity and mixing in the frequency domain. All these concepts were well understood. What was new was the way of putting this together to get information from a magnetic core in a memory plane. I was taking a graduate course called Sample Data Systems from Professor William Linvill. Today the same course might have two names. One would be Digital Signal Processing; the other would be Digital Control Systems. The course taught both signals and control. Since the signals were sampled in time, he called this sample data systems. There was something about the course that fascinated me, and Professor Linvill was just about the most popular professor in electrical engineering because of his personality, the nature of what he did, and the way he thought about things. Everybody wanted to do their theses with him. I could only dream about the possibility of joining his group and doing a thesis under his supervision.

What happened is that I discovered the beat frequency principle for magnetic-core memory, presented it to him, and asked him whether he'd be willing to supervise my master's thesis. He said, "Absolutely, this is a wonderful idea." So that’s how I began working with him. The fact that I got the highest grade on the final in his course helped. When I first started out at MIT, I didn't even know what a doctorate was. Well here I was finishing a master's degree and the time came to think about going on for a doctorate I thought, "Why go on for a doctorate?" But then I would hear people talking about doing that, and I just wondered, "could you really do that? Is it possible to do a thing like that, to do research that is really original." I think my first little inkling of the possibility of doing something original was the idea I came up with for increasing the signal-to-noise ratio in the magnetic core memory. So I asked him, and it was my lucky day.

Computing, mathematics

Widrow:

Professor Linvill was interested in signal processing and control. Whatever he was interested in, I would be interested in. My research focus switched from hardware to algorithms and mathematics. All my life up to that point I thought I was going to be a circuit guy. I loved vacuum tubes. They had sort of a glow about them.

Goldstein:

In more ways than one.

Widrow:

Yes. I just loved vacuum tubes, and I really understood vacuum tubes. The transistor was a mystery. The first experiments I did with transistors during my senior year were for a lab course to could choose your own experiments. Two transistors existed at MIT. They had somehow been secreted out of Bell Labs. Somebody told us that if you did something wrong and you burned out one of them, you would be burned out half the supply of transistors for the whole of New England. So we were very careful. We repeated the experiments that were published by Shockley, et al., in the Proceedings of the IRE but we couldn't get any gain of one out of those transistors. But it was obvious what their potential was, since you could make them so small and you didn't have to have a glowing filament. It was just an overwhelming idea. However, I switched from an interest in hardware to algorithms and mathematics. And there wasn’t any such thing as software.

Goldstein:

You mean when you were working Linvill?

Widrow:

Yes. Even then there were people writing programs, though we didn't call it "software." Today you'd say they were programming in machine language. There was no such thing as computer science. It was all electrical engineering. When I first got involved with computing, you could write programs that would probably have a few hundred lines in machine code. A lot of the initial programming was done by setting the binary code for the program in toggle switch registers, so that the initial programming was done with toggle switches. You should have seen the computer. You can now fit thousands of these switches in your wristwatch. At that time, the computer took the whole floor of a building, not to mention the whole basement, which was filled with air conditioning and motor-generator power supplies. It was all bolted to the concrete!

Goldstein:

Right.

Widrow:

I should mention another course I took. I had some roommates, one of them a graduate student at MIT and the other a graduate student in applied mathematics at Harvard named Mark Beran. He was really a mathematician. We were very close friends, and one day he was telling me about a course at Harvard given by David Middleton. I have forgotten what the title was, but today you call it Stochastic Processes or Stochastic Systems, or you might even call Random Signal Processing. Taking both of those courses it turned out to be a very fortunate decision in my case when they became interested in applying the computer to aircraft tracking for air defense.

Goldstein:

Who was behind that?

Widrow:

It was the navy and the Air Force, I think. Originally the sponsors were navy. The purpose was to track enemy bombers and our own interceptors and be able to guide interceptors up to meet the attacking bombers. The problem as it was presented to me was a two-dimensional problem. The computer only knew where these tracks were. We had radars connected to computers in real time and a room full of terminals, with military people looking at those displays of the radar outputs. The computer was very primitive. People had to be very clever to do things with such a computer. So the sky was divided up into a grid, with one mile by one mile regions, and all the computer sensed was that there was a target within that one mile by one mile square. There were squares all over, and you knew that the target would go from one square to the next square to the next square as the flight progressed. The problem was to estimate the target position to within a small fraction of a square, given this crude information about where the target is. This was a sampled data system because the radars were sampled at the radar pulse repetition frequency.

The radar pulses were transmitted periodically, and then periodically you got samples of the flight path. Say you take a piece of graph paper and draw and nice, smooth flight path for an airplane, and then at uniform points in time put a mark on the actual path based on the radar sample. If you don't show anybody that nice, smooth line but only the sequence of grid squares that the airplane was in, one could still reconstruct the smooth track.

Goldstein:

Right.

Quantization

Widrow:

As a result of this, I got very interested in the subject of quantization. This kind of quantization in two dimensions was too difficult a problem. But taking an analog signal and putting it in digital form is a one-dimensional quantization problem. Now if you have an analog signal that is a time function representing something physical in the real world, and you regularly sample the value, and you put those sample values into the digital computer. Then you can process that information, but you have to sample it. It's necessary to sample that signal. Otherwise you can't get it into the computer. But there's another thing that you also have to remember, which is that the computer only processes data that has a certain word length.

Goldstein:

Right. So there is finite precision there.

Widrow:

Yes. It's necessary to round off the value of each sample that you take so that the data fits what the machine needs to do the arithmetic. When you take a signal in the real world, in principal there's an infinite resolution. But you quantize it to the limit of the resolution of the machine. You're concerned with the size of the least significant digit and how much signal that represents. What you're doing is rounding the true value up or down to the center of the quantization box. Now the active rounding off of the signal makes it less accurate than the original signal, so the effect is to add in error. That's called quantization error, or sometimes quantization noise.

Goldstein:

Right.

Widrow:

I became interested in the nature of this quantization noise. I began to spend some time on this problem, and every now and then I'd stop by and see Professor Linvill to talk about what I was thinking about. After about six months of doing this, he said to me, "You know, you're dealing with a problem that is very difficult. This is a nonlinear problem. Superposition doesn't work. If you take two signals sampled at the same time, quantize the samples, add together the quantized signals, compare that with the original signals added together and quantized, you realize that the quantization of the sum is not generally equal to the sum of the quantization. People have been worrying about this and trying to understand it since man first began to use numbers."

He said, "You should really think about whether you think you are going to be able to make progress on this. Don't spend an infinite amount of time on this problem. It could be that it's impossible to do anything with it." But the trouble is that I just sort of fell in love with the problem. This was sort of foolish for a young guy. I wouldn't recommend that someone else do it. I wouldn't recommend it to myself to do it over again. But I couldn't leave that problem, so I kept on. In my office I had a nice blackboard with a chalk tray was at the bottom, and it was down low so that with a swivel chair you could roll up to it and put your feet up on the chalk tray. In that position you were practically tipping over backwards, and trying not to kill yourself, but you could just sit there staring at the blackboard. I drew a picture on the blackboard of a quantizer that I represented by a square box with a big Q representing quantizer. It had an input and an output with the signal coming out being a quantized version of what goes in. So this box was a nonlinear operator, the function of which I could diagram on the blackboard. This function was a staircase function.

Goldstein:

Right.

Widrow:

So this had a nonlinear transfer characteristic through it. I was trying to understand how signals went through this nonlinear operation. I had not really given much thought to probability density functions since I took Middleton's class. Middleton's class had probability density functions and Fourier transforms (although in statistics these are called characteristic functions or generating functions or moment generating functions). Electrical engineers of course, and even in those days, were very familiar with the use of Fourier and LaPlace transforms, and this is what we always referred to as "working in the frequency domain."

Goldstein:

Right.

Widrow:

So we were very familiar with Fourier transforms of signals. But in Middleton's course he dealt with characteristic functions. So I became a characteristic function freak and became very familiar with probability density functions and characteristic functions. I drew a picture on the blackboard in my office that morning of a probability density function. I wish I had taken a picture of it, because I can still see it in my mind. It sort of looked like a Gaussian probability density. I thought that if I knew the probability density going into the quantizer, what would the probability density look like coming out? I'll just draw a Gaussian because it's so easy to draw, but there's nothing about this that says it has to be Gaussian. You have a quantizer with the label Q, and here's the input. And what I called the input was the signal X, and I called the output coming out X'. So if you draw a picture of X' versus X, what you see is a staircase. You can draw a line through the steps, right through the origin, and you get a straight line approximation, where the average gain equals one. So it's more or less proportional and the output is more or less like the input, except not quite. It's hacked up.

This is the probability density of the input X, so we'll call this PDF of X. If we put a little grid here and make this Q/2, little q is the quantum step size, so this is Q/2. This is 3(Q/2), and then on the other side it's -(Q/2), and -3(Q/2), and then 5(Q/2) and so on, so that when the input is between plus orminus half a quantum box, the output is zero. When the input is between half a box and three halves of a box, the output is Q, and so forth. Now if we draw the quantization grid on the X axis, this is 3(Q/2), -3(Q/2), is -5(Q/2), and that's 5(Q/2) and so on. Now you draw vertical lines. But you can't do that everywhere. Here is a special place we are doing it. Look at the area under the probability density function between plus and minus half a quantum. I'll draw some shading to show that area. Whenever the input is between plus or minus half a quanta, the output is zero. Now the output can only take on certain discrete levels, but the input is continuously variable in amplitude. So the output is the area of the probability density between plus or minus half a quantum, that area gets smashed into an impulse. And the area of this impulse this is the probability of the output having the value zero.

Goldstein:

This is like a direct function?

Widrow:

It's exactly that. When the input is between half and three halves of a quanta, the output has a value equal to one quantum. The output probability is represented by an impulse located at Q.

Goldstein:

Right.

Widrow:

That's how you construct the output probability density from the input probability density. The input probability density is smooth. For example, it could have a nice, smooth Gaussian shape. But the output probability density is a string of Dirac delta functions. I looked at that and I said to myself, "This must be a sampling process." You see, what got me thinking along this line was from the experience I had in David Middleton's course at Harvard. They allowed us to ten percent of our courses at Harvard. Now thinking about Professor Linvill's class where we learned about sampling and Nyquist's sampling theory, I said to myself, "Holy smoke. That quantizer is like a sampling device probability. This was something like Nyquist's sampling, but it was doing it on the probability density function of the signal, not on the signal itself." When you converted from analog-to-digital, you were doing both sampling and quantization. When you converted a signal from analog-to-digital, you sampled in time and quantized in amplitude. A signal goes along in time on the horizontal axis, and in amplitude along the vertical axis of a graph. But you are doing two different kinds of things. The sampling in time turns out to be linear. But the quantizing in amplitude is nonlinear. In terms of probability, you could stop worrying about how the signal went through the quantizer and worry instead about how probability goes through. So now the quantizer was a probability density function which was smooth and continuous. The signal coming out, which was the probability density function of the output of the quantizer, was now a uniform string of delta functions. Now you worried only about ordinary Nyquist sampling, which is linear. What happens is that you start out going into the input of the sampler with a continuous smooth signal, and you come out with a string of delta functions. I'll draw you a signal. Let's say this is F of T, and this is time. When you sample it, you sample like that. You see, you are getting delta functions. We were representing the sample data. Linvill, in his doctoral thesis, looked at sampling in the frequency domain. He examined the Fourier and LaPlace transforms of the signal F of T and studied the properties of the samples where they were separated by t-seconds. This is three t. This is minus t, minus two t, and so on. So if you have a sampling, up above is quantization, down below is sampling. For linear sampling, you have a sampling device, F of T goes in, and what comes out is f* of T, and f* of T of the samples. These would be would be delta functions.

Goldstein:

It's a series of values.

Widrow:

It's a series of delta functions. You can represent it as a series of Kronecker deltas or of Dirac deltas. If you are going to take Fourier or LaPlace transforms of those samples, you have to use Dirac. Because if you do it with Kronecker you will get zero. The area of a Kronecker delta is zero, regardless of how big it is. So these impulses have to have area in order for them to have Fourier or LaPlace transforms.

So we were using Dirac delta functions to represent the samples of the signal. This type of sampling device was well understood, and Linvill was teaching this. I combined what he taught and what Middleton taught, and realized that for quantization you are putting probability in and getting a string of impulses coming out instead of putting a signal in and getting a string of impulses coming out. But there's a difference. These impulses have area equal to the amplitude of FFT at the sampling instant. They don't have area equal to the value of the probability density in the middle of the quantum boxes. They have an area equal to the area of the strip. So I gave this a name: area sampling. Area sampling is linear, because taking the area of something is a linear operation. So analyzing a nonlinear device could be done with linear analysis, and it was still a nonlinear device.

I found out was that you can use the sampling theorem to understand quantization. I used a linear theory to understand and explain something that's nonlinear. Now, I was not trying the cure the error. You cannot do that because it is irreversible after quantization. What I was interested in finding was the nature of the distortion, and how the damage related to the number of digits you carried (in terms of accuracy). It's clear that the more digits that you have, the smaller the least significant digit, and the smaller is distortion introduced by quantization. So how finely do you have to quantize in order to get a good representation of this signal? What do you mean by good representation? What is the nature of the distortion introduced by quantization? In sampling theory, there is a well known theorem called the Sampling Theorem that tells you that if you sample at a frequency which is at least twice as high as the highest frequency contained in the signal, that you can perfectly recover the signal from the samples.

Goldstein:

Is that Nyquist's Theorem?

Widrow:

Yes. There's an analogous quantizing theorem that I discovered that says that if you quantize in amplitude finely enough so that a Nyquist's condition on probability density is met relative to the shape of the probability density, then it turns out that the error that you inject, the distortion that you inject every time you quantize, has certain statistical properties. Here you are thinking of the probability density as a signal. Are you quantizing fine enough? Are you sampling that in amplitude fine enough? Do you have fine enough resolution and fine enough accuracy, and do you have enough digits that you are carrying in your digitizing so that you are satisfying a Nyquist's condition on the probability density? We are not talking about the signal; we are talking about the probability density.

Goldstein:

Right.

Widrow:

The quantization error is always bounded between plus minus half a quantum, because if the error tended to be greater than half a quantum, you'd jump to the next box. The question is, within the bounds, how is that error distributed? The answer is that if the quantizing theorem is satisfied that quantization error is uniformly distributed. Not approximately, but perfectly. That means that the error that you have injected, the additive distortion, is itself a signal. We call it a noise, or quantization noise. Since the quantization noise is uniformly distributed, it has zero mean, or the dc level is zero. Because of the uniform distribution, the mean square error of quantization is one-twelfth the square of a quantum box size. That's what I discovered.

Another thing I discovered was that if the quantizing condition is met, interesting relations will exist between the signal being quantized and the quantization noise. If the input signal to the quantizer has a definite level, then the output of the quantizer has a definite level. Wherever the input is on a uniform amplitude scale, it's within some quantum box. It will be quantized at the center of a quantum box. So for a definite input, there is a definite output. Then there is a definite difference, which is the quantization error. So the quantization error or the quantization noise is deterministically related to the signal being quantized. This is the strongest kind of statistical connection. But it turns out that in spite of the deterministic connection, if the conditions for the quantizing theorem are met, the noise of quantization is uncorrelated with the signal being quantized. So it acts like an uncorrelated noise added to the signal. That is the simplest kind of noise you can have. If the noise or interference is correlated with the signal, it is a difficult thing to deal with. But what could be simpler than an additive noise that has zero mean. Without even knowing the signal itself you know that if the condition for the quantizing theorem is met, the noise has zero mean, a mean square of /12 Q square, and it's uncorrelated with the signal being quantized. So I thought that was kind of interesting. Now we know something about the properties of the noise, and all we have to know about the properties of the signal is, whether it has a probability density that satisfies the quantizing theorem. Well, this question about the quantizing theorem is just like the analogous question about the sampling theorem. When you sample the voice of a basso, the frequency content is much lower than that of a soprano. So you might assume that you should run a higher sampling rate with a soprano than a basso. This is true. But what should the sampling rate be? What is the bandwidth of a basso? What is the bandwidth of a soprano? The answer is that it doesn't have a finite bandwidth. The spectrum just gets low enough so that after a while you don't worry about it.

Goldstein:

You mean the higher order harmonics? You just ignore them?

Widrow:

Yes. The higher frequency content is of such small amplitude that it's negligible. You can talk about the bandwidth of a soprano, but it isn't finite. You just sort of say, "Well, I'll draw the line right here." So you look at the Fourier transform of the probability density and do the same thing. The most popular probability density in the universe is Gaussian, right? But say you take a Gaussian probability density and the Fourier transform of it, trying to see whether the spectrum of a Gaussian shape is band limited. The Fourier transform of a Gaussian density is a Gaussian characteristic function. So in the frequency domain of probability density--this is the characteristic function domain--it is Gaussian and therefore not band limited. The spectrum of that probability density goes on forever. But it goes down with e to the minus variable square. So it drops down with a vengeance, so fast that you can have crude quantization and the conditions for satisfying the quantizing theorem are very close to being met. Although they are never perfectly met, they are close to being met even when the quantization grain size is as big as one standard deviation, which is to say very, very rough. You know, within a few quantum boxes you've got practically the entire area of the Gauss. Things are unlikely to happen beyond it. You know, with ten quantum levels, that's ten standard deviations. You've practically got the whole thing. This is very crude quantization. Physically you don't do it like that in most cases. The quantization granularity is generally much finer. Of course the more finely you quantize, the more closely this condition for the quantizing theorem is approximated.

With a Gaussian probability density and crude quantization, where the quantum box size is as big as one standard deviation, if the mean square of the quantization noise is one-twelfth Q2, you are wrong. Because it isn't. The quantizing theorem is not perfectly met. You are making a mistake of something on the order of one part and 1010. In other words, the quantizing theorem is incredibly close to being met. So closely met, that if you say it's met, you are telling a lie to the extent of one part in 1010, and furthermore you will have an error in the mean of something like that. The input to the quantizer could have a dc level or a mean level. And the mean of the output of the quantizer would be the same as the mean of the input because the additive noise has zero mean to within something like one part in 1010 or 1012 of a quantum box size. Essentially, the mean of the output of the quantizer would be the same as the mean of the input. The mean square of the output would be equal to the mean square of the input plus 1/12 Q2, because the additive quantization noise is uncorrelated with the input. You can measure all the moments--the mean, the mean square, the mean cubed, the mean to the fourth, etc.--of the output of the quantizer, and from them you can calculate the corresponding moment values of the quantizer input even though you don't that input anymore; you've already quantized it.

You can take a signal over time and sample it. Let's say you are doing linear sampling with perfect resolution, no quantization. If you sample frequently enough, these samples are going to be correlated with each other over time. Just like if you took samples of the Dow Jones Industrial Average curve over say the last fifty years. If you look at the sample today, it's going to be correlated with what it was yesterday. There will be some differences, but over time it is correlated. The question is, if you take a signal and sample it so that the samples would be correlated over time--and you are digitizing now so that you are quantized--what can you say about the quantization noise? How is it correlated over time? Answering this takes a higher order probability density, because we are talking about aspects of the joint probability between adjacent input samples. If we quantize finely enough for the input joint probability density to satisfy the quantizing theorem in higher dimensions, it turns out that not only is the quantization noise uncorrelated with the signal being quantized, but even though adjacent samples going into the quantizer could be highly correlated with each other, the adjacent quantization noises that correspond are almost perfectly uncorrelated with each other. They would be perfectly uncorrelated if a high-dimensional quantizing theorem were satisfied. So let's take our Gaussian signal and put it into the quantizer and quantize roughly but let the grain size be one standard deviation. If two adjacent samples going into the quantizer are ninety-nine percent correlated, almost identical to each other, the quantization noises corresponding is less than one percent. So the signal going into the quantizer could be highly correlated over time and the additive quantization noise is essentially uncorrelated over time. In other words, it's white noise.

The story is that when you treat quantization, if the quantizing theorem and theorems are satisfied, then the quantization noise is uniformly distributed and uncorrelated with the signal being quantized. It's an additive noise even though the signal coming into the quantizer is highly correlated over time. It's the easiest thing to deal with. Now you can analyze digital systems with quantization. Suppose that you have a quantizer inside a digital filter. You can calculate the quantization noise coming out at the filter output. The whole system now can be treated as a linear system, even though it's nonlinear. Because you take the quantizer out and replace it with a source of additive noise. And everything is linear, so you can examine how the noise propagates through the system that is otherwise linear. As long as the quantizing theorem is satisfied to a good approximation at the quantizer input, and this is usually the case.

Goldstein:

That was your Ph.D. thesis?

Widrow:

Yes. In my opinion, this was the best piece of work I ever did in my whole life. I was twenty-five years old.

Goldstein:

Did you follow how it was picked up?

Widrow:

Well, I wrote three papers on it. One was in the Transactions of the IRE. I wrote another paper in one of the journals of the AIEE, and I presented a paper on another aspect of the subject at a conference called Wescon.

Stanford

Widrow:

Wescon was at the Cow Palace near San Francisco. I was staying in San Francisco, and I had never been in California before. After the conference, I decided to take two weeks and just get a car, and not caring or knowing where I'm going or what I'm doing, just drive around. I rented a car from Hertz. It was a 1957 Chevrolet I remember, because my uncle had one like it, and I started to drive. I was driving south when I saw signs to highway 101. I didn't have a map, you know, I was just driving, and I thought that 101 might be like Highway 1 in the east, and it turns out that it is. It's a north-south that goes basically from Mexico up to Canada.

I got on, and just happened to get on the southbound side. Pretty soon I saw signs to Palo Alto. I said, "Gee, that's where Stanford is." I knew some people at Stanford. I kept on going, until I saw signs to Palo Alto University Avenue. I said "Bingo, that's it." I got off at University Avenue and went right through this beautiful little town, Palo Alto, and then crossed the El Camino going on to the campus down Palm Drive. This is a main road that comes into the campus, lined with palm trees. I mean, this was so tropical, so exotic, it was mind-blowing. I saw this campus and I said, "Boy, what a gorgeous place that would be to spend a career." It turns out that my thesis advisor, Bill Linvill, had an identical twin brother, John Linvill, who was on the EE faculty at Stanford.

Goldstein:

That's a smooth transition.

Widrow:

That's what happened. I developed a new mentor. When I came to Stanford and began working on adaptive systems and neural networks, the other Professor Linvill, John, became very intrigued with it, because he was interested in solid-state electronics and he could see a big area of application for electronic implementation of new kinds of circuits that "learned." So that's what we were building when I first came to Stanford.

Quantization and neural networks; reception and applications

Goldstein:

Are you aware of how your work in quantization found applications?

Widrow:

I published these three papers, and then I did something that would be academically not so smart. I became interested in, even fascinated by, learning systems, and changed the direction of my academic career. I was pretty young and pretty naive but I knew the significance of what I did. I knew it that day when I came into my office and I drew that Gaussian-like density on the blackboard. I knew it as I began to make all these discoveries about the properties of quantization noise, every day I couldn't wait to get to work. I was making more and more discoveries. It was sort of like being an explorer. I knew that no one had ever thought of this before. I just knew it. In my mind it was like going into a cave and discovering prehistoric things that hadn't been seen for thousands of years.

Goldstein:

Well did that make you want to rush out and tell people?

Widrow:

Of course. But then something happened. In the summer of 1956 I completed my thesis, received the doctoral degree, and joined the faculty at MIT. In 1957 or '58, I came out to San Francisco and gave a talk at Wescon on some aspects of that thesis work. That was the first time I gave a talk to a major audience. I was scared to death. The biggest group I have talked to was on the subject of neural nets, and that was at this conference in San Diego June 21 to 24, 1987. There were two thousand people. So that was the biggest group I ever talked to. That talk went very well. I was just in a good mood that night, everybody was in a good mood and all enthusiastic about neural networks. When I got done, I knew I had given a good talk, but I was not prepared to see the whole two thousand people get up and give me a standing ovation.

Goldstein:

That's nice.

Widrow:

Some of my colleagues were there. One, a good friend, said, "Never in my life have I ever seen anything like that at a scientific meeting, where people get up and give a standing ovation to a talk." I went over some parts of the history of neural networks and demonstrated historic things that were built in the '60s that still work.

To answer your question about what happened when I did this. First of all, there was no field of digital signal processing at the time. There was no name for this stuff. We called it sampled data systems. There just was no formal structure of that type, and the field just wasn't established. The number of computers that existed in the whole world at that time you can practically count on the fingers of one hand. Actually it wasn't quite like that. MIT had a computer, it was an IBM. Today you would call it sort of a mainframe. I think it was an IBM 701, and it had a magnetic core memory in it, and it was just like a jewel. It was identical to the core memories that we built, except that all the memory planes were chrome-plated. We used gray wrinkle finish. They made it really pretty.

Artificial intelligence

Widrow:

I had finished the work on quantization, then joined the faculty during the summer vacation. We were in a U.S. Navy-sponsored lab developing all kinds of new circuits for computing. A colleague named Ken Shoulders was working on field effect devices, and his dream was to make integrated circuits. There was no such thing at the time. The first person I ever heard of talking about making active electronic components and the wiring all at the same time was Shoulders.

We had been working on this since 1950, years before integrated circuited. He was thinking of a new device. He was a brilliant guy and "crazy Texan." He didn't like semiconductors, and said they were "no account." He liked field effect devices. He was trying to make a microscopic field effect device. He was kind of a visionary, and he told me that some people were doing this stuff called artificial intelligence. He wanted to find out what it was about. There was a meeting held at Dartmouth College in the summer of 1956, and he said that he'd heard about it, and it was an open meeting. The two of us got into his car and went up Hanover, New Hampshire to attend the meeting. We stayed there for a week. Minsky was there, McCarthy was there. It would be fascinating if we had only taken a video, except that video didn't exist. Maybe if we had just taken photographs of the people who were there. It was an incredible collection of people. They were talking about making a brain-like machine. This thing hit me so hard that I never got over it. This has been a monkey on my back. It's still there.

So when I came back to MIT, what I wanted to work on was artificial intelligence. I wanted to do something to build a brain. I wanted to build a machine that thinks. I stopped working on quantization. I had written three papers, I thought I'd done the job, that's all you have to do. In those days we thought that if everybody at MIT knew what you did, that's all that was necessary. The sun rose and set in Cambridge, Massachusetts.

One thing that was a big surprise when I left MIT and came here to Stanford was that I found out that there is a whole world outside of MIT. It was amazing. There were all kinds of smart people all over the place, thinking of all kinds of things. So during the summer of 1956, I began work on artificial intelligence. If I had been a real smart guy and had known how to play the academic game, I would have milked that thing for the rest of my life. You know, when you discover the whole theory of something reasonably major, you spend your whole life on it. But I was young and foolish, and I was absolutely captivated with the concept of artificial intelligence.

Publications; adaptive filtering

Goldstein:

So what approach did you try?

Widrow:

The new discoveries in quantization theory are amazing. We have discovered new things based on the original bedrock, but now we are applying this to floating point quantization. Now I'm in the process of writing my third book, which is on quantization. One book was on adaptive signal processing, written with Sam Stearns, and the second book with Eugene Walach is called Adaptive Inverse Control, which uses adaptive filtering and signal processing methods to do adaptive control. We developed an entire theory of adaptive control based on adaptive filtering. It's sort of a signal processing approach to control systems.

Goldstein:

Can we talk about how you got involved with adaptive filtering?

Widrow:

Certainly. I'm working on this third book with a young professor, Istvan Kollar, who is teaching at a university in Budapest in Hungary. He was working for ten years on the theory of quantization before I met him. I had always intended to go back and write a book on quantization. You see I had developed this way back then, and when you read books on digital signal processing, on DSP, the books tell you about this quantization, that the noise is white noise and is zero mean and it's uniformly distributed, and the mean square of the noise is one-twelfth Q2. They don't tell you where it came from, and they don't tell you what the conditions are under which this is true and when it's not true. They do it with great fear and trepidation, and no one, no book has discussed the theory, yet the theory has been out there for years. There are very, very few people in the world that understand that theory, but it's so simple. All you have to understand is Nyquist's sampling theory and you know a little bit about probability theory. Anyone who understands digital signal processing with an hour of instruction from Professor Widrow would understand the theory and would relate to things that they already know. How do you take what you already know to solve a new problem? The problem of quantization noise is one that you deal with but you kind of ignore. That's why we have to write this book.

Anybody could have written this book for me, but so far nobody has. Over the years to come many people will rewrite the book for us and add their own tweaks to it. The books that I write always work. I never write things that summarize the field. I only write about things I've done myself. Dr. Kollar has extended the subject, to develop a theory for floating point quantization. Because that's what's done today. It wasn't done way back then when I had done the original work. We found a way to use the ancient work to apply to the new problems and got very similar and amazing results with floating point. We have developed a nice theory of floating point quantization that anybody can understand. It just takes a little more effort to do it.

So, how did I get into adaptive filtering? Well, I came back to MIT after a conference at Dartmouth College, and I spent about six months thinking about what thinking is about; trying to understand thinking. I still hold the same theories today as to how thinking works.

MIT assistant professorshi; digital filters and sample data systems

Goldstein:

Were you now a research assistant at MIT? Did you have your degree?

Widrow:

No, I already had the degree. I made a transformation from research assistant to assistant professor. The year that I graduated was an interesting year. I can remember a couple of guys that got their doctorates at the same time. One was Amar Bose from Bose Loudspeakers. The other was George Turin, who is on the faculty in EE at Berkeley. Another one is Herb Woodson, who stayed at MIT for quite a while and then joined the faculty at UT in Austin, Texas. He's just retired. And I'm trying to think who else. There were twenty of us that got doctorates in EE that year in June of '56.

I think that ten of us out of the twenty joined the faculty and stayed there for a few years. You become an assistant professor, but it was sort of a professor training program. The only one of the ten as far as I know that's still there is Bose. Everybody else went somewhere else, and I came here. Now I've been on this faculty since 1959. I taught at MIT for three years and then came here.

When I came back to MIT after the conference at Dartmouth I began thinking about thinking, and I began thinking about how do you make a machine that can do it. I came to the conclusion that it would be about twenty years before anything of any consequence in engineering would come from an attempt to build a machine that thinks. I knew enough. I was pretty naive, but I knew enough to know that if one were interested in an academic career in engineering then one could not do something that would have a potential payoff only after twenty-five years. That was not going to work if you wanted to stay in academia. Because they have this business of "publish or perish." The thing of it is, you didn't want to publish something unless it was definitive. People would do a doctoral thesis or do research without publishing very much. It was sufficient if people at MIT knew what you were doing. You didn't really have to put it in the IEEE publications. You certainly wouldn't publish anything unless it were very highly developed and very well proven. Today you wouldn't want to do a doctoral thesis unless it was going to be earth-shattering. A young kid starting a doctoral thesis, stepping up to home plate with a bat in his hand, knows that what he's got to hit the home run. Just getting on base is not sufficient.

When I came here, the attitude was quite different. People didn't intend to do earth-shattering things for their doctoral thesis. What they intended to do was just do a piece of work and get it done and go out in the real world and do something. I found that people were doing quite interesting theses. I had an office mate who came to work every day drawing diagrams on the same blackboard, and he stayed there for about two, two and a half years. I never could get an idea of what he wanted to do. He would draw a hyper cube every day, where you put vertices on the hyper cube, and there was something he was thinking about related to logic. He never was able to develop that into a topic and eventually gave up and left. He passed the qualifying exam. He had much higher grades than I did. He was a much smarter guy than I was. But he gave up. There are so many brilliant people that went there to get doctorates. I'm not going to give you their names. They never finished. If they had come to Stanford they would have gotten a doctorate and gotten done and gone out and had a brilliant career.

At MIT, I was thinking of trying to build a machine that would be brain-like, but it was just going to take too long. I'd have to do something that would have some practical value, but I still wanted to work in that field. So I got an idea, going back to sample data systems, of what today we would call a digital filter. A digital filter has parameters, it has coefficients. I got the idea to develop a mechanism that would vary the coefficients from the filter so that the filter could improve itself and become a better filter. In a sense a filter can learn to improve itself by acting as a filter, and I figured out a way to do this. I didn't have a brain, but I had a filter. I did not have a brain learning concepts and learning sophisticated things; I had a filter learning very simple things. But it was learning. It was something we could do even in those days, and it was something practical and something useful and something good for engineering, and that was my business.

Goldstein:

Can you give me a sense of what some of the applications were?

Widrow:

Well, let's back up for a moment, if I may. One of the things that was going on in sampled data systems that was of great interest was Wiener filter theory. About the time that I joined the faculty, it turned out that Bill Linvill was leaving MIT. He went to Washington and took a job in a think tank called the Institute for Defense Analysis. He stayed there for a few years, and then he went to Rand Corporation in Santa Monica, California for a few years before coming to Stanford. Then we had both twin brothers in the same electrical engineering department. John Linvill was chairman of the department, and Bill Linvill was a faculty member before he went and founded his own department. My original mentor became chairman of a new department that he called Engineering Economic Systems. He was always interested in operations research. He did the work in signal processing, sample data systems. That was his thesis. As soon as he did that work, he switched and began using some of the concepts for the analysis and understanding of economic systems and operations research.

My interest was still in electrical engineering rather than economics. About the time that Bill Linvill was leaving, I took over the course that he taught, "Sample Data Systems." I stayed there for three years, and then I had a new office mate, Professor Ron Howard, who is now here at Stanford in this department of engineering economic systems. He took over that course after I did, and then when he left I believe that the course was taken over by Alan Oppenheim. So I was taught all this stuff about Wiener filter theory or a digital version of it. The way I originally learned Wiener filter theory, everything was analog. I learned the analog form from Y. W. Lee, who was on the EE faculty and was a disciple of Wiener. Even Wiener was there, he was a faculty member all this time he was in the math department.

IFAC conference, 1960; self-improving digital filter

Widrow:

The last time I saw Prof. Wiener alive was at the first IFAC conference in Moscow in 1960. They had a hotel that would hold two thousand people, Hotel Vkrania, so all the Americans were there. Doc Draper from the MIT aeronautical engineering department was there also. He flew over on the same airplane with me. He was getting on in years and was very heavyset, and he had some heavy carry-on bags on the airplane. They had no air conditioning and it was June and it was hotter than hell in Moscow. I can still picture the sweat pouring from the man. I said, "Dr. Draper," I introduced myself and I said, "Can I help you?" I was a brand new faculty member in a different department, so he didn't know me. I said, "Can I help you with that bag? Let me help you." I grabbed it and he said, "Oh, thanks. Geez, it's hot." I just grabbed the bag and I helped him get off the airplane.

So there are a lot of well known people that were at that conference. There was one professor who is still alive and still a friend, and that's Professor Yakov Zalmanovich Tsypkin. It turns out that he knew about my thesis work. I gave a paper on my earliest work in adaptive filtering. He was telling me about their work on quantization theory at the Institute of Automata and Telemechanics in Moscow, and I was telling him about my work on adaptive filtering and saying "you really ought to pay attention to this adaptive business." After that, he began working on adaptive filters and control systems. So we worked on similar things over the years. He's probably at least ten years older than I am. He was a well established faculty member in Moscow. He had two doctoral students that were working on quantization theory. They had copies of my thesis and my IRE papers. Somehow or other they had a good spy system. This stuff was all in the open literature. They knew everything about what we were doing. We in turn knew nothing about what they were doing. It was an eye opener.

When I went to that meeting in Moscow, Professor Tipkin was looking for me. He wanted to meet me to talk about quantization. The application for it at that time was pretty limited. It was almost strictly of theoretical interest. If you wanted an analog-to-digital converter you had to make one and connect it in real time to a computer. How many setups do you think there were in the world like that? There may have been two or three. But there was a lot of theory that was worked out then that really didn't come into play for years and years. People were excited about information theory. What the hell for? This stuff was totally useless. It has only become useful in the last ten or twenty years. The field is about fifty years old. It took years before the early stuff had any significance. Nyquist's paper. What the hell difference did that make? But people were interested in it. People are interested in things because they are interested in things, not because they see the application. I was interested in quantization theory because I was just interested in quantization theory. I wasn't worried about what you are going to do with it. There weren't any systems out there that could use it, compared to today.

I wrote those papers on quantization theory and then got some ideas about how you can develop a simple form of learning and use it for adaptive filters. I knew Wiener filter theory in digital form. The guys in our group had developed it. One of the students, Bob Sittler, taught it to me. As far as I know, he worked this stuff out. But he worked it out based on knowledge of the analog filter theory. There are a lot of differences, but when you see one, if you know how to convert your mind set from analog mind to digital, you can figure out how to do these things. Except that there are only a few people on the planet that knew how to go from an analog mind set to a digital mind set.

I was thinking of a form of digital filter. It's called FIR, for finite impulse response. This is a set of delays that are strung together, and you can get at the original signal with different amounts of delay. By taking the signals at the taps--we called it a tapped delay line or transversal filter--and by varying the coefficients with which the signals coming from the various delay taps are weighted, the weighted signals can be summed to give an adaptive output. The output is a weighted version of the present input, the previous input, the input from the time before that and the time before that. This is, in effect, a moving average filter. The window of the filter is the duration of the delay line. What I was doing was varying the weights of the window. I was varying the coefficients or the weights of the filter. It was obvious that by varying these weights and varying the impulse response of the filter, I was controlling the impulse response and the Fourier transform of that, which is the frequency response. So I was varying the impulse response, the frequency response of the filter by adjusting these coefficients. Now how do you adjust them? Wiener introduced into filter theory the idea of making a filter that minimizes mean square error. The Wiener filter theory was based on the idea that you know for a given filter, let's say you know the statistics of the input. Let's say that the input signal would be the Dow Jones Industrial Average over past history. You take samples of it every day at a certain time and determine what that average is today, yesterday, the day before, and the day before that. You are trying to predict, let's say, what it's going to be tomorrow. You might want to predict what it's going to be a week from today. The way you predict with a moving average is by taking the present value and previous values, a certain number of previous values, and weighting and summing them. That gives you a weighted average that will project. What you try to do is to find the weights that will give you the best statistical prediction of what the future is going to be let's say a week from now.

If you are going to do this with the Wiener filter, do it with the best linear least squares filter. Let's say you are going to do it with a digital filter. Once a day you make this forecast. You use the values of present and previous values at that time. What the Wiener filter theory tells you is you need to know the auto-correlation function of the filter input. Once you know that auto-correlation function, you can use Wiener theory that will give you the impulse response of the best linear filter directly

Goldstein:

You used the term "adaptive filter." Does that term show up in Wiener's work?

Widrow:

No. If Wiener had lived to see it, he would have appreciated it, but there is nothing in his work. I took knowledge of digital filtering, FIR filters impulse response, frequency response, and made a filter that would automatically be able to predict. And how do you do that? Well, if you take let's say samples of the Dow Jones average over time, and let's say you want to build a predictor that will predict the week ahead, so you start with data from a week ago and past history before that. You don't give it today's, yesterday's, the day before; give it only a week ago and history earlier than that. What you are trying to do is predict today's Dow average. So you take a filter and make the weights random, and what you are filtering is stuff from a week ago-- earlier history--so you are making a moving average of that. You start out with the weights arbitrary, random, and make a sum, and that makes a prediction. Then you compare the prediction against what actually happened today. The difference between the two is a prediction error. You want to adjust those coefficients and minimize the mean square of that error. So you can measure the error, let history flow, imagine this process ongoing, and you never give the adaptive filter the present input except the inputs starting the history at a week ago and further back in history. You are always asking it to predict, and you are sensing the error. You can take a whole bunch of error samples and average them to find the mean square. Then you can take one of the coefficients, move it, and then re-measure the mean square error. You can take the old data and run it back through again, or you can take fresh data. It's better to take the old data and run it back through again. You'll get something which is more characteristic of the effect of changing the coefficient, that will more efficiently use your data. You use less data to get the results you want. You can get from that the partial derivative of the mean square error with respect to one of the coefficients. You can do that for all the coefficients. If you have ten coefficients, you can get ten partial derivatives of the mean square error with respect to that coefficient. That's a gradient vector. You can vary all the weights in the direction of the negative gradient. Now what I worked out theoretically with Wiener theory and knowledge of digital filters was that the mean square error if the input is statistically stationary, a quadratic function of the weights.

If you have the mean square error as a function of two weights and you draw a picture, the surface is a [inaudible] it's a quadratic function of the weights. Picture a quadratic bowl as a function of two variables. The optimum place is at the bottom of the bowl where the mean square error is minimum. You are trying to go down the bowl. So if you use a gradient, you can follow the negative of the gradient. You don't want to go positive, because if you do you'll go up the bowl instead of down the bowl, which would be not a good idea. You the follow the negative gradient, and that's what we were doing. We were rocking the weights forward and backward. The most elementary idea from freshman calculus.

Goldstein:

Can you tell me what data you were using as you were developing these ideas? I mean, you mentioned in your example Dow Jones data. Is that the sort of thing?

Widrow:

We were not using Dow Jones data. When this was tested with simulation, it was tested on an IBM 701 computer that was at MIT. The data was synthetic and generated to be correlated over time. If the samples of data are not correlated over time, you can't do any predicting. If the samples are correlated over time, then you have a basis for prediction. So the way we would generate the correlated sequence of data is by putting white noise into a digital filter having one pole. This was a recursive digital filter, having feedback with a single delay, with an appropriate coefficient that you can use to adjust the auto-correlation. You can make any kind of statistics you want by taking white noise and putting it through an appropriate filter. But we knew all about that stuff. So we were able to synthesize correlated sequences with all kinds of statistical properties. We knew theoretically what the Wiener solution was, and we can make sure that the adaptive filter would settle and converge on the Wiener solution.

This went along very well, and I could appreciate that this would make a simple filter. There were all kinds of uses for filters such as tone controls on the radio, to cut out the high frequencies if the signal is noisy. If you get too much high frequency noise you are better off to sacrifice the high frequency signal. You get less noise and less signal, but you're better off. Probably ideas like that are what got Wiener into doing a filter theory. You can imagine building these filters. We did have delay lines, but this was so difficult and expensive that it was not something anybody could do. We were interested in it just because it was a theoretical interest.

Lincoln Labs and filters

Goldstein:

Well, I know that people like Charlie Rader were coming up with digital filters at Lincoln Labs. Did you have much contact with him?

Widrow:

I met him. I worked at Lincoln Lab in the summer of 1959 during the transition between MIT and Stanford. I was supposed to begin at Stanford in September after graduation, so I thought I'd do some work during the summer and they gave me a temporary appointment at Lincoln Labs. I was able to find out about Charlie Rader and Price and Green, and some research that they were doing, and I was able to tell them some things about quantization and quantized data, and at the time I was not working in quantization anymore. I was just busy with adaptive filters. So I continued at Lincoln Labs doing work on adaptive filtering, and with the computers that we had you could simulate an adaptive filter. I knew the computers were going to be better, that they were going to get faster, but I wasn't thinking about cheaper. There was just no way at that time one could anticipate what would happen, that you were going to have a personal computer, or that you can have a signal processing chip and you can have a chip that costs a few bucks and you can do adaptive digital programming with it. Who would ever imagine a thing like that? But that wasn't really what was driving us. It was an intellectual interest. And in my mind, it was the beginning. The Linvill philosophy is that you start out with something simple before you start something complicated. You don't start off trying to build a brain. What you try to do is to build a simple learning system. That was the approach I took.

Evolution of adaptive filtering research

Goldstein:

Tell me where did adaptive filtering go from these earlier researches.

Widrow:

We were working on that, and then along came an MIT student at the time, a guy named Dick Mattson, who became my master's student. I told him about adaptive filters and he showed me something that he was interested in. He was interested in a neural element to do logic with a neuron. You can make a neuron by taking logical input signals that are ones and zeros, two levels, and weight and sum them, then see whether the sum exceeds a threshold or not. The output would be one or zero. It's a peculiar way to do logic. By adjusting the coefficients, you can adjust the logic that the system is doing. I became fascinated with adaptive logic. The first paper I ever published on it was called "Adaptive Switching Circuits." In those days you didn't call it logic or logic design, you called it switching theory. Then the name changed from switching to logic. Why? I don't know. But the first course I took on switching theory was taught by Professor Sam Caldwell at MIT. Two of my classmates turned out to be very interesting logicians, Dave Huffman and Ed McClusky.

Goldstein:

I guess one thing I'm confused about is why they're called filters. Because they're not filters in the classical sense.

Widrow:

It's called an FIR filter. A digital filter of the type called finite impulse response.

Goldstein:

I don't see why it's a filter.

Widrow:

Let's take a filter in the engine of your car. What does that filter do? Pull the old one out and it's loaded with garbage. It removes particles that get into the oil of the car.

Goldstein:

Right, and that's the analogy to the classic filter in radio circuits.

Widrow:

So what happens is that oil from the engine comes into the filter, some of the stuff stays in the filter and the rest of it goes out. So there is an input flow of oil into the filter, and there is a flow of oil out. If you are doing noise filtering you might try to find the impulse response of the best analog filter that will do the filtering function in the "best least squares" sense. If you have an input signal which contains signal plus noise, and you don't know what the signal is, you'd like this filter to separate the signal from the noise.

If the spectrum of the signal overlaps the spectrum of the noise, you can't do this perfectly, because if you try to block the noise you are going to block the signal. What you can do is block the signal at each frequency in such a way that the output of the filter is as close as possible to the true signal. But you don't know the true signal. You don't know the signal from the noise. What Wiener said is that you study the statistics of the signal and you know the statistics of the noise. Then what you needed to know was the auto-correlation of the signal, the auto-correlation of the noise, and the cross-correlation between the signal and the noise. If you know that you can figure out the best way to design that filter. Because you know the second order statistics of all the components of the input. So if you know the statistics of the components of the input you can design the filter, and you can design the best filter without knowing what the input signals and noises are.

The adaptive filter does the same thing. But it is not given the statistics of the input signal but rather examples of what the desired response would be. Suppose you have signal and noise coming into the filter. What you would like to come out would be just the signal without the noise. You want to have the best filter that does the best job possible, a filter that minimizes the noise and maximizes the signal. But you don't know the input statistics, so the adaptive filter needs to sample an input that is the same as what you would like the output to be. The adaptive process automatically adjusts the parameters of the filter to minimize the mean square of the error. But in order to have the error, you need to know the output of the filter, which of course you have, because you made the filter. So you know the output of the filter, and the input of the filter flows. If you know the desired response, you can compare the desired output with the actual output, and the difference between the two is the error signal. And you adjust the parameters to minimize the mean square of the error.

Now, you might say if you knew the desired response, if you knew what you wanted to come out of that filter, then you don't need the filter. So the question is, how are you going to get the desired response. How are you going to adapt this filter? There are many, many applications where you can get a desired response by hook or by crook, and the cleverness in applying adaptive filtering to practical problems comes from finding the desired response in which you can get an error signal to use in adapting the weights of the filter--minimize the mean square error. So the error is used by the adaptive process in adjusting the weights. Automatically they self-adjust to minimize the mean square of the error.

Goldstein:

When you started working on that area, can you tell me what the precedent was?

Widrow:

Well, there wasn't any precedent that I knew of. I said to myself, "What can I do to have engineering significance? We can make something that works in the short term. Never mind building a brain."

Goldstein:

So does that mean that you coined term "adaptive filtering"?

Widrow:

I think so.

Goldstein:

So you were telling me before where that work went once you got to Stanford that you got into these mural

Widrow:

I supervised the thesis with Dick Mattson, and what I could see is that the neural structure that he had was the same as the structure I was using for making a finite impulse response filter. The difference was that he took the signal at the summation point. It's a weighted sum, and he ran that through a quantizer, a threshold device so it would give a one as an output. Even though the inputs were binary, the weights were analog because they were continuously variable. The sum was thus continuously variable. He took the continuously variable sum, put that through a threshold device, and the output was a one or a zero. Later this type of device was called a TLU or a Threshold Logic Unit. Now, I had heard of McCullough and Pitts. Mattson's work was strongly related to theirs. But I realized when he showed me that he was interested in doing logic with this type of structure, that the structure he was doing it with was very similar to what I was using for adaptive filters.

I became very interested in what he was doing and agreed to supervise a thesis on that subject. There's a certain type of logic that can be realized with that kind of structure. It's called a linear reciprocal logic function. There was already a literature on that.

McCullough and Pitts were in a different building. They were active at MIT at the same time that I was working on the beginnings of adaptive filters. There was nothing in McCullough and Pitts that had anything to do with changing the coefficients to allow this thing to learn. They simply had a neuron that could learn. They were interested in the kind of processing that the neuron does, and tried to look biologically to see what there is in living systems that behaves like that. When I had first heard about McCullough and Pitts a few years before that, I never would have believed that I would get involved in anything that they were doing.

Pretty soon I began to be interested in adapting neural elements. Let's say you have a truth table. You say, "Well, I'd like a piece of logic that can implement this truth table." The normal way to design it would be to use what I learned in the course on switching theory.

Goldstein:

What is that method taught by Professor Caldwell.

Widrow:

We were doing a lot of it was by the seat of the pants. We were using the theory of Boolean algebra. The logical devices that were used were relay devices, and there was a computer at Harvard that was built out of relays. I became interested in logic that could learn. What you do is feed it. You feed in the input conditions in the truth table. The truth table says, "For this combination of input, I want the output to be a one or a zero." So you have a whole set of control inputs, and you go through a logic net, and the output could be a single output that would either light a light bulb or not. That was the idea on the logic net.

My thinking was, that if you could put in a set of inputs into the weights, you could then adjust the weights to cause the correct output decision for that set of inputs. I was taking a flexible piece of logic and teaching it. But there are only certain logic functions that you can realize with a single neuron, and that's called linear reciprocal logic. But you can get around that by using combinations of neurons in a neural network to get any kind of logic you like.

Goldstein:

So these were all simulations you had running on the computer, where the computer could update the coefficients.

Widrow:

That's correct. So when Mattson joined our little group, we extended the work to apply not only to adaptive sample data systems but also to adaptive switching circuits.

Goldstein:

Do you know something about that transition?

Widrow:

Well, you see, we had been working on adaptive FIR filters at MIT since 1956 or 57. I was working on gradient following algorithms, and I already was able to prove that the mean square error surface was quadratic. It was beautiful, because it has a unique optimum, and doesn't have local optima. Just following the gradient, it takes you right down to the bottom of the quadratic bowl. I began to work on theory of a dynamic learning theory. What aspects control the rate of learning? How fast does it learn? Some of the things that I figured out over a few years turn out to be fundamental in explaining how these adaptive filters behave.

While some of these things were not published in the best places, they did get picked up and had some influence. Besides myself, Rosenblatt at the Cornell Aero Lab was working on what he called a perceptron. When I came to Stanford I had simple adaptive algorithms that were based on elementary calculus; in other words, getting the gradient by moving the coefficients backward and forward to see what effect that has on the mean square error. Later at Stanford we developed a new algorithm that's called least mean square. Today I think it's safe to say that this is the most widely used adaptive algorithm on the planet. It was discovered in 1959, the first year I came to Stanford, working with my first doctoral student, Ted Hoff. He was the second half of the LMS algorithm. He got his doctorate under my supervision, and stayed on as a post doc for a few years. He worked as a post doc in my lab, and was a classmate at MIT of one of my close colleagues, Professor Jim Avery. He was a friend of Bob Noyce's, who was one of the founders of Intel, and Bob was looking for people. It kind of broke my heart, but I thought it was a great opportunity for Ted and I really was happy that he did it. I told him that this looked like a good thing, to go with this new startup company called Intel. He was employee number twelve.

Goldstein:

But you were talking about the control system.

Widrow:

Well, before control you were asking me about applications. The biggest application that exists today is the modem in your computer. Almost all modems in the world use several adaptive filters, and most of them are built as FIR filters, and they filter the signal coming from the telephone line. The telephone line introduces distortion. The bits are transmitted as pulses. You have the positive impulse which is an approximate Dirac delta filter function in a positive direction. It is injected into the telephone line to represent a one, and a negative pulse representing a zero. When you send a pulse, at the receiving end it no longer looks sharp because of distortion. It looks sort of fat. Now if you start sending pulses at very high rates they get broadened out, and they get fatter as they go through the telephone line and they overlap each other. If the telephone line were perfect, it would transmit signals between zero and 3 khz, the frequency response would be flat, and the phase response would be linear. Then, if you applied an impulse at the transmitting end, it would be received at the receiving end as a sine function, which is the sine x over x function. So it has a main peak, and it has zero crossings that are uniformly spaced in time. It turns out that the zero crossings occur exactly at the Nyquist frequency for the bandwidth of the telephone line. So if you transmitted pulses exactly at the Nyquist rate--even though these pulses having gotten blobby and they overlap each other--if you sample the signal at the receiving end exactly at the Nyquist's rate, you can adjust the sampling phase at and the receiving end so that the sample will take place at the peak of one of the sync pulses. All the neighboring pulses that overlap it would have zero crossings. Because with the sync function, the zero crossings are exactly uniform in time. So you're in great shape if that happens. But if the telephone circuit is not flat in its frequency response, then it has got peaks and dips and the phase response is not linear. If you apply an impulse at the transmitting end, it comes out at the receiving end not a sine function but a distorted sine function. The zero crossings occur non-uniformly in time. So there's no way that you can sample to hit the peak of a pulse, while all the neighboring pulses are having zero crossings.

Goldstein:

Right.

Widrow:

So you have interference from neighboring pulses. This is called intersymbol interference. If you can put a filter at the receiving end of the phone line, you can take the frequency response of the telephone channel as is. You can equalize the telephone line so that the filter has a dip at that place and frequency where the response of the telephone line has a peak. When you combine the two together, put the received signal through an equalizer before you try to deal with it. So if you have an equalizer for the telephone line, you can create something very closely approximating an ideal channel with an impulse response that's basically a very close approximation of a sync function. Then you need a filter with a frequency response that is the inverse of that of the telephone line. You want its transfer function to be the reciprocal of the transfer function of the phone line.

The problem is, every phone line has its own personality. There are no two alike. So how do you do this? Well, one way to do it is to have a filter in the receiving end of the telephone channel and have this filter have a frequency response that is adjustable. So how about our FIR filter with variable weights and some kind of an algorithm or a procedure for adjusting the weights so that this thing equalizes the phone line. The problem is, for the weights to learn, where would you get the desired response? If you knew the desired response, you wouldn't need that filter, would you?

Let me tell you what happens now in the high speed transmission system. This happens in the fax machine and the computer modem. When two systems are trying to communicate with each other, the one that's trying to communicate with the other sends out a known sequence of pulses. The machine on the other end knows what those pulses are. The machine at the receiving end adjusts the weights in that filter so that its output agrees with what this is supposed to be during the training sequence. Once it's trained, then you can go on from there.

Goldstein:

The phone line's distortion needs to be stable for that to work.

Widrow:

It's not stable. It varies just a little bit. But there's another method that you can use once the phone line has stabilized, and that is a way of getting the desired response from the actual output response of a system. It sort of yanks itself up by its own bootstraps. I was already working on such learning, not for that application but for a different application.

There was a man at Bell Labs named Robert Lucky. I didn't know Bob's work at the time, but he had published some papers in the Bell System Technical Journal in the middle '60s on what he called "decision directed learning." You use your own binary decision coming out of the equalizer to use as a desired response. With this desired response, he was using algorithms similar to what I had developed for training single neurons and the adaptive filter. He used this in an application to equalize the telephone line. I'm pretty sure that he wasn't really acquainted with what we did, because at the outset he was developing a different method of adapting the weights. The first papers that he published used another method of adapting the weights. Quite a few years later all the stuff coming out of Bell Labs was beginning to use the LMS algorithm, so I guess they found out it just works better. They never would refer to where it came from. They just called it a well known gradient algorithm.

We found out some things. We looked back at Nyquist's' paper on the sampling theorem. He only proved it with sine waves. If you believe in linearity, you have to do some extrapolating and then you got the full blown sampling theory. Shannon actually did the right job on his famous paper the became the foundation of information theory. But Nyquist's name is attached to sampling theory. There were other people that had come closer to it than Nyquist about fifteen to twenty years earlier. We became very interested in sampling theory, so we looked back through the literature to see who did what.

We were not working on telephone equalization at the time that Lucky was doing the equalization work. We were using adaptive filters for another purpose. We were making adaptive antennas. We published the first paper on adaptive antennas in 1967 in the Proceedings of the IEEE. I think it was December 1967. That paper turned out to be a citation classic, yet it came very close to getting rejected. Three reviewers reviewed it. One reviewer said it's okay to publish; another reviewer had a lot of trouble with it, but he said with some fixing it might be publishable, he was sort of on the fence; and the third reviewer really slammed it. He said, "This is just a bad idea. This is not going to work. It's no good." I still remember why he said that, and the problem was that he really didn't appreciate what we were doing. He was obviously an antenna guy who knew a lot about antennas. When I saw his review, I did add some things to the paper to make sure that antenna people wouldn't have the same feeling that he did. The editor later decided to publish a modified version of the paper.

Goldstein:

Can you tell me what the objection was?

Widrow:

We were using weights on a receiving antenna, and he said "Those weights are attenuated. You never put signals from an antenna through attenuators." He said, "You want all the signal you can get," he said, "You are wasting signal if you go through an attenuator." He said, "You are going to make your noise situation worse, and this is a bad thing to do." But actually we were trying to work out a different problem. We weren't working a problem where you are at threshold level and just barely getting the signal. There are a lot of cases where you can get the signal loud and clear, but you are also getting strong directional interference like a jamming signal or an interfering signal that is much stronger than the signal you are looking for. We made an antenna whose directivity pattern would automatically form a main beam in one direction so that anything coming from any other direction would automatically be eliminated as much as possible. So if a jamming signal is coming in from some other direction, you want to get rid of it, because it's not correlated with the desired response. The way this thing gets rid of it is it puts a null in the direction of the jammer. An antenna pattern has main lobes and side lobes. So if there are side lobes sticking in the direction of the jammer, you are going to get a lot of interference. What this does is causes the side lobes to disappear, or it develops a null in the side lobe structure in the direction of the interference, and if the interference moves then it continually adapts. It will keep a null trained in the direction of that jamming signal. If there are multiple jammings, it will make multiple nulls. That's what we were working on while Lucky was working on adaptive equalization.

Goldstein:

Were the problems were fundamentally similar.

Widrow:

You might not expect it, but they are. Here he was trying to equalize the telephone channel, and what we were doing was trying to eliminate noise from an antenna. They were similar just like Wiener demonstrated. A single theory shows how to deal with the problems of statistical prediction, statistical noise reduction, and other problems of statistical signal processing. They can all be put in a common framework. We were working on an adaptive framework that allowed you to develop a learning system that could be used for all kinds of things.

The adaptive antenna allowed you to electronically "train" the antenna in a way that simulated the way you physically steer an ordinary antenna. So if anything came in from that direction, you receive it with a gain of one. But anything that came in from some other direction, you could eliminate it as much possible. The adaptive process for finding weights to adapt the filter was the LMS algorithm. Lucky was using an algorithm similar to LMS but significantly different in his early work on equalization. Although Bell Labs has switched over and begun using this well known gradient algorithm, they don't use the name LMS algorithm that was already in the literature. Maybe they were afraid we had patents on it or something like that, but the fact is we no longer did. The patents that we had on the adaptive neuron and the LMS algorithm ran out before of any of this stuff became useful. So we never worried about it. I was just happy to see it used. If I got a hundredth of a cent for every modem and every application that's out there using an adaptive filter, I would be a very, very wealthy man. But I must tell you quite frankly--and this is my opinion, not my wife's opinion, although I think she'd agree with me--that I feel much more satisfied with the way it worked out. We patented it but we never saw a dime from it. Nothing for an engineer beats seeing something that you did become used and useful. That makes you very happy.

Widrow:

How did you recognize that it was being used there? You don't have access to design specifications of the equipment. Are there papers in the literature that somehow let you know what is being done?

Widrow:

There were papers published in the Bell System Technical Journal. What we were doing was adaptive antennas. We had worked with recorded data, such as sonar data and we were processing it on the computer. Other data we simulated on a computer. But all the work we did with adaptive antennas was simulated. We had built some single neurons when Hoff was still in my lab. Hoff liked to build stuff, so he was the builder. He was the grand designer of all the adaptive circuits. When he left, we stopped building. We never really picked it back up again, so from then on everything was simulated. We had built one neural network called the MADALINE II in about 1963 that had one thousand weights in it. This was parallel, real time hardware. We had it connected to an IBM 1620 computer, which was basically a minicomputer, except it took up a whole room. We knew about this computer because one of the guys in our lab, Paul Low, ultimately became a Vice President of IBM. When he completed his doctoral studies, he went back to IBM and went right up to the top. As a student, he knew the guys in San Jose who had designed the machine, and he knew how to hook things to it. So in 1963 we had adaptive circuits that were connected onto a digital computer like a peripheral device.

Goldstein:

What did you use that system to do?

Widrow:

The army had sponsored a research project on ballistic missile defense, a 1960s version of Star Wars. You have radar data from a cloud of objects coming in, some of which were warheads and some decoys. It was a pattern recognition problem that you did in real time. The problem was to tell the difference between the decoys and the warheads. I have not worked on this problem for years, but I believe that the problem is still not solved. That's part of why Star Wars got canceled. There are some people in the Republican party who still don't know that that problem is not very well in hand. Although I haven't had contact with this for years, from the things I read in the press, I know how difficult that problem is. The trouble is that if you have a system that tells the difference between the warheads and the decoys, and maybe there are ten warheads and a couple hundred decoys, you get most of the warheads, but one or two get through. Then it's good-bye Charlie. Unless you have something that's essentially perfect, why bother?

Goldstein:

Well, thank you very much for the interview.