(Difference between revisions)
 Revision as of 15:13, 23 January 2009 (view source)EMW (Talk | contribs)← Older edit Revision as of 15:31, 23 January 2009 (view source)EMW (Talk | contribs) Newer edit → Line 33: Line 33: == Interview  == == Interview  == − Interviewee: Fumitada Itakura
Interviewer: Frederik Nebeker
Date: 22 April 1997
Place: Munich, Germany

Interviewer: Frederik Nebeker
Date: 22 April 1997
Place: Munich, Germany
+ + === Childhood, family, and education === '''Nebeker:''' '''Nebeker:''' Line 106: Line 108:

+ + === Undergraduate research === '''Nebeker:''' '''Nebeker:''' Line 309: Line 313: Yes. I naturally became more interested in the theoretical part. Just building is interesting, but if I do only the heuristic part, I’m not confident I’m doing it right. I became more involved in the mathematical aspect of signal processing.
Yes. I naturally became more interested in the theoretical part. Just building is interesting, but if I do only the heuristic part, I’m not confident I’m doing it right. I became more involved in the mathematical aspect of signal processing.

+ === Graduate studies === + + ==== Prof. Udagawa's lab ==== '''Nebeker:''' '''Nebeker:''' Line 333: Line 339: Yes, there were several groups. One was electromagnetic theory analysis, and the other was automaton theory. So it was a discrete mathematics application. There were some people working on cueing theory. I belonged to the pattern recognition group.
Yes, there were several groups. One was electromagnetic theory analysis, and the other was automaton theory. So it was a discrete mathematics application. There were some people working on cueing theory. I belonged to the pattern recognition group.

+ ==== Character recognition ==== '''Nebeker:''' '''Nebeker:''' Line 393: Line 399: It was simulation. I had a mathematical framework and we got the statistics from real data from a recognition experiment.
It was simulation. I had a mathematical framework and we got the statistics from real data from a recognition experiment.

+ ==== Electrocardiograms ==== '''Nebeker:''' '''Nebeker:''' Line 403: Line 409: '''Itakura:''' '''Itakura:''' − Yes, that was funny. I worked very hard for one year on character recognition. It came time to summarize, so I wrote a paper for the IEC. Professor Udagawa suggested, “Why don’t you start working in another area.” At that time, the medical electronics society was newly established in Japan and Professor Udagawa was a committee member.
+ Yes, that was funny. I worked very hard for one year on character recognition. It came time to summarize, so I wrote a paper for the IEC. Professor Udagawa suggested, “Why don’t you start working in another area.” At that time, the Medical Electronics Society was newly established in Japan and Professor Udagawa was a committee member.

Line 490: Line 496:

+ + ==== NTT Electrical Communication Laboratory; speech recognition ==== '''Nebeker:''' '''Nebeker:''' Line 547: Line 555: '''Itakura:''' '''Itakura:''' − No, his thinking was quite fundamental, basic. Instead of just building some machine, he said, "Why don’t you work on understanding the basic principles?" So I tried to understand the phenomenon of speech. But at that time, I did not have a good background in speech. Professor Fukumura assisted me; he taught me about speech in a kind of private seminar using Gunnar Fant's Acoustic Theory of Speech Production. It was published in 1960. It was a revolutionary book on speech.
+ No, his thinking was quite fundamental, basic. Instead of just building some machine, he said, "Why don’t you work on understanding the basic principles?" So I tried to understand the phenomenon of speech. But at that time, I did not have a good background in speech. Professor Fukumura assisted me; he taught me about speech in a kind of private seminar using Gunnar Fant's ''Acoustic Theory of Speech Production''. It was published in 1960. It was a revolutionary book on speech.

Line 571: Line 579: '''Itakura:''' '''Itakura:''' − At that time I couldn’t get it, only a monograph by Gunnar Fant. Of course, there were some technical papers that appeared in JACA or Transactions of the IEEE. I didn't have enough knowledge, so some kind of monograph like Fant's was needed.
+ At that time I couldn’t get it, only a monograph by Gunnar Fant. Of course, there were some technical papers that appeared in ''JACA'' or ''Transactions of the IEEE''. I didn't have enough knowledge, so some kind of monograph like Fant's was needed.

Line 655: Line 663: '''Itakura:''' '''Itakura:''' − I was quite familiar with mathematical aspects of stochastic process. I visited the mathematics department library once or twice a week and looked up journals, such as the AMA's Annals of Mathematical Statistics and the Royal Statistical Society Series B, and other things like that. I accidentally found a very good article written by J. Hajek in a 1961 Czechoslovakian mathematical journal. It had statistics sufficient to show that the stationary process for the spectrum is auto-regressive. That was a very impressive paper, so I tried to apply the theory to speech and classification of vowels. That was my first statistical approach to the speech.
+ I was quite familiar with mathematical aspects of stochastic process. I visited the mathematics department library once or twice a week and looked up journals, such as the AMA's ''Annals of Mathematical Statistics ''and the''Royal Statistical Society: Series B'', and other things like that. I accidentally found a very good article written by J. Hajek in a 1961 Czechoslovakian mathematical journal. It had statistics sufficient to show that the stationary process for the spectrum is auto-regressive. That was a very impressive paper, so I tried to apply the theory to speech and classification of vowels. That was my first statistical approach to the speech.

## Contents

Fumitada Itakura was born in Toyokawa, Aichi prefecture in 1940. He earned undergraduate and graduate degrees at Nagoya University under the supervision of Kanehisa Udagawa and Teruo Fukumura. In 1968, he joined NTT's Electrical Communication Laboratory in Tokyo. He completed his Ph.D. in speech processing in 1972, writing his dissertation on "Speech Analysis and Synthesis based on a Statistical Method." He worked as a Resident Visitor in the Acoustics Research Department of Bell Labs under James Flanagan from 1973 to 1975. Between 1975 and 1981, he researched problems in speech analysis and synthesis based on the Line Spectrum Pair [LSP] method. In 1981, he was appointed as Chief of the Speech and Acoustics Research Section at NTT. He left this position in 1984 to take a professorship in communications theory and signal processing at Nagoya University. His major contributions include theoretical advances involving the application of time static stochastic process, linear prediction, and maximum likelihood classification to speech recognition. He patented the PARCOR vocoder in 1969 and helped to design Fujitsu's speech recognition chip in the early 1980s. His awards include the IEEE ASSP Senior Award, 1975, an award from Japan's Ministry of Science and Technology, 1977, and the 1986 Morris N. Liebmann Award (with B. S. Atal). He is a member of the IEEE, the Institute of Electronics and Communication Engineers of Japan, and the Acoustical Society of America.

The interview begins with a description of Itakura's education and early interest in mathematics. Although he lacked sophisticated equipment, he was able to undertake studies of speech. Because of his interest in and knowledge of mathematics, Itakura was able to exploit the findings of mathematicians and translate them into engineering terms. Much of the rest of the interview is concerned with the technical problems involved in speech modeling and speech recognition, and Itakura's application of mathematical and statistical theory to find solutions. He also describes the different approaches to speech recognition research in the United States and Japan.

FUMITADA ITAKURA: An Interview Conducted by Frederik Nebeker, Center for the History of Electrical Engineering, 22 April 1997

Interview #336 for the Center for the History of Electrical Engineering, The Institute of Electrical and Electronics Engineers, Inc., and Rutgers, The State University of New Jersey

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, Rutgers - the State University, 39 Union Street, New Brunswick, NJ 08901-8538 USA. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

It is recommended that this oral history be cited as follows:
Fumitada Itakura, an oral history conducted in 1997 by Frederik Nebeker, IEEE History Center, Rutgers University, New Brunswick, NJ, USA.

## Interview

Interviewer: Frederik Nebeker
Date: 22 April 1997
Place: Munich, Germany

### Childhood, family, and education

Nebeker:

Itakura:

I was born in Toyokawa in the Aichi prefecture in the central part of Japan, about 200 miles west of Tokyo. My father was a teacher in elementary and junior high school. My mother was also a teacher, but she stopped after she had children. My father was an enthusiastic teacher. He got a special education in mathematics because he wanted to become a high school teacher, but due to World War II he had to stay in junior high.

Nebeker:

He stimulated you to learn math and science?

Itakura:

Yes. Also, my uncle was a university professor in mathematics and he visited us once in a while. When I was ten or twelve years old he taught me how to measure the height of Mount Fuji.

Nebeker:

You were interested in mathematics at an early age?

Itakura:

Yes, I liked math. I was also interested in radio.

Nebeker:

Itakura:

Oh, sure. I got several broken radios for parts, even a transmitter.

Nebeker:

Were you an amateur radio operator?

Itakura:

Yes. I continued that hobby up to junior high school.

Nebeker:

So many engineers started with radio. Did you think that you would go into radio engineering?

Itakura:

Yes, my elementary school was located just one mile from Nagoya University’s radio laboratory. I knew several professors at the university because my friend's father was a professor. Once in a while, I visited the laboratory and asked them what was happening in the solar system. When I was a junior high school student, I really wanted to become a researcher.

Nebeker:

I know you went to Nagoya University.

Itakura:

Yes. That was a main reason for that.

Nebeker:

Were you interested in that laboratory in particular?

Itakura:

Actually, I did research at that laboratory when I was senior at the university.

Nebeker:

You did research as an undergraduate, even before you got your degree?

Itakura:

That is the system in our country. Every student in electrical communication has to do some small research project under the supervision of a professor. For the bachelor's degree we wrote a small thesis of twenty or forty pages.

Nebeker:

That was your work on whistlers. It is interesting that in this very first work you were using statistics.

Itakura:

Yes, I’m very interested in statistical communication, estimation, and detection. But the real phenomenon is very complicated. It is difficult to use established statistical theory for detection in the optimal sense. I built a kind of a filter bank and the signal processing was done using analog filters. In order to find some kind of coincidence in time-frequency pattern, we made digital circuits, but that wasn't easy. We could not use IC's, so I built every circuit using discrete transistors. And the transistors were germanium, not silicon.

Nebeker:

This was in late 1961. It is very interesting that your very first project had band-pass filters.

Itakura:

Yes, and I used the sound spectrogram, so it was designed just for speech analysis.

Nebeker:

You applied it to the whistlers?

Itakura:

Yes. It's an audio frequency phenomenon, even if it is an electromagnetic wave.

Nebeker:

Your objective was to be able to detect the whistlers?

Itakura:

Yes. We analyzed the whistler signal, with time on one axis and frequency on the other. The curve is not sinusoidal because it is associated just with thunder, and the propagation medium is quite complex: magneto-ionosphere; electromagnetic propagation in plasma. There is a lot of noise, it is very difficult to find the parameters of the curve. That was my first project.

Nebeker:

What professor did you work with on this?

Itakura:

A specialist in BLF physics. He also had a background in electrical engineering. Akira Aeye.

Nebeker:

He suggested the problem to you?

Itakura:

Yes, yes.

Nebeker:

You were trying to find a way to characterize this signal?

Itakura:

Yes, it was fairly simple. We subdivided the frequency range and detected the energy in each band. We used some delay, depending on the frequency. We delayed the higher frequency and made a kind of coincidence circuit. By doing this we could detect the parameter of the signal.

Nebeker:

Where did the digital circuitry come in?

Itakura:

A digital shift register was used to get the delay.

Nebeker:

And how did the statistics get into this?

Itakura:

It was a heuristic way to detect the whistler. I figured that by making a statistical model of the phenomenon, we could find a matched filter to detect the signal.

Nebeker:

So the statistics were for modeling the phenomenon in order to improve the filter.

Itakura:

Right. But at this time, I was mostly making the prototype of this system, I had no time to do theoretical work, so I concentrated on just building and using the circuit.

Nebeker:

Did you get it working?

Itakura:

Yes, somehow it did work, but for a limited term. Formally it was a one-year project, but I had to do other things besides. So actually speaking, it was a half-year project. I went to the laboratory from morning to midnight, and sometimes past, soldering and measuring and building.

Nebeker:

That is very impressive.

Itakura:

I liked it very much.

Nebeker:

Where did you learn the statistics? Did you take a course?

Itakura:

In undergraduate there was some statistics and probability theory and I learned quite a lot about stochastic processes, but I didn't really understand what it was used for at that time. In 1963 I advanced to the graduate school, and I changed to Professor Kanehisa Udagawa. He was very strong in applied mathematics. When I was an undergraduate I listened to many lectures. One was advanced calculus using the textbook by Wilson (an MIT professor). I also learned electromagnetic theory from him as an undergraduate. We also had special lectures on mathematics for electrical engineering, which included some lectures on stochastic processes.

Nebeker:

You were particularly attracted to mathematics? You wanted to have good training in math?

Itakura:

Yes. I naturally became more interested in the theoretical part. Just building is interesting, but if I do only the heuristic part, I’m not confident I’m doing it right. I became more involved in the mathematical aspect of signal processing.

#### Prof. Udagawa's lab

Nebeker:

What was the name of Udagawa's laboratory?

Itakura:

In Japan we call the laboratory by the professor's name. It was Udagawa's laboratory.

Nebeker:

The lab included different types of applied mathematics?

Itakura:

Yes, there were several groups. One was electromagnetic theory analysis, and the other was automaton theory. So it was a discrete mathematics application. There were some people working on cueing theory. I belonged to the pattern recognition group.

#### Character recognition

Nebeker:

When you started, your project was to recognize hand written characters. That was a very ambitious thing to do in 1963, wasn’t it?

Itakura:

Yes. At that time the computer technology in Japan was behind the times.

Nebeker:

It would have been difficult anywhere.

Itakura:

But in the United States they had the IBM 706 and the 709, and even the 7094 computers, which were very, very powerful in comparison to Japanese ones. The computer we used was very small. The main memory was a rotating disk. There was a magnetic core buffer memory. The capacity of the magnetic was just 200 words! [laughter] But I’m interested in that kind of thing for character recognition.

Nebeker:

Did you achieve important results with that work?

Itakura:

Not very important, but some new results. I contributed to the Institute of Electrical Communication Engineering, the Japanese version of the IEEE.

Nebeker:

You used a stochastic model of characters?

Itakura:

Yes. We write characters, but characters are modified by the random nature of writing. I subdivided the character into fourteen-by-fourteen meshes, and quantized each picture element. Using statistical analysis, I found the maximum likelihood decision method for classifying the characters. It is similar to a scheme I used in speech recognition.

Nebeker:

Was this all at the theoretical level, or did you build something?

Itakura:

It was simulation. I had a mathematical framework and we got the statistics from real data from a recognition experiment.

#### Electrocardiograms

Nebeker:

Then you moved on to a project analyzing electrocardiograms.

Itakura:

Yes, that was funny. I worked very hard for one year on character recognition. It came time to summarize, so I wrote a paper for the IEC. Professor Udagawa suggested, “Why don’t you start working in another area.” At that time, the Medical Electronics Society was newly established in Japan and Professor Udagawa was a committee member.

Nebeker:

So he wanted some results from this, too.

Itakura:

That’s right!

Nebeker:

You were trying to classify the electrocardiogram signals.

Itakura:

Right. But at that time we did not have even an A-to-D converter for the computer. It was very difficult to analyze a signal itself, so I concentrated on the rhythmic pattern of the cardiogram. My project was to classify the irregularity of the rhythm of the cardiogram.

Nebeker:

Itakura:

The timing was converted to numerical waves.

Nebeker:

So you get a time for each pulse beat and then examine the time series?

Itakura:

Yes. From the electrical communications engineer’s point of view, it is a kind of parse string statistics.

Nebeker:

Were you successful at finding a statistical characterization of irregular heartbeats? You got some results from that?

Itakura:

Some. I communicated with a medical doctor who was a friend of Professor Udagawa. Once in a while I visited him and got comments on an electrocardiogram, what kind of disease and what situation was involved. We collected a whole bunch of that data, doing computations on a computer similar to what I described before.

Nebeker:

How long did you work on that project?

Itakura:

Approximately one year. After getting some results, I had a chance to present to the International Conference on Medical Electronics held in Tokyo. At that time I was just a beginner; my results were more method-of-analysis than medical. It was difficult to communicate with medical doctors. But I also wrote a paper for the IEC about statistical analysis of random pulse trains, as applied to electrocardiograms, ECGs.

Nebeker:

Do those pulse trains occur in other areas? Is that analysis useful elsewhere?

Itakura:

Yes. After several years I got letters from other fields, for example, a scientist working in atomic physics. They were interested in a counting process for a certain kind of wave. He asked for me for some preprints and also for an extended version of my master's thesis.

#### NTT Electrical Communication Laboratory; speech recognition

Nebeker:

In 1965 Professor Udagawa died suddenly.

Itakura:

Yes, of a heart attack.

Nebeker:

You had to find someone else to work with.

Itakura:

As you know, in the case of a Ph.D. student, the professor is responsible for everything. So I had to find another professor or give up the course and begin working in a company or something. Fortunately, I was asked to come to the NTT Electrical Communication Laboratory as a kind of cooperative student

Nebeker:

So you continued at Nagoya University, but worked also at NTT’s laboratory?

Itakura:

Right. Dr. Shuzo Saito was a graduate of Nagoya University and he had friend there named professor Teruo Fukumura. He was working in artificial intelligence at that time, but his original interest was in speech and acoustics. So Saito asked Fukumura, “Is there any person who is good for doing speech research at NTT? So I was recommended by Fukumura to work in NTT.

Nebeker:

That was a rather sudden change of research topic. What was the objective of the speech work being done at NTT? Was it the same kind of vocoder work that Bell Labs was doing?

Itakura:

Dr. Saito was interested in speech recognition, so he told me “Why don’t you start speech recognition?”

Nebeker:

Was the objective to develop a marketable product for speech recognition?

Itakura:

No, his thinking was quite fundamental, basic. Instead of just building some machine, he said, "Why don’t you work on understanding the basic principles?" So I tried to understand the phenomenon of speech. But at that time, I did not have a good background in speech. Professor Fukumura assisted me; he taught me about speech in a kind of private seminar using Gunnar Fant's Acoustic Theory of Speech Production. It was published in 1960. It was a revolutionary book on speech.

Nebeker:

When did Jim Flanagan’s book come out? The late ‘60s maybe?

Itakura:

The first edition was in the early part of the ‘60s.

Nebeker:

Did you use his book, or wasn’t it available?

Itakura:

At that time I couldn’t get it, only a monograph by Gunnar Fant. Of course, there were some technical papers that appeared in JACA or Transactions of the IEEE. I didn't have enough knowledge, so some kind of monograph like Fant's was needed.

Nebeker:

Had Professor Fukumura himself worked in speech recognition?

Itakura:

His professor was expert in speech quality variation. Ochiai was his name.

Nebeker:

This was the professor of both Fukumura and Saito?

Itakura:

Yes. Professor Ochiai was one of the pioneers in Japan in speech in the pre-computer age. His method different from that used today, but he was very interested in speech. I had a lecture course with Ochiai also.

Nebeker:

You were using a sonogram in this speech recognition?

Itakura:

Yes, for the first time. I’d wanted to know the sound spectrogram, the form of a sound wave. There is of course an instrument to measure it. At that time, the computer was not so popular in Japan, so I made a plot of the sum of the wave using a fast optical waveform recorder. As silver-coated light sensitive paper moves along, an oscillogram is written on it.

Nebeker:

So you got a kind of recording of the sound.

Itakura:

Yes, but we could not do analog-to-digital conversion at that time. Another way of analysis was to make a sound spectrogram. I recorded several hours of consonants using my own voice. But the sound spectrogram of my voice was quite different from what is shown in a textbook. My pitch is quite high and I have a husky voice. That causes some disturbance on a spectrogram.

Nebeker:

So you weren’t getting nice vowel signals.

Itakura:

I thought the signal itself should be triggered as a random process instead of just a deterministic process.

Nebeker:

This was because of your own voice?

Itakura:

But if we could do good analysis of the stochastic or random signal, we could do even better on the more regular signals. So I approached the speech signal from the worst case. And that was my first approach to speech.

Nebeker:

So you modeled the speech signal, as I’m reading here, on a "continuous time stationary stochastic process." You were trying to classify the vowels.

Itakura:

I was quite familiar with mathematical aspects of stochastic process. I visited the mathematics department library once or twice a week and looked up journals, such as the AMA's Annals of Mathematical Statistics and theRoyal Statistical Society: Series B, and other things like that. I accidentally found a very good article written by J. Hajek in a 1961 Czechoslovakian mathematical journal. It had statistics sufficient to show that the stationary process for the spectrum is auto-regressive. That was a very impressive paper, so I tried to apply the theory to speech and classification of vowels. That was my first statistical approach to the speech.

The paper says, "Suppose that the audio spectrum is P. We can get sufficient statistics of that process using the energy of derivatives of the speech waves to the Pth order." That is a very impressive result. Since I am an engineer, I built a cascaded high-order differentiator, and measured the energy for each output of the differentiator. I imagined that it would work, because the mathematics was complete, but it didn't. If we analyze a signal using the differentiator, it gives the higher frequency enhancement: 6 decibels per octave for one derivative. Now, if we cascade, for example, decibels per octave, that means if we analyze the speech up to tenth order, it’s a much, much higher frequency enhancement. But in real situations we cannot do that because at the higher frequency we don’t have meaningful energy.

I had a bad experience when I was checking the output of the differentiator. I could hear a radio station! What was happening? There was an AM broadcast station near Nagoya, of course, and it was mixed into the input of the speech terminal and differentiated many times, so it was enhanced. And there was some similarity between a radio receiver and the circuit I was using, so it was detected, and we could hear the broadcast.

Nebeker:

Itakura:

Right.

Nebeker:

You built this circuit to analyze the speech sounds, but you said the results were disappointing.

Itakura:

It did not work at all. The mathematician does not worry about that kind of thing. A signal is a signal to him. Of course, he might consider the problem noise, but in the case of autoregressive processing, the higher frequency component is missing due to the robust characteristics of the autoregressive process. It is not necessary to consider the problem, but in actual engineering, it is very important. Hajek’s paper also mentioned the discrete case, that is, a sampled input signal. I was interested in a discrete version of Hajek’s theory.

Nebeker:

Because you were thinking of using a computer?

Itakura:

Right. At that time my laboratory at NTT purchased a mini-computer for speech research and installed it in our laboratory. It was quite expensive. So I thought that computer should be used for that particular purpose. So I went to the discrete version of the Hajek’s theory and applied it. In that case, of course, the signal is assumed to be sampled. That means it is band-derivative. Band-derivative signals can be sampled and if it is sampled data, the program that appears in continuous-case disappears. Everything is very simple. That was my first approach. The title of my first paper on that issue translates to "One Consideration on Optimal Discrimination or Classification of Speech." The first author is me, and the second is Fukumura, my professor at the university, and the third is Shuzo Saito, who was the editor of communications laboratory. This was a first paper; it describes the basic theory of how to apply Hajek’s theory to speech recognition.

Nebeker:

This was 1966?

Itakura:

Yes. If we add twenty-five we get the Western calendar, so it was 1966.

Nebeker:

You presented this just shortly afterwards, in January of 1967, and you’re saying that this was the first detailed description of the maximum likelihood estimation?

Itakura:

No. What you are holding is a very brief description, the manuscript for the oral presentation of ten minutes or so. At that time everything was hand written—this was my first speech paper—I brought it just for historical reasons.

Nebeker:

Is this the work that contains the maximum likelihood estimation?

Itakura:

No. This is the maximum likelihood classification of the speech spectrum. But of course, to have the maximum likelihood for the classification, we need the maximum likelihood for the function of the signal. So it describes the actual formula for the likelihood. By making a derivative with respect to parameter Alpha 1 to Alpha P, we can get the maximum likelihood for a linear predictive parameter. So the next paper is on that aspect of so-called linear prediction, it is essentially the same topic.

Nebeker:

Linear predictive coding that Manfred Schroeder and others did at Bell Labs.

Itakura:

Right. That was in 1967.

Nebeker:

You were telling me before that the work was entirely independent, that you didn’t know what they were working on.

Itakura:

Yes, although my work did depend on Hajek's mathematical paper.

Nebeker:

Did they also use that paper?

Itakura:

I don’t know. It's a pure mathematics paper.

Nebeker:

It’s impressive that you looked at the mathematical literature.

Itakura:

Visiting the mathematics library is my hobby.

Itakura:

Have you continued to look at mathematics literature?

Itakura:

Not right now, I have many things to do.

Nebeker:

It's interesting that your work also led to linear predictive coding. Can you explain the difference in the approaches taken by you and Schroeder?

Itakura:

The simplest explanation is that my approach is based on a frequency domain approach. Manfred Schroeder's approach is time domain. Of course, those two are kind of a duo, for me it is comfortable to consider them in the Fourier domain. But for most engineers the time- domain is more direct because it is just a regular waveform. So my paper is almost always written in frequency domain. Many people had difficulty understanding it, my boss Saito, for example. He asked, “Now, what is new about that theory?” Of course, the theory was mathematical and not new. But it is new in the way it was applied to speech. But he did not like to work on the mathematics, so he said to me, “Why don’t you show a practical result based on your theory?” So I tried to do that. At that time a newcomer joined us, Mr. Masaki Koda from NTT’s laboratory. Koda was also a graduate of Nagoya University. He joined us and was an employee of NTT. I was a cooperative student, a Ph.D. student, from Nagoya University. Saito allocated the speech recognition staff to Koda. So I had another chance to work further on vocoding research. I got some results so I proposed to him to start vocoding research based on this fundamental idea. But Saito told me, “The real problem of vocoding is pitch detection.” Of course, it is true. He knew some of the research results from AT&T. The vocoder's difficulty in recognizing voice is due to poor pitch detection. As a result, it was very difficult to start the vocoding research. So I tried to find a new method of pitch detection. At that time the best pitch detection was Peter Knowles' Cepstral method. But it was very complicated processing at that time. Today it is easy, we have a good FFT algorithm and good hardware, but at that time it was almost prohibitive for vocoder applications.

Nebeker:

You’re talking about a real time application?

Itakura:

Yes. I conceived a new method of pitch detection using an inverse filter and oscillation; we call it “modified oscillation.” I did some simulations with the method, and the pitch extraction result was quite promising. I proposed integrating the linear predictive analysis and the new pitch detection method to get a complete vocoding system. It was late 1967, I suppose, when speech was synthesized from that vocoder. I brought that tape to Saito and he was quite astonished, because he knew of the results at Bell and other laboratories worldwide. He recommended that I work further, and from that time on, my main research was on vocoding.

Nebeker:

I know that you presented these results in August of 1968 at the International Congress on Acoustics in Tokyo. At the same session, you heard Manfred Schroeder talk about adaptive predictive coding. Were you surprised to hear that?

Itakura:

Yes. I had the conference index abstract, but didn't notice anything similar to my work. Then when I read the paper I almost fainted, especially over the part about determination of linear predictive parameters for the autoregression process. But there was a big difference: his method was a kind of adaptive differential PCM, the signal is analyzed and the prediction receiver signal is quantized and input to the field. So it is a kind of waveform coding. My paper, on the other hand, was a vocoding system. The excitation signal is an impulse string or random noise. So it is completely a vocoding system. Of course, my group's method was a narrow-band application and Manfred’s method was medium band, as high as 8 kilobits per second or something. Our research is focused on 2 or 3 kilobits per second, a different basic concept. But the essential part is very close.

Nebeker:

You say that in making a formal evaluation of the maximum likelihood vocoder you found certain problems that you then worked on.

Itakura:

As you know, the linear predictive code uses a recursive filter for speech synthesis. To reduce bandwidth, we have to quantize the parameters quite rapidly. But if the parameters of a recursive filter are quantized, the resulting filter could be unstable. That is a real big problem. At the conference I asked Manfred Schroeder how many bits were necessary to quantize linear predictive parameters. He said, I remember it, that the parameter is a slowly varying signal, so it could be sampled very sparsely. That is one point, of course it is true, but he did not do a simulation of quantizational parameters. So it could be quantized to no particular number, but quite low. But I did some previous experiments on that aspect. Sometimes 10 bits per parameter is necessary to quantize the recursive filter. That is a very big problem for vocoding. Suppose that we sample that linear predictive parameter every twenty milliseconds, fifty times a second, and suppose that we have ten parameters, and each parameter is quantized with 10 bits. So for each frame we need 100 bits. If we multiply by fifty, that is 5 kilobits per second just for the linear predictive parameter. That is too much for vocoding. So we have to find a better method of quantizing linear predictive parameters. I tried to reduce the number of quantizing parameters of LPC parameters, but it was not perfect. I had to think a little more in detail. I went to the new concept of partial autocorrelation, PARCOR. By using those parameters we could quantize parameters to any number of bits. Of course, there might be some degradation of speech resulting in instability.

Nebeker:

Itakura:

It is a well-known statistical concept.

Nebeker:

That was the first time it was applied to the speech problem?

Itakura:

Right.

Nebeker:

This was what you applied for a patent for in May 1969, and then in July you presented the PARCOR vocoder at the special group meeting of the Acoustical Society of Japan.

Itakura:

Yes, that was a very special day.

Nebeker:

The same day as the moon landing.

Itakura:

Yes. The interesting thing about that is that in Japan we have regular meetings, so every eight months speech scientists and engineers get together to talk and encourage the technology and science of speech. It is informal, with one person designated to talk. Usually twenty or even forty people get together, but that was the day of the more interesting moon landing. Very few attended: the chairman of the committee, myself, Saito, and a few other researchers very close to the work, but they kept coming in and going out. So it was essentially only three people.

Nebeker:

But gradually, people noticed the PARCOR method?

Itakura:

The real appreciation came in 1972, when I brought it to the United States. A meeting on speech communication was held in the Boston area and the Air Force in Cambridge was a supporter of that meeting. I presented there; it was my first visit to the United States.

Nebeker:

This was also the subject of your doctoral dissertation?

Itakura:

Yes, it was on the PARCOR method for a speech and vocoding system.

Nebeker:

You completed your Ph.D. in ’72?

Itakura:

Yes. I should have finished sooner, in 1968, but I was quite interested in the research and had no time to write a big thesis.

Nebeker:

Was any of this put to use by NTT?

Itakura:

To some extent. When I submitted the patent for the PARCOR method, an engineer in the laboratory wanted to have an audio response unit for telecommunications application. At that time, memory was very expensive, so he wanted to reduce the number of bits for storing the voice. Also, the vocoding system can manipulate pitch and duration of the sound, which is a good part of speech synthesis. They wanted to use that method in the real application. In about 1972, we jointly designed a speech synthesis system for an audio response unit. I'll show you this book, The Seventy-Five Year History of the IEIC, published in ’92. I wrote a short article in it on the history of the PARCOR method. And this is the PARCOR analysis and synthesis system. This object over here is a machine I designed and manufactured for speech synthesis. It’s a big machine! Today it would be a very small chip, 2 by 2 millimeters. This is a single-channel prototype. The next year we designed the multi-channel speech synthesizer for telecommunications purposes.

Nebeker:

Did that find application? Was it built and used?

Itakura:

Yes. I don’t know the exact date of the real-time application.

Nebeker:

But the next year, in '71, you built a multi-channel?

Itakura:

Right. A huge machine, the DC power was 50 amperes.

Nebeker:

It sounds like you were doing very well at the NTT Laboratory and enjoying the work.

Itakura:

Yes, yes. After I presented this material in United States, Jim Flanagan proposed that I work at Bell Laboratories. My company was quite cooperative and I was sent to Jim Flanagan's acoustics research department at AT&T, Bell Laboratories in August of 1973.

Nebeker:

You were paid by NTT?

Itakura:

My basic wage came from NTT, but at that time the wage in Japan was much smaller than in the US, so the difference was paid by AT&T.

Nebeker:

The cost of living was higher in New Jersey.

Itakura:

Right.

Nebeker:

You stayed there for almost two years, until July of ’75?

Itakura:

I was expected to stay only one year, but Jim proposed another year to NTT. I liked working there. My first project was speech recognition. Speech recognition work at the Bell Laboratory was very limited at that time, maybe due to the comments made by John Pierce.

Nebeker:

I’ve heard that, yes.

Itakura:

Because I was just a temporary researcher, I could do almost anything. Jim suggested speech research, which I was glad to do. I had some previous experience on the theoretical aspects of speech recognition. I also applied some mathematical optimization techniques to find a time warping of the speech signal, which is very effective for reducing the temporal variation of speech. So by combining the maximum likelihood method and dynamic time-warping, we could get a speech recognition system. I worked very hard for the first two or three months, and after three months I could demonstrate the speech recognition system using Bell Laboratories’ computer. The result was quite promising; Jim was delighted. I was also very happy.

Nebeker:

You think that may have helped change the minds of some people at Bell Labs, that speech recognition was promising?

Itakura:

Possibly.

Nebeker:

You continued to work on speech analysis and synthesis. Can you explain to me the line-spectrum pair method, LSP?

Itakura:

I continued my hobby of visiting mathematics libraries and accidentally found an interesting paper that transformed the autocorrelation function. It didn't say autocorrelation at all, but that a positive definite function could be expanded using line-spectrum type of transformation. The language was completely mathematical. But if I interpreted with in my engineering approach, the autocorrelation function could be expanded using minimum line spectrum and the frequency and amplitude combination. That was the so-called LSP theory. So I was quite lucky to find good mathematics—a very good mathematician paved the way for a speech scientist.

Nebeker:

There are many examples in the history of science and engineering of a mathematician pioneering methods that were later reinvented in the engineering context. But because you looked at the mathematical literature, you found the techniques right away.

Itakura:

Right. I know how to interpret the mathematics.

Nebeker:

Were you at Bell Labs when you found that paper?

Itakura:

Yes, and I presented my applications of it to the Acoustical Society of America at a meeting in San Antonio. I had only a small abstract for that meeting that I presented. At that time, most people were not very interested. I talked with Vishnu about that paper, and I also talked with John McClure of BBN. John McClure said, “It is useless,” [laughter] but I was interested in the method so I continued the work back in Japan. I found it very useful for quantizing speech parameters at a very low bit rate.

Nebeker:

So you continued at the NTT for quite some time. In 1981 you were appointed Chief of the Speech and Acoustics Research Section. What were the highlights of the period from your return to Japan until you left NTT? You were working on LSP method?

Itakura:

LSP was developed for the real application, and we also designed LSI for speech synthesis, based on the LSP method. That is of interest from the engineering point of view.

Nebeker:

Did these find application in the NTT system?

Itakura:

Yes, not only in NTT, but also companies such as Fujitsu used that chip for speech synthesis on the PC. Of course, today it could be programmed on a Pentium. At that time the PC was not so powerful.

Nebeker:

Were you yourself involved in the design of that chip?

Itakura:

Deeply. Of course, we only did a basic design, what kind of [unintelligible] should be used, and what kind of arithmetic method should be used, and what kind of deconverter should be used, and things like that. The details were done by an outside company, Fujitsu.

Nebeker:

Was it Fujitsu who built that? I see. In 1984 you became a professor at Nagoya University. Why did you want to return to an academic setting?

Itakura:

I could have continued working at NTT, but my parents were very near Nagoya. They were getting sick and I am the first son—my brother and sisters lived quite distant from them. I also liked working in academia because I like mathematics and I like to apply mathematical concepts to engineering. But when I moved up to the grade of say, director, at NTT it was almost impossible to do that.

Nebeker:

There’s too little time left for your own research.

Itakura:

Yes, managing. I wanted to research. Of course, at the university we have another distraction: education. So we cannot get everything. But education helps in doing research.

Nebeker:

Especially the graduate students who work with you. How have things gone in the thirteen years that you’ve been at Nagoya? You have an institute there.

Itakura:

Not a big institute, but we regularly have twenty or twenty-three people, including graduate students. It’s a good size for doing research. But of course, there are still limitations, like in time, and we don’t have a budget to work with.

Nebeker:

There isn’t much collaboration between Nagoya and companies?

Itakura:

Many companies help us with the budget, but it’s not big money. Each company contributes to Nagoya University Laboratory ¥1 million a year or so. Five or ten companies do that. It’s substantial money for continuing our research.

Nebeker:

So you’re able to get the equipment you need?

Itakura:

Yes, and equipment is now cheaper, which is very good.

Nebeker:

I know we’re approaching the end of the time you have available, but I really wanted to ask you about the development of the field of digital signal processing in Japan. Has it paralleled the United States? Have there been similar efforts in most areas?

Itakura:

Basically the area of signal processing is very similar, but with a twenty degree delay. It’s a very small delay.

Nebeker:

Ah, the same waves with the same shift, but a little behind. Is there good communication between the researchers in the United States and Japan?

Itakura:

Oh, yes. Did you notice in the Proceedings of the ICASP that more than 200 people attended a recent meeting?

Nebeker:

But there are many, many Japanese.

Itakura:

Yes. We also have good communication between countries. For example, NTT and ATR have regular exchange programs with Bell Labs. In the laboratory, we have a foreign student from the United States and another from the United Kingdom. So we have quite a bit of communication.

Nebeker:

Thank you very much for the interview.