Oral-History:Thomas Huang

About Thomas Huang

Thomas Huang was born in Shanghai in 1936. After his family resettled in Taiwan in 1949 he attended National Taiwan University, graduating in 1956. He arrived at MIT and completed his doctorate in 1963, working under Bill Scrieber. He was appointed to the faculty of MIT that same year. He remained at MIT until 1973, when he took a position as an electrical engineering professor and director of Information and Signal Processing Laboratory at Purdue University. In 1980 he was offered a chaired position in electrical engineering at the University of Illinois at Champaign-Urbana, and has remained there since. Major career milestones include a major role in the development of transform coding in the late 1960s. He also helped organize the 1969 image coding conference at MIT. He also helped to develop algorithms for efficient image recognition and translating two dimensional images into three dimensional motion and he made contributions to fax compression and MPEG standards. Huang has made fundamental contributions to image processing, pattern recognition, and computer vision: including design and stability test of multidimensional digital filters; digital holography; compression techniques for documents and images; 3D motion analysis; 3D modeling, analysis, and visualization of human face, hand and body; multimodal human-computer interfaces; and multimedia databases. Huang is a Fellow of the IEEE (fellow award for "fundamental contributions to multidimensional filtering and image processing"), the Optical Society of America, International Association for Pattern Recognition, and SPIE: The International Optical Society. His IEEE awards include Co-author of Best paper of the IEEE Acoustics, Speech, and Signal Processing Society (1986), the Technical Achievement Award of the ASSP Society (1987), and the Society Award of the IEEE Signal Processing Society (1991). He served as Associate Editor, IEEE Transactions on Acoustics, Speech, and Signal Processing from 1984-1987. Huang is author or co-author of over 300 articles and 12 books, including Network Theory (Addison-Wesley, 1971) [with R. R. Parker] and Motion and Structure from Image Sequences (Springer, 1992) [with J. Wend and N. Ahuja].

After a brief overview of his education, Huang describes in detail the development of image coding techniques, from Pulse coded modulation through transform and wavelet coding, to current approaches such as fractal coding and the MPEG-4 standards for video transmission. He describes these developments from a university based, basic research perspective. At the end of the interview he extols the interdisciplinary, basic research approach of the Beckman Research Laboratory at University of Illinois, and argues that such institutions are best situated to bring about the next advances in image processing.

About the Interview

THOMAS HUANG: An Interview Conducted by Andrew Goldstein, Center for the History of Electrical Engineering, 20 March 1997

Interview #331for the Center for the History of Electrical Engineering, The Institute of Electrical and Electronics Engineers, Inc., and Rutgers, The State University of New Jersey

Copyright Statement

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, Rutgers - the State University, 39 Union Street, New Brunswick, NJ 08901-8538 USA. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

It is recommended that this oral history be cited as follows:
Thomas Huang, an oral history conducted in 1997 by Andrew Goldstein, IEEE History Center, Rutgers University, New Brunswick, NJ, USA.

Interview

Interview: Dr. Thomas Huang
Interviewer: Andrew Goldstein
Date: 20 March 1997
Place: Bell Labs, Murray Hill

Goldstein: Dr. Huang, would you start with a brief biographical sketch of your life and education?
Huang: I was born in Shanghai, China in 1936, and in 1949 I left with my parents for Taiwan. I attended the National Taiwan University and graduated in 1956 and spent two years in Reserve Officer Training. I was a radar officer in the Air Force. In 1958, I began graduate school at MIT and received the Doctor of Science degree in 1963. I was on the faculty at MIT from 1963 to 1973, then I was at Purdue for seven years, and in 1980 I joined the University of Illinois at Urbana-Champagne where I still am.
Goldstein: What did you study when you were at MIT?
Huang: When I was in Taiwan, I read some books by Ernie Guillemin at MIT on Network Theory, so I really wanted to go to MIT and study with him. I was the first student from the National Taiwan University who was accepted by MIT. I went there intending to work with professor Guillemin; not only had I read his books, we had corresponded over some mistakes I found in them. But it turned out he only took a small number of students and really didn’t have a place for me. I ended up working with Peter Elias. At that time, Elias was interested in information theory and coding, so I started to work on image coding. It was all a bit of an accident—I had intended to study network theory, but ended up studying image coding. Then Peter Elias became department head and got too busy, so I changed my advisor to Bill Schrieber. I kept working on image coding with him. I did my M.S. and my Doctor of Science with Bill Schrieber.
Goldstein: Where did image coding fit into the discipline of electrical engineering at that time?
Huang: It was just getting started, it was sort of a part of information theory, at least at MIT. A lot of attention was paid to the statistical aspects of it, but it was also starting to include more of the human perception side, so it was really an interdisciplinary topic, even at that time.
Goldstein: What did you do for your Ph.D.?
Huang: For my M.S., I did image coding using the interpolation technique. It’s an adaptive technique that pays special attention to edges. A lot of work has been done in image coding since then, but I still think my M.S. thesis is quite interesting even today. I actually looked into some of the perception issues for my doctorate, especially the subjective effect of pictorial noise and how it depends on the spectrum of the noise.
Goldstein: Describe more of what your work was like. Was it all theoretical, or were you involved in hardware systems?
Huang: It was mainly algorithms, but image processing was so primitive then that we had to be concerned with equipment as well. We had to build our own scanner for digitizing images and reproducing them. We built one of the first image scanners in our lab using a CRT. I was using a Lincoln Lab prototype computer called the TX-0, probably the first transistorized computer. I had to program in assembly language. After digitizing each image we had to store the data on paper tape with punched holes. We fed that into the computer and then had to write the result on paper tape again. I remember an image of just 240 x 240 pixels took three rolls of tape.
Goldstein: When image processing was in its infancy, where did people publish the results?
Huang: MIT and Bell Labs were the central places doing image processing and compression at that time. A lot of the work from Bell was published in the BSTJ; more generally, I don’t even remember. I guess the IEEE Transactions.
Goldstein: Were there conferences?
Huang: We held the first image coding conference in 1969 at MIT. I organized it along with my colleague Oleh Tretiak, who is now at Drexel University. We called it something like a picture coding symposium. It was the first such meeting and it was very successful. We had people coming from all over the world, including some now-famous names in the field like Hans Musmann from Hanover, Germany. At that time he was a young, young guy. The symposium was sponsored by the IEEE, I think the Boston chapter, and since it was so successful it continued annually or every one and a half years. It’s still going on.
Goldstein: So you were working on algorithms for your system—tell me something about how you scouted-out research problems and framed a solution.
Huang: Actually, the formulation of a problem and the approaches to it are to a large extent by chance. There is really no fixed way. In many cases the goal is clear, you want to compress images. But you get ideas for approaching it from all over: you read someone else's paper, or hear something at a conference.
After my thesis I was still interested in compression, so I worked on both facsimile (binary) and continuous-tone images. The methods for document images at that time were one dimensional, single scan-lines were coded. I tried to extend it to 2-D and I did some interesting work. I found out much later, after contact with people in Japan, that my work on binary document compression had a big impact there. In fact, the standards today for fax compression are based on a proposal from Japan called a modified READ. That was based on some of my earlier work. I was really happy to find that out; usually I do things just for fun, but it’s nice to know when it's useful.
Goldstein: You said just now that you worked for fun, but earlier you referred to a practical goal: compressing images. Was there something about its usefulness that made the theoretical problem compelling? Did you want to compress images simply because it was an interesting problem, or did you have applications in mind?
Huang: In the good-old-days it was more fun, but now things have changed. We are more application-driven now, even if our main interest is still in basic research. At that time I was more romantic—I did things just for fun and didn't worry too much about the application.
Goldstein: Tell me more about the one-line scanning for fax images and how you tried to extend it to two dimensions.
Huang: The ideas are fairly simple. First of all, I did some careful statistical analysis of the one-line coding to learn about its performance. Then there are several ways to try and extend it to 2-D. One approach is to assume you have already transmitted a line, then for the next line you basically transmit only the transition point. I did statistical analysis predicting the performance of this method, and it turned out that the theoretical predictions fit very well with the experimental results.
Goldstein: Does that work only for certain classes of images?
Huang: It depends. It turned out that for most images the distribution of the change of the boundary is fairly similar, but with some specialized images they are different. You may have to change your coding method.
Goldstein: So by now we’re into the mid-60s.
Huang: At that time we also worked on coding of continuous tone images. One of the most popular coding methods right now is based on transform coding. Today's standards like JPEG and some aspects of MPEG are based on this discrete cosine transform, which is basically a Fourier transform. You divide the image into blocks, typically 8 by 8, and then do the discrete cosine transform on each one and then quantize the coefficient, and that’s the compressed data. Actually, a number of people way back in the late '60s invented this method; we were one of the teams, the other was at the University of Southern California.
Goldstein: Simultaneous invention?
Huang: Yes. It's not clear that one team can be pointed to as the creator of transform coding, but we are among the few who invented it.
Goldstein: That’s an interesting thread of technological development to look at. Let's start from the beginning, before there was a coding standard, then trace the developments.
Huang: The earliest method for compressing continuous-tone images was actually invented at Bell Labs. That’s the differential PCM.
Goldstein: Pulse-coded modulation.
Huang: Well, pulse code modulation just quantizes each pixel. The differential pulse code modulation, DPCM, quantizes the difference between successive pixels. The idea is that the difference in most cases will be very small, so you don’t need as many bits for that as for the original. This was the method before transform coding.
Goldstein: When was DPCM invented?
Huang: Bell Labs in the late '50s.
Goldstein: So that was the dominant paradigm just when you were getting involved?
Huang: That’s right, and then transform coding started in the late-60s and dominates even today. Now there are some new approaches, one is wavelet based and the other is fractal coding, which is very interesting.
Goldstein: Let's wait on those, give some more detail on what led you to transform coding.
Huang: Actually, the motivation for the transform coding is based on a paper by Huang and Schulthesis. It's a landmark theoretical paper on block quantization. Instead of quantizing each pixel one-by-one, the idea is to divide the image into blocks. They then take a block at a time and quantize the pixels simultaneously—taking advantage of the correlation. They show, in some sense, that the optimum transform is the so-called Karhunen-Loeve transform. It’s image dependent and very computationally intensive. So, we are looking into using different transforms, so we tried Hadamard and Fourier. It turned out the Karhunen-Loeve transform would decorrelate the samples, and then you quantize each one. It turned out the Fourier Transform will almost decorrelate the samples if the block size is reasonably large. But the advantages are, first, that it is independent of the image, so you can use a fixed transform, and the second is it has fast algorithms, the FFT.
Goldstein: What did you work on next?
Huang: After getting my degree from MIT, I was on the faculty there from 1963 until '73. In addition to compression I started to look into some of the other research problems. One is related to multi-dimensional digital filtering, especially the stability issue. I did some theoretical work on how to test the stability of two-dimensional filters and some of these results are now fairly standard. They appear in textbooks on multi-dimensional digital signal processing.
Goldstein: Did your work on image processing intersect with what was being done on what's now called signal processing, things like filters?
Huang: If you interpret signal processing in a broad sense, it covers almost everything. In a narrow sense, maybe it’s just transform and filtering. I think we should take a broader view than just filtering. With signals there are several different areas: one is representation and compression of a signal, second is enhancement and reconstruction, and third is analysis—detecting a signal and recognizing its features.
Goldstein: I would imagine that in the early days, these areas of inquiry were all fairly close together, people in the field were well-exposed to each of them, but over time, they begin to specialize. Is that your perception?
Huang: Yes, I think that happened in some sense. For example, speech processing became a big area by itself, and image processing got big. But I think these are still part of signal processing, because if you maintain a narrow look at the area of filtering and transform, I think, you are going to run out of problems, unless you get into some deep theoretical issues.
Goldstein: Some say these are the same problems, just with different boundary conditions. Do you agree that image and speech processing are fundamentally similar?
Huang: You're right; one of the more recent trends is multimedia. We are getting into several projects which involve both speech and images and have found some of the speech recognition techniques very useful in image analysis, for example the Hidden Markov Model idea. It’s one of the most common techniques for speech recognition right now and we are trying to use it for hand-gesture recognition.
Goldstein: There’s a convergence now, but I’m wondering about the points of differentiation back in the beginning. If you were specializing in image processing in the '60s, what might you know that a speech processing or digital filtering person would not know? What were the points of differentiation?
Huang: Actually, at that time I don’t think that there was much collaboration between speech and image. People worked separately and didn't really know each other, but by chance some of the techniques in speech and images tended to be similar. For example, in speech you have linear predictive coding and in image you also have predictive coding, but nobody knew.
Goldstein: These techniques were developed independently, reinvention of the wheel?
Huang: To some extent, but now I think people are more aware of each others' work. Something that was happening then and may still be true is that there was a strong connection between image processing and optical signal processing. I did some work trying to make a hologram digitally and in using optical filtering to process the image.
Goldstein: So, it was not digitized at all?
Huang: Using optics you can do linear filtering and Fourier Transform very easily, but the problem is to synthesize the filter, and the most flexible way is to do it digitally. You choose a filter, then write it out on a piece of film and develop it into a transparency and use that as the filter in the optical setup.
Goldstein: When were you doing that?
Huang: That was again in the late-60s.
Goldstein: Did that research lead to anything?
Huang: Much later; I think it was premature at the time. The computational power was not sufficient for digital holography. People started looking into digital holography again recently, but even the supercomputer may not be enough.
Goldstein: Did you abandon it because of insufficient computing power?
Huang: At the university, we really need students to work out our ideas. The choice of research projects also depends, to a large extent, on the interests of the students. I had one very good student working on digital holography, Wei Hong Lee. He invented some new ways of synthesizing holograms, but after he left I didn’t find another good student interested in the area, so I didn’t pursue it.
Goldstein: Earlier while talking about transform coding you referred to a group that you were with, can you tell me who else was involved?
Huang: At MIT, it was really my students and myself. One of the students who worked on this was John Woods. He is now a professor at RPI. He is a very well known in signal processing.
Goldstein: About how big was the group, or was it just you working with one student?
Huang: For transform coding, it was really me and two students, John Woods and Grant Anderson. But I was working in a group which involved several other people including my advisor Bill Schrieber and Oleh Tretiak.
Goldstein: You also mentioned a group over at USC.
Huang: That was two people, Bill Pratt and Harry Andrew. Both of them later left USC and started their own company.
Goldstein: As you began to publicize the ideas of transform coding, what impact did it have on the field?
Huang: I don’t really know how things developed; I didn’t publicize it at all. We published papers, but how people decided to include that material in the standards I don’t really know. It’s also not clear whether it was a good decision; maybe there are better methods.
Goldstein: Earlier we sketched out different eras. You said that new ideas began challenging transform coding, for example, wavelets. Tell me more about that.
Huang: Wavelets happened much later. They are a more recent development, again ideas I borrowed from mathematics. The mathematicians have been studying wavelets for a long time and the physics people have applied it to quantum mechanics. A number of signal processing people realized it had potential for images, not only compression, but more generally the multi-frequency type of representation. So a number of people tried to apply it to images, beginning probably in the late '80s.
The Fourier Transform spreads things all over. If you have a small object in the image, its Fourier transform is spread over the whole frequency domain. So, it has advantages and disadvantages. If you want to search for the object, you cannot do it in frequency domain because it’s spread all over. Also, if you make mistakes or have errors in the frequency domain, they are spread all over the image. On the other hand, the wavelet transform is concentrated in both frequency and in the spatial domain. The transform domain has several different layers with different frequencies in the components. In each layer, the original object remains concentrated, not spread out.
One application for image representation is in retrieval. You want to retrieve images, but in the meantime you want to represent the image in your database in an efficient way. So, you want to compress, but still be able to retrieve different objects. If you use the Fourier Transform you have to decompress before you can search for your object, but if you're in the wavelet representation, you can search for objects directly (although this is not completely done yet). People worked on optimizing both the wavelet and the discrete cosine transform, and it seems that at this time the wavelets give better performance. I think future compression standards will include wavelets.
Goldstein: What developments took place in discrete cosine transform from the late-60s through the 80s?
Huang: I think the only development was that various people tried to come up with faster algorithms for computing the DCT. But in terms of the compression, I don’t think much has happened since the early '70s. It really takes a long time for a technique to be incorporated into commercial devices. Take the fax. The main idea was there way back in the '60s, but not until about ten or fifteen years ago did the fax become popular.
Goldstein: My first guess would be that’s an issue of having adequate cheap computing power.
Huang: I don’t think so. I don’t know the reason.
Goldstein: Your description of wavelets is illuminating. Would you also explain the discrete cosine transform?
Huang: To me, the discrete cosine transform is basically a Fourier Transform. The only difference is that with Fourier, you start with an image where every pixel is real or even positive, but when you take the Fourier Transform, the coefficients are complex numbers, so they are harder to manipulate. The discrete cosine transform is essentially the real part of the Fourier Transform. You are dealing with only real numbers, but the behavior is similar to the Fourier Transform. The Fourier Transform is simply: take a signal and decompose it into a summation of sinusoids at different frequencies.
Goldstein: If an image were encoded this way, what trade-offs would an engineer face?
Huang: For the DCT, one question is what block size to use. Actually, the trade-off is more complicated. Take, for example the mean-square error. You compare the decoded image with the original by subtracting corresponding pixels, squaring the error, adding them up and so on. It turns out that you’ll get the smallest mean-square error by using the whole image and taking just the transform, instead of dividing it into blocks. But if you are building a real time system, you cannot have such a big transform. In addition to that, the mean-square error is not really a good measure of the subjective quality, because if you have sharp edges in small areas, they won’t show up in the transform of the whole image very strongly. So, even if you have a small mean-square error locally, that method may degrade the sharp edges. In most cases, it’s better to divide the images into blocks, then you can preserve the strong edges. So even subjectively there is a question of trade-off. No one has really done a careful study of that. In the standard JPEG, people use 8 by 8, just from an implementation consideration, because at least until a few years ago they could build only 8 by 8 DCD chips. If they could have built 16 by16, I think they would have.
Goldstein: You mentioned that 1970s work included tinkering with algorithms to speed up the relationships. That suggests two approaches to development: one is to improve the performance of methods you already have, the other is to invent whole new equations that are conceptually superior. Is that a fair dichotomy? Where was the research focus in the '60s, '70s and after?
Huang: After the invention of transform coding people tried to optimize it in various ways, but I think it's more interesting to look at some other method entirely. My group didn’t do much work after the first invention of the transform coding, we went on to other issues. I tend to work on new things, then when many people come into the field I’ll leave and do something else.
But by the way, I didn’t talk about fractal coding. Fractal coding is very interesting, have you heard about it?
Goldstein: No; start from scratch.
Huang: First, there was differential PCM, next was transform coding, and then came fractal coding, a really novel idea in image coding. After transform coding, nothing really new happened in image coding, people just tried to improve it, at least until Barnsley's idea of fractal coding. Michael Barnsley was a mathematician at Georgia Tech, then he left to form his own company. The wavelet method is still a traditional signal processing approach, it's not that different from transform coding. Fractal coding is completely new.
The idea is, you have an image and you try to find a mathematical function—a system—for which this image is an attractor, what in mathematics they call a fixed point. You have a system for the input. If you can find an input which, when fed into the system, looks exactly the same when it is output, then this input is called a fixed-point of the system. For example, if the system is taking the square root of a positive number, then the number one is a fixed point, but any other number is not a fixed point.
Goldstein: It sounds like an identity element.
Huang: The interesting thing is that for most systems, if the input is not a fixed point, and you repeatedly feed its output back into the system, it will approach a fixed point. With any real, positive number, if you take the square root, and the square root of that, again and again, it will approach one. So, the idea of fractal coding is to take an image and find a system for which this image becomes a fixed point. The description of the system is your compressed data. Once you have this system, you can take any image, feed it in iteratively, and eventually you will get the original image back. You try to make this system simple, like linear transformation. Fractal coding is a completely radical idea.
Goldstein: When did it emerge?
Huang: It started with Barnsley in the late 1980's; he's working on it even now in his own company. Since Barnsley is a mathematician and devoted to his company and its profits, people in signal processing are not getting much from him anymore.
Goldstein: He’s being proprietary in his research.
Huang: That’s right, but there has been a lot of work in fractal coding. So far, the best fractal coder is not quite as good as the best wavelet, but fractal has some advantages. One is that once you have the data encoded, you can decode it at different resolutions. This would be useful in a heterogeneous system, one with different terminals and resolutions.
Goldstein: What brought you to Purdue?
Huang: As I said earlier, I tend to leave a field when it gets too crowded. I think it’s good to change jobs from time to time, otherwise you become stale. A long time ago, I set ten years as about the time to stay at one place. After ten years at MIT, an opportunity at Purdue came up. They wanted to start a laboratory for information and signal processing and were looking for a director. I went there to head up the lab and also started getting more into image enhancement instead of compression. I gradually became more interested in image restoration, recognition, and so forth.
At Purdue we looked into some of the nonlinear filters, especially the so-called median filters. The median filter is for reducing noise in the image. The conventional way is to replace each point by the local average in the neighborhood to smooth out the noise. But it's not very effective when the noise is a spike type—salt and pepper—which is very common in digital transmission. It turns out that the median filter is much better. For a given point, you take the neighborhood around it, and instead of replacing the middle point by the mean, you replace it with the median gray level. It takes out spikes very easily and also has the nice property of keeping edges sharp. If you use the mean, the edges smear. We looked into this and found a very efficient algorithm. It became very popular; many of the software packages today use this algorithm.
Goldstein: What else did you do at Purdue?
Huang: We got more into analysis, like recognition or matching different patterns using graph representation and so forth. One of the main reasons I moved to Purdue was that I had a good friend there, Kim Sun Fu. He was very good in recognition, not just with images but also speech. We worked together, but unfortunately he passed away from a heart attack.
Goldstein: Was it difficult to set up the laboratory at Purdue? Was there enough research talent there?
Huang: Yes. I think Purdue was and still is a very good place. At that time there were a number of people in image processing and we got a sizable grant from DARPA for start-up.
Goldstein: What were your reporting requirements to DARPA? Was it unrestricted money?
Huang: Oh, no, it was not unrestricted, but DARPA funds supposedly supports high-risk research, basic research. There is some general objective but other than that it’s very flexible.
Goldstein: Did a lot of new laboratories start up at this time? In the beginning there were just a few organizations doing research like this.
Huang: That’s exactly right. Our DARPA grant is one of a number of grants under a program which still exists, although they have changed the name several times. Now it is called Image Understanding. I don't remember what it was called at that time, but they funded a number of universities to research image processing.
Goldstein: Do you remember any of the others?
Huang: Universities who are in the program now include Maryland, MIT, Stanford, Columbia, and the University of Washington.
Goldstein: Apart from DARPA's initiative, do you think schools were interested in launching programs like these—adding special laboratories or new specialties to there engineering schools?
Huang: I think there was a lot of interest among students, but from the university point of view, almost all research is supported by outside funding. Unless there is outside funding, it is not easy to start large-scale labs. Image processing requires quite a bit of equipment.
Goldstein: Sounds like through the '70s, even the '80s, there were still just a small number of centers doing the most exciting work.
Huang: Yes, that’s true. At Purdue I started looking into the problem of estimating 3-D motion from 2-D sequences of images, which has many applications. One is in compression of television images: you want to estimate motion and compensate for it, then take the difference between frames and transmit that. That’s the basic idea among recent standards, but in current standards, the motion estimation is done fairly crudely. I still feel there is a need for better ways of estimating 3-D motion.
Goldstein: What are some of the techniques and when did you begin working on them?
Huang: We published several papers after some early work, then many people joined in. No one had really looked into this before, so we spent a long time, a couple of years, just trying to formulate the mathematical problems. We found some very interesting theoretical questions about uniqueness. If you have video of moving objects, how many frames do you need in order to determine the 3-D motion? From 2-D to 3-D, it's basically sort of an undetermined situation. The question is whether you can get enough information to give you these 3-D motion parameters. So, we did some basic work in uniqueness. If you just write out the equations and try to solve for the parameter from observed images you have non-linear equations which are not easy to deal with. So we came up with some very elegant linear algorithms for doing that. Again, what we do at universities depends a lot on what kind of students we have and I was very fortunate to have Roger Tsai as a student, he’s really good. The two of us worked intensely together for several years. He's at IBM right now. He followed me to Illinois when I moved and finished there. A couple of summers I was away, he followed me and worked very hard, almost twenty-four hours a day. In Lausanne I remember he lived in his office and was haunted by the janitor and her dog; they had to drive him out.
Goldstein: I was wondering about this earlier. Did you do this type of 3-D work because it was a natural extension of what you were already doing? It sounds like there was no specific application you were trying to develop.
Huang: I guess there are several motivations, whether conscious or subconscious. One is to move toward what people call computer vision. One of the main problems in computer vision is to try to get 3-D information from 2-D images. There's an analogy in human vision—we have 2-D images on the retina, but we can deduce 3-D information about shape and motion. One of the main aims of computer vision is to try to come up with computer algorithms which can interpret 3-D in terms of 2-D images, especially shape and motion.
Interestingly enough, motion estimation leads back to compression, because in some scenarios, like video phones, you're looking at a fairly restricted class of scenes: basically you look at a person's face. There is a big need for very low bit-rate video phones. Even now you have video conferencing systems at 64 or 128 kilobits using special ISDN lines. The quality is not that good, even moderate movement causes severe blurring. One idea for really low bit-rate video phone is to use a 3-D model. You construct a 3-D computer graphic model of a person; the receiving end has this model, so you don’t have to transmit the images in the conventional way. At the transmitting end, you extract only movement information, which hopefully requires only a small number of bits to transmit. At the receiving end, this movement information drives the model and regenerates the sequences. So there is modeling analysis and synthesis. Probably the most difficult part is analyzing not just the global head motions, but also facial movements. That's a project we are still working on now.
Goldstein: What are some landmarks in developing a system that will do that?
Huang: There is an international committee called MPEG-4. We have the MPEG-1, MPEG-2 standards already established and the next one is MPEG-4. Hopefully the standards will be fixed next year. Originally, the MPEG-4 goal was to achieve very low-bit rate video coding. But, gradually it has extended the range of bit-rates that are of interest from the very low, like a couple of kilobits per second, to maybe 1.5 megabits. The emphasis is on multi-functionality, compression methods which are useful for different applications, not just video phoning and teleconferencing, but maybe databases, virtual agents, things like that. One subgroup of this MPEG-4 is called Synthetic Natural Hybrid Coding. It is concerned with problems like combining natural scenes with synthetic objects, like synthetic humans. The question is how do you model the human and compress the data describing the human. This combines many areas: signal processing, computer vision, and computer graphics. Computer graphics people have looked into modeling of the human, so some of these techniques could be used. The analysis is really difficult and people are just starting to look at that.
Right now, most of the work is concentrated on the face, although people are starting to look into hand motion and body motion. There are several different groups working on that. We have fairly good face models. This committee has also tried to come up with a list of basic movements, like for the corners of the mouth, the eyebrows. I think the latest list has about sixty-eight different units. By combining these you can create different expressions. If you want to transmit these to the receiving end, a question is how to automatically extract a large portion of these movement-units. Several groups are working intensively on this. We recently developed algorithms for tracking the head and some key points, but are still far from extracting the 68 movement units.
Goldstein: You say MPEG-4 has changed its goals for the standards it is trying to set. How did that happen, who was influential?
Huang: First of all, MPEG meetings are really chaotic. They meet all over the world every six weeks or so. Recent ones were in Israel and Brazil. Only very dedicated companies can attend all of these meetings. I send my students to some, but not all. So, it’s not clear how the decision was made. It depends on who is pushing what. One reason they changed the emphasis is that by using the existing standards in MPEG-1 and 2 and tuning the parameters, they can get results—not great ones, but usable—down to maybe even 26 kilobits for video conferencing. It’s not clear whether it is possible to improve on that in the next couple of years. One hope is the model-based approach, but making that really work seems far in the future.
Goldstein: So none of the systems available today are model based.
Huang: No. The main difficulty is the analysis part, and it's not clear when this will get resolved. For two-way communication, like video phoning and teleconferencing, the model-based approach is still really long term. They want to have the short-term recommendation next year. It’s not clear how the model-based approach can compete with methods which are just tuning-up existing techniques. Instead of that, they’re looking into other applications of the model-based approach which may not require the analysis, like virtual agents or animation. Maybe you can do the analysis by hand; you don’t have the real-time constraint.
Goldstein: How have outside developments influenced work on standards? I’m thinking of HDTV, digital TV, and maybe things from the '70s like picture phones. Have initiatives like those shaped the direction of work in your field?
Huang: I think so. Some major applications drive work in image processing and computer vision today. One is the multi-media database—there's a lot of interest in that. That motivates some of the work in model-based coding and content-based image retrieval. Another is human interaction with a virtual environment, trying to have more natural ways to interface with the computer, using speech and gesture, for example. In one project we are trying to recognize gesture based on video input. Another trend is multi-media, multi-modality, that is, combining image analysis with, for example, speech to interface with the computer.
Goldstein: Let me ask a similar question from another direction: Did external things like the picture phone cause a shake-up in research?
Huang: Well, I don’t really know how society and economy work; the technology for the picture phone was there, but Bell Labs didn't succeed in pushing it into the market. I’m not sure exactly why. I guess the combination of the cost and the need.
Goldstein: It sounds like research in image processing hasn’t been influenced by any commercial venture that uses it; it has proceeded fairly independently.
Huang: It’s influenced in some way, but I don’t think it’s a close coupling, at least in the universities. To some extent compression has been motivated by the facsimile, picture phone, and teleconferencing. But in universities you are working on problems which are more basic, rather than for any particular system.
Goldstein: So, you're never surprised by a commercial system—you knew, "of course they could do that"?
Huang: University researchers create all these tools and results and then when someone wants to start a company, they can take advantage of that.
Goldstein: Let's return to the thread of your career. You were at Purdue and had begun getting involved in 3-D.
Huang: I left Purdue for Illinois in 1980, mainly because I like to move every ten years or so, and seven years is close enough. I have been at Illinois for sixteen, seventeen years now, so I’m overdue. Several times I thought about moving, but about seven years ago, something interesting happened at Illinois. We got a donation from Dr. Arnold Beckman. Beckman graduated from Illinois a long time ago and he donated forty million dollars to start an interdisciplinary research lab. That opened six or seven years ago. It's a very nice set-up. It has people from not only engineering, physics, and computer science, but also cognitive sciences like psychology. The idea is to make these people work together on some interesting, important problems. Also, the trend is to look at big interdisciplinary issues. The solution of a problem may require people from many different disciplines.
Goldstein: That sounds similar to color TV, where they were able to reduce the bandwidth by taking advantage of perceptual effects.
Huang: Yes. So, now we have this very nice infrastructure. I think the Institute has close to one thousand people, about one-hundred professors from more than twenty departments, plus graduate students. We have three major research themes which serve as focal points for interdisciplinary research: biological intelligence; electronic and molecular nanostructure; and the human/computer intelligent interaction. I’m co-chairing this third theme, so I'm involved with planning in research in human/computer interaction. This has been very exciting and has kept me from leaving for now. There is an article on the Beckman Institute in a recent issue of IEEE Spectrum.
Goldstein: Have you kept your emphasis on basic research instead of industrial research while at Beckman, or has that changed some?
Huang: Up until now, most of our research has been supported by government agencies—National Science Foundation, Department of Defense—but we are trying to connect more to industry. We are starting to get fairly large grants from industry.
Goldstein: Do you notice a difference in the kind of problems that you're investigating?
Huang: We try to get grants from industry for basic research instead of short-term work, which is not easy. Research projects need to be defined in ways that are suitable for student theses, so we need some flexibility.
Goldstein: We jumped over your first ten years or so at Illinois. Could you tell me something about what you were doing there, in 1980?
Huang: I started research into 3-D motion at Purdue, but I barely got started before I left. At Illinois, motion estimation was the main topic during the first seven-to-ten years. I started with rigid motion; one approach is to try to track points on the image plane and from that find the 3-D motion. We’d look at tracking lines and so forth. Then we got into non-rigid motion. I looked into the estimation of heart motion and facial motion, as well as turbulent fluid flow.
Goldstein: I'm interested in the relationship of this kind of research to pure mathematics. Would you say that these investigations into estimation used math that was fairly well developed, or was it new?
Huang: They may be using it in different ways, but most of the work in signal processing uses mathematics which is known. Even the fractal work is not really new, it’s just different ways of doing things.
Goldstein: So, there hasn’t been much transfer of innovation from signal processing over to math?
Huang: To a large extent that’s right, although there's some information flowing in the other direction. For example, the wavelet. Most of the wavelet theory in mathematics is on one-dimensional signals. When people look into the application of wavelets to images, there are some issues about two-dimensional wavelets. So, there is some math which has to be resolved.
Goldstein: Can you say something about how work in your area, image processing, has been represented in the IEEE Signal Processing Society?
Huang: I think the compression enhancement reconstruction parts are very well represented, especially now that we have the image processing Transactions. But, I think only a small piece of the analysis part is represented in the Signal Processing Society. There is the Transactions on Pattern Analysis and Machine Intelligence, which probably has just more of the analysis type of work.
Goldstein: How about a long time ago, back in the '60s, was image processing one of the hot topics in the professional group then?
Huang: Gee, I don’t really remember when image processing became an active area in the IEEE Signal Processing Society. I don’t remember when the technical committee on multi-dimensional signal processing got started. Even when that got started, some of the issues were signal processing in the narrow sense, like filtering, transforms. Later, more people got into image processing, so the name was changed to Image and Multi-Dimensional Signal Processing. The name change is fairly recent, just a few years ago.
Goldstein: The area grew so fast and got kind of crowded. Was that a problem for people in your field?
Huang: Image processing is such a broad field, I don't think crowding is really a problem. For example, a sub-area is medical image processing, and that has grown a lot also.
Goldstein: I want to be sure that I’ve got a good grasp of the major milestones in image processing. The milestones can come in several different categories, like systems. Can you think of a category of milestones that I may be missing?
Huang: I guess one way of looking at this is to look at the different goals: compression, enhancement, restoration, and analysis. In compression, the milestones are fairly clear: differential PCM, transform coding, then wavelets, fractals and now perhaps model-based compression. In the enhancement area, a major milestone of image reconstruction that we have completely ignored in this discussion is computer tomography. That’s really one of the major achievements in image enhancement and reconstruction.
Goldstein: Let’s finish this thought and then go back to that.
Huang: Sure. More recently of course we have magnetic resonance imaging. For analysis, the milestones are harder to say, there are so many different facets. Object recognition in images has been going on for a long time, but it’s hard to pinpoint milestones. Object recognition is maybe too vague—we have to be more specific. I would think OCR, fingerprints, and more recently, many people are getting into face recognition.
Goldstein: Now let’s get back to computer tomography.
Huang: There’s an interesting story there. I told you when I was at MIT, one of my colleagues was Oleh Tretiak, who is now at Drexel, and he and I worked very closely together, he is one of the smartest persons I have ever known. Except he spent all his time trying to help other people instead of writing his own papers. We were students together and graduated about the same time in '63 and then stayed on the faculty together. One day in the late-60s he rushed into my office and said, “Oh, I have a new discovery: I can reconstruct 3-D from 2-D slices by using the Fourier Transform." He told me the story and how to do it. It just so happened that a couple of days before I had read a paper in Nature , I think by Francis Crick, describing the same kind of reconstruction, but with respect to molecules rather than computer tomography. I told him I had unfortunately read a paper with ideas which were almost the same, so he was very disappointed. This same mathematical reconstruction technique appeared a number of times in different applications. One was computer tomography which came a little bit later, and the earlier one was Francis Crick's work on reconstructing molecules from electron-beam images. Then of course it came again in magnetic resonant imaging. So, actually a number of people got Nobel Prizes for applications which are all based on this principle, and Oleh Tretiak missed out by only a few months. This was the mid to late-60s.
Goldstein: Before computer tomography matured to commercial use, what contributions helped to solidify that technique?
Huang: People tried different algorithms for doing the reconstruction. Then it became more of a development and engineering issue.
Goldstein: Is there anything in your work or in the field that we’ve overlooked?
Huang: One point I want to make again is that the trend now is for different fields to come together to work on interdisciplinary problems. Where I am at Beckman, I see several such interactions. One is signal processing, image processing, and computer vision coming together with computer graphics—in many applications you need analysis as well as synthesis, like in the modeling approach to video compression. The other interdisciplinary effort is in multi-modality, the merging of image analysis with speech in solving problems. So, I think we’ll see more and more of these interdisciplinary efforts. In another example, we are getting into image video databases, so we have to work together with people in computer science, data structure, and information retrieval.
Goldstein: Thank you very much.