3D Acoustic Source Localization


This project was not only the one into which I have invested the most time, but also the most difficult. The premise is simple enough - by figuring out the time delay between a few closely spaced microphones, one can infer the location of the sound. The earliest implementations of this date as far back as the seventies, and there was really very little new thought I had to add to the idea. It was worth it, however, just to see if I could do it.
The first thing I did was to break the problem up into a few different sections, discrete areas that I knew would be difficult and that I would have to tackle individually. The four main sections are a) audio capture, b) calculating the delays, c) calculating the location from the delays, and d) displaying all of the information in an intuitive fashion. Of course, I naturally started backwards - you would think it would be useless to display data that you don't yet have, but the truth is that a robust and simple user interface speeds up debugging and experimenting by leaps and bounds.

User Interface
This part was definitely the one I had the most experience with. In a few hours, I had a nice and simple windowed DirectX application that could display various geometric shapes, textures, and text so that I could craft a nice interface as I started developing the rest of the program. I already had a pretty good idea of how I wanted it to look and act, much of my experience coming from the old computer game Homeworld. The goal here is to display the sources in 3D space as they appear and disappear, in such a fashion as to give the user a good sense of where they are - a similar problem as displaying spaceships (and their precise locations) in 3D. So, I had a flat 25 cm grid in the program with the microphones displayed as black dots to orient the viewer properly. Sources would be displayed as colored dots floating in space, with a line going straight down or up to the plane of the grid, giving the viewer a good sense of the horizontal as well as vertical location of the source. I also added some simple sliders onto the window that allow you to change variables in real-time, most of which I later removed. Lastly, in the top left of the window are four volume displays that show the current sound intensity of the four microphones, to indicate peaking and such. I revised and completed the interface as time went on, naturally, but it was nice to have a good way of displaying information, various debug values, and so on (classical step-by-step debugging is not much good for realtime applications, obviously).

Audio Capture
Luckily for us, this was an area that we (my family) was good at on the hardware end of things. We're all musicians, my dad was in a band, so we have a nice little recording studio set up with four-track (simultaneous) recording to computer (plug: check gallery page for music). The relevant hardware consists of two parts: first, a 24-track Eurodesk mixer to use as a preamplifier and basic EQ for the four microphones, and second, an M-Audio Quattro, a USB device that allows digital capture of four simultaneous audio tracks at 16 bits and 44.1 kHz. Of course, up to that point I had only used a finished recording application to capture audio - now I had to figure out how to do it from C#. The audio interface is via a driver called ASIO (handles the buffering and such, I assume), a supposedly low-latency driver. After downloading some sample code and fiddling around for a bit, I was able to figure out how the buffer callback functions worked and how to read out the data in a timely fashion. As it turns out, there are 100 buffer updates of 441 samples every second, which gives me a maximum of 10 ms for processing if I don't want to have skips in my data. I solved this by essentially double-buffering; the program always puts the new incoming audio into a recording buffer. Then, whenever the processing (separate thread) is done, the program copies data from the recording buffer into the processing buffer. This way, no data is lost or rewritten while processing occurs. When the data is captured, the only thing that actually happends aside from the data being put into a temporary buffer is that the net energy is analyzed for the four volume displays in the top left.

Now, for another footnote about audio capture, we get to the microphones. The original correlation was tested using two condenser mics that were pointed directly at each other; a sound source would be moved between them and that worked out fine. Of course, with four microphones significant coverage is a requirement, and cardioid mics are be insufficient. Omnidirectional microphones, condenser or no, are typically quite expensive. As such, it was a stroke of luck that we found the Behringer ECM8000, an omni mic for audio measurement with a completely flat frequency response, for only $50. Once we had four of those, we were all set (of course, more expensive microphones did work better, but these were sufficient for the desired level of accuracy). At one point, though, I decided that it might be a neat experiment to try and do the whole thing with cheap computer microphones. The company Koss manufactures tabletop omni mics for $10 a piece, so I figured those would be worth trying. They're tiny little condenser mics that need a +5V bias voltage to function (or maybe it's +12V, I forget). Either way, standard phantom power that puts out +48V is definitely no good, so I had to either acquire or build my own preamp. Here's where I made the mistake: I decided to design and build my own 4-mic preamp, knowing nothing about building preamps. It went downhill from there, and I lost about two days in the process. I mean, the preamps worked, and they sounded fine to the human ear, but I have the feeling something was getting garbled that stopped it from working properly. So, back to the Behringer mics it was.

Computing the Delays
If you have two signals that are identical, but time-shifted, cross-correlation is the most straightforward way to go about figuring out the amount of shift. After doing some reading on the relevant math (which I initially knew next to nothing about) and talking to my dad, I determined that the fastest way to perform the cross-correlation would be to use fourier transforms. The underlying principle is the concept of the time domain vs. frequency domain - that is to say, a signal can be represented either as a certain value at multiple points in time (say, 44100 samples to record one second of audio) or as a combination of sine waves with varying amplitudes and phases. So, if you have one second of audio, you can represent that one second of audio either as 44100 samples or the sum of 44100 sine waves, using the same amount of memory. The cross-correlation can be performed in the spatial domain (slow) or the frequency domain (fast). In order to do it in the frequency domain, both signals need to be transformed, cross-correlated (multiplied), and then inverse-transformed to get the delay. The ideal library for performing all of this computation is the FFTW library, for which I actually wrote a wrapper here. That thing is fast.

Of course, it isn't quite that easy. The audio changes by a surprising amount between microphones, probably due to echoes and reverberation in the room. This makes the whole process much more difficult. See, what the cross-correlation actually returns is a 1 dimensional array - the strength of correlation at each possible delay. If the signals were perfectly identical (just time-shifted), there would be one large spike in the graph at the actual delay and then maybe some smaller ones here and there, but the correct one would be rather obvious. The reality, however, is that we get a bunch of spikes and bumps and things where data may or may not line up accidentally. In some cases, the correct delay is not actually the largest value.
To help me understand these issues, I added a realtime display of the correlation and the current spectrum of the sound to the program, as can be seen in the screen captures.
There arew a few steps I take that help significantly in cleaning up the data:
1) Smoothing - over the range of possible delays, I apply a basic smoothing where I blend each value with the two next to it. Once is not enough, though - it occurs around 20-30 times to each new data set.
2) Averaging - I keep track of the past few (ten or so) values and average them, which also improves the signal/noise ratio a bunch.
After those two improvements, the correct delay is typically the largest peak in the majority of the correlations. There is a global "buffer size" that determines the amount of data to correlate at a time, and it's typically set to around 1/6 of a second. Increasing that, as well as increasing the number of correlations to average, increases the reliability of the delay calculation, but it introduces some lag into the whole thing - if a source suddenly appears, it first has to appear in most of the averaged frames before the program actually recognizes it.

At this point, the delays are falling into place fairly well. But there are still a lot of peaks and we want to make it more robust, so there is a neat trick we can perform to do that. We know that there are four microphones, so picking pairs from four gives us six possible delays (d12, d13, d14, d23, d24, d34, where d12 is the delay between mics 1 and 2). Only three are actually necessary to calculate the position, so that gives us a certain amount of redundancy that we can use to great effect. What do we know about a correct source? Well, using a simple algebraic relation, we know that d12 - d23 = d13, or in other words d12 - d23 - d13 = 0. There are three more such relations that should hold for actual sources. If we pick six arbitrary peaks, one from each correlation result, we can check if they are in fact correct or not by computing these relations and seeing how close they are to zero (or, more precisely, how close the sum of their squares is to zero). If there is a correct source to be found, this method essentially finds it flawlessly. I just check the highest three or four peaks from each correlation (that would be 4^6 or 4096 to check, but it actually comes out to be much less because of the algorithm).
The best part of this method is if there are multiple sources. In that case, looking for the largest peaks might yield gibberish, whereas this method separates each one nicely because the phase closure relation holds for all of the sources. Also, this gives us an idea of the approximate error in the delays we find. If the sum of the squares of all four phase closure relations is exactly zero, we know the error is practically nonexistent. A higher value means the delays didn't fit quite as perfectly and thus had more erros. So, after all that nice processing, we have a nice set of delays to work with. The next step, of course, is computing the position.

Localization
I originally thought this part would be a breeze. Some basic triangulation math, right? Probably been done a bunch of times before. Well, turns out all is not so simple. I was indeed able to find a webpage that contained all the relevant math, one that dealt with GPS location or somesuch. After implementing it and testing it, everything looked ok... all seemed to work perfectly fine. However, that was only due to the particular microphone arrangement I had been using - as soon as I moved one of the mics, everything ceased working because of a problem with degenerate solutions. After talking to my dad for a bit, he sat down with a legal pad and came up with a fixed version of the math in a short enough time, which we then plugged into Matlab and tested. Sure enough, it solved the linear solver problems that we had been encountering, and was still fast and robust. I coded it in C#, tested it, and all was working. So far so good.
After extensive use, however, a few nuances gradually emerged. It makes perfect sense that the positions of the microphones affect the accuracy of the entire system - changing an arrangement will shift the distribution of accuracy around, and increasing the size increases the accuracy throughout. One can represent the accuracy as the slope of the transform (x,y,z) <-> (d1,d2,d3) (three delays are enough, since the other three can be described by those three). So, if a large change in d1-d3 corresponds to a small change in x,y,z, we consider that to be 'accurate'. There are so-called deadzones, where the accuracy is very poor, which we found to occur primarily along the lines connecting microphone locations. These are the areas where a small inaccuracy in a delay could actually give a set of delays that have no corresponding x,y,z coordinate. How? Well, first, delays are converted to distances using the speed of sound. Now, say a source is on the line connecting a certain pair of microphones. If the delay is measured inaccurately such that it is slightly larger, we have a delay distance that is larger than the distance between those two mics, which is not possible; the math dies and gives an invalid result. So, the goal of designing a particular microphone arrangement is to shift the dead zones into areas where they are least likely to actually contain sources (such as the ceiling or floor). The areas not around these zones typically have a theoretical accuracy of millimeters and a practical accuracy (due to measurement errors) of inches. See the accuracy image for an example image, courtesy of a Matlab script.

Still not out of the woods yet, though. As we very quickly discovered, most of the delays actually have two solutions, not just one. Well, maybe not most of them, but a bunch of them, at least. After crafting a Matlab script to display the zones of accuracy for a given arrangement (see image), we now had one to differentiate between the areas where there are a) one solution b) two solutions, with the other solution in the floor or ceiling or far away and c) two solutions, with both possible. Using certain configurations, it was actually possible to eliminate (c) entirely, leaving us with one solution that we could be fairly sure was correct.

Conclusions
So, now that everything was put together, tweaked, and improved upon, how well did it fare in the end? Well, it responds fairly well to more random noises, like the rustling of papers or clicks or things like that, sound that can be correlated better, or music, due to the drums and such. After everything is set up properly, though, it can detect one source to an accuracy of a few inches and track it as the source moves. Sometimes it wobbles or jumps a tiny bit, but by and large it's consistent. We tried it for a bit with better microphones, and the reliaility and accuracy improved greatly. It also works in fairly noisy environments, though that depends on the quality of the noise as well. In smaller rooms, such as the one I was primarily using for initial testing, the echoes from the walls can actually hurt the detection quite a lot. Also, I tested it with two sources only briefly, but that seemed to work to a degree; however, that was before many additional improvements to the program. There is a decent amount of potential left in the program - many things that don't work well, completley useless chunks of code or layers of operation that could probably be streamlined. For the time being, though, I'm satisfied enough with the result of my efforts.

Download
You can download the source or the compiled program. The source contains three parts - the main C# program, a C library that handles the fourier transforms (so that I can do the multiplications in C, marginally faster), and a C library that handles the ASIO device (which should not really need modification). The code is terrible - messy, ugly, bloated, probably non-functional in places. The first thing you should do is get rid of the Tracking module entirely, because there is no point in having it around. Then, do whatever the hell you want to it. It's obviously not documented, because I'm a horrible human being, so as always, email me if you have any questions. If you understand the layout of the code, the rest of it is fairly self-documenting - the difficulty is in the structure and not the individual functions. I have included a Word document that explains it fairly well with the source. So, without further ado:

The full source code (to main program, ASIO loader, and FFTW interop)
Matlab scripts (danger: very complicated)

And do send me an email if you have any questions.



Much thanks to my dad on this one. The others I did on my own, but he helped a bunch with the fairly complex math involved and it would certainly have gone nowhere fast without that help.




An example setup


Behringer ECM-8000


Eurodesk Mixer


Failed preamp


Number of solutions analysis


Accuracy calculation


Localization program


Correlation peak example


Spectrum analysis example


 

 

 

Copyright 2008 - Tamas Szalay.