{"id":4235,"date":"2020-05-05T07:05:15","date_gmt":"2020-05-05T11:05:15","guid":{"rendered":"https:\/\/blogs.mathworks.com\/deep-learning\/?p=4235"},"modified":"2021-04-06T15:48:34","modified_gmt":"2021-04-06T19:48:34","slug":"optimization-of-hrtf-models","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/deep-learning\/2020\/05\/05\/optimization-of-hrtf-models\/","title":{"rendered":"Optimization of HRTF Models with Deep Learning"},"content":{"rendered":"<em>Today\u2019s post is from Sunil Bharitkar, \u00a0who leads audio\/speech research in the Artificial Intelligence &amp; Emerging Compute Lab (AIECL) within HP Labs. He will discuss his research using deep learning to model and synthesize head-related transfer functions (HRTF) using MATLAB. This work has been published in an IEEE paper, linked at the bottom of the post.<\/em>\r\n<h6><\/h6>\r\nToday I\u2019d like to discuss my research, which focuses on a new way to model how to synthesize sound from any direction at all angles using deep learning.\r\n<h6><\/h6>\r\n<h2>A crash course in Spatial Audio<\/h2>\r\n<h6><\/h6>\r\nOne important aspect of this research is in the localization of sound. This is studied quite broadly for audio applications and relates to how we as humans hear and understand where sound is coming from. This will vary person to person, but it generally has to do with the delay relative to each ear (below ~800 Hz) and spectral details at the individual ears at frequencies greater than ~ 800 Hz. These are primarily the cues that we use to localize sound on a daily basis (for example see the <span style=\"text-decoration: underline;\"><a href=\"https:\/\/www.mathworks.com\/help\/audio\/examples\/cocktail-party-source-separation-using-deep-learning-networks.html\">cocktail party effect<\/a><\/span>).\r\n<h6><\/h6>\r\nBelow are figures displaying the <strong>Head-Related impulse response<\/strong> measured in an anechoic chamber at the entrance of both ear-canals of a human subject (left figure), and the Fourier-domain representation i.e., <strong>the Head-related Transfer function<\/strong> (right figure), which shows how a human hears sound at both ears at a certain location (e.g., 45 degrees to the left and front and 0 degrees elevation) in the audible frequencies.\r\n<h6><\/h6>\r\n<table>\r\n<tbody>\r\n<tr>\r\n<td><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-4237 size-large\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig0_left-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" \/><\/td>\r\n<td><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"768\" class=\"alignnone size-large wp-image-4239\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig0_right-1024x768.png\" alt=\"\" \/><\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n<h6><\/h6>\r\nIf we look at this plot, you can see that when a sound is played at a direction of 45 degrees from the center, the sound at the left ear is higher in amplitude at this angle than the right. Also embedded in this plot is the difference in arrival time between the left and right ear, where only a few milliseconds in difference can have an important impact on where we perceive sound. Subconsciously, we interpret the location of the sound source based on this difference in spectra and delay.\r\n<h6><\/h6>\r\nCompare this with a sound coming 180 degrees behind a human:\r\n<h6><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-4259 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig1_chart_resize.png\" alt=\"\" width=\"800\" height=\"600\" \/><\/h6>\r\nThe spectral detail at the left and right ear is nearly identical at all frequencies, since the sound source is substantially equidistant from both ears. The difference in arrival time will be insignificant for the sounds source at 180 degrees. These discrepancies (or lack thereof) is what helps us determine where sounds is coming from.\r\n<h6><\/h6>\r\nWe are very good at localizing sound at certain frequencies, and not as good at others*. This depends on the frequency and the location of the sound.\r\n<h6><\/h6>\r\n<em>*It\u2019s interesting to note that humans aren\u2019t very good at determining whether sound in certain angles (e.g., in the cone of confusion). The best way we can help to localize if we are confused is to move our head around to try to optimize the discrepancies between left and right ear. I\u2019m sure you\u2019re now curious to try this experiment informally at home with your next beeping fire alarm.<\/em>\r\n<h6><\/h6>\r\nThis research has many applications where localization of sound is critical. One example is in video game design, or virtual reality, where the sound must match the video for a truly immersive experience. For sound to match video, we must match the expected cues for both ears at desired locations around the user.\r\n<h6><\/h6>\r\nThere are many aspects of this research which makes this a challenging problem to solve:\r\n<h6><\/h6>\r\n<ul>\r\n \t<li>Humans are very good at spotting discrepancies in sound, which will appear fake and lead to a less than genuine experience for the user.<\/li>\r\n \t<li>Head related transfer functions are different for different people over all angles.<\/li>\r\n \t<li>Each HRTF is direction dependent and will vary at each angle for any given person.<\/li>\r\n<\/ul>\r\n<h6><\/h6>\r\nFigure 3 shows as an example how each person\u2019s HRTF varies person to person:\r\n<h6><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-4261 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig2_chart_resize.png\" alt=\"\" width=\"800\" height=\"600\" \/><\/h6>\r\nOur anatomies and our hearing are a unique quality each human has for themselves. The only way to be 100% confident the sound will be perfect for a person will be to measure their individualized head related transfer function in an anechoic chamber. This is highly impractical, as our goal is to have the least setup time for the consumer. So this leads us to the main aspect of my research:\r\n<h6><\/h6>\r\n<em><strong>Can we use deep learning to approximate an HRTF at all angles for a genuine experience for a large number of listeners? <\/strong><\/em>\r\n<h6><\/h6>\r\n<h2>Current state of the art vs. our new approach<\/h2>\r\n<h6><\/h6>\r\nThe question you might be asking is \u201cwell, if everyone is different, why not just take the average of all the plots and create an average HRTF?\u201d To that, I say \u201cif you take the average, you\u2019ll just have an average result.\u201d Can deep learning help us improve on the average?\r\n<h6><\/h6>\r\nPrior to our research, the primary method to perform this analysis was principle component analysis for HRTF modeling over a set of people. In the past, researchers have found 5 or 6 components used that generalize as well as they can for a small test set of approximately 20 subjects([1] [2] [3]) but we want to generalize over a larger dataset and a larger number of angles.\r\n<h6><\/h6>\r\nWe are going to show a new approach using deep learning. We are going to apply this to an autoencoder approach for learning a lower dimensional representation (latent representation) of HRTFs using nonlinear functions, and then using another network (a Generalized Regression neural network in this case) to map angles to the latent representation. We start with an autoencoder of 1 hidden layer, and then we optimize the number of hidden layers and the spread of the Gaussian RBF in the GRNN by doing Bayesian optimization with a validation metric (the log-spectral distortion metric). The next section shows the details of this new approach.\r\n<h6><\/h6>\r\n<h2>New approach<\/h2>\r\n<h6><\/h6>\r\nFor our approach we are using the IRCAM dataset, which consists of 49 subjects with 115 directions of sound per subject. We are going to use an autoencoder model and compare this against the principle component analysis model, (which is a linearly optimal solution conditioned on the number of PC\u2019s) and we will compare the results using objective comparisons using log spectral distortion metric to compare performance.\r\n<h6><\/h6>\r\n<h3>Data setup<\/h3>\r\n<h6><\/h6>\r\nAs I mentioned, the dataset has 49 subjects, 115 angles and each HRTF is created by computing the FFT over 1024 frequency bins.\r\n\r\nProblem statement: can we find an HRTF representation for each angle that best maximizes the fit over all subjects for that angle? We\u2019re essentially looking for the best possible generalization, over all subjects, for each of the 115 angles.\r\n<ul>\r\n \t<li>We also used hyperparameter tuning (bayesopt) for the deep learning model.<\/li>\r\n<\/ul>\r\nAutoencoder approach:\r\n<ul>\r\n \t<li>We take the entire HRTF dataset (1024X5635) and train the autoencoder. The output of the hidden layer gives you a compact representation of the input data. We take the autoencoder, we extract that representation and then map that back to the angles using a Generalized RNN. We also add jitter, or noise that we add for each angle and for each subject. This will help the network generalize rather than overfit, since we aren\u2019t looking for the perfect answer (this doesn\u2019t exist!) rather the generalization that best fits all test subjects.<\/li>\r\n \t<li>Bayesian optimization was used for:\r\n<ul>\r\n \t<li>The size of the autoencoder network (the number of layers)<\/li>\r\n \t<li>The jitter\/noise variance added to each angle<\/li>\r\n \t<li>The RBF spread for the GRNN<\/li>\r\n<\/ul>\r\n<\/li>\r\n<\/ul>\r\n<em>\u00a0<\/em>\r\n<h6><\/h6>\r\n<h2>Results<\/h2>\r\n<h6><\/h6>\r\nWe optimized the Log-spectral distortion to determine the results:\r\n<h5><img decoding=\"async\" loading=\"lazy\" width=\"547\" height=\"255\" class=\"alignnone size-large wp-image-4245\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig3_equation.png\" alt=\"\" \/><\/h5>\r\nThis formula says, for a given user at a given angle and certain frequency bin, compare the actual amplitude vs. the network\u2019s given response.\r\n<h6><\/h6>\r\nFor PCA, we found that 10 PCs was a fair comparison to our model (as shown in the figure below), since this allowed for the most of this large dataset to be covered properly by this method.\r\n<h6><\/h6>\r\n<div id=\"attachment_4281\" style=\"width: 610px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-4281\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-4281 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig4_cropped_resized.png\" alt=\"\" width=\"600\" height=\"309\" \/><p id=\"caption-attachment-4281\" class=\"wp-caption-text\">Selection of PCA-order (abscissa: no. of PC, ordinate: explained variance), (a) left-ear (b) right ear)<\/p><\/div>\r\n<h6><\/h6>\r\nHere\u2019s a random sampling for test subjects, for given angles, showing the results between the AE approach vs. the PCA. You can see for the most part, you can see significant improvements using the deep learning approach.\r\n<h6><\/h6>\r\n<div id=\"attachment_4307\" style=\"width: 810px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-4307\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-4307 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/05\/Fig5_cropped2-1.png\" alt=\"\" width=\"800\" height=\"431\" \/><p id=\"caption-attachment-4307\" class=\"wp-caption-text\">Example of the left-ear HRTF reconstruction comparison using AE-model (green) compared to true HRTF (blue). The PCA model is used as an anchor and is shown in red.<\/p><\/div>\r\n<h6><\/h6>\r\nAnother nice visualization I created clearly shows the difference between models:\r\n<h6><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-4267 size-full\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig6_cropped.png\" alt=\"\" width=\"446\" height=\"302\" \/><\/h6>\r\n<h6><\/h6>\r\nThe two models are compared using log-spectral distortion (dLS) for each angle and each user. A green box is an indication the dLS is lower than the PCA (lower is better). And the blue is where the PCA based approach outperforms the autoencoder. As you can see, for the most part, the AE encoder outperforms the PCA. The AE approach is not 100% better, but certainly a much better result overall.\r\n<h6><\/h6>\r\n<strong>In conclusion<\/strong>, we showed a new deep learning approach showing significant improvement representing a large subject-pool HRTFs over the state-of-the-art PCA approach. We are continuing to test on much larger datasets, and we are continuing to see reproducible results even over larger dataset sizes, so we are confident this approach is producing consistent results.\r\n<h6><\/h6>\r\nIf you would like to learn more on this topic, the link to the entire paper, which received the Outstanding Paper Award from the IEEE is here <a href=\"https:\/\/ieeexplore.ieee.org\/document\/8966196\">https:\/\/ieeexplore.ieee.org\/document\/8966196<\/a>.\r\n<h6><\/h6>\r\n<h3>References<\/h3>\r\n<h6><\/h6>\r\n[1] D. Kistler, and F. Wightman, \u201cA model of head-related transfer\r\n\r\nfunctions based on principal components analysis and minimum-phase\r\n\r\nreconstruction,\u201d J. Acoust. Soc. Amer., vol. 91(3), 1992, pp. 1637-1647.\r\n<h6><\/h6>\r\n[2] W. Martens, \u201cPrincipal components analysis and resynthesis of spectral\r\n\r\ncues to perceived direction,\u201d Proc. Intl. Comp. Mus. Conf., 1987, pp.\r\n\r\n274-281.\r\n<h6><\/h6>\r\n[3] J. Sodnik, A. Umek, R. Susnik, G. Bobojevic, and S. Tomazic,\r\n\r\n\u201cRepresentation of Head-related Transfer Functions with Principal Component Analysis,\u201d Acoustics, Nov. 2004.\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6><\/h6>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/deep-learning\/files\/2020\/04\/Fig0_left-1024x768.png\" onError=\"this.style.display ='none';\" \/><\/div><p>Today\u2019s post is from Sunil Bharitkar, \u00a0who leads audio\/speech research in the Artificial Intelligence &amp; Emerging Compute Lab (AIECL) within HP Labs. He will discuss his research using deep... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/deep-learning\/2020\/05\/05\/optimization-of-hrtf-models\/\">read more >><\/a><\/p>","protected":false},"author":156,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/4235"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/users\/156"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/comments?post=4235"}],"version-history":[{"count":25,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/4235\/revisions"}],"predecessor-version":[{"id":4323,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/posts\/4235\/revisions\/4323"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/media?parent=4235"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/categories?post=4235"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/deep-learning\/wp-json\/wp\/v2\/tags?post=4235"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}