A.I.
[Interview] Hearing is misunderstood

Your AI is not hearing, yet




Cochl is a startup that created sound AI for machine hearing.
Machine hearing replicates the intricate human auditory process, including the brain’s ability to filter repetitive sounds automatically.
Hearing is an active and complex sense, functioning as dynamically as vision in processing the surrounding environment.

cochl




“Hey, Siri.” Siri responded your voice. “Alexa!”. Alexa also did it. So I think the computer can hear my voice. But STT (Speech-to-Text), or speech recognition, is not equivalent to AI hearing technology. What is different?




Machine hearing refers to technology that mechanically replicates human auditory processes. The way humans perceive sound is incredibly complex. Have you ever noticed a repetitive noise gradually becoming less noticeable? This happens because the brain automatically filters out repetitive sounds without requiring a conscious command, such as saying, “Filter out that noise.” This is just one example of the many functions of auditory perception. At this very moment, your sense of hearing is working as actively as your sense of sight.




With advancements in STT (Speech-to-Text) technology and natural language recognition powered by LLMs (Large Language Models), many people mistakenly believe that computers are capable of “hearing.” However, this is merely an understanding of spoken words, not an actual ability to “hear.” So, are there startups working on true auditory technology? The answer is yes—Cochl, a company operating out of Korea and the Bay Area, is leading the way.




“What does ‘Cochl’ mean?” I asked Yoonchang Han, the CEO of Cochl. “When the company was founded in 2017, it was named Cochlear.ai, inspired by the cochlea,” he explained. “However, in 2020, we shortened it to Cochl due to trademark issues and because the original name was difficult for people to remember.”

So, what exactly is the machine hearing foundation model that Cochl is developing?




Yoonchang Han, the CEO of Cochl.




Giving Machines the Ability to Hear

Listening to Cochl’s explanation of their technology, I realized something surprising—I had never considered that computers could only recognize human speech or music but not other sounds. I had assumed that machines, AI, and computers were capable of “hearing.”

Humans gather a wealth of information through sound, though we often fail to notice it because it isn’t as visually apparent. However, computational technology has largely focused on speech recognition, overlooking other auditory elements. Compared to advancements in image analysis or chatbot development, the progress in developing auditory perception for computers has been relatively slow.




At TechCrunch Disrupt 2024, it was only after hearing Cochl’s explanation of their technology that I grasped the distinction. Are speech recognition and sound recognition entirely different?

Machine listening encompasses all sounds humans hear, including speech, music, and environmental noises. In the near future, it is expected that auditory cognitive abilities will be an essential component for computers to achieve human-like capabilities. To integrate diverse sound domains—such as environmental and mechanical sounds, beyond speech and music—into a single AI system, the importance of generalized machine listening technology will only grow. It seems likely that auditory cognitive abilities will play a crucial role in the development of AGI.




Repeated sounds and noise are often filtered out by the brain, which automatically prevents us from consciously recognizing them. Replicating this complex auditory cognitive process in machines must have presented many challenges. What was the most difficult aspect of this?

Machine listening requires AI to understand all sounds, across a wide range of frequencies that overlap in various ways. This posed a significant technical challenge. Moreover, sound data varies depending on the characteristics of microphones, noise levels, and reverberation in recording environments. Establishing standards and implementing these as generalized technologies was challenging, but we are making progress through technological solutions. Researching and developing machine listening technology has reinforced how incredibly advanced the human brain is—truly a high-performance computer.




To process such complex and extensive data, algorithm development must have been necessary. I understand that you utilize CNN and RNN models. How is Cochl developing its foundation model?

Until the early 2010s, engineers manually observed and extracted data features based on predefined rules. However, more recently, we have adopted deep learning technologies where computers identify patterns in data, extract features, and even classify results. Through artificial neural network training, we are designing various model architectures and conducting end-to-end research and development for Sound AI technology, including preprocessing, postprocessing, data augmentation, and collection.




Using AI chatbots like ChatGPT has become second nature to many of us. However, concerns about hallucinations in AI responses continue to grow. Companies aiming to adopt Cochl’s technology are using sound to detect machine anomalies and to establish a social safety net through sound data. With accuracy being the most critical factor, how is Cochl working to enhance it?

Improving accuracy requires careful attention to every step of auditory cognitive development—from data collection and augmentation to model training, evaluation, and deployment. Nothing can be overlooked in this process. To this end, Cochl evaluates and measures accuracy in real-world environments rather than relying solely on laboratory-generated refined data.




Cochl



AI Between Hearing and Vision

Cochl’s sound AI technology reminded me of a stethoscope. Just as doctors use stethoscopes to examine patients, machine hearing technology can detect mechanical faults and identify incidents like gunshots in everyday life, making it seem like a stethoscope for society as a whole.

When machines break down, the first sign is often a change in sound. Thus, calling it a “stethoscope for machines” seems fitting. The world is full of various machines, each producing distinct sounds, making the development of a universal model challenging. In theory, sound can be used to detect faults in machines, but collecting the necessary data is both time-consuming and costly.




Has Cochl addressed the resource issue?
Cochl has developed a foundation model pre-trained with sound data collected over the past seven years. Thanks to this, there’s no need to gather millions of new data points for each case. With just a small amount of data collected over 2-3 days, fine-tuning can deliver high performance.




How does the fine-tuning process work?
For audio data, additional sounds from the actual site are used. In the preprocessing stage, signal processing techniques are applied to remove unnecessary information, ensuring the deep learning model can recognize patterns more effectively. After this, training proceeds, yielding the desired results within just 2-3 days.



Recently, the spotlight in AI technology has been on visual technologies.
From generating images through text inputs to creating videos with tools like Sora, visual-based innovations dominate the AI field. Even Twelve Labs’ video comprehension technology is rooted in visual processing. Cochl, however, focuses on sound.Among human cognitive abilities, hearing is second only to vision in importance. Yet, computers’ auditory perception abilities remain significantly underdeveloped. If computers could drastically enhance their ability to perceive and understand sound, it could revolutionize daily life, making it far more convenient and efficient. Recognizing this potential, I decided to build AI technologies centered around sound, leveraging my previous research experience. I believed this technology could bring us closer to a more convenient future at a faster pace.




Where does machine hearing currently stand in the broader scope of AI development, much like OpenAI’s 5 Steps to AGI?
The ultimate goal of AI technology is to replicate human-like cognitive and reasoning abilities. With that in mind, machine hearing can be said to have just entered the foundational stage.




What changes can we expect in a world where computers and machines gain the ability to hear?
Many tasks currently dependent on human auditory skills could be automated, enabling computers to act “intuitively.” For instance, humans adjust the heating when they hear someone sniffle and assume they’re cold. Similarly, a computer might hear the sounds in a home and determine that children and pets are present, then adjust the environment accordingly. In senior care, computers could identify coughing sounds from older adults and instantly assess their health conditions.




In a future powered by Cochl’s advanced auditory technology, society could transform significantly.
In the event of a shooting, the technology could identify the type and location of the firearm in real-time and relay that information to law enforcement. Autonomous vehicles equipped with auditory capabilities could listen for crash sounds to gauge the severity of accidents, prepare for emergencies, or make way for ambulances by recognizing their sirens. Giving computers the ability to hear would open a world of possibilities, enabling them to perform a wide array of tasks.






A World Filled with Sound

The world is brimming with waves, including frequencies that lie beyond the range of human hearing. Cochl’s machine hearing technology transforms these countless sounds into data, enabling computers to make sense of them. On the path to AGI, humanity has gained yet another clue: beyond reasoning abilities, computers must acquire sensory perception.

As Dr. S. F. Singer, a physicist from the University of Maryland, aptly said, “the only non-linear, 150-pound servomechanical system which can be mass-produced by unskilled labor.”

TAG
2025-01-08
editor
Youngwoo Cha
share