Modeling Surgical Voice Outcomes with AI

Pure Math Editorial
Sep 21
9 min read

Updated: Sep 28

Applied AI in Healthcare: Modeling Surgical Voice Outcomes Through Data Science and Clinical Collaboration

Executive Summary

When a patient sits across from a surgeon before a laryngeal procedure, one question always comes up: “What will my voice sound like after surgery?”

The surgeon can explain how vocal fold tension will change or that they are targeting the ~250 Hz range, but those explanations rarely satisfy the patient completely. The vocal transformation is more complex than a simple shift in pitch, and to the patient, voice is not mechanical; it is part of their personal identity.

This paper tells the story of how we’re working with a renowned surgeon at a clinic in Kyoto to enable them to be able to provide patients with an answer to their most pressing pre-op question: in the form of an audio file approximating what their post-op voice will sound like.

We also reveal how applied AI consulting actually works: embedding in a domain, listening to experts, decomposing complex problems into discrete tasks, and using AI to augment—not replace—human expertise.

1. The Challenge

Patient uncertainty: Voice is central to identity, and outcomes on voice surgeries are difficult to describe.

Clinical limitation: The surgeons understand anatomy and the surgery itself, but lack the tools to effectively demonstrate to a patient before the operation how their voice will be impacted.

Clinical need: The clinic needed a credible, data-driven method to help improve pre-op patient counseling. It’s worth mentioning here that patients are under local anesthesia and are asked to make vocalizations during the procedure, which, as we’re sure you can imagine, can be a little disconcerting.

Business Need: Improve patient relationships by offering ground-breaking pre-op tools to facilitate decision-making and improve patients’ overall experience. Enhance the clinic’s ability to attract international partnerships and referrals by making these tools more widely available.

2. Immersion in the Clinical Domain

We were invited to the clinic in Kyoto to attend a surgery and observe the procedure firsthand. During a Type IV Thyroplasty, the surgeon adjusted cartilage tension until the patient’s voice rose into the ~250 Hz range. It became clear to us that predicting outcomes wasn’t just a matter of saying, “Your pitch will increase by 80 Hz.” A credible model would need to capture the full acoustic fingerprint of a voice, including the Fundamental Frequency (represented as F0) and Formants (represented as F1, F2, F3, F4). Formants are resonant frequency bands shaped by the vocal tract (throat, tongue, mouth). Formants define vowel quality and contribute heavily to how a person’s voice sounds.

This experience was invaluable in helping us internalize and frame the problem.

3. Translating Clinical Needs into Data Science

We translated surgical observations into analytic targets:

Measure fundamental frequency (F0) before and after surgery.

Pre- and Post-op Spectrograms
Track pre- and post-op formant frequencies (F1–F4), their centroids, and bandwidths.

Formants — Pre- and Post-op Formant Centroids and Bandwidths

Analyze pre- and post-op spectral intensity shifts across frequency bands.

To create a model that can simulate an “after-surgery” voice, we’d need to quantify these acoustic features across multiple patients, average the changes across surgery types, and then use AI to create functions and write code to transforms pre-op audio files of a patient’s voice into a simulation of what post-op success would likely sound like.

4. AI-Assisted Acoustic Modeling

To model surgical outcomes at a clinical level, we didn’t start by writing a one-shot prompt asking ChatGPT to “help me transform voice files to make them sound more feminine or masculine.”

The first step was using AI to accelerate our understanding of the mechanics of voice. We determined it was possible to treat the human voice like an instrument: the vocal folds generate sound at a fundamental frequency (F0), and the shape of the throat, tongue, and mouth creates resonant peaks—formants. Our analysis confirmed that the first four formants, combined with F0, are sufficient to parameterize most of what listeners perceive as “voice.”

With that foundation, we reframed the clinical question. Surgeons typically aimed to raise the F0 into the ~250 Hz range during the procedure—this is really the only metric they have to go on. The other qualities of voice that change are a result of physical characteristics that they can't particularly control. Our task became: extract F0 and formant values from previous pre- and post-operative recordings, compute average transformations for each surgery type, and apply those transformations to new audio samples in order to simulate “after-surgery” outcomes.

Then AI helped us rapidly prototype. Rather than treating LLMs like some kind of oracle, we use them to generate code. And crucially, the quality of the code depends on the precision of prompting. Instead of vague requests for a “magic” solution, we provide mathematically precise instructions.
- Example: “Write a function which is passed a list of voice files (all of a single person, for context) and return a pair of (centroid, bandwidth) for each of [F0, F1, F2, F3, F4], where F0 is the fundamental frequency and the others are formants.”

Finally, we chained those functions into a complete workflow: measure acoustic parameters, calculate average shifts across surgeries, and apply them as transformation matrices. To test, we first applied it to our own recorded voices, listening to how the transformed samples compared to real post-operative voice samples. Early results were promising. Pitch aligned well with expectations, but the surgeon pointed out that intensity and vowel–consonant balance were just as important as F0. That feedback guided our next round of refinements.

In practice, AI served as an accelerator. It helped us get up to speed on the specific surgery quickly, and it reduced the time to create functions and code from weeks to hours. But it never replaced the data science or clinical expertise. We defined the problem, selected the parameters, and validated the results with clinical experts.

The solution came together through that collaboration: scientific decomposition, AI-assisted research and coding, and expert review and optimization.

Transformation Modeling

For context, there are two types of surgeries. We’re focused on CTA (Cricothyroid Approximation / Isshiki Type IV Thyroplasty) for this example. This type of surgery is often used by transgender people or those who want to improve voice quality affected by medical conditions such as spasmodic dysphonia, paralysis, or other laryngeal disorders.

Sample Audio

Original

0:00

Transformed Audio

Transformed Cohort 7

0:00

The transformation is more complex than a simple pitch-shift. If you were to take the original voice file and just pitch-shift it upwards it would sound mousey, like if you inhaled helium or something.

5. Clinical Validation When Modeling Surgical Voice Outcomes with AI

After building a working model to simulate post-surgical voice outcomes, the next step was to test it with the person who understands voices best: the surgeon. The goal was to create simulations that a clinician could listen to and say, “Yes, that sounds like the kind of result we typically hear in practice.”

We presented the transformed voice samples side by side with neutral input recordings. The surgeon listened closely, comparing the simulations with their experience across hundreds of patients. As we dialed in on recreating the dimensions, the surgeons confirmed that the results sounded more similar to reality.

Pitch. The simulations reliably reproduced the expected rise in fundamental frequency (F0) for Type IV Thyroplasty. Where pre-operative averages clustered around ~162 Hz, transformed voices rose to ~240 Hz. This alignment was significant, because it demonstrated that our model was not simply altering voices arbitrarily, but applying changes consistent with documented surgical outcomes.

Resonance. The shifts in resonance were more subtle. The formants—those clusters of frequencies shaped by the vocal tract—remained relatively stable in absolute terms, but small adjustments in their balance had audible effects. Surgeons emphasized that patients do not hear themselves as “feminine” or “masculine” based on pitch alone. Vowel–consonant interactions, driven by formant positioning, often determined whether a voice was perceived as natural. The simulations captured some of this subtlety, reinforcing the importance of going beyond F0 in the model.

Intensity. Perhaps the most revealing feedback concerned vocal intensity. Post-operative voices often showed greater variability in decibels across frequency bands. Surgeons associate this with the effort patients expend as they adapt to new vocal fold tension. Our spectral analyses showed the same effect: broader fluctuations in intensity, particularly in the lower and mid-frequency ranges. Capturing this variability helped the simulations feel more realistic and less like simple “pitch-shifted” audio.

The validation process underscored the value of iteration. Each round of simulations was tested—not against abstract acoustic theory, but against clinical reality. When the surgeon said, “This sounds closer to what we expect,” the model earned credibility. When they flagged missing elements—like intensity patterns—we incorporated that feedback into the next cycle.

Ultimately, clinical validation demonstrated that a data-driven, AI-assisted approach could generate not only charts and metrics but also realistic audio transformations. It is this feedback loop—analysis, simulation, expert review, optimization—that can transform a rapid prototype into a clinically relevant tool.

6. Outcomes and Impact

When patients consider voice surgery, the decision is often fraught with uncertainty and anxiety. A surgeon can explain cartilage adjustments or show average pitch increases, but those abstractions don’t address the experience of speaking with a new voice. By transforming clinical data into audio simulations, we created a solution that bridges the gap between surgical expertise and patient expectations.

For patients, the most immediate outcome is trust. Instead of imagining change through numbers and charts, patients can listen to a before-and-after transformation. Even when framed as simulations rather than guarantees, the ability to hear a likely outcome helps reduce patient anxiety—especially since they’ll be under local anesthesia during the procedure and essentially ‘participating’ by making vocalizations. The consultation becomes more concrete: a voice they can recognize as their own, shifted into the expected post-surgical range. Patients gain confidence, and surgeons gain a clearer way to communicate what “success” may sound like.

For the clinic, the system becomes a differentiator. Many clinics worldwide perform procedures like Type IV Thyroplasty or Wendler Glottoplasty, but few can demonstrate outcomes in such a tangible way. The ability to pair clinical authority with data-driven simulations helps set this, already prestigious, Kyoto surgicenter apart in a competitive international landscape. It also creates a foundation for new services: pre-surgery counseling supported by acoustic modeling, and post-surgery follow-up where simulated targets can be compared to actual results. These innovations help position the clinic as both scientifically rigorous and technologically forward-looking.

For business leaders, this project demonstrates how applied AI consulting creates value beyond strategy and prototyping. The core process—embedding with domain experts, decomposing ambiguous problems into measurable, scientific processes, using AI to accelerate R&D, and validating outputs with experts—also applies outside of voice surgery. We use the same approach across a range of problems and domains.

7. Lessons for Leaders

The takeaway isn’t that AI can “do it all,” but that AI, combined with data science and domain expertise, can tackle problems previously thought too difficult or too costly to solve. These lessons extend far beyond healthcare.

Applied AI is a scientific process. You may have the experience but not know what is going on mathematically. In this instance, we approached voice like an instrument: identified variables, tested hypotheses, and validated results with experts. This is not using AI to automate routine tasks; it is clinical problem solving using data science and AI.
Consulting is the key. We build solutions by breaking down complex problems into math and science. As the math gets closer to what we expect, we confirm with the domain experts that the results are ‘realistic’, i.e., what they are expecting. Executives should recognize that success in AI requires structured, team collaboration, not individual efforts.
Expect iteration, not ‘one shot’ magic. Our first simulations needed to be refined by human-in-the-loop feedback. Pitch shifted as expected, but intensity and resonance felt off. Surgeons provided clinical insights for us to refine the model. Each loop refined the model. Today, leaders need to reset expectations: applied AI is successful through the power of data science, a consultative approach with experts, and cycles of development and optimization.

Leaders who treat AI-related projects as a scientific process drastically improve their odds of success.

Contact Us to learn more about how we work with clients to translate complex challenges into measurable outcomes.

8. Glossary

Fundamental Frequency (F0): The base frequency of vocal fold vibration, measured in Hertz (Hz). Often perceived as the “pitch” of the voice. Surgeries like Type IV Thyroplasty directly increase F0 to raise pitch.

Formants (F1–F4): Resonant frequency bands shaped by the vocal tract (throat, tongue, mouth). Formants define vowel quality and contribute heavily to how gendered or natural a voice sounds.

Formant Centroid: The central frequency of a formant band. Shifts in centroids affect vowel perception.

Formant Bandwidth: The range of frequencies covered by a formant. Narrow or wide bandwidths influence clarity and resonance.

Spectrogram: A visual representation of sound that plots frequency over time with color indicating intensity. Used to see patterns in F0 and formants before and after surgery.

Spectral Intensity: The loudness (in decibels) of frequencies across the spectrum. Post-surgery voices often show greater variability in spectral intensity, linked to perceived vocal effort.

Type IV Thyroplasty: A laryngeal framework surgery that raises pitch by approximating the cricoid and thyroid cartilages, increasing vocal fold tension. Typically raises F0 from ~162 Hz to ~240 Hz.

Prompt Engineering: Designing inputs to AI systems. In this project, mathematically precise prompts (e.g., “extract formant centroids and bandwidths”) generated functional code, while vague prompts produced weak results.

Transformation Matrix: A mathematical mapping that applies average acoustic changes (e.g., pitch increase, formant shift) from surgical data to a new voice sample, simulating an “after-surgery” outcome.

Validation Loop: An iterative process where simulated outputs are reviewed by surgeons, feedback is incorporated, and the model is refined. Ensures outputs align with clinical reality.

Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries. As with any AI-based project, human oversight is employed throughout the content creation process.