top of page

Part 1: The Turning Point — When Cloud AI Stopped Making Sense

Updated: May 3


A candid “conversation” between our CEO, Sean Douglas and Pure Math Editorial. Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries.


Nvidia Tesla T4
Where the Magic Happens

If you’ve ever tried to run serious AI workloads, you may have noticed… it’s not as seamless as the marketing makes it sound.


Between GPU shortages, unpredictable costs, and compliance constraints, what starts as a simple prototype can quickly spiral into a budgeting and infrastructure nightmare.

Unless you have unlimited resources and don’t care about budgeting—which describes exactly none of our clients. Or us.


For us, what began as a routine architecture decision turned into a lesson in the hidden tradeoffs of cloud-based AI—and a case for why, in some situations, owning the hardware just makes more sense.


We’re sharing this in a five-part series because:

  1. We believe it might be genuinely useful for anyone thinking about AI systems in regulated industries, and

  2. Sean’s been having a lot of fun getting his hands ‘dirty’ with the hardware.


This is how we got to a turning point—When Cloud AI Stopped Making Sense—and why we think more firms should start seriously planning their own in-house AI infrastructure.


Pure Math Editorial:

What problem were you trying to solve that led you down this GPU rabbit hole?


Sean:

We are building a RAG system (retrieval-augmented generation) for internal use and for a product we’re building. The idea is to create an LLM that can understand and interact with millions of financial documents—think analyst meeting notes, pitch decks, DDQs, SEC filings. Stuff like that.


And it’s not just about volume—it’s about how systems actually use that volume.


You’re not processing each document once. You’re chunking them, generating embeddings for each chunk, and storing those in a vector database. During retrieval, you're querying that index—sometimes with multi-turn prompts, sometimes across multiple use cases like summarization, classification, or feature extraction.


Depending on how aggressively you chunk, and how often you reprocess or re-embed, a million documents can easily balloon into hundreds of billions of embeddings

Which is fine. Until you price it out.


Pure Math Editorial:

So you did what most teams do—start with APIs?


Sean:

Yeah, we looked at all the usual suspects—OpenAI, Claude, DeepSeek. Even with open models like DeepSeek, running inference through hosted APIs adds up fast.


Once you start talking about high-volume document processing—millions of chunks, multiple passes, embeddings, feature extraction—you quickly get into five-figure a month territory. And we weren’t even in production.


Just getting to a working prototype would’ve cost tens of thousands.


I didn't want to burn budget to start experimenting. So we shifted gears. What if we self-hosted an open model instead? Spin up a GPU server in the cloud, run DeepSeek locally, and manage the processing ourselves.

That led us to Google Cloud—and to a whole new set of issues.


Pure Math Editorial:

Let me guess: GPU scarcity?


Sean:

Exactly. We targeted an NVIDIA T4 instance—solid for inference workloads, relatively low power draw, and more than enough for what we needed. But when I tried to spin one up? Nothing. No availability in Japan. Nothing in Taiwan. Not a single U.S. region.


I’ve been using GPU instances for years, and I’ve never hit that wall. Surprise! GPU demand has exploded across the board. Even the cheaper, preemptible instances—where you submit a job and wait your turn—were either delayed indefinitely or completely unreliable.

And we weren’t just batch processing. We needed a live web server running document search and a front-end interface. That pushed us toward reserved instances—which lock you into a one- or two-year commitment, with upfront payment.


We were staring at thousands of dollars just to try something out. No flexibility, no pause button, no ability to scale down if we changed course.


Pure Math Editorial:

So it wasn’t just about cost—it was about control.


Sean:

You lose control over your schedule, your iteration cycles, and your access to compute. With reserved instances, it doesn’t matter if you're actively using the machine or not—you’re still paying.


That might be fine in production. But in R&D? It’s a terrible setup. You want the freedom to spin something up, run a few jobs, shut it down—and not spend hundreds of dollars a day if you're not using it.

Then there’s the compliance layer. We focus on operating in highly regulated environments—asset management, healthcare. Running sensitive data through external APIs, even with encryption, introduces concerns.


So when you factor in infrastructure costs and other risks, owning the hardware starts to look like a more conservative option than you might think.


Pure Math Editorial:

So that was the breaking point?


Sean:

Yeah. At some point, the math just didn’t work.


APIs were too expensive. Cloud GPUs were either unavailable or locked behind multi-year, prepaid contracts. And we were still dealing with compliance overhead we couldn’t fully control.


So I looked at the numbers and realized: I could buy the exact same GPU—an NVIDIA T4—for around $1,200. One-time cost. No queueing. No usage caps. No API retention policies or region limitations. And for what we needed—RAG pipelines, inference at scale, internal tooling—it would easily handle the load.

That was the moment it clicked. For our use case—high-volume, high-sensitivity document processing—it wasn’t just doable to run things, literally, in-house. It was smarter.


The Takeaways

What started as a simple budget-related decision quickly became a case study in hidden tradeoffs—between flexibility and control, cost and compliance, convenience and capability.


Not to mention a few trips to the 3D printer—and some real-time education in the dark arts of cooling server-grade GPUs inside consumer desktops.

At one point, we were deep in a Slack thread about why “2-pin” refers to so many different things, why every cable looks the same but has a different name, and whether it’s acceptable to power a blower fan by stripping the casings off of the connector to a USB charging cable and using it.


(It is. The fan spins. The GPU stays cool. Success. Photos and Videos when we get to that section 🙂)


But I digress.


Cloud APIs are great for certain uses. But once you begin working with a large number of documents, sensitive data—especially in finance, healthcare, or other regulated industries—the economics and risks can shift fast.


You start to ask different questions:

  • Can I turn this off when I’m not using it?

  • Can I guarantee where my data lives?

  • Can I afford to experiment, or am I getting billed just for trying?

  • Can I audit what’s happening under the hood?


In this case, for us, the answer was to bring it in-house. Buying a GPU gives us predictable costs, full control over data flow, and a secure foundation for future client-facing work.

In the next post, deciding to order a GPU and the initial install.


Contact Us to learn how we can help you build LLMs into your organization's day-to-day.


Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries.

bottom of page