Part 2: Why Buying a T4 Became the Easiest Option.
- Pure Math Editorial
- May 6
- 6 min read
A candid “conversation” between our CEO, Sean Douglas and Pure Math Editorial. Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries.
Most people interact with large language models through consumer-facing tools like ChatGPT, Claude, Perplexity, or Gemini. You write a prompt, it gives you an answer.
What a lot of introductory users don’t realize is that those chat-based tools are just one kind of interface—an application layer—sitting on top of the underlying large language model (LLM). Or, more typically, the application is connected to multiple LLMs.
When you type in your prompt and the model responds—that’s an API call. It includes both your prompt and the model’s response. And every API call has a cost.
But these systems don’t charge by the prompt or the word—they charge by the token.
A token is a chunk of text. Sometimes it’s a whole word, sometimes it’s just a few letters or punctuation. For example, the word “investment” might be two tokens ("invest" + "ment"), but “internationalization” might break into four ("inter" + "national" + "iz" + "ation"). A sentence like “What is your outlook on the private markets?” might be roughly ten tokens.
Every model charges a different amount for those tokens. Rates can range from fractions of a cent per thousand tokens to significantly more, depending on the model. And token costs are different for the prompt and the response.
OpenAI’s GPT-4, for instance, might cost $30 to $60 per million tokens, depending on the configuration. Cheaper models like GPT-3.5 might be closer to $1 to $5 per million tokens. But the real cost depends on volume.
Consumer-grade users never see this because they’re using flat-rate plans—$20/month for ChatGPT Plus, for example—which gives you access to the various models they're making available. But what you’re really paying for is a prepackaged token allowance. Once you hit certain thresholds (which OpenAI doesn’t clearly publish), you might experience slower response times or usage limits.
That’s fine for individuals writing basic content or code. But in enterprise contexts—where teams are building tools and automated workflows—they’re often making direct API calls to the models, not using the chat interfaces at all.
And that’s when the pricing model changes. You’re no longer ‘chatting’—you’re developing automated systems that do things like:
Break down large PDFs or structured reports into smaller chunks optimized for model input,
Generate embeddings for each chunk to enable fast search and retrieval,
Run summarization or classification tasks across each section of a document,
Extract structured data or features for downstream workflows,
And often reprocess the same content multiple times as models are tuned or updated.
Each of those steps consumes tokens—and when you’re doing it across tens or hundreds of thousands of documents, those token counts add up quickly. You’re now in a world where the cost of experimentation alone can reach tens of thousands of dollars.
That’s the point where teams start asking serious questions about infrastructure tradeoffs, cost control, and whether it’s time to start hosting their own models.
Pure Math Editorial:
And that’s when buying a T4 became the easiest option?
Sean:
Yeah. I did the math and realized—I could just buy the same GPU we were trying to rent. T4’s are going for around $1,200 in Japan. One-time cost. No queue. No usage caps. No region limitations. And for what we were doing—RAG pipelines, document processing, generating embeddings—it should be more than enough.
I checked Amazon Japan and found plenty in stock. In the U.S., it was a totally different story—almost no availability, and where they were available, prices were ridiculous. $4,000, $5,000, even higher in some cases.
But in Japan, I could get one delivered within a couple of weeks. So I ordered it.

Pure Math Editorial:
Once the GPU arrived, what was the first step?
Sean:
I was planning to install the T4 into my personal machine—a consumer-grade gaming and work PC—so I already knew it wasn’t exactly designed for this kind of hardware. But I figured it was close enough, and I was prepared to make a few upgrades to get it there.
While I waited for it to show up, I dealt with the RAM. I didn’t have enough to support the kind of processing we wanted to do, so I ordered 64 gigabytes and installed that first.
I was also thinking about the cooling. The T4 isn’t a massive card, but it’s built for rack-mounted servers. It assumes things like directional airflow and consistent cooling, which a desktop just doesn’t provide by default.
I decided I’d install the T4 and sort out the cooling once I knew for sure how hot it would run.
Pure Math Editorial:
What did the software setup look like?
Sean:
I installed NixOS because that’s what I’m most comfortable with, and I wanted a clean environment I could configure exactly how I needed. But setting up the T4 wasn’t plug-and-play. I had to deal with driver conflicts right away.
The main issue was that my system already had a GPU driving my displays, and now I was adding a second one specifically for compute. But the NVIDIA drivers don’t always handle that split cleanly, especially in Linux.
So I had to configure everything—make sure the display GPU and the compute GPU weren’t stepping on each other, isolate the CUDA environment, and test that the T4 was actually being used for processing and not just sitting idle.
It took a few tries, some research, and some trial and error, but eventually I got the system stable and CUDA running correctly.
Pure Math Editorial:
Once you had the T4 installed and CUDA running, did everything just work?
Sean:
Hahaha. Not quite. As soon as I started using the GPU for actual processing, it started overheating. Like, within minutes.
At first I thought maybe I’d misconfigured something in the drivers, but then I checked the thermals and saw the temperature climbing fast—into the danger zone almost immediately under load.
Like I said, T4 doesn’t have its own fan. It’s a passively cooled card, which works fine in a server rack where you’ve got directional airflow moving front-to-back across the board. But in a consumer desktop case, with more diffuse airflow, it just didn’t work.
My case already had decent cooling—multiple fans, open space, a standard layout—but that didn’t matter. The T4 expects forced air straight through the card, and without that, it basically just bakes itself.


Pure Math Editorial:
So once you realized the card was overheating, what did you do?
Sean:
I started researching how people were handling this, because I knew I wasn’t the first person to try putting a server-grade GPU into a desktop case.
I figured it was probably a pretty common issue with the T4 for people trying to do this. If you don’t create that airflow manually, it just won’t work.
There were a bunch of DIY solutions floating around—everything from CPU liquid cooling to fully custom 3D-printed mounts. That’s when I started going down the rabbit hole to figure out how I could build a cooling setup that would actually work in my system.
The Takeaways
Buying a GPU would give us what the cloud couldn’t for this project: guaranteed access, predictable cost, and full control over our infrastructure.
But that decision came with tradeoffs of its own.
The T4 wasn’t designed for consumer desktops. It doesn’t have onboard cooling. It expects to live in a server rack, not under a desk.
So while the economics made sense, the physical setup became its own engineering problem.
In the next post, we’ll cover how we got the cooling situation handled—a few trips to the 3D printer, a USB-powered blower fan, and a little trial and error to keep our GPU from melting down.
Contact Us to learn how we can help you build LLMs into your organization's day-to-day.
Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries.