Part 4: What’s Running on Our T4 Now That it Doesn't Overheat (T4 GPU Inference in Action)

Pure Math Editorial
11 minutes ago
6 min read

With all the talk going on about AI, it still amazes us that so few people are talking about the ‘magic’ of embeddings—except for maybe in more technical circles. Large Language Models (LLMs) are a game-changer for data nerds:)

They enable rapid, large-scale extraction of meaning from unstructured documents and language-based information—think regulatory filings, marketing brochures, press releases, websites. And, it’s not just pulling keywords—it’s context and nuance.

Let that sink in.

More importantly, though, they make that meaning computable. Once the text is embedded, you can run mathematical operations on it—compare documents, measure semantic similarity, cluster by tone or intent, even build queries that return answers based on how something feels, not just what it says.

This isn’t just language anymore. It’s structured data made from unstructured language—numbers you can work with—and sure, you can talk to them too:)

These kinds of operations aren’t feasible with the old tools and methods. You could build a search index or run keyword filters, but you can't ask complex, language-based questions across thousands of documents—and get usable results back in seconds.

Regulatory filings are a good example. They are typically dense, language-heavy, and written in a broad range of formats, by different people, for different reasons. Take ADVs for example. There’s no standard formatting, and no shared vocabulary. Part 2 of Form ADV is basically marketing and sales materials.

They’re not really written to be parsed by machines—or honestly, even most people.

Trying to work with them using traditional tools—like keyword search or tagging systems—usually leads nowhere. You either get too much noise or you miss the one paragraph that actually matters.

That’s where embeddings and vectors come in. Instead of indexing documents by keywords, you break them down into smaller chunks—paragraphs, sections, sometimes even sentences—and encode each chunk as a vector. That vector represents the meaning of the text, not just the words in it. You can then store those vectors in a database and query them later using natural language.

Vector Database — Source: https://www.nvidia.com/en-us/glossary/vector-database/

This is important! This is a big deal!

Semantic search doesn’t care about matching keywords. It’s able to look underneath. It doesn’t need the word “retirees” to show up in the document. If the text talks about helping people plan their income in their 60s, or optimizing withdrawals from tax-deferred accounts, the model can connect the dots.

It’s not pulling the answer from a spreadsheet. It’s estimating intent—based on language patterns, context, and proximity to similar ideas it’s seen before.

This isn’t magic. It’s data science and engineering. You need infrastructure to process the documents, store the embeddings, and manage how the system retrieves them later. That includes deciding how to chunk the text—where to split, how much to include in each segment, and how to preserve enough context for the chunk to still make sense on its own.

Most of that requires experimentation. Not all document formats behave the same way, and not all LLMs produce the same results. That’s part of what this build is for. We’re not training a model. We’re not trying to do anything too fancy (yet). We’re just trying to process a large number of language-heavy documents in a way that makes them usable for our project.

And now we have the hardware. And it doesn’t overheat.

Pure Math Editorial:

What does that picture show?

Sean:

That’s basically just proof the setup is working. It’s from a tool that shows you what your GPUs are doing. We’ve got two in the machine—my old GeForce RTX 3070 and the new Tesla T4—and what that screenshot shows is the T4 is online, running inference, and not overheating.

At the time I took that, I was running ollama, which is our local language model.

You can see that it’s using the T4—about 870 megabytes of GPU memory—which means the drivers are installed, CUDA is working, and the machine is actually doing what it’s supposed to be doing.

It’s not just idle. It’s thinking.

You can also see the temperature. The T4 was sitting at 45 degrees Celsius in the first shot, and then 62 a little later. That’s still within a safe range, but it confirms what we’ve been seeing: these server-grade GPUs heat up fast when they’re not in a rack with controlled airflow. This thing was not designed to live under a desk.

But we’ve got the cooling dialed in now, so it’s stable. We’re running real models, on real data, locally. No tokens. No rate limits. No vendor lock-in. Just: here’s your GPU, here’s your model, go.

That’s the takeaway. We’re not just talking about building. This is the point where it’s actually doing the work.

Pure Math Editorial:

Okay, so the hardware’s up and running. But what’s the actual goal here? What are you building this system to do?

Sean:

We’re processing hundreds of thousands of pages of financial documents—regulatory filings, fund pitch decks, DDQs, prospectuses.

Some people we’ve talked to think you can just dump these types of docs into a folder and ask ChatGPT to ‘Do Due Diligence’.

That’s not how this works.

Pure Math Editorial:

So how does this work?

Sean:

In simple terms, when we talk about preprocessing documents, what we’re building is a system that thinks about every chunk. It asks a series of questions or prompts about each chunk. Stuff like: What types of investment strategies do they focus on? What’s their AUM? Who are the key decision-makers.

We’re just getting going so it’s not to the point where we’re asking hundreds and hundreds of questions yet.

But imagine, like I said, you can’t just throw documents in and then do due diligence by asking the model to “do due diligence”.

You actually have to tell the model all of the things that someone actually doing due diligence would look for in the documents.

Pure Math Editorial:

What do you mean by ‘thinks’

Sean:

Yeah, so “thinks” just means we’re running prompts against each chunk of text—usually that’s a paragraph, or maybe a few paragraphs together depending on how we split it. But yeah, the system is basically going through every chunk and asking specific questions ahead of time. We’re not waiting until someone asks a question and then searching across the raw text. That’s too slow.

Instead, we’re doing the work upfront. We’re asking each chunk: is there anything here about investment strategy? Anything about conflicts of interest? Do they mention fees, or use language that implies they're targeting retirees?

And then we take all those answers—structured responses—and store them in a vector database. So later, when someone asks a question like “Find advisors with over $500 million in assets under advisement and that specialize in alternatives for high-net-worth clients,” the system isn’t guessing. It’s matching against things it’s already thought about.

That’s what makes it fast—and way more accurate than trying to pull from raw text on the fly.

Pure Math Editorial:

So once the system has gone through all those chunks and asked its questions—then what? What happens with those answers?

Sean:

The answers get stored in a vector database. That’s the part that makes it fast later.

Like, let’s say you ask: “Which firms manage alternatives for high-net-worth investors?” Instead of searching through every document from scratch, the system just pulls from what it already knows—because it’s already thought through all those chunks.

The key is: embeddings turn those chunks into numbers. And once it’s numbers, it’s just math. You’re not looping through a list like old software does saying “Is this the best match? No. Is this the best match? No.” You’re just computing the closest match—like, literally calculating the best answer.

So the heavy lifting happens up front. And later, when someone asks a question, it’s basically instant. You're not asking the model to go figure it out—you’re asking it to retrieve something it’s already figured out.

The Takeaways

We say this a lot. AI isn’t a magic.

You have to understand how these systems work to get anything useful out of them consistently. You can’t just dump fifty documents into a folder and ask:

“Suggest to me the best advisors if I’m interested in alternative investments.”

You can try it if you’d like. It might come back and say:

“I’m sorry, I don’t see anything about that in any of these documents.”

The problem isn’t the documents—or the question.

The problem is that no one told the system what to look for.

Knowing the process is one thing. Knowing what matters—what to flag, what to extract, what to ask—is something else entirely. That’s where people with an intimate understanding of an industry and it’s processes comes in.

Up next: Where the expertise shows up. In the prompts. In the logic. In the way systems are designed.

Contact Us to learn how we can help you build LLMs into your organization's day-to-day.

Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries.