Part 3: Fan Install or Why is EVERYTHING Called a 2-pin Connector?

7 days ago
4 min read

Updated: 6 days ago

A lot of people still don’t understand what Large Language Models (LLMs) actually do. They think you type in a prompt and the application simply retrieves an answer.

But every time you send a prompt, you’re starting from scratch, essentially. The prompt doesn’t retrieve an answer—it triggers a simulation. The model is generating a response in real time, one word at a time, based on probabilities. What word is most likely to come next? And after that? And the one after that?

To do this, the model takes your prompt, converts it into numbers, and then runs those numbers through layer after layer of weighted rules—millions, sometimes billions of them—to figure out the best possible output.

Here’s an oversimplification of how it breaks down:

Tokenization
The process of breaking the input text into chunks (tokens), which are usually subwords or characters. For example, “internationalization” might be split into “inter”, “national”, “iz”, “ation”.
Embeddings & Vectors
An embedding is a way of representing each token as a list of numbers—a vector. You can think of a vector as a kind of digital fingerprint that captures the meaning and context of a word based on how it's been used across billions of examples (when the model was trained). Words that are similar in meaning—like “portfolio” and “allocation”—will have vectors that are close together in space, even if they don’t look alike.

Vectors

Source: NVIDIA https://www.nvidia.com/en-au/glossary/vector-database/

These vectors are what models actually process. It doesn’t “understand” words—it just works with patterns in the numbers. Your prompt gets turned into a series of vectors, and then the model runs those vectors through many layers of calculations to figure out which word should come next. That’s what generates a response.

The same processes hold true if you’re preprocessing language-based documents and building knowledge bases for RAG pipelines.

In those cases, you're not ‘chatting’ to generate a response—you’re preparing your documents so a model can use them later. The documents are converted into tokens and embeddings, just like a prompt would be. But instead of feeding them directly into the model, those embeddings are stored in a vector database.

A vector database is a specialized system that can store and search through these numerical representations efficiently. When a user asks a question, the system looks for the chunks of text—now represented as vectors—that are most similar in meaning to the question. Those chunks are then passed to the model as additional context to help generate a better response based on your documents.

GPUs can process thousands of operations at once, making them ideal for workloads where the same kind of math needs to be done across lots of data. That’s exactly what happens when you’re preparing documents for a RAG system—splitting them into chunks, generating embeddings, comparing vectors, and running similarity searches across high-dimensional spaces.

It’s the same math that powers the chat interfaces—matrix multiplications, attention layers, all of that—but instead of doing it for a single prompt at a time, you're running it across many multiples, simultaneously. It’s still inference—it’s just done in bulk, across millions of tokens, to prep documents for retrieval later.

And the GPUs get hot.

Pure Math Editorial:

Once the T4 was installed, what happened when you started using it?

Sean:

It got very, very hot, very quickly. I had a decent cooling setup for the case—fans, heatsinks—so I was curious to see if it would also cool the T4. It didn’t. It just started cooking.

Pure Math Editorial:

So what did you do?

Sean:

I did some research and found a bunch of ways people solved this issue. I went with a cheap electronics fan—like a blower fan that you might put into like a radio or like a remote control car type of thing—and 3D printed a bracket I found for it online.

Pure Math Editorial:

And it worked?

Sean:

Physically, yeah. Powering it was an issue, though. It came with a 2-pin connector—which sounds specific, but apparently “2-pin connector” means a thousand different things, depending on if you’re talking about hobby electronics or PC parts, and even then there are multiple types and sizes of each…

Pure Math Editorial:

So how’d you power it?

Sean:

At that point I just stripped a old USB cable with a “2-pin” plug… It didn’t fit, so I cut the plastic casing off, jammed them together, and slapped on some electrical tape. I’m no electrician but I wanted to use my GPU.

Pure Math Editorial:

And it actually runs?

Sean:

It runs. It’s slightly underpowered—USB gives it 5 volts, it wants 12—but it still spins. It displaces enough air to keep the card from overheating.

Pure Math Editorial:

Did you ever find the right adapter?

Sean:

Waiting for it in the mail now, but for now I can have a working GPU and fan :)

The Takeaways

Getting the T4 installed was one thing. Keeping it cool was another.

This wasn’t exactly a performance build. It was just the fastest way to get a passively cooled T4 working in my computer—well enough to run real inference jobs without throttling or crashing. That meant figuring out airflow, 3D printing a mount, and wiring power manually because none of the connectors matched anything.

It wasn’t elegant, but it worked. And it gave us what we needed: a stable environment to start pushing full-scale document workloads without waiting on cloud GPUs or dealing with quota limits.

Up next in Part 4:

Now that the GPU’s stable and not overheating, we’ll look at what it actually took to start running jobs: document batching, performance tuning, and getting DeepSeek to process data the way we needed.

Contact Us to learn how we can help you build LLMs into your organization's day-to-day.

Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries.