What Happens When You Upload a 93 Page Fund Prospectus to an AI Chatbot?

Pure Math Editorial
Jan 26
10 min read

Our Real-World Example: A 93-Page Fund Prospectus

We uploaded the CAZ Strategic Opportunities Fund Prospectus (93 pages, approximately 408,591 characters) to ChatGPT, Claude, and Perplexity. Like we mentioned in a previous post, the AI doesn't "read" your document the way you do. Instead, it goes through a sophisticated process to break it down, understand it mathematically, and prepare it for answering questions.

Let's break down exactly what happens, step-by-step.

The Big Picture: From Document to Answers (The 5-Step Process)

Think of this like preparing a massive cookbook for quick recipe lookup:

Upload & Text Extraction: The AI "scans" your PDF and pulls out all the text
Chunking: Breaks the document into digestible "recipe cards"
Embedding: Converts each chunk into a unique mathematical "fingerprint"
Storage: Files these fingerprints in a special mathematical library (vector database)
Retrieval: When you ask a question, it finds the most relevant chunks and sends them along with the users prompt to generate an answer

Step 1: Upload & Text Extraction

What you see: You click "upload," select your PDF, and it uploads in seconds.

What's actually happening:

The AI system receives your 93-page PDF (about 1.1 MB) and immediately starts extracting text:

For text-based PDFs (like this prospectus): Extracts the embedded text layer directly—fast and accurate
For scanned/image PDFs: Uses OCR (Optical Character Recognition) to "read" the text from the image—slower and potentially less accurate

The prospectus specifics:

Document size: ~408,591 characters
Estimated word count: ~60,000-70,000 words
Estimated token count: ~100,000-110,000 tokens

What's a token? Think of tokens as "word pieces." The AI doesn't read full words—it breaks language into smaller units. For example:

"investment" = 1 token
"uninvestable" = 2 tokens ("un" + "investable")
"CAZ" = 1 token

Why tokens matter: AI models have strict limits on how much text they can process at once (their "context window"). A prospectus like this is too large for it to read it all at once, which brings us to...

Step 2: Chunking (Breaking It Into Digestible Pieces)

The analogy: Imagine you're creating a reference library from a 93-page manual. You can't hand someone the entire manual every time they have a question. Instead, you cut it into separate index cards, each covering a specific topic.

What's happening technically: The AI splits your 93-page document into smaller "chunks"—overlapping segments of text that preserve context while fitting within processing limits.

How Each Platform Chunks The Prospectus:

ChatGPT

Chunking parameters:pinecone+2

Chunk size: 800 tokens (~600 words, ~1.5-2 pages)
Overlap: 400 tokens (50% overlap between consecutive chunks)
Method: Recursive character splitting (tries to break on paragraphs, then sentences, then words)

For the 93-page prospectus (~110,000 tokens):

Number of chunks created: ~275-300 chunks
Why so many: Each chunk is only 800 tokens, and chunks overlap by 50%, so you need many to cover the full document

Chunk Overlap:

Why overlap matters: Imagine a sentence that says "The management fee is 1.25% annually." If that sentence gets split between two chunks, neither chunk fully captures the fee structure. Overlap ensures critical information isn't lost at boundaries.

Claude

Chunking parameters:developer.nvidia+2

Chunk size: ~800 tokens (similar to ChatGPT)
Overlap: Variable (dynamically adjusted)
Method: Contextual Retrieval—generates explanatory context for each chunk before embedding

For the 93-page prospectus:

Number of chunks created: ~275-300 chunks
Unique feature: Before storing each chunk, Claude reads the full document and adds a brief contextual summary to each chunk

Example transformation:

Original chunk:

"The minimum initial investment for Class A Shares, Class D Shares and Class R Shares of the Fund is $25,000..."

Claude's contextualized version:

"[This chunk describes investor minimums from the CAZ Strategic Opportunities Fund prospectus, Section: Securities Offered] The minimum initial investment for Class A Shares, Class D Shares and Class R Shares of the Fund is $25,000..."

This context helps Claude avoid retrieving irrelevant chunks later.

Perplexity

Chunking parameters:github+1

Chunk size: 512 tokens (~380 words, ~1 page)
Overlap: 125 tokens (~24% overlap)
Method: Sliding window with metadata enrichment

For the 93-page prospectus:

Number of chunks created: ~430-470 chunks
Why more chunks: Smaller chunk size (512 vs. 800 tokens) means more chunks needed

Perplexity's twist: Adds metadata to each chunk (page numbers, section headers, document type) to improve ranking during retrieval.

Step 3: Embedding (Converting Text to Mathematical Fingerprints)

The analogy: Imagine converting every recipe card into a unique barcode that captures its "essence"—not just the words, but the meaning. Similar recipes get similar barcodes.

What's happening technically:

Each text chunk is transformed into an embedding—an array of numbers (a "vector") that represents the chunk's semantic meaning in mathematical space.

Understanding Embeddings

What they look like:

Your chunk: "The Fund invests primarily in private equity funds..."

Becomes: [0.23, -0.15, 0.87, 0.42, ..., -0.61] ← This is a 1,536-number array for ChatGPT

Why this matters:

The numbers capture meaning, not just words. Consider these two phrases:

"The fund focuses on alternative investments"
"The portfolio concentrates on non-traditional assets"

Even though the words are completely different, their embeddings would be very similar because they mean the same thing. The AI places them close together in "mathematical space."

Words with similar meanings cluster together. Distance = similarity.

Embedding Models Used by Each Platform:

ChatGPT

Model: text-embedding-3-largeplatform.openai+2
Dimensions: 256 dimensions (downsized from 3,072 for efficiency)
What this means: Each chunk becomes a list of 256 numbers

For the prospectus:

275-300 chunks × 256 numbers each = ~70,400-76,800 total numbers stored

Claude

Model: Proprietary Anthropic embedding model
Dimensions: Not publicly disclosed (estimated 768-1,536)
Special feature: Embeddings generated from contextualized chunks (with added context)

For the prospectus:

275-300 contextualized chunks embedded into vector space

Perplexity

Model: Proprietary (likely similar to OpenAI or open-source alternatives)
Dimensions: Estimated 768-1,536
Enhancement: Metadata-enriched embeddings

For the prospectus:

430-470 chunks (smaller chunks = more embeddings)

Step 4: Storage (Organizing the Mathematical Library)

The analogy: All those barcoded recipe cards are now filed in a special library where similar recipes automatically sit near each other—even if they use different ingredients or cooking methods.

What's happening technically:

The embeddings are stored in a vector database—a specialized system optimized for finding "nearby" vectors quickly.

How Vector Databases Work:

Unlike regular databases that search for exact matches ("Find rows where Fund_Name = 'CAZ'"), vector databases search for semantic similarity ("Find chunks similar in meaning to my question").

Search methods:platform.openai+2

Cosine similarity: Measures the angle between vectors (most common)
- Close angle = similar meaning
- Used by ChatGPT, Claude, Perplexity
Euclidean distance: Measures straight-line distance in vector space
- Short distance = similar meaning

Indexing for speed:

With 275-470 chunks, the AI could compare your question to every single chunk—but that's slow. Instead, it builds an index that pre-organizes vectors for fast lookup:

HNSW (Hierarchical Navigable Small World): Creates a multi-layer graph connecting similar vectors—like a highway system with exits leading to local roads
IVF (Inverted File Index): Clusters similar vectors into groups, then searches within the relevant group—like organizing books by genre before searching

For the prospectus:

All 275-470 chunk embeddings stored in vector database
Indexed using HNSW or IVF for sub-second retrieval
Takes ~1-3 seconds for initial processing during upload

Step 5: Retrieval (Finding Relevant Chunks When You Ask Questions)

Now the magic happens. You ask: "What are the management fees for Class A shares?"

What's happening behind the scenes:

Sub-Step 5a: Your Question Gets Embedded

Your question goes through the same embedding process as the document chunks:

"What are the management fees for Class A shares?"

→ [0.31, -0.22, 0.74, ..., -0.43] (256 numbers for ChatGPT)

Sub-Step 5b: Vector Similarity Search

The system compares your question's embedding to all stored chunk embeddings and finds the closest matches:

textYour question vector: [0.31, -0.22, 0.74, ...] ↓ Compare Chunk 47 (fee table): [0.29, -0.24, 0.71, ...] ← Very similar! (0.94 similarity) Chunk 48 (Class A details): [0.28, -0.21, 0.73, ...] ← Very similar! (0.91 similarity) Chunk 103 (risk factors): [0.02, -0.61, 0.15, ...] ← Not similar (0.32 similarity)

Sub-Step 5c: Ranking & Selection

The system ranks all chunks by similarity and selects the top matches.

ChatGPT's Retrieval Process:openai+1

Hybrid search approach:

Keyword search (BM25): Finds chunks containing exact words like "management" "fee" "Class A"
Semantic search (embeddings): Finds chunks with similar meaning even if words differ
Combines both: Merges results from keyword + semantic searches
Reranks: Secondary model evaluates which chunks are most relevant
Retrieves: Typically 10-20 chunks selected for answering

For the fee question:

Chunks retrieved: 12-15 chunks (~9,600-12,000 tokens, ~15-20 pages worth of content)
Content: Fee table, Class A description, expense footnotes, examples section
Time: ~300-800 milliseconds for retrieval

Claude's Retrieval Process:anthropic+2

Context window advantage:

Claude 3.5 Sonnet has a 1 million token context window
Your entire 93-page prospectus (~110,000 tokens) could fit with room to spare

Two approaches:

Full document mode: If prospectus < 1M tokens, Claude loads the entire document into context and uses attention mechanisms to focus on relevant sections
Retrieval mode: For larger documents, uses semantic search similar to ChatGPT but with contextualized chunks

For the fee question:

Potential approach: Load full prospectus into context, use retrieval-style attention to identify relevant sections
Chunks retrieved (if using retrieval): 10-15 contextualized chunks
Time: ~400-1,000 milliseconds (slower due to larger context)

Perplexity's Retrieval Process:github+1

Hybrid document-web approach:

Searches uploaded document chunks
Simultaneously searches real-time web for related information
Combines internal + external sources

For the fee question:

Chunks retrieved: 15-20 chunks (smaller chunks = more needed for context)
External search: May also find SEC filings, fund databases, or general fee structure info
Time: ~500-1,200 milliseconds (web search adds latency)

Sub-Step 5d: Answer Generation

Finally, the AI feeds the retrieved chunks + your question into the language model to generate an answer:

The prompt that actually goes to the AI (simplified):

`textContext (retrieved chunks): [Chunk 47]: "ANNUAL FUND EXPENSES - Management Fee: 1.25% for all classes..." [Chunk 48]: "Class A Shares are subject to a sales charge of up to 3.00%..." [Chunk 12]: "Total Annual Fund Operating Expenses: Class A: 3.77%..."

User Question: What are the management fees for Class A shares?

Generate answer:`

The AI reads the context and generates:

"The management fee for Class A shares is 1.25% annually of the Fund's average net assets. Note that this is just the management fee component; total annual fund operating expenses for Class A shares are 3.77% before fee waivers."

Platform-by-Platform Comparison of The CAZ Prospectus

Here's a summary of what actually happens when you upload your 93-page, ~110,000-token CAZ prospectus to each platform:

Aspect	ChatGPT Pro	Claude Pro	Perplexity Pro
Upload processing time	2-5 seconds	3-6 seconds	3-7 seconds
Chunking method	Recursive splitting	Contextual retrieval	Sliding window + metadata
Chunk size	800 tokens	~800 tokens	512 tokens
Total chunks created	~275-300	~275-300 (contextualized)	~430-470
Embedding dimensions	256	~768-1,536 (estimated)	~768-1,536 (estimated)
Storage approach	Vector database	Session memory (unless Files API)	Ephemeral vector store
Retrieval chunks per query	10-20 chunks (~12K tokens)	10-15 chunks OR full doc (1M window)	15-20 chunks
Retrieval method	Keyword + semantic hybrid	Semantic + contextual	Document + web hybrid
Answer generation time	2-4 seconds	3-6 seconds	4-8 seconds
Total query time	~3-5 seconds	~4-7 seconds	~5-9 seconds

Common Questions About Chunking & Embedding

Q: Why not just load the whole document every time?

A: Token limits and cost.

GPT-4's context window: 128,000 tokens
Your prospectus: ~110,000 tokens
Would consume 85% of context window for just the document, leaving little room for conversation history, instructions, or output

Additionally, processing 110,000 tokens costs significantly more than processing 12,000 tokens (10-20 retrieved chunks). Retrieval is ~9x more cost-efficient.

Q: How accurate is retrieval? Could it miss important information?

A: Not perfect—this is the critical limitation.

Measured performance for financial documents:developers.openai+2

Recall@10 (finds relevant info in top 10 chunks): 81-97% depending on optimization
Miss rate: 3-19% of queries retrieve incorrect or incomplete chunks

What this means for your prospectus:

Simple questions ("What's the fund size?"): High accuracy (~95%+)
Complex questions ("Explain all redemption restrictions"): Lower accuracy (~75-85%), higher risk of missed details

Q: What happens if the answer requires information from multiple sections?

A: This is where chunking creates problems.

Example: "What's my net cost for Class A shares accounting for fees and early redemption penalties?"

This requires:

Management fee (1.25%) - Section 1
12b-1 fee (0.60%) - Section 1
Sales charge (3.00%) - Section 2
Early redemption fee (2.00%) - Section 5
Total expense ratio - Section 1

If the retrieval system misses even one relevant chunk, the answer will be incomplete. This is why financial due diligence cannot rely solely on AI outputs—cross-document reasoning across multiple sections is where errors accumulate.

Q: Does the AI remember what chunks it retrieved in previous questions?

A: Partially, through conversation history.

ChatGPT/Claude/Perplexity: Maintain conversation history (previous Q&As), but each new question triggers fresh retrieval
No memory of chunks: The AI doesn't remember "I used Chunk 47 last time"—it re-searches every time
Implication: You might ask two related questions and get answers from completely different chunks, potentially creating inconsistencies

Q: Can I improve accuracy by asking better questions?

A: Yes, significantly.

Poor question: "Tell me about fees"

Too vague → retrieves scattered chunks about various fee types

Better question: "What is the management fee percentage for Class A shares as stated in the Annual Fund Expenses table?"

Specific → retrieves exact relevant chunks

Best question: "According to the Summary of Fees and Expenses section, what is the management fee for Class A shares, and does it differ from other share classes?"

Very specific + section reference → highest retrieval precision

The Critical Takeaway for Due Diligence

Understanding chunking and embedding reveals why AI chatbots aren't reliable for business-critical document analysis:

Chunking breaks context: Your prospectus is artificially divided into 275-470 pieces. Information spanning multiple chunks may not be fully captured.
Retrieval isn't guaranteed: Even with 97% recall, 3% of queries miss relevant information. For a fund prospectus with hundreds of critical details, that's dozens of potential misses.
Semantic search has blind spots: If you ask about "redemption terms" but the document uses "repurchase provisions," the embedding may not recognize them as equivalent despite overlapping meaning.
No verification of completeness: The AI doesn't know what it doesn't know. It answers based on retrieved chunks without checking if other relevant information exists elsewhere in the document.
Cross-chunk reasoning is weak: Questions requiring synthesis across multiple sections (fees + restrictions + timing) are where accuracy degrades most.

For consumer-grade chatbots: These are research assistants, not analysts. Use them for initial exploration, then verify every output manually against the source document.

Conclusion: Under the Hood of AI Document Processing

When you upload your 93-page fund prospectus to ChatGPT, Claude, or Perplexity, a sophisticated pipeline transforms your document into mathematical representations optimized for semantic search. The process—extraction, chunking, embedding, storage, and retrieval—enables AI to "understand" and answer questions about documents far too large to process all at once.

But this very process introduces limitations: chunking fragments context, embeddings approximate meaning imperfectly, and retrieval can miss critical details. For educational purposes or initial research, this technology is transformative. For investment due diligence worth millions? The 3-19% miss rate and cross-document reasoning weaknesses mean human verification remains non-negotiable.

Understanding what happens "under the hood" isn't just technically interesting—it's essential for knowing when to trust AI outputs and when to dig deeper yourself.

Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries. As with any AI-based project, human oversight is employed throughout the content creation process.

What Happens When You Upload a 93 Page Fund Prospectus to an AI Chatbot?

Our Real-World Example: A 93-Page Fund Prospectus

The Big Picture: From Document to Answers (The 5-Step Process)

Step 1: Upload & Text Extraction

Step 2: Chunking (Breaking It Into Digestible Pieces)

How Each Platform Chunks The Prospectus:

ChatGPT

Claude

Perplexity

Step 3: Embedding (Converting Text to Mathematical Fingerprints)

Understanding Embeddings

Embedding Models Used by Each Platform:

ChatGPT

Claude

Perplexity

Step 4: Storage (Organizing the Mathematical Library)

How Vector Databases Work:

Step 5: Retrieval (Finding Relevant Chunks When You Ask Questions)

Sub-Step 5a: Your Question Gets Embedded

Sub-Step 5b: Vector Similarity Search

Sub-Step 5c: Ranking & Selection

ChatGPT's Retrieval Process:openai+1

Claude's Retrieval Process:anthropic+2

Perplexity's Retrieval Process:github+1

Sub-Step 5d: Answer Generation

Platform-by-Platform Comparison of The CAZ Prospectus

Common Questions About Chunking & Embedding

The Critical Takeaway for Due Diligence

Conclusion: Under the Hood of AI Document Processing

Recent Posts

Comments

Subscribe to our newsletter