What Happens When You Upload a 93 Page Fund Prospectus to an AI Chatbot?
- Pure Math Editorial
- 4 hours ago
- 10 min read
Our Real-World Example: A 93-Page Fund Prospectus
We uploaded the CAZ Strategic Opportunities Fund Prospectus (93 pages, approximately 408,591 characters) to ChatGPT, Claude, and Perplexity. Like we mentioned in a previous post, the AI doesn't "read" your document the way you do. Instead, it goes through a sophisticated process to break it down, understand it mathematically, and prepare it for answering questions.
Let's break down exactly what happens, step-by-step.
The Big Picture: From Document to Answers (The 5-Step Process)
Think of this like preparing a massive cookbook for quick recipe lookup:
Upload & Text Extraction: The AI "scans" your PDF and pulls out all the text
Chunking: Breaks the document into digestible "recipe cards"
Embedding: Converts each chunk into a unique mathematical "fingerprint"
Storage: Files these fingerprints in a special mathematical library (vector database)
Retrieval: When you ask a question, it finds the most relevant chunks and sends them along with the users prompt to generate an answer
Step 1: Upload & Text Extraction
What you see: You click "upload," select your PDF, and it uploads in seconds.
What's actually happening:
The AI system receives your 93-page PDF (about 1.1 MB) and immediately starts extracting text:
For text-based PDFs (like this prospectus): Extracts the embedded text layer directly—fast and accurate
For scanned/image PDFs: Uses OCR (Optical Character Recognition) to "read" the text from the image—slower and potentially less accurate
The prospectus specifics:
Document size: ~408,591 characters
Estimated word count: ~60,000-70,000 words
Estimated token count: ~100,000-110,000 tokens
What's a token? Think of tokens as "word pieces." The AI doesn't read full words—it breaks language into smaller units. For example:
"investment" = 1 token
"uninvestable" = 2 tokens ("un" + "investable")
"CAZ" = 1 token
Why tokens matter: AI models have strict limits on how much text they can process at once (their "context window"). A prospectus like this is too large for it to read it all at once, which brings us to...
Step 2: Chunking (Breaking It Into Digestible Pieces)
The analogy: Imagine you're creating a reference library from a 93-page manual. You can't hand someone the entire manual every time they have a question. Instead, you cut it into separate index cards, each covering a specific topic.
What's happening technically: The AI splits your 93-page document into smaller "chunks"—overlapping segments of text that preserve context while fitting within processing limits.
How Each Platform Chunks The Prospectus:
ChatGPT
Chunking parameters:pinecone+2
Chunk size: 800 tokens (~600 words, ~1.5-2 pages)
Overlap: 400 tokens (50% overlap between consecutive chunks)
Method: Recursive character splitting (tries to break on paragraphs, then sentences, then words)
For the 93-page prospectus (~110,000 tokens):
Number of chunks created: ~275-300 chunks
Why so many: Each chunk is only 800 tokens, and chunks overlap by 50%, so you need many to cover the full document
Chunk Overlap:

Why overlap matters: Imagine a sentence that says "The management fee is 1.25% annually." If that sentence gets split between two chunks, neither chunk fully captures the fee structure. Overlap ensures critical information isn't lost at boundaries.
Claude
Chunking parameters:developer.nvidia+2
Chunk size: ~800 tokens (similar to ChatGPT)
Overlap: Variable (dynamically adjusted)
Method: Contextual Retrieval—generates explanatory context for each chunk before embedding
For the 93-page prospectus:
Number of chunks created: ~275-300 chunks
Unique feature: Before storing each chunk, Claude reads the full document and adds a brief contextual summary to each chunk
Example transformation:
Original chunk:
"The minimum initial investment for Class A Shares, Class D Shares and Class R Shares of the Fund is $25,000..."
Claude's contextualized version:
"[This chunk describes investor minimums from the CAZ Strategic Opportunities Fund prospectus, Section: Securities Offered] The minimum initial investment for Class A Shares, Class D Shares and Class R Shares of the Fund is $25,000..."
This context helps Claude avoid retrieving irrelevant chunks later.
Perplexity
Chunking parameters:github+1
Chunk size: 512 tokens (~380 words, ~1 page)
Overlap: 125 tokens (~24% overlap)
Method: Sliding window with metadata enrichment
For the 93-page prospectus:
Number of chunks created: ~430-470 chunks
Why more chunks: Smaller chunk size (512 vs. 800 tokens) means more chunks needed
Perplexity's twist: Adds metadata to each chunk (page numbers, section headers, document type) to improve ranking during retrieval.
Step 3: Embedding (Converting Text to Mathematical Fingerprints)
The analogy: Imagine converting every recipe card into a unique barcode that captures its "essence"—not just the words, but the meaning. Similar recipes get similar barcodes.
What's happening technically:
Each text chunk is transformed into an embedding—an array of numbers (a "vector") that represents the chunk's semantic meaning in mathematical space.
Understanding Embeddings
What they look like:
Your chunk: "The Fund invests primarily in private equity funds..."
Becomes: [0.23, -0.15, 0.87, 0.42, ..., -0.61] ← This is a 1,536-number array for ChatGPT

Why this matters:
The numbers capture meaning, not just words. Consider these two phrases:
"The fund focuses on alternative investments"
"The portfolio concentrates on non-traditional assets"
Even though the words are completely different, their embeddings would be very similar because they mean the same thing. The AI places them close together in "mathematical space."

Words with similar meanings cluster together. Distance = similarity.
Embedding Models Used by Each Platform:
ChatGPT
Model: text-embedding-3-largeplatform.openai+2
Dimensions: 256 dimensions (downsized from 3,072 for efficiency)
What this means: Each chunk becomes a list of 256 numbers
For the prospectus:
275-300 chunks × 256 numbers each = ~70,400-76,800 total numbers stored
Claude
Model: Proprietary Anthropic embedding model
Dimensions: Not publicly disclosed (estimated 768-1,536)
Special feature: Embeddings generated from contextualized chunks (with added context)
For the prospectus:
275-300 contextualized chunks embedded into vector space
Perplexity
Model: Proprietary (likely similar to OpenAI or open-source alternatives)
Dimensions: Estimated 768-1,536
Enhancement: Metadata-enriched embeddings
For the prospectus:
430-470 chunks (smaller chunks = more embeddings)
Step 4: Storage (Organizing the Mathematical Library)
The analogy: All those barcoded recipe cards are now filed in a special library where similar recipes automatically sit near each other—even if they use different ingredients or cooking methods.
What's happening technically:
The embeddings are stored in a vector database—a specialized system optimized for finding "nearby" vectors quickly.

How Vector Databases Work:
Unlike regular databases that search for exact matches ("Find rows where Fund_Name = 'CAZ'"), vector databases search for semantic similarity ("Find chunks similar in meaning to my question").
Search methods:platform.openai+2
Cosine similarity: Measures the angle between vectors (most common)
Close angle = similar meaning
Used by ChatGPT, Claude, Perplexity
Euclidean distance: Measures straight-line distance in vector space
Short distance = similar meaning
Indexing for speed:
With 275-470 chunks, the AI could compare your question to every single chunk—but that's slow. Instead, it builds an index that pre-organizes vectors for fast lookup:
HNSW (Hierarchical Navigable Small World): Creates a multi-layer graph connecting similar vectors—like a highway system with exits leading to local roads
IVF (Inverted File Index): Clusters similar vectors into groups, then searches within the relevant group—like organizing books by genre before searching
For the prospectus:
All 275-470 chunk embeddings stored in vector database
Indexed using HNSW or IVF for sub-second retrieval
Takes ~1-3 seconds for initial processing during upload
Step 5: Retrieval (Finding Relevant Chunks When You Ask Questions)
Now the magic happens. You ask: "What are the management fees for Class A shares?"
What's happening behind the scenes:
Sub-Step 5a: Your Question Gets Embedded
Your question goes through the same embedding process as the document chunks:
"What are the management fees for Class A shares?"
→ [0.31, -0.22, 0.74, ..., -0.43] (256 numbers for ChatGPT)
Sub-Step 5b: Vector Similarity Search
The system compares your question's embedding to all stored chunk embeddings and finds the closest matches:
textYour question vector: [0.31, -0.22, 0.74, ...] ↓ Compare Chunk 47 (fee table): [0.29, -0.24, 0.71, ...] ← Very similar! (0.94 similarity) Chunk 48 (Class A details): [0.28, -0.21, 0.73, ...] ← Very similar! (0.91 similarity) Chunk 103 (risk factors): [0.02, -0.61, 0.15, ...] ← Not similar (0.32 similarity)
Sub-Step 5c: Ranking & Selection
The system ranks all chunks by similarity and selects the top matches.
ChatGPT's Retrieval Process:openai+1
Hybrid search approach:
Keyword search (BM25): Finds chunks containing exact words like "management" "fee" "Class A"
Semantic search (embeddings): Finds chunks with similar meaning even if words differ
Combines both: Merges results from keyword + semantic searches
Reranks: Secondary model evaluates which chunks are most relevant
Retrieves: Typically 10-20 chunks selected for answering
For the fee question:
Chunks retrieved: 12-15 chunks (~9,600-12,000 tokens, ~15-20 pages worth of content)
Content: Fee table, Class A description, expense footnotes, examples section
Time: ~300-800 milliseconds for retrieval
Claude's Retrieval Process:anthropic+2
Context window advantage:
Claude 3.5 Sonnet has a 1 million token context window
Your entire 93-page prospectus (~110,000 tokens) could fit with room to spare
Two approaches:
Full document mode: If prospectus < 1M tokens, Claude loads the entire document into context and uses attention mechanisms to focus on relevant sections
Retrieval mode: For larger documents, uses semantic search similar to ChatGPT but with contextualized chunks
For the fee question:
Potential approach: Load full prospectus into context, use retrieval-style attention to identify relevant sections
Chunks retrieved (if using retrieval): 10-15 contextualized chunks
Perplexity's Retrieval Process:github+1
Hybrid document-web approach:
Searches uploaded document chunks
Simultaneously searches real-time web for related information
Combines internal + external sources
For the fee question:
Chunks retrieved: 15-20 chunks (smaller chunks = more needed for context)
External search: May also find SEC filings, fund databases, or general fee structure info
Time: ~500-1,200 milliseconds (web search adds latency)
Sub-Step 5d: Answer Generation
Finally, the AI feeds the retrieved chunks + your question into the language model to generate an answer:
The prompt that actually goes to the AI (simplified):
`textContext (retrieved chunks): [Chunk 47]: "ANNUAL FUND EXPENSES - Management Fee: 1.25% for all classes..." [Chunk 48]: "Class A Shares are subject to a sales charge of up to 3.00%..." [Chunk 12]: "Total Annual Fund Operating Expenses: Class A: 3.77%..."
User Question: What are the management fees for Class A shares?
Generate answer:`
The AI reads the context and generates:
"The management fee for Class A shares is 1.25% annually of the Fund's average net assets. Note that this is just the management fee component; total annual fund operating expenses for Class A shares are 3.77% before fee waivers."
Platform-by-Platform Comparison of The CAZ Prospectus
Here's a summary of what actually happens when you upload your 93-page, ~110,000-token CAZ prospectus to each platform:
Aspect | ChatGPT Pro | Claude Pro | Perplexity Pro |
Upload processing time | 2-5 seconds | 3-6 seconds | 3-7 seconds |
Chunking method | Recursive splitting | Contextual retrieval | Sliding window + metadata |
Chunk size | 800 tokens | ~800 tokens | 512 tokens |
Total chunks created | ~275-300 | ~275-300 (contextualized) | ~430-470 |
Embedding dimensions | 256 | ~768-1,536 (estimated) | ~768-1,536 (estimated) |
Storage approach | Vector database | Session memory (unless Files API) | Ephemeral vector store |
Retrieval chunks per query | 10-20 chunks (~12K tokens) | 10-15 chunks OR full doc (1M window) | 15-20 chunks |
Retrieval method | Keyword + semantic hybrid | Semantic + contextual | Document + web hybrid |
Answer generation time | 2-4 seconds | 3-6 seconds | 4-8 seconds |
Total query time | ~3-5 seconds | ~4-7 seconds | ~5-9 seconds |
Common Questions About Chunking & Embedding
Q: Why not just load the whole document every time?
A: Token limits and cost.
GPT-4's context window: 128,000 tokens
Your prospectus: ~110,000 tokens
Would consume 85% of context window for just the document, leaving little room for conversation history, instructions, or output
Additionally, processing 110,000 tokens costs significantly more than processing 12,000 tokens (10-20 retrieved chunks). Retrieval is ~9x more cost-efficient.
Q: How accurate is retrieval? Could it miss important information?
A: Not perfect—this is the critical limitation.
Measured performance for financial documents:developers.openai+2
Recall@10 (finds relevant info in top 10 chunks): 81-97% depending on optimization
Miss rate: 3-19% of queries retrieve incorrect or incomplete chunks
What this means for your prospectus:
Simple questions ("What's the fund size?"): High accuracy (~95%+)
Complex questions ("Explain all redemption restrictions"): Lower accuracy (~75-85%), higher risk of missed details
Q: What happens if the answer requires information from multiple sections?
A: This is where chunking creates problems.
Example: "What's my net cost for Class A shares accounting for fees and early redemption penalties?"
This requires:
Management fee (1.25%) - Section 1
12b-1 fee (0.60%) - Section 1
Sales charge (3.00%) - Section 2
Early redemption fee (2.00%) - Section 5
Total expense ratio - Section 1
If the retrieval system misses even one relevant chunk, the answer will be incomplete. This is why financial due diligence cannot rely solely on AI outputs—cross-document reasoning across multiple sections is where errors accumulate.
Q: Does the AI remember what chunks it retrieved in previous questions?
A: Partially, through conversation history.
ChatGPT/Claude/Perplexity: Maintain conversation history (previous Q&As), but each new question triggers fresh retrieval
No memory of chunks: The AI doesn't remember "I used Chunk 47 last time"—it re-searches every time
Implication: You might ask two related questions and get answers from completely different chunks, potentially creating inconsistencies
Q: Can I improve accuracy by asking better questions?
A: Yes, significantly.
Poor question: "Tell me about fees"
Too vague → retrieves scattered chunks about various fee types
Better question: "What is the management fee percentage for Class A shares as stated in the Annual Fund Expenses table?"
Specific → retrieves exact relevant chunks
Best question: "According to the Summary of Fees and Expenses section, what is the management fee for Class A shares, and does it differ from other share classes?"
Very specific + section reference → highest retrieval precision
The Critical Takeaway for Due Diligence
Understanding chunking and embedding reveals why AI chatbots aren't reliable for business-critical document analysis:
Chunking breaks context: Your prospectus is artificially divided into 275-470 pieces. Information spanning multiple chunks may not be fully captured.
Retrieval isn't guaranteed: Even with 97% recall, 3% of queries miss relevant information. For a fund prospectus with hundreds of critical details, that's dozens of potential misses.
Semantic search has blind spots: If you ask about "redemption terms" but the document uses "repurchase provisions," the embedding may not recognize them as equivalent despite overlapping meaning.
No verification of completeness: The AI doesn't know what it doesn't know. It answers based on retrieved chunks without checking if other relevant information exists elsewhere in the document.
Cross-chunk reasoning is weak: Questions requiring synthesis across multiple sections (fees + restrictions + timing) are where accuracy degrades most.
For consumer-grade chatbots: These are research assistants, not analysts. Use them for initial exploration, then verify every output manually against the source document.
Conclusion: Under the Hood of AI Document Processing
When you upload your 93-page fund prospectus to ChatGPT, Claude, or Perplexity, a sophisticated pipeline transforms your document into mathematical representations optimized for semantic search. The process—extraction, chunking, embedding, storage, and retrieval—enables AI to "understand" and answer questions about documents far too large to process all at once.
But this very process introduces limitations: chunking fragments context, embeddings approximate meaning imperfectly, and retrieval can miss critical details. For educational purposes or initial research, this technology is transformative. For investment due diligence worth millions? The 3-19% miss rate and cross-document reasoning weaknesses mean human verification remains non-negotiable.
Understanding what happens "under the hood" isn't just technically interesting—it's essential for knowing when to trust AI outputs and when to dig deeper yourself.
Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries. As with any AI-based project, human oversight is employed throughout the content creation process.
