top of page

What Happens When You Upload a 93 Page Fund Prospectus to an AI Chatbot?

  • Writer: Pure Math Editorial
    Pure Math Editorial
  • 4 hours ago
  • 10 min read

Our Real-World Example: A 93-Page Fund Prospectus

We uploaded the CAZ Strategic Opportunities Fund Prospectus (93 pages, approximately 408,591 characters) to ChatGPT, Claude, and Perplexity. Like we mentioned in a previous post, the AI doesn't "read" your document the way you do. Instead, it goes through a sophisticated process to break it down, understand it mathematically, and prepare it for answering questions.


Let's break down exactly what happens, step-by-step.

The Big Picture: From Document to Answers (The 5-Step Process)


Think of this like preparing a massive cookbook for quick recipe lookup:

  1. Upload & Text Extraction: The AI "scans" your PDF and pulls out all the text

  2. Chunking: Breaks the document into digestible "recipe cards"

  3. Embedding: Converts each chunk into a unique mathematical "fingerprint"

  4. Storage: Files these fingerprints in a special mathematical library (vector database)

  5. Retrieval: When you ask a question, it finds the most relevant chunks and sends them along with the users prompt to generate an answer


Step 1: Upload & Text Extraction


What you see: You click "upload," select your PDF, and it uploads in seconds.


What's actually happening:

The AI system receives your 93-page PDF (about 1.1 MB) and immediately starts extracting text:

  • For text-based PDFs (like this prospectus): Extracts the embedded text layer directly—fast and accurate

  • For scanned/image PDFs: Uses OCR (Optical Character Recognition) to "read" the text from the image—slower and potentially less accurate


The prospectus specifics:

  • Document size: ~408,591 characters

  • Estimated word count: ~60,000-70,000 words

  • Estimated token count: ~100,000-110,000 tokens


What's a token? Think of tokens as "word pieces." The AI doesn't read full words—it breaks language into smaller units. For example:

  • "investment" = 1 token

  • "uninvestable" = 2 tokens ("un" + "investable")

  • "CAZ" = 1 token


Why tokens matter: AI models have strict limits on how much text they can process at once (their "context window"). A prospectus like this is too large for it to read it all at once, which brings us to...

Step 2: Chunking (Breaking It Into Digestible Pieces)


The analogy: Imagine you're creating a reference library from a 93-page manual. You can't hand someone the entire manual every time they have a question. Instead, you cut it into separate index cards, each covering a specific topic.


What's happening technically: The AI splits your 93-page document into smaller "chunks"—overlapping segments of text that preserve context while fitting within processing limits.

How Each Platform Chunks The Prospectus:


ChatGPT


Chunking parameters:pinecone+2

  • Chunk size: 800 tokens (~600 words, ~1.5-2 pages)

  • Overlap: 400 tokens (50% overlap between consecutive chunks)

  • Method: Recursive character splitting (tries to break on paragraphs, then sentences, then words)


For the 93-page prospectus (~110,000 tokens):

  • Number of chunks created: ~275-300 chunks

  • Why so many: Each chunk is only 800 tokens, and chunks overlap by 50%, so you need many to cover the full document


Chunk Overlap:


Chunking

Why overlap matters: Imagine a sentence that says "The management fee is 1.25% annually." If that sentence gets split between two chunks, neither chunk fully captures the fee structure. Overlap ensures critical information isn't lost at boundaries.


Claude


Chunking parameters:developer.nvidia+2

  • Chunk size: ~800 tokens (similar to ChatGPT)

  • Overlap: Variable (dynamically adjusted)

  • Method: Contextual Retrieval—generates explanatory context for each chunk before embedding


For the 93-page prospectus:

  • Number of chunks created: ~275-300 chunks

  • Unique feature: Before storing each chunk, Claude reads the full document and adds a brief contextual summary to each chunk


Example transformation:

Original chunk:

"The minimum initial investment for Class A Shares, Class D Shares and Class R Shares of the Fund is $25,000..."

Claude's contextualized version:

"[This chunk describes investor minimums from the CAZ Strategic Opportunities Fund prospectus, Section: Securities Offered] The minimum initial investment for Class A Shares, Class D Shares and Class R Shares of the Fund is $25,000..."

This context helps Claude avoid retrieving irrelevant chunks later.


Perplexity


Chunking parameters:github+1

  • Chunk size: 512 tokens (~380 words, ~1 page)

  • Overlap: 125 tokens (~24% overlap)

  • Method: Sliding window with metadata enrichment


For the 93-page prospectus:

  • Number of chunks created: ~430-470 chunks

  • Why more chunks: Smaller chunk size (512 vs. 800 tokens) means more chunks needed


Perplexity's twist: Adds metadata to each chunk (page numbers, section headers, document type) to improve ranking during retrieval.


Step 3: Embedding (Converting Text to Mathematical Fingerprints)


The analogy: Imagine converting every recipe card into a unique barcode that captures its "essence"—not just the words, but the meaning. Similar recipes get similar barcodes.


What's happening technically:

Each text chunk is transformed into an embedding—an array of numbers (a "vector") that represents the chunk's semantic meaning in mathematical space.


Understanding Embeddings

What they look like:

Your chunk: "The Fund invests primarily in private equity funds..."

Becomes: [0.23, -0.15, 0.87, 0.42, ..., -0.61] ← This is a 1,536-number array for ChatGPT


Embeddings

Why this matters:

The numbers capture meaning, not just words. Consider these two phrases:

  • "The fund focuses on alternative investments"

  • "The portfolio concentrates on non-traditional assets"


Even though the words are completely different, their embeddings would be very similar because they mean the same thing. The AI places them close together in "mathematical space."



Similarity

Words with similar meanings cluster together. Distance = similarity.



Embedding Models Used by Each Platform:


ChatGPT

  • Model: text-embedding-3-largeplatform.openai+2

  • Dimensions: 256 dimensions (downsized from 3,072 for efficiency)

  • What this means: Each chunk becomes a list of 256 numbers


For the prospectus:

  • 275-300 chunks × 256 numbers each = ~70,400-76,800 total numbers stored


Claude

  • Model: Proprietary Anthropic embedding model

  • Dimensions: Not publicly disclosed (estimated 768-1,536)

  • Special feature: Embeddings generated from contextualized chunks (with added context)


For the prospectus:

  • 275-300 contextualized chunks embedded into vector space


Perplexity

  • Model: Proprietary (likely similar to OpenAI or open-source alternatives)

  • Dimensions: Estimated 768-1,536

  • Enhancement: Metadata-enriched embeddings


For the prospectus:

  • 430-470 chunks (smaller chunks = more embeddings)



Step 4: Storage (Organizing the Mathematical Library)


The analogy: All those barcoded recipe cards are now filed in a special library where similar recipes automatically sit near each other—even if they use different ingredients or cooking methods.


What's happening technically:

The embeddings are stored in a vector database—a specialized system optimized for finding "nearby" vectors quickly.



Vector

How Vector Databases Work:

Unlike regular databases that search for exact matches ("Find rows where Fund_Name = 'CAZ'"), vector databases search for semantic similarity ("Find chunks similar in meaning to my question").


Search methods:platform.openai+2

  1. Cosine similarity: Measures the angle between vectors (most common)

    • Close angle = similar meaning

    • Used by ChatGPT, Claude, Perplexity

  2. Euclidean distance: Measures straight-line distance in vector space

    • Short distance = similar meaning


Indexing for speed:

With 275-470 chunks, the AI could compare your question to every single chunk—but that's slow. Instead, it builds an index that pre-organizes vectors for fast lookup:


  • HNSW (Hierarchical Navigable Small World): Creates a multi-layer graph connecting similar vectors—like a highway system with exits leading to local roads

  • IVF (Inverted File Index): Clusters similar vectors into groups, then searches within the relevant group—like organizing books by genre before searching


For the prospectus:

  • All 275-470 chunk embeddings stored in vector database

  • Indexed using HNSW or IVF for sub-second retrieval

  • Takes ~1-3 seconds for initial processing during upload


Step 5: Retrieval (Finding Relevant Chunks When You Ask Questions)


Now the magic happens. You ask: "What are the management fees for Class A shares?"

What's happening behind the scenes:


Sub-Step 5a: Your Question Gets Embedded

Your question goes through the same embedding process as the document chunks:


"What are the management fees for Class A shares?"

→ [0.31, -0.22, 0.74, ..., -0.43] (256 numbers for ChatGPT)


Sub-Step 5b: Vector Similarity Search

The system compares your question's embedding to all stored chunk embeddings and finds the closest matches:


textYour question vector: [0.31, -0.22, 0.74, ...] ↓ Compare Chunk 47 (fee table): [0.29, -0.24, 0.71, ...] ← Very similar! (0.94 similarity) Chunk 48 (Class A details): [0.28, -0.21, 0.73, ...] ← Very similar! (0.91 similarity) Chunk 103 (risk factors): [0.02, -0.61, 0.15, ...] ← Not similar (0.32 similarity)


Sub-Step 5c: Ranking & Selection

The system ranks all chunks by similarity and selects the top matches.


ChatGPT's Retrieval Process:openai+1

Hybrid search approach:

  1. Keyword search (BM25): Finds chunks containing exact words like "management" "fee" "Class A"

  2. Semantic search (embeddings): Finds chunks with similar meaning even if words differ

  3. Combines both: Merges results from keyword + semantic searches

  4. Reranks: Secondary model evaluates which chunks are most relevant

  5. Retrieves: Typically 10-20 chunks selected for answering


For the fee question:

  • Chunks retrieved: 12-15 chunks (~9,600-12,000 tokens, ~15-20 pages worth of content)

  • Content: Fee table, Class A description, expense footnotes, examples section

  • Time: ~300-800 milliseconds for retrieval


Claude's Retrieval Process:anthropic+2

Context window advantage:

  • Claude 3.5 Sonnet has a 1 million token context window

  • Your entire 93-page prospectus (~110,000 tokens) could fit with room to spare


Two approaches:

  1. Full document mode: If prospectus < 1M tokens, Claude loads the entire document into context and uses attention mechanisms to focus on relevant sections

  2. Retrieval mode: For larger documents, uses semantic search similar to ChatGPT but with contextualized chunks


For the fee question:

  • Potential approach: Load full prospectus into context, use retrieval-style attention to identify relevant sections

  • Chunks retrieved (if using retrieval): 10-15 contextualized chunks

  • Time: ~400-1,000 milliseconds (slower due to larger context)


Perplexity's Retrieval Process:github+1

Hybrid document-web approach:

  • Searches uploaded document chunks

  • Simultaneously searches real-time web for related information

  • Combines internal + external sources


For the fee question:

  • Chunks retrieved: 15-20 chunks (smaller chunks = more needed for context)

  • External search: May also find SEC filings, fund databases, or general fee structure info

  • Time: ~500-1,200 milliseconds (web search adds latency)


Sub-Step 5d: Answer Generation

Finally, the AI feeds the retrieved chunks + your question into the language model to generate an answer:


The prompt that actually goes to the AI (simplified):
`textContext (retrieved chunks): [Chunk 47]: "ANNUAL FUND EXPENSES - Management Fee: 1.25% for all classes..." [Chunk 48]: "Class A Shares are subject to a sales charge of up to 3.00%..." [Chunk 12]: "Total Annual Fund Operating Expenses: Class A: 3.77%..."
User Question: What are the management fees for Class A shares?
Generate answer:`

The AI reads the context and generates:

"The management fee for Class A shares is 1.25% annually of the Fund's average net assets. Note that this is just the management fee component; total annual fund operating expenses for Class A shares are 3.77% before fee waivers."

Platform-by-Platform Comparison of The CAZ Prospectus

Here's a summary of what actually happens when you upload your 93-page, ~110,000-token CAZ prospectus to each platform:

Aspect

ChatGPT Pro

Claude Pro

Perplexity Pro

Upload processing time

2-5 seconds

3-6 seconds

3-7 seconds

Chunking method

Recursive splitting

Contextual retrieval

Sliding window + metadata

Chunk size

800 tokens

~800 tokens

512 tokens

Total chunks created

~275-300

~275-300 (contextualized)

~430-470

Embedding dimensions

256

~768-1,536 (estimated)

~768-1,536 (estimated)

Storage approach

Vector database

Session memory (unless Files API)

Ephemeral vector store

Retrieval chunks per query

10-20 chunks (~12K tokens)

10-15 chunks OR full doc (1M window)

15-20 chunks

Retrieval method

Keyword + semantic hybrid

Semantic + contextual

Document + web hybrid

Answer generation time

2-4 seconds

3-6 seconds

4-8 seconds

Total query time

~3-5 seconds

~4-7 seconds

~5-9 seconds


Common Questions About Chunking & Embedding


Q: Why not just load the whole document every time?

A: Token limits and cost.

  • GPT-4's context window: 128,000 tokens

  • Your prospectus: ~110,000 tokens

  • Would consume 85% of context window for just the document, leaving little room for conversation history, instructions, or output


Additionally, processing 110,000 tokens costs significantly more than processing 12,000 tokens (10-20 retrieved chunks). Retrieval is ~9x more cost-efficient.


Q: How accurate is retrieval? Could it miss important information?

A: Not perfect—this is the critical limitation.


Measured performance for financial documents:developers.openai+2

  • Recall@10 (finds relevant info in top 10 chunks): 81-97% depending on optimization

  • Miss rate: 3-19% of queries retrieve incorrect or incomplete chunks


What this means for your prospectus:

  • Simple questions ("What's the fund size?"): High accuracy (~95%+)

  • Complex questions ("Explain all redemption restrictions"): Lower accuracy (~75-85%), higher risk of missed details


Q: What happens if the answer requires information from multiple sections?

A: This is where chunking creates problems.


Example: "What's my net cost for Class A shares accounting for fees and early redemption penalties?"

This requires:

  • Management fee (1.25%) - Section 1

  • 12b-1 fee (0.60%) - Section 1

  • Sales charge (3.00%) - Section 2

  • Early redemption fee (2.00%) - Section 5

  • Total expense ratio - Section 1


If the retrieval system misses even one relevant chunk, the answer will be incomplete. This is why financial due diligence cannot rely solely on AI outputs—cross-document reasoning across multiple sections is where errors accumulate.


Q: Does the AI remember what chunks it retrieved in previous questions?

A: Partially, through conversation history.

  • ChatGPT/Claude/Perplexity: Maintain conversation history (previous Q&As), but each new question triggers fresh retrieval

  • No memory of chunks: The AI doesn't remember "I used Chunk 47 last time"—it re-searches every time

  • Implication: You might ask two related questions and get answers from completely different chunks, potentially creating inconsistencies


Q: Can I improve accuracy by asking better questions?

A: Yes, significantly.


Poor question: "Tell me about fees"

  • Too vague → retrieves scattered chunks about various fee types


Better question: "What is the management fee percentage for Class A shares as stated in the Annual Fund Expenses table?"

  • Specific → retrieves exact relevant chunks


Best question: "According to the Summary of Fees and Expenses section, what is the management fee for Class A shares, and does it differ from other share classes?"

  • Very specific + section reference → highest retrieval precision


The Critical Takeaway for Due Diligence

Understanding chunking and embedding reveals why AI chatbots aren't reliable for business-critical document analysis:


  1. Chunking breaks context: Your prospectus is artificially divided into 275-470 pieces. Information spanning multiple chunks may not be fully captured.

  2. Retrieval isn't guaranteed: Even with 97% recall, 3% of queries miss relevant information. For a fund prospectus with hundreds of critical details, that's dozens of potential misses.

  3. Semantic search has blind spots: If you ask about "redemption terms" but the document uses "repurchase provisions," the embedding may not recognize them as equivalent despite overlapping meaning.

  4. No verification of completeness: The AI doesn't know what it doesn't know. It answers based on retrieved chunks without checking if other relevant information exists elsewhere in the document.

  5. Cross-chunk reasoning is weak: Questions requiring synthesis across multiple sections (fees + restrictions + timing) are where accuracy degrades most.


For consumer-grade chatbots: These are research assistants, not analysts. Use them for initial exploration, then verify every output manually against the source document.

Conclusion: Under the Hood of AI Document Processing

When you upload your 93-page fund prospectus to ChatGPT, Claude, or Perplexity, a sophisticated pipeline transforms your document into mathematical representations optimized for semantic search. The process—extraction, chunking, embedding, storage, and retrieval—enables AI to "understand" and answer questions about documents far too large to process all at once.


But this very process introduces limitations: chunking fragments context, embeddings approximate meaning imperfectly, and retrieval can miss critical details. For educational purposes or initial research, this technology is transformative. For investment due diligence worth millions? The 3-19% miss rate and cross-document reasoning weaknesses mean human verification remains non-negotiable.


Understanding what happens "under the hood" isn't just technically interesting—it's essential for knowing when to trust AI outputs and when to dig deeper yourself.


Pure Math Editorial is an all-purpose virtual writer we created to document and showcase the various ways we are leveraging generative AI within our organization and with our clients. Designed specifically for case studies, thought leadership articles, white papers, blog content, industry reports, and investor communications, it is prompted to ensure clear, compelling, and structured writing that highlights the impact of AI across different projects and industries. As with any AI-based project, human oversight is employed throughout the content creation process.


bottom of page