Some interesting things when i was thinking about improving multi-document search. the specific context is that let’s say i have apple’s 10k filings from 2019-2024, that’s 5 long pdfs, and i want to find trends from it. if i simply chunk them using deterministic programs, when using similarity search i would obtain chunks that are most likely very similar to each other. we don’t necessarily want to discard any of them, and thanks to longer context window models we can get slightly away with this.

the problem emerges when i try to think about structured data among the pdfs, such as data tables (earnings, etc): the company’s current system works pretty well on parsing and obtaining text-friendly representations (words lmfao) off them, but what should we expect if we do both:

  • embed the parsed text tokens from the table
  • let llm summarize the parse texts and embed that summarization?
    • what if this is letting llm interpret the insights and write about it? (this would create a three way comparison) it’s straightforward that raw data formatted to resemble a table should be semantically different than summarizations in natural language, but how different would they be?

This is something not too hard to figure out, the rough steps i should take are

  • obtain the embeddings for table chunks
  • let llm summarize the parsed table chunk
    • let llm interpret and get insights from it
  • obtain embeddings for summarization and insights
  • visualize the differences via either embeddings or some other technique.