IODIS is the “Is it imba or do I suck” video series hosted by professional Starcraft 2 player Harstem. In an IODIS episode, Harstem watches a fan-submitted replay, commentate and judge whether a race/unit is imbalanced or the player sucks.

Search Problem Formation

Consider the following case where I can describe what I remember happened during one episode:

“during one of the episodes, Harstem talked about how protoss can just build two buildings using one probe without the probe needing to go into the building, where as other two race needs to send in two workers, and then the editor slapped an IMBA stamp on and flashed the end screen for the joke.”

This particular query contains visual information that can only be obtained from video frame data, which could not be retrieved in a text-only search index. However, IODIS being a commentary series means that even if we only consider the transcript rather than the video content, most of the important stuff in the game will be mentioned and explained by Harstem. This means natural language queries on transcripts should be able to take us pretty far, and I can expect to see an information retrieval pipeline work to a certain extent to help me find the video id from a retrieval base.

In short, I used the transcript data as the search database and my recollection of what happened in an episode as query. I implemented BM25 computation in numpy and two-stage search process using faiss and rerankers from Replicate (that I uploaded!). I tried three queries, and the semantic search did not work as intended.

Project Status

My current step is to fix terminology mistakes in the transcript, as a precursor to judging whether I should fine-tune a reranker model.

Data preparation

Harstem (or his editor Hamster) maintains an official public IODIS playlist. It’s possible to programmatically obtain the audio only files via some python packages, I won’t disclose them here since they’re fairly accessible from a Google search. I obtain all the video ids from the playlist, used the distil-whisper/distil-large-v3 from huggingface to transcribe the audio. From the video data, these may be fields of interest in information retrieval:

  • title
  • description
  • text (the transcription)

Fetching these metadata for videos in a playlist can be done either via YouTube API or some pip packages.

Obtaining Transcripts

I run the automatic-speech-recognition pipeline using the distil-whisper/distil-large-v3 model (yes I’m using hf dependency, maybe llama.cpp is ok). It does surprisingly well for audios that are usually 20-40min long, especially as a general speech-to-text model.

Chunking

Chunking and tweaking the details is subjective to each use cases and data, which is why many RAG software packages are bloated with hierarchies of classes and dependencies just so that “hrrdrr i support this clever trick!”

I started out by using a plain 128-token upperbound for sentence-based chunking with no overlap. I split a video’s transcript from periods, and add two sentences together. if the result has more than 128 words after a space split, I treat the result as one chunk and add to record. One problem is that I did not overlap my chunks, but I did not verify whether this is affecting the

Problems with data and my processing

  1. the transcription is full of terminology errors: as most games, Starcraft 2 contains many custom terminologies, and whsiper could not pick them up. While making the modifications in place did not help the system retrieve the chunks relevant to the top-priority queries, it is fair to try fixing this by passing a curated terminology dictionary and pass both the dictionary and the chunk to check to a language model to proofread and pick up any term errors.
    1. for a language model to be able to do what I plan for it to do, it should have the following capabilities, which have become a norm at the time of writing (2024):
      1. in-context learning (a classic that I have grown to appreciate)
      2. long context window (usually above 8192 tokens)
      3. returning valid JSON object
    2. god i love progresses in ai
  2. the chunk sizes can be tweaked a bit, but it should not be

Project Background

As said, I’m a fan of this series, to the point where I have some minor noisy memory like

”during one of the episodes, Harstem talked about how protoss can just build two buildings using one probe without the probe needing to go into the building, where as other two race needs to send in two workers, and then the editor slapped an IMBA stamp on and flashed the end screen for the joke.”

Now if you play SC2 and watches IODIS, I hope you share the same obsession over this kind of funny moments. But, having a “stuff i quote too much” playlist isn’t exactly tractable for this kind of stuff as this behavior of mine generalizes beyond sc2; and there’s no way to add custom metadata when saving a youtube video to a playlist. What if I randomly remembered this and just want to rewatch it again, but couldn’t remember the video titles?

Turning this into a search problem

The intuition to solving the above situation via search is as such: IODIS is fundamentally a casting/commentating series: this means that whatever happens during the game is very likely being stated or explained by Harstem. This means that if we transcribe the video, it is not merely a “transcript” of what he said, but a good representation of things that happened during the game and episode as well; things like the example I said above.

Given this premise, it is reasonable to expect an information retrieval system to work on the transcript data. I also figure it would be a nice exercise for me to write BM25 and use it, considering I did not personally work with any lexical search algorithms in my previous experience.