Overview
A two-stage information retrieval API that takes a podcast title and transcript, then extracts the top 5 most relevant highlight sentences. The system combines lexical ranking with semantic understanding for accurate extractive summarization, inspired by Spotify’s podcast segment retrieval research (TREC 2020).
Architecture
Stage 1 — BM25 Lexical Ranking (Custom Implementation)
A from-scratch implementation of the Okapi BM25 algorithm — no external library used. Scores each transcript sentence against the podcast title as a query using term frequency, inverse document frequency, and length normalization. The top 10 candidates are passed to Stage 2.
Stage 2 — MSMARCO DistilBERT Semantic Re-ranking
Uses msmarco-distilbert-base-tas-b to encode the title and candidate sentences into dense vector embeddings, then re-ranks by cosine similarity — capturing semantic meaning that BM25’s lexical matching misses.
Tech Stack
- API Framework: FastAPI + Uvicorn
- Lexical Ranking: Custom BM25 implementation
- Semantic Re-ranking: MSMARCO DistilBERT (sentence-transformers)
- Tokenization: NLTK
- Containerization: Docker
References
- Spotify Podcast Segment Retrieval — TREC 2020