Podcast Highlight Extractor

Overview

A two-stage information retrieval API that takes a podcast title and transcript, then extracts the top 5 most relevant highlight sentences. The system combines lexical ranking with semantic understanding for accurate extractive summarization, inspired by Spotify’s podcast segment retrieval research (TREC 2020).

Architecture

Stage 1 — BM25 Lexical Ranking (Custom Implementation)

A from-scratch implementation of the Okapi BM25 algorithm — no external library used. Scores each transcript sentence against the podcast title as a query using term frequency, inverse document frequency, and length normalization. The top 10 candidates are passed to Stage 2.

Stage 2 — MSMARCO DistilBERT Semantic Re-ranking

Uses msmarco-distilbert-base-tas-b to encode the title and candidate sentences into dense vector embeddings, then re-ranks by cosine similarity — capturing semantic meaning that BM25’s lexical matching misses.

Tech Stack

API Framework: FastAPI + Uvicorn
Lexical Ranking: Custom BM25 implementation
Semantic Re-ranking: MSMARCO DistilBERT (sentence-transformers)
Tokenization: NLTK
Containerization: Docker

References

Spotify Podcast Segment Retrieval — TREC 2020

View on GitHub