Generating sports highlight videos from broadcast footage presents significant challenges, including the accurate identification of key moments and understanding the broader con- text of the game. Existing methods often fail to capture the full context and dynamics of various sports, resulting in less effective highlight detection. In this paper, we propose a novel framework that leverages large language models (LLMs) to address these challenges. Our approach involves extracting audio commentary from broadcast videos, converting it into text using a Speech-to-Text (STT) model, and segmenting this text into coherent units. Additionally, we analyze the audio to extract volume data, which, combined with the seg- mented commentary, serves as input to the LLM. The LLM identifies potential highlight moments by considering both the textual and audio context, assigning pseudo labels to these segments. This process enables the generation of highlight videos that encapsulate the most exciting and significant moments of the game. Our method adapts to different sports without the need for extensive retraining. Experimental results demonstrate the effectiveness of our framework in producing high-quality sports highlights, outperforming traditional methods in both accuracy and contextual relevance.
Publisher
Ulsan National Institute of Science and Technology