The images, text, and sound in the documentary video will arouse audiences’ curiosity. The documentary series “The Firsts in Life” integrates multiple modalities, such as images, texts, and sounds, which can express the theme of the documentary in a simple manner. While people are watching a documentary, it is actually a process of decoding symbolic elements in multimodal discourse. After the documentary series “The Firsts in Life” released on CCTV on January 12, 2020, its Douban score has reached 9.2, which has attracted much attention. It is based on Zhang Delu’s Synthetic Theoretical Framework of Multimodal Discourse Analysis and Kress & van Leeuwen’s Visual Grammar. In this paper, we analyze the language mode, para-language mode, bodily mode and non-bodily mode in the video. Moreover, the author explores how the documentary reflects the relationship between characters, such as text, images, and sound, to construct the overall meaning of the documentary.