Computational modeling of human spoken language is an emerging research area in multimedia analysis spanning across the text and acoustic modalities. Multi-modal sentiment analysis is one of the most fundamental tasks in human spoken language understanding. In this paper, we propose a novel approach to selecting effective sentiment-relevant words for multi-modal sentiment analysis with focus on both the textual and acoustic modalities. Unlike the conventional soft attention mechanism, we employ a deep reinforcement learning mechanism to perform sentiment-relevant word selection and fully remove invalid words of each modality for multi-modal sentiment analysis. Specifically, we first align the raw text and audio at the word level and extract independent handcraft features for each modality to yield the textual and acoustic word sequence. Second, we establish two collaborative agents to deal with the textual and acoustic modalities in spoken language respectively. On this basis, we formulate the sentiment-relevant word selection process in a multi-modal setting as a multi-agent sequential decision problem and solve it with a multi-agent reinforcement learning approach. Detailed evaluations of multi-modal sentiment classification and emotion recognition on three benchmark datasets demonstrate the great effectiveness of our approach over several conventional competitive baselines.