Predicting popularity of social media videos before they are published is a challenging task, mainly due to the complexity of content distribution network as well as the number of factors that play a part in this process. As solving this task provides tremendous help for media content creators, many successful methods were proposed to solve this problem with machine learning. In this work, we change the viewpoint and postulate that it is not only the predicted popularity that matters but also, maybe even more importantly, understanding of how individual parts influence the final popularity score. To that end, we propose to combine the Grad-CAM visualization method that allows to visualize spatial relevance to popularity with a soft self-attention mechanism to weight the relative importance of frames in time domain. Our preliminary results show that this approach allows for more intuitive interpretation of the content impact on video popularity while achieving competitive results in terms of prediction accuracy.