Automatic processing of emotion information through deep neural networks (DNN) can have great benefits for human-machine interaction. Vice versa, machine learning can profit from concepts known from human information processing (e.g., visual attention). We employed a recurrent DNN incorporating a spatial attention mechanism for facial emotion recognition (FER) and compared the output of the network with results from human experiments. The attention mechanism enabled the network to select relevant face regions to achieve state-of-the-art performance on a FER database containing images from realistic settings. A visual search strategy showing some similarities with human saccading behavior emerged when the model’s perceptive capabilities were restricted. However, the model then failed to form a useful scene representation.