We present a general framework for fusing pre-trained multisensor object detection networks for perception in autonomous cars at an intermediate stage using perspective invariant features. Key innovation is an autoencoder-inspired Transformer module which transforms perspective as well as feature activation layout from one sensor modality to another. Transformed feature maps can be combined with those of a modality-native object detector to enhance performance and reliability through a simple fusion scheme. Our approach is not limited to a specific object detection network architecture or even to specific sensor modalities. We show effectiveness of the proposed scheme through experiments on our own as well as on the KITTI dataset.