Beyond a Pre-Trained Object Detector:
Cross-Modal Textual and Visual Context for Image Captioning