Can convolutional neural networks (CNNs) recognize gestures from a camera for robotic control? We examine this question using a small set of vehicle control gestures (move forward, grab control, no gesture, release control, stop, turn left, and turn right). Deep learning methods typically require large amounts of training data. For image recognition, the ImageNet data set is a widely used data set that consists of millions of labeled images. We do not expect to be able to collect a similar volume of training data for vehicle control gestures. Our method applies transfer learning to initialize the weights of the convolutional layers of the CNN to values obtained through training on the ImageNet data set. The fully connected layers of our network are then trained on a smaller set of gesture data that we collected and labeled. Our data set consists of about 50,000 images recorded at ten frames per second, collected and labeled in less than 15 man-hours. Images contain multiple people in a variety of indoor and outdoor settings. Approximately 4,000 images are held out for testing and contain a person not present in any of the training images. After training, greater than 99% of the images in the test set are correctly recognized. Additionally, we use the system to control a small unmanned ground vehicle. We also investigate using a Long Short-Term Memory (LSTM) layer for recognizing gestures that require analyzing sequences of images. On this more difficult set of gestures, we achieve a recognition rate of approximately 80% using a smaller data set of approximately 26,000 images.