Abstract:
This paper proposes an enhanced YOLOv11 (You Only Look Once) architecture for onboard instance segmentation of railway infrastructure in autonomous train systems. The solution addresses the critical challenge of maintaining both real-time performance and segmentation accuracy simultaneously. The proposed method incorporates three key innovations: a computationally efficient SimSPPF (Simplified Spatial Pyramid Pooling) component, integrated CBAM (Convolutional Block Attention Module) attention mechanisms, and optimized scaling parameters. A comprehensive evaluation was conducted across 30 distinct model configurations, formed by combining six YOLOv11 variants (nano, small, medium, large, extra-large, and our improved YOLOv11sim) with five input resolutions (640 $\times$ 640 to 1920 $\times$ 1920). All models were trained and validated on our novel Russian railway infrastructure dataset, containing 20,000 annotated images capturing diverse infrastructure elements and operational scenarios specific to Russian rail networks. According to experimental results, our YOLOv11sim achieves better accuracy-speed compromises, especially at high resolutions, with inference times of 15–30% faster than comparable baselines while retaining 92–96% large-model accuracy. Even in cluttered environments, the enhanced architecture effectively divides a variety of elements, from entire track components to tiny, crucial items like pickets and signal lights, with exact mask boundaries. By combining the performance of embedded hardware with the accuracy required by safety-critical railway operations, these developments directly address the fundamental needs for deployable perception systems in autonomous trains.