Point Cloud Transformers (PCTs) have gained lot of attention not only on the indoor data but also on the large-scale outdoor 3D point clouds, such as in autonomous driving. However, the vanilla self-attention mechanism in PCTs does not include any explicit prior spatial information about the quantized voxels (or pillars). Recently, Retentive Network has gained attention in the natural language processing (NLP) domain due to its efficient modelling capability and remarkable performance, leveraged by the introduction of explicit decay mechanism which incorporates the distance related spatial prior knowledge into the model. As the NLP tasks are causal and one-dimensional in nature, the explicit decay is designed to be unidirectional and one-dimensional. However, the pillars in the Bird’s Eye View (BEV) space are two-dimensional without causal properties. In this work, we propose RetFormer model by introducing bidirectional and two-dimensional decay mechanism for pillars in PCT and design the novel Multi-Scale Retentive Self-Attention (MSReSA) module. The introduction of explicit bidirectional and two-dimensional decay incorporates the 2D spatial distance related prior information of pillars into the PCT which significantly improves the modelling capacity of RetFormer. We evaluate our method on large-scale Waymo and KITTI datasets. RetFormer not only achieves significant performance gain over of 2.3 mAP and 3.2 mAP over PCT-based SST and FlatFormer respectively, and 6.4 mAP over sparse convolutional-based CenterPoint for small object pedestrian category on Waymo Open Dataset, but also is efficient with 3.2x speedup over SST and runs in real-time at 69 FPS on a RTX 4090 GPU.
3DObjDet-Fusion
SRFDet3D: Sparse Region Fusion based 3D Object Detection
Unlike the earlier 3D object detection approaches that formulate hand-crafted dense (in thousands) object proposals by leveraging anchors on dense feature maps, we formulate np (in hundreds) number of learnable sparse object proposals to predict 3D bounding box parameters. The sparse proposals in our approach are not only learnt during training but also are input-dependent, so they represent better object candidates during inference. Leveraging the sparse proposals, we fuse only the sparse regions of multi-modal features and we propose Sparse Region Fusion based 3D object Detection (SRFDet3D) network with mainly three components: an encoder for feature extraction, a region proposal generation module for sparse input-dependent proposals and a decoder for multi-modal feature fusion and iterative refinement of object proposals. Additionally for optimal training, we formulate our sparse detector with many-to-one label assignment based on Optimal Transport Algorithm (OTA). We conduct extensive experiments and analysis on publicly available large-scale autonomous driving datasets: nuScenes, KITTI, and Waymo. Our LiDAR-only SRFDet3D-L network achieves 63.1 mAP and outperforms the state-of-the-art networks on the nuScenes dataset, surpassing the dense detectors on KITTI and Waymo datasets. Our LiDAR-Camera model SRFDet3D achieves 64.7 mAP with improvements over existing fusion methods.
3DObjDet-LiDAR
DeLiVoTr: Deep and light-weight voxel transformer for 3D object detection
The image-based backbone (feature extraction) networks downsample the feature maps not only to increase the receptive field but also to efficiently detect objects of various scales. The existing feature extraction networks in LiDAR-based 3D object detection tasks follow the feature map downsampling similar to image-based feature extraction networks to increase the receptive field. But, such downsampling of LiDAR feature maps in large-scale autonomous driving scenarios hinder the detection of small size objects, such as pedestrians. To solve this issue we design an architecture that not only maintains the same scale of the feature maps but also the receptive field in the feature extraction network to aid for efficient detection of small size objects. We resort to attention mechanism to build sufficient receptive field and we propose a Deep and Light-weight Voxel Transformer (DeLiVoTr) network with voxel intra- and inter-region transformer modules to extract voxel local and global features respectively. We introduce DeLiVoTr block that uses transformations with expand and reduce strategy to vary the width and depth of the network efficiently. This facilitates to learn wider and deeper voxel representations and enables to use not only smaller dimension for attention mechanism but also a light-weight feed-forward network, facilitating the reduction of parameters and operations. In addition to model scaling, we employ layer-level scaling of DeLiVoTr encoder layers for efficient parameter allocation in each encoder layer instead of fixed number of parameters as in existing approaches. Leveraging layer-level depth and width scaling we formulate three variants of DeLiVoTr network. We conduct extensive experiments and analysis on large-scale Waymo and KITTI datasets. Our network surpasses state-of-the-art methods for detection of small objects (pedestrians) with an inference speed of 20.5 FPS.
3DObjDet-LiDAR
DDet3D: Embracing 3D Object Detector with Diffusion
Existing approaches rely on heuristic or learnable object proposals (which are required to be optimised during training) for 3D object detection. In our approach, we replace the hand-crafted or learnable object proposals with randomly generated object proposals by formulating a new paradigm to employ a diffusion model to detect 3D objects from a set of randomly generated and supervised learning-based object proposals in an autonomous driving application. We propose DDet3D, a diffusion-based 3D object detection framework that formulates 3D object detection as a generative task over the 3D bounding box coordinates in 3D space. To our knowledge, this work is the first to formulate the 3D object detection with denoising diffusion model and to establish that 3D randomly generated and supervised learning-based proposals (different from empirical anchors or learnt queries) are also potential object candidates for 3D object detection. During training, the 3D random noisy boxes are employed from the 3D ground truth boxes by progressively adding Gaussian noise, and the DDet3D network is trained to reverse the diffusion process. During the inference stage, the DDet3D network is able to iteratively refine the 3D randomly generated and supervised learning-based noisy boxes to predict 3D bounding boxes conditioned on the LiDAR Bird’s Eye View (BEV) features. The advantage of DDet3D is that it allows to decouple training and inference stages, thus enabling the use of a larger number of proposal boxes or sampling steps during inference to improve accuracy. We conduct extensive experiments and analysis on the nuScenes and KITTI datasets. DDet3D achieves competitive performance compared to well-designed 3D object detectors. Our work serves as a strong baseline to explore and employ more efficient diffusion models for 3D perception tasks.
3DObjDet-Fusion
DAFDeTr: Deformable Attention Fusion Based 3D Detection Transformer
Gopi Krishna Erabati, and Helder Araujo
In Robotics, Computer Vision and Intelligent Systems , 2024
Existing approaches fuse the LiDAR points and image pixels by hard association relying on highly accurate calibration matrices. We propose Deformable Attention Fusion based 3D Detection Transformer (DAFDeTr) to attentively and adaptively fuse the image features to the LiDAR features with soft association using deformable attention mechanism. Specifically, our detection head consists of two decoders for sequential fusion: LiDAR and image decoder powered by deformable cross-attention to link the multi-modal features to the 3D object predictions leveraging a sparse set of object queries. The refined object queries from the LiDAR decoder attentively fuse with the corresponding and required image features establishing a soft association, thereby making our model robust for any camera malfunction. We conduct extensive experiments and analysis on nuScenes and Waymo datasets. Our DAFDeTr-L achieves 63.4 mAP and outperforms well established networks on the nuScenes dataset and obtains competitive performance on the Waymo dataset. Our fusion model DAFDeTr achieves 64.6 mAP on the nuScenes dataset. We also extend our model to the 3D tracking task and our model outperforms state-of-the-art methods on 3D tracking.
Pose and Depth
Self-supervised monocular pose and depth estimation for wireless capsule endoscopy with transformers
Nahid Nazifi, Helder Araujo, Gopi Krishna Erabati, and Omar Tahri
In Medical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling , 2024
Wireless Capsule Endoscopy (WCE) is an emerging diagnostic technology to examine the Gastrointestinal tract and detect a wide range of diseases and pathologies by capturing images and transferring them remotely. The necessity of having control over the movement of the capsule is crucial to get more accurate detection of the location of the capsule, potential diseased areas, biopsy and drug delivery. However, several challenges are present for WCE, notably the deformable nature of the soft tissues, and texture-less surfaces which are subjected to strong specular reflections. To address these issues and since a reliable real-time 3D pose estimation is critical for controlling active endoscopic capsule robots, this work proposes a data-driven approach to estimate the pose and depth estimation of a wireless capsule endoscope. With recent advances in transformer networks in computer vision tasks, we introduce a Transformer-based architecture to use the self-attention mechanism for specular reflections and deformable topography of the Gastrointestinal tract. This would be a step toward developing a fully autonomous capsule endoscopy for more precise diagnostics and treatments.
2023
3DObjDet-LiDAR
Li3DeTr: A LiDAR Based 3D Detection Transformer
Gopi Krishna Erabati, and Helder Araujo
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2023
Inspired by recent advances in vision transformers for object detection, we propose Li3DeTr, an end-to-end LiDAR based 3D Detection Transformer for autonomous driving, that inputs LiDAR point clouds and regresses 3D bounding boxes. The LiDAR local and global features are encoded using sparse convolution and multi-scale deformable attention respectively. In the decoder head, firstly, in the novel Li3DeTr cross-attention block, we link the LiDAR global features to 3D predictions leveraging the sparse set of object queries learnt from the data. Secondly, the object query interactions are formulated using multi-head self-attention. Finally, the decoder layer is repeated Ldec number of times to refine the object queries. Inspired by DETR, we employ set-to-set loss to train the Li3DeTr network. Without bells and whistles, the Li3DeTr network achieves 61.3% mAP and 67.6% NDS surpassing the state-of-the-art methods with non-maximum suppression (NMS) on the nuScenes dataset and it also achieves competitive performance on the KITTI dataset. We also employ knowledge distillation (KD) using a teacher and student model that slightly improves the performance of our network.
2022
3DObjDet-Fusion
MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving
Gopi Krishna Erabati, and Helder Araujo
In ICPR 2022 workshop on Deep Learning for Visual Detection and Recognition (DLVDR) , Aug 2022
3D object detection is a significant task for autonomous driving. Recently with the progress of vision transformers, the 2D object detection problem is being treated with the set-to-set loss. Inspired by these approaches on 2D object detection and an approach for multi-view 3D object detection DETR3D, we propose MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer architecture to fuse image and LiDAR features to improve the detection accuracy. Our end-to-end single-stage, anchor-free and NMS-free network takes in multi-view images and LiDAR point clouds and predicts 3D bounding boxes. Firstly, we link the object queries learnt from data to the image and LiDAR features using a novel MSF3DDETR cross-attention block. Secondly, the object queries interacts with each other in multi-head self-attention block. Finally, MSF3DDETR block is repeated for L number of times to refine the object queries. The MSF3DDETR network is trained end-to-end on the nuScenes dataset using Hungarian algorithm based bipartite matching and set-to-set loss inspired by DETR. We present both quantitative and qualitative results which are competitive to the state-of-the-art approaches.
2021
Moving Obj Seg
MOSNet: A lightweight Moving Object Segmentation Network for Autonomous Driving
Gopi Krishna Erabati, and Helder Araujo
In RECPAD 2021 - 27th Portuguese Conference on Pattern Recognition , Aug 2021
The ability to segment moving objects like cars is a very crucial element of visual perception system of autonomous vehicles for safe manoeuvrability of vehicles. In this paper, we aim to propose a light-weight Moving Object Segmentation Network (MOSNet) which adapts a two-stream architecture to extract appearance and motion features from RGB images and optical flow respectively. The extracted features are fused with the help of a fusion transformer, a Feature Pyramid Network (FPN) head is used to combine feature maps at various scales and further they are bilinearly upsampled to get back to the original dimension of the input which produces per-pixel class. The network is trained and tested on publicly available KITTI MOD dataset. It is shown that the proposed architecture achieves the Intersection over Union (IoU) of 48.89 % for moving objects and also runs at 50 fps on a RTX 2080 Ti GPU using a ShuffleNetV2 backbone.
2020
2DObjDet
Object Detection in Traffic Scenarios - A Comparison of Traditional and Deep Learning Approaches
Gopi Krishna Erabati, Nuno Gonçalves, and Helder Araujo
In Proceedings of 9th International Conference on Advanced Information Technologies and Applications (ICAITA 2020) , Aug 2020
In the area of computer vision, research on object detection algorithms has grown rapidly as it is the fundamental step for automation, specifically for self-driving vehicles. This work presents a comparison of traditional and deep learning approaches for the task of object detection in traffic scenarios. The handcrafted feature descriptor like Histogram of oriented Gradients (HOG) with a linear Support Vector Machine (SVM) classifier is compared with deep learning approaches like Single Shot Detector (SSD) and You Only Look Once (YOLO), in terms of mean Average Precision (mAP) and processing speed. SSD algorithm is implemented with different backbone architectures like VGG16, MobileNetV2 and ResNeXt50, similarly YOLO algorithm with MobileNetV1 and ResNet50, to compare the performance of the approaches. The training and inference is performed on PASCAL VOC 2007 and 2012 training, and PASCAL VOC 2007 test data respectively. We consider five classes relevant for traffic scenarios, namely, bicycle, bus, car, motorbike and person for the calculation of mAP. Both qualitative and quantitative results are presented for comparison. For the task of object detection, the deep learning approaches outperform the traditional approach both in accuracy and speed. This is achieved at the cost of requiring large amount of data, high computation power and time to train a deep learning approach.
3DObjDet-Fusion
SL3D - Single Look 3D Object Detection based on RGB-D Images
Gopi Krishna Erabati, and Helder Araujo
In 2020 Digital Image Computing: Techniques and Applications (DICTA) , Nov 2020
We present SL3D, Single Look 3D object detection approach to detect the 3D objects from the RGB-D image pair. The approach is a proposal free, single-stage 3D object detection method from RGB-D images by leveraging multi-scale feature fusion of RGB and depth feature maps, and multi-layer predictions. The method takes pair of RGB and depth images as an input and outputs predicted 3D bounding boxes. The neural network SL3D, comprises of two modules: multi-scale feature fusion and multi-layer prediction. The multi-scale feature fusion module fuses the multi-scale features from RGB and depth feature maps, which are later used by the multi-layer prediction module for 3D object detection. Each location of prediction layer is attached with a set of predefined 3D prior boxes to account for varying shapes of 3D objects. The output of the network regresses the predicted 3D bounding boxes as an offset to the set of 3D prior boxes and duplicate 3D bounding boxes are removed by applying 3D non-maximum suppression. The network is trained end-to-end on publicly available SUN RGB-D dataset. The SL3D approach with ResNeXt50 achieves 31.77 mAP on SUN RGB-D test dataset with an inference speed of approximately 4 fps, and with MobileNetV2, it achieves approximately 15 fps with a reduction of around 2 mAP. The quantitative results show that the proposed method achieves competitive performance to state-of-the-art methods on SUN RGB-D dataset with near real-time inference speed.
2019
2DObjDet
Dynamic Obstacle Detection in Traffic Environments
Gopi Krishna Erabati, and Helder Araujo
In Proceedings of the 13th International Conference on Distributed Smart Cameras , Trento, Italy, Nov 2019
The research on autonomous vehicles has grown increasingly with the advent of neural networks. Dynamic obstacle detection is a fundamental step for self-driving vehicles in traffic environments. This paper presents a comparison of state-of-art object detection techniques like Faster R-CNN, YOLO and SSD with 2D image data. The algorithms for detection in driving, must be reliable, robust and should have a real time performance. The three methods are trained and tested on PASCAL VOC 2007 and 2012 datasets and both qualitative and quantitative results are presented. SSD model can be seen as a tradeoff for speed and small object detection. A novel method for object detection using 3D data (RGB and depth) is proposed. The proposed model incorporates two stage architecture modality for RGB and depth processing and later fused hierarchically. The model will be trained and tested on RGBD dataset in the future.
2016
Microwave
Particle-in-cell simulations of CC-TWT for radar transmitters
Latha Christie, and Gopikrishna Erabati
In 2016 International Symposium on Antennas and Propagation (APSYM) , Dec 2016
TWT is one of the fundamental components in a Radar Transmitter and compared to Helix TWT, Coupled Cavity TWT (CCTWT) gives the highest power over a moderate bandwidth. A complete simulation of the TWT is very useful in avoiding iterations and failures in TWT development cycle. In this paper, the CCTWT designed in the X-band frequency range, has been simulated using the Eigen mode solver and the particle-in-cell solver of the three-dimensional software package, CST Microwave Studio. The slow wave structure along with the couplers is designed initially with the equivalent circuit approach and later optimized using the Eigen mode solver of CST MWS. The initial estimate of the total number of cavities and the number of cavities per section was obtained using large signal analysis which was later optimized using Particle-in-cell Solver of CST MWS. The Simulation predicted an output power of around 1 kW and 27 cavities.
Microwave
Homogeneous and inhomogeneous coupling structures for coupled cavity TWTS
Analysis and optimization of performance of coupling structures of Coupled Cavity Travelling Wave Tube (CC-TWT) that are used to feed and extract RF power into and from the TWT is presented. The design of coupling structures includes the design of hybrid cavity and the design of stepped impedance transformer for transforming the impedance of waveguide to that of the cavity. The stepped impedance transformer can be homogeneous or inhomogeneous based on the design of magnets near the coupling end of the tube. In this paper the design of both the coupling structures. In this paper, the complete design of the coupling structure for coupled cavity TWT using the hybrid cavities and the stepped impedance transformer, both the homogeneous and the inhomogeneous type is presented. The analytical design is compared with that of 3-Dimensional (3D) electromagnetic simulation software, CST Microwave Studio (MWS) and the results presented.
Microwave
Analysis of H-Plane Discontinuity in a Rectangular Waveguide using Mode Matching Technique
Latha Christie, Payel Mondal, Sritama Dutta, and Gopikrishna Erabati
INROADS- An International Journal of Jaipur National University, Jan 2016
Accurate determination of S-parameters using Mode Matching Technique of H-plane discontinuity is presented. The generalized scattering matrix is obtained from the respective field equations. The H-plane discontinuity operating in X-Band has been considered for a case study. The results that have been obtained using Mode Matching Technique are compared with equivalent circuit approach and 3-D EM simulation software CST MWS and HFSS which are based on Finite Integration Technique and Finite Element Method respectively, based on accuracy and simulation time. The error between Mode Matching Technique and CST MWS is found to be less than 1% with the lowest simulation time.
2015
Microwave
Analysis of Propagation Characteristics of Circular Waveguide Loaded with Dielectric Disks Using Coupled Integral Equation Technique
Latha Christie, Gopikrishna Erabati, and Mita Jana
In 2015 Fifth International Conference on Advances in Computing and Communications (ICACC) , Sep 2015
Coupled Integral Equation Technique (CIET) is presented for the study of propagation characteristics of circular waveguide periodically loaded with dielectric disks operating in TM01d mode. CIET is a combination of Mode Matching Technique (MMT) and Method of Moments. The results are presented for three materials having different dielectric constants and compared with 3D simulation tool CST Studio in terms of simulation time and accuracy.
Microwave
Transverse Focusing Structure for TWTs
Mita Jana, Latha Christie, and Gopikrishna Erabati
In 11th International Conference on Microwaves, Antenna, Propagation and Remote Sensing (ICMARS 2015) , Dec 2015