Research & Publication

Our research interests focus on the intersection of robotics, machine learning and machine vision. We are interested in developing algorithms for an adaptive perception system based on interactive environment exploration and open-ended learning, which enables robots to learn from past experiences and interact with human users. We have evaluated our works on different robotic platforms including PR2, robotic arms, and humanoid robots. Our up-to-date list of publications and corresponding BibTeX files can be found on this Google scholar account . In partiuclar, our research is summarized by the following projects:

Harnessing the Synergy between Pushing, Grasping, and Throwing to Enhance Object Manipulation in Cluttered Scenarios

In this work, we delve into the intricate synergy among non-prehensile actions like pushing, and prehensile actions such as grasping and throwing, within the domain of robotic manipulation. We introduce an innovative approach to learning these synergies by leveraging model-free deep reinforcement learning. The robot's workflow involves detecting the pose of the target object and the basket at each time step, predicting the optimal push configuration to isolate the target object, determining the appropriate grasp configuration, and inferring the necessary parameters for an accurate throw into the basket. This empowers robots to skillfully reconfigure cluttered scenarios through pushing, creating space for collision-free grasping actions. Simultaneously, we integrate throwing behavior, showcasing how this action significantly extends the robot's operational reach.

Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models

Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work, we introduce LRLL, an LLM-based lifelong learning agent that continuously grows the robot skill library to tackle manipulation tasks of ever-growing complexity. LRLL achieves this with four novel contributions: 1) a soft memory module that allows dynamic storage and retrieval of past experiences to serve as context, 2) a self-guided exploration policy that proposes new tasks in simulation, 3) a skill abstractor that distills recent experiences into new library skills, and 4) a lifelong learning algorithm for enabling human users to bootstrap new skills with minimal online interaction. LRLL continuously transfers knowledge from the memory to the library, building composable, general and interpretable policies, while bypassing gradient-based optimization, thus relieving the learner from catastrophic forgetting.

Self-supervised Learning for Joint Pushing and Grasping Policies in Highly Cluttered Environments

Robotic systems often face challenges when attempting to grasp a target object due to interference from surrounding items. This paper proposes a Deep Reinforcement Learning (DRL) method that develops joint policies for grasping and pushing, enabling effective manipulation of target objects within untrained, densely cluttered environments. In particular, a dual RL model is introduced, which presents high resilience in handling complicated scenes, reaching an average of 98% task completion in simulation and real-world scenes. To evaluate the proposed method, we conduct comprehensive simulation experiments in three distinct environments: densely packed building blocks, randomly positioned building blocks, and common household objects. Further, real-world tests are conducted using actual robots to confirm the robustness of our approach in various untrained and highly cluttered environments.

Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE

Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations (Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards developing advanced controllable video generation models that can handle complex and dynamic scenes.

Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition

This paper introduces a lifelong ensemble learning method for few-shot object recognition, combining deep learning and 3D shape descriptors. Designed for service robots in dynamic environments, the approach supports open-ended learning, allowing robots to continuously learn and recognize new object categories. The performance of the approach was evaluated through extensive experiments using both real and synthetic datasets, including a unique synthetic dataset of 27,000 views of 90 household objects. Results highlight the method's effectiveness in online few-shot learning and its superiority over existing open-ended learning models, particularly in lifelong learning scenarios. The approach was also validated in both simulated and real-world robot settings, demonstrating rapid learning capabilities from limited examples.

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. We develop a challenging benchmark based on cluttered indoor scenes and propose a novel end-to-end model (CROG) that learns grasp synthesis directly from image-text pairs.

Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition

The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. We explore the effectiveness of different depth representations and compare early and late RGB-D fusion strategies. We find that late fusion offers significant benefits for both online and offline 3D object recognition, synthetic-to-real visual domain adaptation as well as in an open-ended lifelong object category learning for human-robot interaction applications.

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

In this study, we introduce a hybrid Vision Transformers (ViT) and Convolutional Neural Networks (CNN) approach for fine-grained visual classification (FGVC) in robotics, aimed at improving object recognition in environments such as retail and households. To overcome the scarcity of fine-grained 3D data,we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances . Our approach outperforms traditional CNN and ViT models, achieving accuracies of 94.50% and 93.51% on the respective datasets. We have made these FGVC RGB-D datasets publicly available, and our method's successful integration into robotic frameworks indicates its potential for enhancing fine-grained perception in both simulated and real-world settings.

Simultaneous Multi-View Object Grasping and Recognition in Open-Ended Domains

Most state-of-the-art approaches tackle object recognition and grasping as two separate problems while both use visual input. Such approaches are not suitable for task-informed grasping, where the robot should recognize a specific object first and then grasp and manipulate it to accomplish a task. In this work, we propose a multi-view deep learning approach to handle simultaneous object grasping and recognition in open-ended domains. In particular, our approach takes multi-view of the object as input and jointly estimates pixel-wise grasp configuration and a deep scale- and rotation-invariant representation. The obtained representation is then used for open-ended object category learning and recognition. Experimental results on benchmark datasets have shown that our approach outperforms state-of-the-art methods by a large margin in terms of grasping and recognition.

MVGrasp: Real-Time Multi-View 3D Object Grasping in Highly Cluttered Environments

Nowadays service robots are entering more and more in our daily life. In such a dynamic environment, a robot frequently faces pile, packed, or isolated objects. Therefore, it is necessary for the robot to know how to grasp and manipulate various objects in different situations to help humans in everyday tasks. Most state-of-the-art grasping approaches addressed four degrees-of-freedom (DoF) object grasping, where the robot is forced to grasp objects from above based on grasp synthesis of a given top-down scene. Although such approaches showed a very good performance in predefined industrial settings, they are not suitable for human-centric environments as the robot will not able to grasp a range of household objects robustly. In this work, we propose a multi-view deep learning approach to handle robust object grasping in human-centric domains. In particular, our approach takes a partial point cloud of a scene as an input, and then, generates multi-views of existing objects. The obtained views of each object are used to estimate pixel-wise grasp synthesis for each object.

Self-Imitation Learning by Planning

Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called self-imitation learning by planning (SILP), where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot’s own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks.

3D_DEN: Open-ended 3D Object Recognition using Dynamically Expandable Networks

Service robots have to work independently and adapt to the dynamic changes in real-time. One important aspect in such scenarios is to continually learn to recognize newer object categories when they become available. This combines two main research problems namely continual learning and 3D object recognition. Most of the existing research approaches include the use of deep Convolutional Neural Networks (CNNs) focusing on image datasets. A modified approach might be needed for continually learning 3D object categories. A major concern in using CNNs is the problem of catastrophic forgetting when a model tries to learn a new task. Despite various proposed solutions to mitigate this problem, there still exist some downsides of such solutions, e.g., computational complexity, especially when learning substantial number of tasks. These downsides can pose major problems in robotic scenarios where real-time response plays an essential role. In this work, we propose a new deep transfer learning approach based on a dynamic architectural method to make robots capable of open-ended learning about new 3D object categories.

OrthographicNet: A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains

We present OrthographicNet, a deep transfer learning based approach, for 3D object recognition in open-ended domains. In particular, OrthographicNet generates a rotation and scale invariant global feature for a given object, enabling to recognize the same or similar objects seen from different perspectives. Experimental results show that our approach yields significant improvements over the state-of-the-art approaches concerning scalability, memory usage and object recognition performance. Moreover, OrthographicNet demonstrates the capability of learning new categories from very few examples on-site. Regarding real-time performance, three real-world demonstrations validate the promising performance of the proposed architecture.

Combining Shape Features with Multiple Color Spaces in Open-Ended 3D Object Recognition

Considering the expansion of robot applications in more complex and dynamic environments, it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to learn about new object categories in an open-ended fashion while working in the environment. Towards this goal, we propose a deep transfer learning approach to generate a scale- and pose-invariant object representation by considering shape and texture information in multiple color spaces. The obtained global object representation is then fed to an instance-based object category learning and recognition, where a non-expert human user exists in the learning loop and can interactively guide the process of experience acquisition by teaching new object categories, or by correcting insufficient or erroneous categories. In this work, shape information encodes the common patterns of all categories, while texture information is used to describes the appearance of each instance in detail. Multiple color space combinations and network architectures are evaluated to find the most descriptive system.

Few-Shot Visual Grounding for Natural Human-Robot Interaction

Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. Unlike most grounding methods that tackle the challenge using pre-trained object detectors via a two-stepped process, we develop a single stage zero-shot model that is able to provide predictions in unseen data. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets. Experimental results showed that the proposed model performs well in terms of accuracy and speed, while showcasing robustness to variation in the natural language input.

Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition

Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either the shape information only and ignore the role of color information or vice versa. Furthermore, they mainly utilize the Ln Minkowski family functions to measure the similarity of two object views, while there are various distance measures that are applicable to compare two object views. In this paper, we explore the importance of shape information, color constancy, color spaces, and various similarity measures in open-ended 3D object recognition.

Local-HDP: Interactive Open-Ended 3D Object Categorization

We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to high-level conceptual topics for 3D object categorization. However, the efficiency and accuracy of LDA-based approaches depend on the number of topics that is chosen manually. Moreover, fixing the number of topics for all categories can lead to overfitting or underfitting of the model. In contrast, the proposed Local-HDP can autonomously determine the number of topics for each category. Furthermore, an inference method is proposed that results in a fast posterior approximation.

*This research was done in collaboration with Hamed Ayoobi

The State of Service Robots: Current Bottlenecks in Object Perception and Manipulation

Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object. While there are many successes, the robot should be painstakingly coded in advance to perform a set of predefined tasks. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In this paper, we review advances in service robots from object perception to complex object manipulation and shed a light on the current challenges and bottlenecks.

Accelerating Reinforcement Learning for Reaching using Continuous Curriculum Learning

Reinforcement learning has shown great promise in the training of robot behavior due to the sequential decision making characteristics. However, the required enormous amount of interactive and informative training data provides the major stumbling block for progress. In this study, we focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching tasks. Specifically, we propose a precision-based continuous curriculum learning (PCCL) method in which the requirements are gradually adjusted during the training process, instead of fixing the parameter in a static schedule. To this end, we explore various continuous curriculum strategies for controlling a training process. This approach is tested using a Universal Robot 5e in both simulation and real-world scenarios.

Look Further to Recognize Better: Learning Shared Topics and Category-Specific Dictionaries for Open-Ended 3D Object Recognition

In human-centric environments, fine-grained object categorization is as essential as basic-level object categorization. In this work, each object is represented using a set of general latent topics and category-specific dictionaries. The general topics encode the common patterns of all categories, while the category-specific dictionary describes the content of each category in details. We discovered both sets of general and specific representations in an unsupervised fashion and updated them incrementally using new object views.

Interactive Open-Ended Learning Approach for Recognizing 3D Object Category and Grasp Affordance Concurrently

This paper presents an interactive open-ended learning approach to recognize multiple objects and their grasp affordances concurrently. This is an important contribution in the field of service robots since no matter how extensive the training data used for batch learning, a robot might always be confronted with an unknown object when operating in human-centric environments. Our approach has two main branches. The first branch is related to open-ended 3D object category learning and recognition. The second branch is associated with learning and recognizing the configuration of grasps in a reasonable amount of time.

Learning to Grasp 3D Objects using Deep Residual U-Nets

In this study, we present a new deep learning approach to detect object affordances for a given 3D object. The method trains a Convolutional Neural Network (CNN) to learn a set of grasping features from RGB-D images. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure and residual network-styled blocks. It devised to be robust and efficient to compute and use. A set of experiments has been performed to assess the performance of the proposed approach regarding grasp success rate on simulated robotic scenarios. Experiments validate the promising performance of the proposed architecture on ShapeNetCore dataset and simulated robot scenarios.

Coping with Context Change in Open-Ended Object Recognition without Explicit Context Information

To deploy a robot in a human-centric environment, it is important that the robot is able to continuously acquire and update object categories while working in the environment. Therefore, autonomous robots must have the ability to continuously execute learning and recognition in a concurrent or interleaved fashion. One of the main challenges in unconstrained human environments is to cope with the effects of context change. This paper presents two main contributions: (i) an approach for evaluating open-ended object category learning and recognition methods in multi-context scenarios; (ii) evaluation of different object category learning and recognition approaches regarding their ability to cope with the effects of context change. Off-line evaluation approaches such as cross-validation do not comply with the simultaneous nature of learning and recognition. A teaching protocol, supporting context change, was therefore designed and used in this work for experimental evaluation. Seven learning and recognition approaches were evaluated and compared using the protocol. The best performance, in terms of number of learned categories, was obtained with a recently proposed local variant of Latent Dirichlet Allocation (LDA), closely followed by a Bag-of-Words (BoW) approach. In terms of adaptability, i.e. coping with context change, the best result was obtained with BoW, immediately followed by the local LDA variant.

Perceiving, Learning, and Recognizing 3D Objects: An Approach to Cognitive Service Robots

This paper proposes a cognitive architecture designed to create a concurrent 3D object category learning and recognition in an interactive and open-ended manner. In particular, this cognitive architecture provides automatic perception capabilities that will allow robots to detect objects in highly crowded scenes and learn new object categories from the set of accumulated experiences in an incremental and open-ended way. Moreover, it supports constructing the full model of an unknown object in an on-line manner and predicting next best view for improving object detection and manipulation performance.

Active Multi-View 6D Object Pose Estimation and Camera Motion Planning in the Crowd

In this project, we developed a novel unsupervised Next-Best-View (NBV) prediction algorithm to improve object detection and manipulation performance. Particularly, the ability to predict the NBV point is important for mobile robots performing tasks in everyday environments. In active scenarios, whenever the robot fails to detect or manipulate objects from the current view point, it is able to predict the next best view position, goes there and captures a new scene to improve the knowledge of the environment. This may increase the object detection and manipulation performance.

Hierarchical Object Representation for OpenEnded Object Category Learning and Recognition (Local LDA)

This paper proposes an open-ended 3D object recognition system which concurrently learns both the object categories and the statistical features for encoding objects. In particular, we propose an extension of Latent Dirichlet Allocation to learn structural semantic features (i.e. topics), from low-level feature co-occurrences, for each category independently. Moreover, topics in each category are discovered in an unsupervised fashion and are updated incrementally using new object views. In this way, the advantage of both the local hand-crafted and the structural semantic features have been considered in an efficient way.

GOOD: A Global Orthographic Object Descriptor for 3D Object Recognition and Manipulation

The Global Orthographic Object Descriptor (GOOD) has been designed to be robust, descriptive and efficient to compute and use. GOOD descriptor has two outstanding characteristics: (1) Providing a good trade-off among: descriptiveness, robustness, computation time, memory usage; (2) Allowing concurrent object recognition and pose estimation for manipulation. The performance of the proposed object descriptor is compared with the main state-of-the-art descriptors. Experimental results show that the overall classification performance obtained with GOOD is comparable to the best performances obtained with the state-of-the-art descriptors. Concerning memory and computation time, GOOD clearly outperforms the other descriptors. The current implementation of GOOD descriptor supports several functionalities for 3D object recognition and object manipulation.

Towards Lifelong Assistive Robotics: A Tight Coupling between Object Perception and Manipulation

In this work, we propose a cognitive architecture designed to create a tight coupling between perception and manipulation for assistive robots. This is necessary for assistive robots, not only to perform manipulation tasks in a reasonable amount of time and in an appropriate manner, but also to robustly adapt to new environments by handling new objects. In particular, this cognitive architecture provides perception capabilities that will allow robots to, incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks.

Interactive Open-Ended Learning for 3D Object Recognition: An Approach and Experiments

This work presents an efficient approach capable of learning and recognizing object categories in an interactive and open-ended manner. In particular, we mainly focus on two state-of-the-art questions: (1) How to automatically detect, conceptualize and recognize objects in 3D scenes in an open-ended manner? (2) How to acquire and use high-level knowledge obtained from the interaction with human users, namely when they provide category labels, in order to improve the system performance?

Learning to Grasp Familiar Objects using Object View Recognition and Template Matching

In this work, interactive object view learning and recognition capabilities are integrated in the process of learning and recognizing grasps. The object view recognition module uses an interactive incremental learning approach to recognize object view labels. The grasp pose learning approach uses local and global visual features of a demonstrated grasp to learn a grasp template associated with the recognized object view. A grasp distance measure based on Mahalanobis distance is used in a grasp template matching approach to recognize an appropriate grasp pose.

Humanoid Robots (RoboCup-HL)

After obtaining extensive knowledge about real-time intelligent robotic systems in Middle-Size League, I tried to make humanoid robots and formed two new robotic teams namely Persia and BehRobot for participating in RoboCup humanoid leagues. We worked on three different types of humanoid robots including kid-size (height = 59cm, weight = 4kg), teen-size (height = 93cm, weight = 7Kg) and adult-size (height = 155cm, weight = 11:5Kg) robots. We were one of the successful teams in the humanoid leagues and achieved several ranks in national and international competitions.

Middle Size Soccer Robots (RoboCup-MSL)

During the second year of my undergraduate program, I got familiar with RoboCup competitions. I formed a team of Middle Size Soccer Robots (RoboCup-MSL) namely ADRO in 2006. We provided five player robots and one goalkeeper robot with similar structure but equipped with some additional accessories and sensors. Through this teamwork, I took an active role in the development of the robots’ software. Furthermore, I worked on the mechanical design of the robot via Autodesk Inventor. We achieved several ranks in national and international RoboCup competitions.

Contact



Dr. Hamidreza Kasaei
Artificial Intelligence Department,
University of Groningen,
Bernoulliborg building,
Nijenborgh 9 9747 AG Groningen,
The Netherlands.
Office: 340
Tel: +31-50-363-33926
E-mail: hamidreza.kasaei@rug.nl