IRL Lab

Research & Publication

Our research interests focus on the intersection of robotics, machine learning and machine vision. We are interested in developing algorithms for an adaptive perception system based on interactive environment exploration and open-ended learning, which enables robots to learn from past experiences and interact with human users. We have evaluated our works on different robotic platforms including PR2, robotic arms, and humanoid robots. Our up-to-date list of publications and corresponding BibTeX files can be found on this Google scholar account . In partiuclar, our research is summarized by the following projects:

Learning Dual-Arm Push and Grasp Synergy in Dense Clutter

Robotic grasping in densely cluttered environments is challenging due to scarce collision-free grasp affordances. Non-prehensile actions can increase feasible grasps in cluttered environments, but most research focuses on single-arm rather than dual-arm manipulation. Policies from single-arm systems fail to fully leverage the advantages of dual-arm coordination. We propose a target-oriented hierarchical deep reinforcement learning (DRL) framework that learns dual-arm push-grasp synergy for grasping objects to enhance dexterous manipulation in dense clutter. Our framework maps visual observations to actions via a pre-trained deep learning backbone and a novel CNN-based DRL model, trained with Proximal Policy Optimization (PPO), to develop a dual-arm push-grasp strategy. The backbone enhances feature mapping in densely cluttered environments. A novel fuzzy-based reward function is introduced to accelerate efficient strategy learning. Our system is developed and trained in Isaac Gym and then tested in simulations and on a real robot. Experimental results show that our framework effectively maps visual data to dual push-grasp motions, enabling the dual-arm system to grasp target objects in complex environments. Compared to other methods, our approach generates 6-DoF grasp candidates and enables dual-arm push actions, mimicking human behavior. Results show that our method efficiently completes tasks in densely cluttered environments.

Learning Dual-arm Coordination for Grasping Large Flat Objects

Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dual-arm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large-scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNN-based Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods.

Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning

In the realm of agricultural automation, occlusion presents a significant challenge for robotic harvesting. We propose an imitation learning-based viewpoint planning method, employing the Action Chunking with Transformer (ACT) algorithm to learn from expert demonstrations and generate refined multi-degree camera movements. Our approach introduces a novel framework for effectively managing occlusions within intricate agricultural settings and utilizes a continuous six-degreeof-freedom action space for precise camera pose adjustments. Comprehensive evaluations in both simulated and real-world environments, featuring agricultural scenarios and a six-degree collaborative robot arm equipped with an RGB-D camera, demonstrate our method’s superior success rate and efficiency, especially in complex occlusion conditions, as well as its ability to generalize across different crops. This study not only advances robotic harvesting technology but also enhances overall agricultural productivity by offering universal solutions to occlusion challenges.

Single-Shot 6DoF Pose and 3D Size Estimation for Robotic Strawberry Harvesting

In this study, we introduce a deep-learning approach for determining both the 6DoF pose and 3D size of strawberries, aiming to significantly augment robotic harvesting efficiency. Our model was trained on a synthetic strawberry dataset, which is automatically generated within the Ignition Gazebo simulator, with a specific focus on the inherent symmetry exhibited by strawberries. By leveraging domain randomization techniques, the model demonstrated exceptional performance, achieving an 84.77% average precision (AP) of 3D Intersection over Union (IoU) scores on the simulated dataset. Empirical evaluations, conducted by testing our model on real-world datasets, underscored the model’s viability for real-world strawberry harvesting scenarios, even though its training was based on synthetic data. The model also exhibited robust occlusion handling abilities, maintaining accurate detection capabilities even when strawberries were obscured by other strawberries or foliage. Additionally, the model showcased remarkably swift inference speeds, reaching up to 60 frames per second (FPS).

Towards Open-World Grasping with Large Vision-Language Models

The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM’s reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG’s robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.

SoftManiSim accelerates learning and deployment of continuum manipulators

Existing continuum robot simulators often rely on simplifying assumptions, such as constant curvature bending or ignoring contact forces, to meet real-time simulation and training demands. To bridge this gap, we propose a robust and rapid mathematical model for continuum robots at the core of SoftManiSim, ensuring precise and adaptable simulations. The framework can integrate with various rigid-body robots, increasing its utility across different robotic platforms. SoftManiSim supports parallel operations for simultaneous simulations of multiple robots and generates synthetic data essential for training deep reinforcement learning models. This capability enhances the development and optimization of control strategies in dynamic environments. Extensive simulations validate the framework’s effectiveness, demonstrating its capabilities in handling complex robotic interactions and tasks. We also present real robot validation to showcase the simulator’s practical applicability and accuracy in real-world settings.

Fast Trajectory Planner with a Reinforcement Learning-Based Controller for Robotic Manipulators

The ability to quickly generate obstacle avoidance trajectories in unstructured and obstructed environments remains a significant challenge for robotic manipulators. This paper highlights the strong potential of model-free reinforcement learning methods over model-based approaches for obstacle-free trajectory planning in joint space. We propose a fast trajectory planning system for manipulators that integrates vision-based path planning in task space with reinforcement learning-based obstacle avoidance in joint space. The paper introduces enhancements to the Proximal Policy Optimization (PPO) algorithm, including Action Ensembles (AE) and Policy Feedback (PF), which significantly improve precision and stability for goal-reaching and obstacle avoidance in joint space. These enhancements make PPO more adaptable to a variety of robotic tasks, thereby boosting performance. Additionally, we have integrated the Fast Segment Anything (FSA) with B-spline optimized kinodynamic path searching to develop a vision-based trajectory planner in task space.

Harnessing the Synergy between Pushing, Grasping, and Throwing to Enhance Object Manipulation in Cluttered Scenarios

In this work, we delve into the intricate synergy among non-prehensile actions like pushing, and prehensile actions such as grasping and throwing, within the domain of robotic manipulation. We introduce an innovative approach to learning these synergies by leveraging model-free deep reinforcement learning. The robot's workflow involves detecting the pose of the target object and the basket at each time step, predicting the optimal push configuration to isolate the target object, determining the appropriate grasp configuration, and inferring the necessary parameters for an accurate throw into the basket. This empowers robots to skillfully reconfigure cluttered scenarios through pushing, creating space for collision-free grasping actions. Simultaneously, we integrate throwing behavior, showcasing how this action significantly extends the robot's operational reach.

Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models

Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work, we introduce LRLL, an LLM-based lifelong learning agent that continuously grows the robot skill library to tackle manipulation tasks of ever-growing complexity. LRLL achieves this with four novel contributions: 1) a soft memory module that allows dynamic storage and retrieval of past experiences to serve as context, 2) a self-guided exploration policy that proposes new tasks in simulation, 3) a skill abstractor that distills recent experiences into new library skills, and 4) a lifelong learning algorithm for enabling human users to bootstrap new skills with minimal online interaction. LRLL continuously transfers knowledge from the memory to the library, building composable, general and interpretable policies, while bypassing gradient-based optimization, thus relieving the learner from catastrophic forgetting.

ArXiv (ICRA2024)

Self-supervised Learning for Joint Pushing and Grasping Policies in Highly Cluttered Environments

Robotic systems often face challenges when attempting to grasp a target object due to interference from surrounding items. This paper proposes a Deep Reinforcement Learning (DRL) method that develops joint policies for grasping and pushing, enabling effective manipulation of target objects within untrained, densely cluttered environments. In particular, a dual RL model is introduced, which presents high resilience in handling complicated scenes, reaching an average of 98% task completion in simulation and real-world scenes. To evaluate the proposed method, we conduct comprehensive simulation experiments in three distinct environments: densely packed building blocks, randomly positioned building blocks, and common household objects. Further, real-world tests are conducted using actual robots to confirm the robustness of our approach in various untrained and highly cluttered environments.

Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE

Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations (Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards developing advanced controllable video generation models that can handle complex and dynamic scenes.

ArXiv (ICRA2024)

Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition

This paper introduces a lifelong ensemble learning method for few-shot object recognition, combining deep learning and 3D shape descriptors. Designed for service robots in dynamic environments, the approach supports open-ended learning, allowing robots to continuously learn and recognize new object categories. The performance of the approach was evaluated through extensive experiments using both real and synthetic datasets, including a unique synthetic dataset of 27,000 views of 90 household objects. Results highlight the method's effectiveness in online few-shot learning and its superiority over existing open-ended learning models, particularly in lifelong learning scenarios. The approach was also validated in both simulated and real-world robot settings, demonstrating rapid learning capabilities from limited examples.

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. We develop a challenging benchmark based on cluttered indoor scenes and propose a novel end-to-end model (CROG) that learns grasp synthesis directly from image-text pairs.

Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition

The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. We explore the effectiveness of different depth representations and compare early and late RGB-D fusion strategies. We find that late fusion offers significant benefits for both online and offline 3D object recognition, synthetic-to-real visual domain adaptation as well as in an open-ended lifelong object category learning for human-robot interaction applications.

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

In this study, we introduce a hybrid Vision Transformers (ViT) and Convolutional Neural Networks (CNN) approach for fine-grained visual classification (FGVC) in robotics, aimed at improving object recognition in environments such as retail and households. To overcome the scarcity of fine-grained 3D data,we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances . Our approach outperforms traditional CNN and ViT models, achieving accuracies of 94.50% and 93.51% on the respective datasets. We have made these FGVC RGB-D datasets publicly available, and our method's successful integration into robotic frameworks indicates its potential for enhancing fine-grained perception in both simulated and real-world settings.

Simultaneous Multi-View Object Grasping and Recognition in Open-Ended Domains

Most state-of-the-art approaches tackle object recognition and graspifng as two separate problems while both use visual input. Such approaches are not suitable for task-informed grasping, where the robot should recognize a specific object first and then grasp and manipulate it to accomplish a task. In this work, we propose a multi-view deep learning approach to handle simultaneous object grasping and recognition in open-ended domains. In particular, our approach takes multi-view of the object as input and jointly estimates pixel-wise grasp configuration and a deep scale- and rotation-invariant representation. The obtained representation is then used for open-ended object category learning and recognition. Experimental results on benchmark datasets have shown that our approach outperforms state-of-the-art methods by a large margin in terms of grasping and recognition.

MVGrasp: Real-Time Multi-View 3D Object Grasping in Highly Cluttered Environments

Nowadays service robots are entering more and more in our daily life. In such a dynamic environment, a robot frequently faces pile, packed, or isolated objects. Therefore, it is necessary for the robot to know how to grasp and manipulate various objects in different situations to help humans in everyday tasks. Most state-of-the-art grasping approaches addressed four degrees-of-freedom (DoF) object grasping, where the robot is forced to grasp objects from above based on grasp synthesis of a given top-down scene. Although such approaches showed a very good performance in predefined industrial settings, they are not suitable for human-centric environments as the robot will not able to grasp a range of household objects robustly. In this work, we propose a multi-view deep learning approach to handle robust object grasping in human-centric domains. In particular, our approach takes a partial point cloud of a scene as an input, and then, generates multi-views of existing objects. The obtained views of each object are used to estimate pixel-wise grasp synthesis for each object.

Self-Imitation Learning by Planning

Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called self-imitation learning by planning (SILP), where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot’s own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks.

3D_DEN: Open-ended 3D Object Recognition using Dynamically Expandable Networks

Service robots have to work independently and adapt to the dynamic changes in real-time. One important aspect in such scenarios is to continually learn to recognize newer object categories when they become available. This combines two main research problems namely continual learning and 3D object recognition. Most of the existing research approaches include the use of deep Convolutional Neural Networks (CNNs) focusing on image datasets. A modified approach might be needed for continually learning 3D object categories. A major concern in using CNNs is the problem of catastrophic forgetting when a model tries to learn a new task. Despite various proposed solutions to mitigate this problem, there still exist some downsides of such solutions, e.g., computational complexity, especially when learning substantial number of tasks. These downsides can pose major problems in robotic scenarios where real-time response plays an essential role. In this work, we propose a new deep transfer learning approach based on a dynamic architectural method to make robots capable of open-ended learning about new 3D object categories.

OrthographicNet: A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains

We present OrthographicNet, a deep transfer learning based approach, for 3D object recognition in open-ended domains. In particular, OrthographicNet generates a rotation and scale invariant global feature for a given object, enabling to recognize the same or similar objects seen from different perspectives. Experimental results show that our approach yields significant improvements over the state-of-the-art approaches concerning scalability, memory usage and object recognition performance. Moreover, OrthographicNet demonstrates the capability of learning new categories from very few examples on-site. Regarding real-time performance, three real-world demonstrations validate the promising performance of the proposed architecture.

Combining Shape Features with Multiple Color Spaces in Open-Ended 3D Object Recognition

Considering the expansion of robot applications in more complex and dynamic environments, it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to learn about new object categories in an open-ended fashion while working in the environment. Towards this goal, we propose a deep transfer learning approach to generate a scale- and pose-invariant object representation by considering shape and texture information in multiple color spaces. The obtained global object representation is then fed to an instance-based object category learning and recognition, where a non-expert human user exists in the learning loop and can interactively guide the process of experience acquisition by teaching new object categories, or by correcting insufficient or erroneous categories. In this work, shape information encodes the common patterns of all categories, while texture information is used to describes the appearance of each instance in detail. Multiple color space combinations and network architectures are evaluated to find the most descriptive system.

Few-Shot Visual Grounding for Natural Human-Robot Interaction

Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. Unlike most grounding methods that tackle the challenge using pre-trained object detectors via a two-stepped process, we develop a single stage zero-shot model that is able to provide predictions in unseen data. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets. Experimental results showed that the proposed model performs well in terms of accuracy and speed, while showcasing robustness to variation in the natural language input.

Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition

Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either the shape information only and ignore the role of color information or vice versa. Furthermore, they mainly utilize the Ln Minkowski family functions to measure the similarity of two object views, while there are various distance measures that are applicable to compare two object views. In this paper, we explore the importance of shape information, color constancy, color spaces, and various similarity measures in open-ended 3D object recognition.

Local-HDP: Interactive Open-Ended 3D Object Categorization

We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to high-level conceptual topics for 3D object categorization. However, the efficiency and accuracy of LDA-based approaches depend on the number of topics that is chosen manually. Moreover, fixing the number of topics for all categories can lead to overfitting or underfitting of the model. In contrast, the proposed Local-HDP can autonomously determine the number of topics for each category. Furthermore, an inference method is proposed that results in a fast posterior approximation.

*This research was done in collaboration with Hamed Ayoobi

The State of Service Robots: Current Bottlenecks in Object Perception and Manipulation

Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object. While there are many successes, the robot should be painstakingly coded in advance to perform a set of predefined tasks. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In this paper, we review advances in service robots from object perception to complex object manipulation and shed a light on the current challenges and bottlenecks.

Accelerating Reinforcement Learning for Reaching using Continuous Curriculum Learning

Reinforcement learning has shown great promise in the training of robot behavior due to the sequential decision making characteristics. However, the required enormous amount of interactive and informative training data provides the major stumbling block for progress. In this study, we focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching tasks. Specifically, we propose a precision-based continuous curriculum learning (PCCL) method in which the requirements are gradually adjusted during the training process, instead of fixing the parameter in a static schedule. To this end, we explore various continuous curriculum strategies for controlling a training process. This approach is tested using a Universal Robot 5e in both simulation and real-world scenarios.

Look Further to Recognize Better: Learning Shared Topics and Category-Specific Dictionaries for Open-Ended 3D Object Recognition

In human-centric environments, fine-grained object categorization is as essential as basic-level object categorization. In this work, each object is represented using a set of general latent topics and category-specific dictionaries. The general topics encode the common patterns of all categories, while the category-specific dictionary describes the content of each category in details. We discovered both sets of general and specific representations in an unsupervised fashion and updated them incrementally using new object views.

Interactive Open-Ended Learning Approach for Recognizing 3D Object Category and Grasp Affordance Concurrently

This paper presents an interactive open-ended learning approach to recognize multiple objects and their grasp affordances concurrently. This is an important contribution in the field of service robots since no matter how extensive the training data used for batch learning, a robot might always be confronted with an unknown object when operating in human-centric environments. Our approach has two main branches. The first branch is related to open-ended 3D object category learning and recognition. The second branch is associated with learning and recognizing the configuration of grasps in a reasonable amount of time.

Learning to Grasp 3D Objects using Deep Residual U-Nets

In this study, we present a new deep learning approach to detect object affordances for a given 3D object. The method trains a Convolutional Neural Network (CNN) to learn a set of grasping features from RGB-D images. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure and residual network-styled blocks. It devised to be robust and efficient to compute and use. A set of experiments has been performed to assess the performance of the proposed approach regarding grasp success rate on simulated robotic scenarios. Experiments validate the promising performance of the proposed architecture on ShapeNetCore dataset and simulated robot scenarios.

Coping with Context Change in Open-Ended Object Recognition without Explicit Context Information

To deploy a robot in a human-centric environment, it is important that the robot is able to continuously acquire and update object categories while working in the environment. Therefore, autonomous robots must have the ability to continuously execute learning and recognition in a concurrent or interleaved fashion. One of the main challenges in unconstrained human environments is to cope with the effects of context change. This paper presents two main contributions: (i) an approach for evaluating open-ended object category learning and recognition methods in multi-context scenarios; (ii) evaluation of different object category learning and recognition approaches regarding their ability to cope with the effects of context change. Off-line evaluation approaches such as cross-validation do not comply with the simultaneous nature of learning and recognition. A teaching protocol, supporting context change, was therefore designed and used in this work for experimental evaluation. Seven learning and recognition approaches were evaluated and compared using the protocol. The best performance, in terms of number of learned categories, was obtained with a recently proposed local variant of Latent Dirichlet Allocation (LDA), closely followed by a Bag-of-Words (BoW) approach. In terms of adaptability, i.e. coping with context change, the best result was obtained with BoW, immediately followed by the local LDA variant.

Perceiving, Learning, and Recognizing 3D Objects: An Approach to Cognitive Service Robots

This paper proposes a cognitive architecture designed to create a concurrent 3D object category learning and recognition in an interactive and open-ended manner. In particular, this cognitive architecture provides automatic perception capabilities that will allow robots to detect objects in highly crowded scenes and learn new object categories from the set of accumulated experiences in an incremental and open-ended way. Moreover, it supports constructing the full model of an unknown object in an on-line manner and predicting next best view for improving object detection and manipulation performance.

Active Multi-View 6D Object Pose Estimation and Camera Motion Planning in the Crowd

In this project, we developed a novel unsupervised Next-Best-View (NBV) prediction algorithm to improve object detection and manipulation performance. Particularly, the ability to predict the NBV point is important for mobile robots performing tasks in everyday environments. In active scenarios, whenever the robot fails to detect or manipulate objects from the current view point, it is able to predict the next best view position, goes there and captures a new scene to improve the knowledge of the environment. This may increase the object detection and manipulation performance.

Hierarchical Object Representation for OpenEnded Object Category Learning and Recognition (Local LDA)

This paper proposes an open-ended 3D object recognition system which concurrently learns both the object categories and the statistical features for encoding objects. In particular, we propose an extension of Latent Dirichlet Allocation to learn structural semantic features (i.e. topics), from low-level feature co-occurrences, for each category independently. Moreover, topics in each category are discovered in an unsupervised fashion and are updated incrementally using new object views. In this way, the advantage of both the local hand-crafted and the structural semantic features have been considered in an efficient way.

GOOD: A Global Orthographic Object Descriptor for 3D Object Recognition and Manipulation

The Global Orthographic Object Descriptor (GOOD) has been designed to be robust, descriptive and efficient to compute and use. GOOD descriptor has two outstanding characteristics: (1) Providing a good trade-off among: descriptiveness, robustness, computation time, memory usage; (2) Allowing concurrent object recognition and pose estimation for manipulation. The performance of the proposed object descriptor is compared with the main state-of-the-art descriptors. Experimental results show that the overall classification performance obtained with GOOD is comparable to the best performances obtained with the state-of-the-art descriptors. Concerning memory and computation time, GOOD clearly outperforms the other descriptors. The current implementation of GOOD descriptor supports several functionalities for 3D object recognition and object manipulation.

Towards Lifelong Assistive Robotics: A Tight Coupling between Object Perception and Manipulation

In this work, we propose a cognitive architecture designed to create a tight coupling between perception and manipulation for assistive robots. This is necessary for assistive robots, not only to perform manipulation tasks in a reasonable amount of time and in an appropriate manner, but also to robustly adapt to new environments by handling new objects. In particular, this cognitive architecture provides perception capabilities that will allow robots to, incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks.

Interactive Open-Ended Learning for 3D Object Recognition: An Approach and Experiments

This work presents an efficient approach capable of learning and recognizing object categories in an interactive and open-ended manner. In particular, we mainly focus on two state-of-the-art questions: (1) How to automatically detect, conceptualize and recognize objects in 3D scenes in an open-ended manner? (2) How to acquire and use high-level knowledge obtained from the interaction with human users, namely when they provide category labels, in order to improve the system performance?

Journal of Intelligent and Robotic Systems

Learning to Grasp Familiar Objects using Object View Recognition and Template Matching

In this work, interactive object view learning and recognition capabilities are integrated in the process of learning and recognizing grasps. The object view recognition module uses an interactive incremental learning approach to recognize object view labels. The grasp pose learning approach uses local and global visual features of a demonstrated grasp to learn a grasp template associated with the recognized object view. A grasp distance measure based on Mahalanobis distance is used in a grasp template matching approach to recognize an appropriate grasp pose.

Humanoid Robots (RoboCup-HL)

After obtaining extensive knowledge about real-time intelligent robotic systems in Middle-Size League, I tried to make humanoid robots and formed two new robotic teams namely Persia and BehRobot for participating in RoboCup humanoid leagues. We worked on three different types of humanoid robots including kid-size (height = 59cm, weight = 4kg), teen-size (height = 93cm, weight = 7Kg) and adult-size (height = 155cm, weight = 11:5Kg) robots. We were one of the successful teams in the humanoid leagues and achieved several ranks in national and international competitions.

Middle Size Soccer Robots (RoboCup-MSL)

During the second year of my undergraduate program, I got familiar with RoboCup competitions. I formed a team of Middle Size Soccer Robots (RoboCup-MSL) namely ADRO in 2006. We provided five player robots and one goalkeeper robot with similar structure but equipped with some additional accessories and sensors. Through this teamwork, I took an active role in the development of the robots’ software. Furthermore, I worked on the mechanical design of the robot via Autodesk Inventor. We achieved several ranks in national and international RoboCup competitions.

Demo1

Contact

Dr. Hamidreza Kasaei
Artificial Intelligence Department,
University of Groningen,
Bernoulliborg building,
Nijenborgh 9 9747 AG Groningen,
The Netherlands.
Office: 340
Tel: +31-50-363-33926
E-mail: hamidreza.kasaei@rug.nl