This thesis focuses on advancing situated natural language understanding for human-robot interaction (HRI) through the use of large language models (LLMs), aiming to create robots that can understand, process, and respond to human commands in real-world environments. The work addresses key challenges in integrating multitasking capabilities, scaling models across multiple languages and modalities, including visual data, and adopting sustainable training techniques. The key contributions of this work include: • Integrating Multitask learning into LLMs: Methods enabling LLMs to integrate various problem-solving approaches are explored, initially handling tasks individually and then within a unified structure. This is demonstrated through the ExtremITA model, which participated in the EVALITA challenge on a diverse set of Italian-specific linguistic tasks. Following the exploration of multitasking capabilities, the thesis further investigates the application of LLMs for syntactic analysis across languages. • Neural transcoding for grammatical parsing based on LLMs: The thesis introduces U-DepPLLaMA, a model for universal dependency parsing using large autoregressive language models. This model achieves state-of-the-art results in dependency parsing across multiple languages and demonstrates the feasibility of scaling models using low-rank adapted parameters. It can handle multiple languages without taskspecific architectural modifications. • Grounding Language Understanding in HRI: Building on these capabilities, the thesis presents the GrUT approach, which interprets robotic commands in multiple languages, particularly English and Italian. This method combines frame semantics, a knowledge base, and lexical similarity to understand natural language commands. • Multimodal Interaction: The study incorporates multimodal models, with a particular focus on visual question answering (VQA) in Italian, using the GQA-it dataset. The MiniCPM-V model was optimised to improve its performance on the GQA-it dataset. • Developing the MM-IGLU Resource: The thesis introduces MM-IGLU, an interactive multimodal resource for grounded language understanding. It also presents MM-IGLU-it, an extension that supports the Italian language. These resources facilitate the training and evaluation of models in a multilingual context, grounded in the Minecraft-like world through environmental images. • Dialogue Systems in HRI: Preliminary work is presented on creating a dialogue resource on MM-IGLU, where a robot can ask follow-up questions to clarify commands, and evaluating the abilities of Multimodal models in effectively planning the interaction and solving the ambiguities of input commands. The thesis also provides a comprehensive evaluation of various models, using different, automatic or manual metrics. Error analyses are conducted to evaluate the strengths and limitations of the models. In summary, these contributions collectively pave the way for developing more efficient, versatile, and interactive robots capable of understanding complex commands, performing tasks in multilingual and multimodal contexts, and seamlessly integrating into real-world human environments.
Situated Natural Language Understanding in HRI via LLMs / Claudiu Daniel Hromei , 2025 Jun 03. 37. ciclo, Anno Accademico 2024/2025.
Situated Natural Language Understanding in HRI via LLMs
HROMEI, CLAUDIU DANIEL
2025-06-03
Abstract
This thesis focuses on advancing situated natural language understanding for human-robot interaction (HRI) through the use of large language models (LLMs), aiming to create robots that can understand, process, and respond to human commands in real-world environments. The work addresses key challenges in integrating multitasking capabilities, scaling models across multiple languages and modalities, including visual data, and adopting sustainable training techniques. The key contributions of this work include: • Integrating Multitask learning into LLMs: Methods enabling LLMs to integrate various problem-solving approaches are explored, initially handling tasks individually and then within a unified structure. This is demonstrated through the ExtremITA model, which participated in the EVALITA challenge on a diverse set of Italian-specific linguistic tasks. Following the exploration of multitasking capabilities, the thesis further investigates the application of LLMs for syntactic analysis across languages. • Neural transcoding for grammatical parsing based on LLMs: The thesis introduces U-DepPLLaMA, a model for universal dependency parsing using large autoregressive language models. This model achieves state-of-the-art results in dependency parsing across multiple languages and demonstrates the feasibility of scaling models using low-rank adapted parameters. It can handle multiple languages without taskspecific architectural modifications. • Grounding Language Understanding in HRI: Building on these capabilities, the thesis presents the GrUT approach, which interprets robotic commands in multiple languages, particularly English and Italian. This method combines frame semantics, a knowledge base, and lexical similarity to understand natural language commands. • Multimodal Interaction: The study incorporates multimodal models, with a particular focus on visual question answering (VQA) in Italian, using the GQA-it dataset. The MiniCPM-V model was optimised to improve its performance on the GQA-it dataset. • Developing the MM-IGLU Resource: The thesis introduces MM-IGLU, an interactive multimodal resource for grounded language understanding. It also presents MM-IGLU-it, an extension that supports the Italian language. These resources facilitate the training and evaluation of models in a multilingual context, grounded in the Minecraft-like world through environmental images. • Dialogue Systems in HRI: Preliminary work is presented on creating a dialogue resource on MM-IGLU, where a robot can ask follow-up questions to clarify commands, and evaluating the abilities of Multimodal models in effectively planning the interaction and solving the ambiguities of input commands. The thesis also provides a comprehensive evaluation of various models, using different, automatic or manual metrics. Error analyses are conducted to evaluate the strengths and limitations of the models. In summary, these contributions collectively pave the way for developing more efficient, versatile, and interactive robots capable of understanding complex commands, performing tasks in multilingual and multimodal contexts, and seamlessly integrating into real-world human environments.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


