Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.
Cross-lingual distillation for domain knowledge transfer with sentence transformers
Bacco L.;Merone M.;Pecchia L.
2025-01-01
Abstract
Recent advancements in Natural Language Processing (NLP) have substantially enhanced language understanding. However, non-English languages, especially in specialized and low-resource domains like biomedicine, remain largely underrepresented. Bridging this gap is essential for promoting inclusivity and expanding the global applicability of NLP technologies. This study presents a cross-lingual knowledge distillation framework that utilizes sentence transformers to improve domain-specific NLP capabilities in non-English languages. Specifically, the framework focuses on biomedical text classification tasks. By aligning sentence embeddings between a teacher model trained on English biomedical corpora and a multilingual student model, the proposed method effectively transfers both domain-specific and task-specific knowledge. This alignment allows the student model to efficiently process and adapt to biomedical texts in Spanish, French, and German, particularly in low-resource settings with limited tuning data. Extensive experiments with domain-adapted models like BioBERT and multilingual BERT with machine-translated text pairs demonstrate substantial performance improvements in downstream biomedical NLP tasks. The proposed framework proves highly effective in scenarios characterized by limited training data availability. The results highlight the scalability and effectiveness of this approach, facilitating the development of robust multilingual models tailored to the biomedical domain, thus advancing global accessibility and impact in biomedical NLP applications.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S0950705125001261-main.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
2.48 MB
Formato
Adobe PDF
|
2.48 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.