#265

deep learning

deep learning for NLP

natural language processing

large language models

llms

legal

law

legalNLP

benchmark

dataset

XLM

roberta

XLM-R

Автор: Data Science Gems

Загружено: 2025-07-30

Просмотров: 63

Описание: Large, high-quality datasets are crucial for training LLMs. However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. MultiLegalPile is a 689GB corpus in 24 languages from 17 jurisdictions. MultiLegalPile includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. Two RoBERTa models and one Longformer multilingual, and 24 monolingual models are pretrained on each of the language specific subsets and evaluated on LEXTREME. Additionally, the English and multilingual models are evaluated on LexGLUE. The multilingual models set a new SotA on LEXTREME and English models on LexGLUE. The dataset, trained models, and all code are publicly released.

In this video, I talk about the following: What is MultiLegalPile? How do Legal-XLM models perform?

For more details, please look at https://arxiv.org/pdf/2306.02069

Niklaus, Joel, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel Ho. "MultiLegalPile: A 689GB Multilingual Legal Corpus." In ACL, pp. 15077-15094. 2024.

Thanks for watching!
LinkedIn: http://aka.ms/manishgupta
HomePage: https://sites.google.com/view/manishg/

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео