#265
Автор: Data Science Gems
Загружено: 2025-07-30
Просмотров: 63
Описание:
Large, high-quality datasets are crucial for training LLMs. However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. MultiLegalPile is a 689GB corpus in 24 languages from 17 jurisdictions. MultiLegalPile includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. Two RoBERTa models and one Longformer multilingual, and 24 monolingual models are pretrained on each of the language specific subsets and evaluated on LEXTREME. Additionally, the English and multilingual models are evaluated on LexGLUE. The multilingual models set a new SotA on LEXTREME and English models on LexGLUE. The dataset, trained models, and all code are publicly released.
In this video, I talk about the following: What is MultiLegalPile? How do Legal-XLM models perform?
For more details, please look at https://arxiv.org/pdf/2306.02069
Niklaus, Joel, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel Ho. "MultiLegalPile: A 689GB Multilingual Legal Corpus." In ACL, pp. 15077-15094. 2024.
Thanks for watching!
LinkedIn: http://aka.ms/manishgupta
HomePage: https://sites.google.com/view/manishg/
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: