How to Build a Finance Domain Specific LLM from Scratch Using Python
Автор: Analytics in Practice
Загружено: 2025-12-30
Просмотров: 162
Описание: This notebook walks through an end-to-end workflow for building a finance domain–specific LLM in Python, starting with a clear goal: ingest real financial language like 10-K/10-Q filings and train a model to answer finance questions, follow finance instructions, and later support citation-style retrieval with RAG. It begins by installing the core tooling and setting environment flags to reduce multiprocessing and tokenizer threading issues, which helps stability on Windows. The pipeline downloads recent SEC filings for a small set of tickers and forms using sec-edgar-downloader, while emphasizing proper SEC identification via a user agent and email. Before heavy processing, it checks available RAM to avoid crashes when loading and tokenizing large documents. Next, it traverses the SEC filing directory tree, selects high-signal files like full-submission.txt, filters out tiny or noisy documents, and builds a Hugging Face Dataset with the raw text plus metadata like file path and ticker. The notebook then tokenizes the text with a pretrained tokenizer, removes the raw text to save memory, and “packs” tokens into fixed-size blocks suitable for language-model training by concatenating sequences and chunking into 1024-token windows. To avoid redoing expensive preprocessing, it saves the tokenized and packed datasets to disk and demonstrates reloading them later. For training, it switches to a CPU-friendly base model and uses LoRA with peft plus trl’s SFTTrainer to perform a small instruction-tuning run on a subset of the packed dataset, keeping steps limited for practicality on a laptop. Finally, it shows how to load the base model with the LoRA adapter and query it using a chat-style prompt template so the model responds as a finance tutor, producing explanatory answers rather than code or unrelated output.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: