Wessel Sandtke - Don’t judge a book by its cover: Using LLM created datasets to train models...
Автор: PyData
Загружено: 2023-11-22
Просмотров: 509
Описание:
Don’t judge a book by its cover: Using LLM created datasets to train models that detect literary features
Existing book recommendation systems like Goodreads are based on correlating the reading habits of people. But what if you want a humorous book? Or a book that is set in 19th century Paris? Or a thriller, but without violence?
We build book recommendation systems for Dutch libraries based on more than a dozen features from historical setting, to writing style, to main character characteristics. This allows us to tailor each recommendation to individual readers.
The recent developments in LLMs are an interesting area for us to explore to improve our recommendations. However, running LLMs in production is unfortunately not always feasible. The associated costs may be too high, and running code from third parties in your daily pipeline may be undesirable. And then there’s data privacy - or, in our case, intellectual copyright - to be considered as well.
So how can you reap the benefits of an LLM, without exposing yourself or your company to some of these major downsides?
We utilized LLMs to generate custom, tailor-made datasets for our literary feature detection models to train on. This allowed us to benefit from the high performance of large language models, without continued reliance on external parties such as OpenAI or Google.
While you may think LLMs are not as effective for languages other than English, we’ve seen major improvements in several of our models.
In this talk, we’ll highlight:
A note on recommenders: Why does Goodreads recommender not work for me, while Spotify’s Discover Weekly is so good?
Different methods of getting data from books
Iterative process of creating a dataset using an LLM and retraining our models
Some notes on intellectual property and evaluation of models.
Bio:
Wessel Sandtke
Typewriter repairman turned Machine Learning Engineer, now working for Bookarang, a Dutch startup working with Dutch libraries to improve the recommendations for its members.
Wrote several picture books, but is not allowed to boost those in the recommendation system.
===
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.
Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: