Mastering Context Extraction for Fine-Tuning GPT models

Extracting Domain Specific Context from OpenWebText Data

Sadrach Pierre, Ph.D.
DataFabrica

--

Image by Pixabay on Pexels

Generative Pre-trained Transformer (GPT) models are large language models that are used for natural language processing tasks. GPT models are developed by training on a diverse set of text data. By learning from a wide range of textual data, GPT models are able to generate contextually relevant content based on the terms provided in a prompt. These models are able to achieve these natural language tasks through what is called a transformer architecture. The transformer architecture enables GPT models to identify and weigh important keywords in a sequence given a word or set of words. These models are reliably able to understand grammatical structure, semantics, and overall language patterns.

GPT models have a wide range of applications across industries that are far too numerous to summarize in a single post. The most popular application built on tops of a GPT model is ChatGPT, which was release by OpenAI in 2022. ChatGPT has quickly evolved into a modern day search engine amassing over 180 million active users since its release. Further, ChatGPT logged 1.7 billion visits this past October alone. Being that ChatGPT functions as a search engine, its use cases are extremely broad. This includes summarizing documents, looking up well known/document concepts, generating code, and much more.

Currently OpenAI has the publicly available GPT 3.5 and the paid for version GPT-4. The public version is free to use and the paid for version has a monthly subscription price of $20/month. The paid for version is multimodal and more accurate. Specifically you can generate text, image or audio outputs by inputing text based prompts or speaking directly to the tool!

Given the extreme success of ChatGPT many companies have expressed strong interest in developing industry specific version of ChatGPT. This is done by fine tuning the underlying GPT model using industry specific text-based data and deploying the fine-tuned model to an application. For example, many financial institutions either retrain or fine-tune GPT models on finance specific knowledge bases. This is a useful way of updating the weights placed on finance specific terms that are relevant to their business use case. A specific example is fine tuning a GPT model on financial news data to enhances the quality of output generated for writing financial reports.

There are several public data options that can be used to fine tune GPT models for industry specific use cases. One of the most well-known public data sources for building GPTs is the OpenWebText database.

What is OpenWebText ?

OpenWebtext is an open source version of the webtext corpus created by OpenAI. It contains scrapped web data from Reddit, where high quality texts are prioritized. Specifically OpenAI used a cutoff of at least 3 Karma on Reddit as a heuristic for assessing quality. The data was generated by extracting URLs from the Reddit submissions data set.

The Openwebtext data has been used for a wide variety of research and industry applications. These include developing pre-trained LLMs, generating industry specific content, language translation, name entity recognition, text summarization and much more.

Accessing OpenWebText data

You can access the data set by navigating to the OpenWebTextCorpus site. Upon downloading the .tar file, which is 12 GB compressed, you can extract the files with the following command:

tar -xf path/to/files/openwebtext.tar.xz - C path/to/files/

This will result in a directory called OpenWebText which will .XZ files.

Screenshot taken by Author

If we go into some of these files we can see the following content:

Screenshot taken by Author
Screenshot taken by Author

We can also extract text containing specific keywords. This is useful for tasks such as fine-tuning LLMs, like GPTs. You can easily write a python script that iterates over these .xz files, extracts context related to a keyword, and writes the context to a .txt file. Let’s extract context related to the keyword “Finance”. Here is a snapshot of the resulting .txt file:

Screenshot taken by Author

This file contains relatively broad text information related to the keyword finance. This can be further refined by including more keywords or keywords that are more specific. For example, you can try extracting context related to “Equity Markets”:

Screenshot taken by Author

This can be used to fine tune a GPT model used for generating equity market news sentiment scores. This can serve as decision support for traders, investors and financial analyst. Specifically, the sentiment scores can be used to help the decision making process regarding sentiment related to stocks.

Another example is “Fixed Income”. Let’s extract some context related to “Fixed Income”:

Screenshot taken by Author

This context can be used to fine tune a GPT model that generates fixed income investment recommendations based on a user profile and financial goals.

The script used to extract keyword related context can be found on DataFabrica. You can access the script here.

Conclusions

In this post we discussed how the public data source, OpenWebText, can be leveraged to generate domain specific context that can be used to fine tune LLMs. We first discussed how we can access the data through the OpenWebText site. We then showed how to decompress the downloaded .tar file and extract the .xz files. Finally, we discussed some context specific text and how they can be used to improve the outputs of GPT models. You can access the context extraction script here.

--

--

Sadrach Pierre, Ph.D.
DataFabrica

Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.