Easy Peasy NLP With LLMs

LLMs have opened up a whole new way of doing tricky NLP tasks that I haven't seen documented in other places yet.

Recently I had a need to disentangle a text field that was a messy combination of dates (written out in words) and a place of birth. For example Fourteenth of June 1956 (8:05am), Sacred Heart Hospital, Seattle. I wanted to extract the date and the place of birth separately.

I could have used a regex to extract the date, but as soon as the data didn't fit the pattern everything would fall to bits. I was dealing with 100,000+ rows of messy manually entered data, so I wanted to use a model that could handle that. I tried using Spacy but it didn't work very well, and I would have to write a lot of rules to get it to almost work.

I decided to try using a LLM (Large Language Model) to do the task. I and the whole world have had nothing but LLMs in our faces for the last year, so I was curious to see if they would work in this context. It turns out that they worked really well, and made the task super easy!

V1 - Prompting

The first thing I tried was to prompt the model with the text I wanted to extract. I used the Hugging Face Transformers library to do this. I used the "meta-llama/Llama-2-7b-chat-hf" model which is the smallest of the Llama-2 models.

My environment for this experiment was a paperspace gradient notebook usually running with a P6000 GPU.

First steps - install the required libraries and download the model.

 

The last line obviously does a lot of lifting. It downloads the model from the huggingface hub, loads it into memory, and sets up a pipeline for generating text. The pipeline is a wrapper around the model that makes it easy to use. It takes care of tokenizing the input text, running it through the model, and decoding the output.

At this point, we can start generating text. Let's try it out with a simple prompt.

The model has used the model as the start of a story. Not bad, but no good for us. We need to steer the model in the right direction but showing it some example inputs and outputs. We can do this by adding some extra text to the prompt. This is a type of prompt engineering called "Few-shot in-context learning", and most importantly it requires no training!

There are some special tags which we can use to craft our prompt. These tags are specific to the model, so in this case we're using tags specific to the llama-2 model family. I'm not completely sure whether I'm using them correctly, but they seem to work. Each example starts with a <s> tag and ends with a </s> tag. We differenciate between what the user input and what the model output by wrapping the user input in [INST] and [/INST] tags. Of course in reality, we generated all of the examples, so in a sense we're tricking the model to think that the examples are previous parts to the conversation.

Important to note is that each input to the model is a blank slate, so for every query we show the model the whole set of examples all over again.

Amazing! With a very small amount of code, we've got a very robust model that will happily accept out of domain inputs and return the correct answer. It would take a very long time to craft a set of rules that would do the same thing.

But the language of the model is numbers, why should we limit ourselves to telling what the model to do in text? This is the idea behind the next version of the experiment, prompt tuning.

V2 - Prompt tuning

The idea behind prompt tuning is that we can steer the model by prepending a vector to the input text. We freeze the model and learn the vector to produce the desired output. I think about it like gradient descent powered prompt engineering. It's much faster than training the whole model, and also means we can train multiple task specific vectors for the same model and swap them in and out as needed. However, because we're going to be optimising the vector we need some training data.

As usual, in this case getting perfect training data is a catch-22 situation where if I had it, I wouldn't need to do this experiment. So I'm going to use some fake data generated by the Faker library. I'm going to generate 100,000 examples of the type of text I want to extract, and then use the same examples as the training data for the model. To make the data look a bit more realistic, I'll rough it up a little by adding some errors here and there. It's not perfect, but we'll see that it still works pretty well for in-domain and out-of-domain inputs.

Also, for anyone looking to copy this notebook, at this point I restarted it to clear out some memory. I'm sure there is a better way of this this somehow

We'll start this time by creating our synthetic dataset. The goal is to make a somewhat realistic dataset that is at least illustrative of the task at hand. Generating synthetic data is usually easier said than done, especially when you have to worry about conditional distributions of dependent variables. There has been a number of times where I've had to stop myself falling into a rabbit hole by realising that generating the synthetic data is actually just as much if not more work than whatever alternative I had been avoiding. But in this case it's relatively straight forward.

Next we have to do some preprocessing on the these examples so that they are ready to be fed into the language model

What we'll feed into the model is 3 vectors

  1. The first vector is input_ids which is the tokenized input + output padded out to max_length

  2. The second vector is attention_mask which is 0s for the length of the input + padding, and 1s for the length of the desired output. While training, the output will be masked so that the model can't cheat.

  3. The third is the desired output labels

 

Now that we have some synthetic data, and preprocessed it, we can put together our prompt tuning model

I'm going to use a smaller model as the base since it seems to be more stable for this workflow

Using huggingface PEFT (parameter efficient fine-tuning), how we do this is by creating a new instance of the model with some extra bits as defined in the PromptTuningConfig. If we wanted to fine-tune a model using LORA, its the set up is very similar.

 

 

 

We now have a model that functions almost the same as the base Bloom model, except with the addition of 8 trainable 'tokens' that will be prepended to the output. These will steer the model in the right direction.

Time to train the model! First initialise the optimizer and learning rate scheduler

This next bit is a fairly standing modeling training loop, and can probably be re-written more elegantly using the huggingface Trainer interface

Finally, we can check some output of the prompt tuning model

Not too bad!

Although I was hoping for more than 0.87 accuracy on the test set, I think this method makes up for that by showing what it can do with out of domain examples above. This isn't going to replace rules based approches completely, not least because of the dramatically higher resource requirements, but it is a useful tool to have in the box.

To extend this further, the next step might be to look at LORA fine-tuning. This is where you are training a very small proportion of the overall model, and will probably lead to better results for these tasks. As with all of these approaches, better results probably mean more training time.

Useful resources

Below are some useful resources I found while writing this

I also found this snippet useful to see memory estimates for models for different quantisation levels

And another handy one to diagnose memory issues

 


 

Note on how I generated this this page

This blog is hosted in jekyll. I don't post often but whenever I do I find that the set of tools I use has usually moved on a bit from when I last posted, and integrating the new workflow into jekyll seems to be a pain every time. This time, I wasn't happy to just post a plain jupyter notebook because I didn't like the way it looked. I wanted it to be rendered like a Github markdown page. I spent some time trying to export the jupyter notebook as a markdown and style it directly in jekyll, but eventually gave up. The way I found was to export it into markdown, open that markdown file in Typora, apply the default Github theme and then export it as HTML with styles. It feels like there should be an easier way...