Learning of an unsupervised Fine-Tuning

My goal with this experiment was to add new knowledge to a model with LoRA, and make it in an automated way. For the topic I’ve chosen a niche area of vintage computers so I’ve chosen to teach a model about the NextStep Operating system. In some parts I’ve used intuition, and often naive implementation, but it will get better over time. It’s turned out as a questionable success, but achieved the goal a and learned a lot. In this post I’d like to share my approach and learning.

The data

Get data is a pain. I mean to get the raw data is not so difficult, you can get some PDFs HTMLs or doc files, but convert it to meaningful training data takes some processing. I’ve got the idea from the Alpaca approach to use another LLM to generate the training data. I’ve added a twist.
Here’s how I’ve did it. I’ve run a 13B Orca2 Q8 model locally. I’ve read the raw documents and and split it to chunks that do not exceeded 3500 tokens, so with my instructions it would fit the 4096 context length of the Orca model.
With these chunks I’ve instructed the model to generate question answer pairs related to the chunk.

messages = [
        {"role": "system", "content": "You are an assistant helping me to create a dataset"},
        {"role": "user", "content": f"Generate multiple at least 3 Question Answer pairs. Use EXACTLY the 'Question:' 'Answer:' format. \
        The answers should borrow, verbatim from the data below. \
        Provide each question consider the reader does not have access to any other question for the context. \
        Vary the style and format of the questions. \
        in the context of: {document_context}, about the following data: \n\n {context_str}"}
    ]

I’ve used regular expression and high hopes 🙂 to parse the model output and store the individual datapoints. To have meaningful training data I’ve stored it in the following format. Note for the context I’ve simply used either the filename of the file path which has more meaningful data (like in filename was index.html I’ve fallback to the path)

training_data.append({"input":    question, 
                      "output":   answer, 
                      "context":  document_context,
                      "data":     chunk,
                      "chunk_id": chunk_id,
                      "file":     file_path})

This process has taken a while on a relatively powerful Workstation, it’s depending on the data size. I’ve managed to generate around 30k sample in 24hrs. (Note: I haven’t optimized the local LLM calls).

For the evaluation data I’ve used similar approach, on 1% of all the chunk I’ve used a different locally LLM (LLaMA 7B) to generate the same. The intention with that it to provide some different samples, but still able to feed the whole document in the training process.

The training

I’ve chosen a smaller model to make my life easier from the GPU memory perspective so I’ve used phi-2 as the base model. I’ve added a LoRA layer with a R on the higher side R=128 this itself resulted to train 3.5% equivalent of the original model. For Alpha I’ve used 128 to tamper the new information as I wanted to add new knowledge. For the training I’ve used my 2×3090 with vanilla DDP, and used 100 epochs which was a guess (now I know better). I’ve used batch size of 1 and limited the tokens to 325 to fit to memory. Even with this the GPU utilization were above 90% (3090 has 24GB) The training has taken ~60 hours. The tightly packed blower style 3090 haven-t had adequate airflow/cooling so I had to supplemented it with a big fan.

The result

Training completed that itself was a big win for me. It took few trials to not crash due to memory consumption hit the ceiling or after couple hours one GPU has crashed to to thermal conditions. The tensorboard logs showed sever overfitting. Some overfitting I’ve wanted to achieve memorization, but this might too much and the model won;t be able to generalize as good.

The produced model seems adapter knew knowledge, and provided answers in the right context. Sometimes it has generated reasonable short answer for the question and like 10 other questions. Anyway after using that much patience from my family and that much electricity I’ve released the model on HuggingFace feel free to play with it.
https://huggingface.co/DevQuasar/vintage-nextstep_os_systemadmin-ft-phi2

The learning

  1. For new knowledge I probably don’t need it in a question answer format. Maybe be beneficial to do a 2 phased Fine Tuning, teach some new knowledge by just show the data itself without instructions, and in another iteration show examples of asking about the topic.
  2. My unsupervised training data generation approach has put too much trunk to the model the prompt and the regexp :D. Once I’ve eyeballed the generated data I found multiple entries where the input has contained multiple set on Question-Answer pairs. This might be the explanation why the final model has occasionally generates some additional questions-answer pairs after a reasonable answer to the user question.
  3. Data chunking is critical. I’ll need more sophisticated way to able to keep contextually belonging pieces of the input data together. By chunking it on token limit results in split of the original data within context so the generated training Q-A pairs quality suffers.
  4. Human documents uses references to other part of the document. I haven’t even thought about this but during testing the tuned model, sometimes the answer contained some reference to other chapters. Which makes sense from the model’s perspective, but doesn’t help you when you’re using the model and have to rely only on the model’s knowledge and has no access to the original data source.
    I was somewhat able to mitigate this behavior with prompt engineering like:
    “Give me a complete answer do not refer to other chapters but collect the information from them. How to setup a local network in Openstep OS?”
    but were the seemingly coherent answer was true of just result of hallucinations is yet to be confirmed.
    Anyway I feel this can be an interesting topic to dig deeper in the future.
  5. I should have been done way less epochs and apply early stopping based on the eval results.