Model merging is one of the most efficient ways to create your own LLM without any overhead. By combining a few top-ranked models, we can build our own custom model that often results in better performance, improved efficiency, and better tailoring to our use cases.
In this blog post, I'll cover how merging models can transform your AI development workflow using Ori Global Cloud (OGC) and Mergekit to blend the best code-generating LLMs into an even better one, resulting in a mini-developer assistant.
If you're already familiar with Mergekit, you can jump straight into the model built in this post on the Ori Hugging Face: ori-cloud/ds-trinity-7b-v1.
Model merging blends two or more LLMs into a single, custom model. It's a fast and effective way to build models for cheap (only CPU needed, no GPUs). Model merging works surprisingly well and has produced many state-of-the-art models on the Open LLM Leaderboard.
Merging models involves combining multiple pre-trained models into a single, more robust model. Unlike traditional model training, which requires building a model from the ground up using vast datasets and significant computational power, model merging leverages existing models that have already been trained on diverse datasets. This approach allows for the amalgamation of the unique strengths and knowledge bases of individual models, resulting in a composite model that performs better or is more versatile than its constituents. Using the merge approach significantly reduces the resources and time needed for model enhancement. Instead of spending weeks or months on training and fine-tuning, developers can merge models in a fraction of the time, with less computational demand, achieving advanced capabilities.
Mergekit allows for straightforward integration into AI development workflows. It supports various merging techniques, making it adaptable to different requirements and objectives.
By providing a practical solution for merging pre-trained models, Mergekit opens up new possibilities for AI development, making it easier for developers to exploit the full potential of existing LLMs and accelerate innovation in AI applications.
These techniques represent mergekit's diverse approaches to model integration, each with unique strengths, from preserving vector integrity with SLERP to innovating with Passthrough for custom parameter scales.
Before we dive into the setting up the mergekit environment, here is a quick guide on how to provision a GPU instance on Ori.
First, let's install Mergekit.
Now we'll add dependencies: Follow the GitHub page to install the additional libraries needed. But, before you start installing the libraries, run python3 -m pip install --upgrade pip.
to avoid getting python version errors.
Depending on the use case you choose, select a base model and other good performing models targeting a similar use case.
In this guide, our goal is to create a better Code Generation model.
Big Code Models Leaderboard was used to select Llama-2 based models that are ranked well in Code Generation. This leaderboard is benchmarked on a code generation dataset, HumanEval.
Three Llama based models were chosen based on the same parameters such as 6.74B in this case and Tensor Type BF16.
Base model = deepseek-ai/deepseek-coder-6.7b-base
Model 1 = deepseek-ai/deepseek-coder-6.7b-instruct
Model 2 = m-a-p/OpenCodeInterpreter-DS-6.7B
With TIES being one of the popular merging methods due to its ability to merge more than two models, the merge yaml configuration has
Density: Refers to the proportion of weight differences from the base model that are kept.
Gradient values: This parameter consists of a sequence of floating-point numbers that dictate the blending proportions for merging the tensors from two models, usually within a range of 0.0 and 1.0. (Read more about Gradient Parameters)
Understanding above parameters:
--allow-crimes
(allows mixing architectures)
--copy-tokenizer
(copy a tokenizer to the output)
--out-shard-size 1B
(number of parameters per output shard)
--lazy-unpickle
(experimental lazy unpickle for lower memory usage)
Additionally, we may use the follow parameters:
--low-cpu-memory
(store results and intermediate values on GPU, useful if VRAM > RAM)
--write-model-card
(output README.md containing details of the merge)
The above should now start downloading the models in a “output-model-directory” folder. Depending on the type of GPU/CPU you are using, the time of merge varies. In this case, a single V100 GPU with 16GB VRAM/8 vCPU was used.
Once the merge gets completed, the new model weights can be uploaded to Hugging Face, using a WRITE token.
You may create an organisation or any personal space where the model can be uploaded.
Use the following python script to initiate the upload.
Due to the large size of the newly merged model, a GPU with higher specs such as VRAM can be used to check its performance. In this case, 1x H100 is being used.
Install all the Python dependencies as suggested in the earlier part of the guide, except a virtual environment is not needed to run the model.
Running the new merged model using the base model tokenizer, deepseek-ai/deepseek-coder-6.7b-base and it’s prompt format in Python:
Run the merged model’s AutoTokenizer, m-a-p/OpenCodeInterpreter-DS-6.7B and it’s prompt format in Python:
Run the merged model’s AutoTokenizer, deepseek-ai/deepseek-coder-6.7b-instruct and it’s prompt format
Now we can see that the best output generated above is using the “m-a-p/OpenCodeInterpreter-DS-6.7B” Tokenizer. To make the model usable, we’ll update three Tokenizer files of the new merged model;
eos_token_id: <value>
tokenizer.json
and tokennizer_config.json
tokennizer_config.jsonThe techniques we've discussed for merging pre-trained models are not limited to the use case provided. Mergekit is capable of facilitating the integration of various LLMs, offering a broad canvas for innovation and customisation. For those eager to dive deeper and experiment firsthand, the /examples folder in the mergekit GitHub repository is an excellent resource, filled with sample scripts and scenarios to test and learn from.
We encourage our readers to not just follow along but to actively participate in this journey. Try out merging models yourself and experiment with the capabilities of mergekit. To get you started, you can access the merged models we've discussed on the Hugging face platform, ori-cloud/ds-trinity-7b-v1, readily available for you to try out.
Stay tuned for our next blog posts where we'll share insights on measuring LLM performance, and much more!
References:
Hugging Face - Code Llama Models
Hugging Face’s Docker Spaces