Blog Post

How to Merge Models for Code-Generating LLMs

Generative AI coding is a powerful assistant for software developers. Mergekit offers an easy way to blend pre-trained code LLMs and create your own custom LLM.

Model merging is one of the most efficient ways to create your own LLM without any overhead. By combining a few top-ranked models, we can build our own custom model that often results in better performance, improved efficiency, and better tailoring to our use cases. 

In this blog post, I'll cover how merging models can transform your AI development workflow using Ori Global Cloud (OGC) and Mergekit to blend the best code-generating LLMs into an even better one, resulting in a mini-developer assistant.

If you're already familiar with Mergekit, you can jump straight into the model built in this post on the Ori Hugging Face: ori-cloud/ds-trinity-7b-v1.

Merging Large Language Models

Model merging blends two or more LLMs into a single, custom model. It's a fast and effective way to build models for cheap (only CPU needed, no GPUs). Model merging works surprisingly well and has produced many state-of-the-art models on the Open LLM Leaderboard.

  1. Complexity and Cost: Training and fine-tuning Large Language Models (LLMs) present significant challenges in both complexity and computational resources required.

  2. Know-how: Expertise in deep learning and machine learning is a significant barrier in creating LLMs from the ground up. The initial training phase of an LLM is typically only accessible only to organizations with substantial resources.

  3. Data: LLMs require massive, diverse, and high-quality datasets for training to achieve broad understanding and generate coherent responses. Curating such datasets can be resource-intensive, requiring extensive cleaning, labelling, and preprocessing efforts.

  4. Single-models: Individual models come with their own strengths, weaknesses and errors. Combining models can de-risk relying on a single LLM, and leverage the combined strengths of each one.

How Model Merging Works

Merging models involves combining multiple pre-trained models into a single, more robust model. Unlike traditional model training, which requires building a model from the ground up using vast datasets and significant computational power, model merging leverages existing models that have already been trained on diverse datasets. This approach allows for the amalgamation of the unique strengths and knowledge bases of individual models, resulting in a composite model that performs better or is more versatile than its constituents. Using the merge approach significantly reduces the resources and time needed for model enhancement. Instead of spending weeks or months on training and fine-tuning, developers can merge models in a fraction of the time, with less computational demand, achieving advanced capabilities.



Mergekit: Create your own custom code-generating LLM

Mergekit allows for straightforward integration into AI development workflows. It supports various merging techniques, making it adaptable to different requirements and objectives. 

By providing a practical solution for merging pre-trained models, Mergekit opens up new possibilities for AI development, making it easier for developers to exploit the full potential of existing LLMs and accelerate innovation in AI applications.

We'll explore four key techniques used Mergekit for model integration that generates code when given natural-language input text.

These techniques represent mergekit's diverse approaches to model integration, each with unique strengths, from preserving vector integrity with SLERP to innovating with Passthrough for custom parameter scales.

  1. SLERP: This method interpolates between vectors on a spherical surface, maintaining constant change rates and geometric integrity. It's favoured over linear interpolation in high-dimensional spaces to prevent scale reduction and preserve directional changes, important for learning. SLERP, which only merges two models at a time, normalises vectors, calculates angles between them, and applies scale factors based on interpolation, efficiently blending their properties.

  2. TIES: Focuses on merging multiple models by reducing redundancy and resolving parameter sign conflicts. It trims excess parameters, elects a dominant sign direction, and merges aligned parameters, offering a multi-model merging capability.

  3. DARE: Similar to TIES but includes pruning (resetting weights to base values) and rescaling to maintain output expectations. Mergekit offers DARE with or without TIES's sign step, supporting the integration of multiple models while optimising parameter efficiency.

  4. Passthrough: A novel approach creating "frankenmerges" by layer concatenation, resulting in models with unique parameter counts. This experimental method has yielded large, innovative models by combining layers from different LLMs, showcasing the potential for creating highly customised AI tools.

Mergekit: Create your own custom code-generating LLM

Before we dive into the setting up the mergekit environment, here is a quick guide on how to provision a GPU instance on Ori

Setting up your environment

First, let's install Mergekit.

git clone https://github.com/arcee-ai/mergekit.git && cd mergekit


Now we'll add dependencies: Follow the GitHub page to install the additional libraries needed. But, before you start installing the libraries, run python3 -m pip install --upgrade pip. to avoid getting python version errors. 

📣  In the process you may encounter an error `externally-managed-environment` in which case you'll need to set up a virtual environment to address the error.
sudo apt install python3-venv python3 -m venv <virtual-env-name> source <v-env-name>/bin/activate #activates your virtual env.

 

Selecting Models to Merge

Depending on the use case you choose, select a base model and other good performing models targeting a similar use case.

In this guide, our goal is to create a better Code Generation model. 

Big Code Models Leaderboard was used to select Llama-2 based models that are ranked well in Code Generation. This leaderboard is benchmarked on a code generation dataset, HumanEval.

image1 (1)

Three Llama based models were chosen based on the same parameters such as 6.74B in this case and Tensor Type BF16.

Base model = deepseek-ai/deepseek-coder-6.7b-base

Model 1 = deepseek-ai/deepseek-coder-6.7b-instruct

Model 2 = m-a-p/OpenCodeInterpreter-DS-6.7B

Create a merge configuration in yaml format

With TIES being one of the popular merging methods due to its ability to merge more than two models, the merge yaml configuration has

  • deepseek-coder-6.7b-base set as the base model

  • merge model 1 and model 2, with gradient density [1, 0.7, 0.1] (tells the script to start by blending tensors with 100% of model 2's values, gradually transition  to a blend skewed towards the model 1, with 70% of its contribution coming from the first model1 and 30% from the model2, and finally to use only model 1's values)

  • A 1.0 weight on model 1 means that for the specific layer being considered, the contribution comes entirely from model 1, with no contribution from the second model. This weight signifies a 100% inclusion of the first model's attributes for that portion of the merged model, effectively excluding the model2’s corresponding part for that layer.
models: - model: deepseek-ai/deepseek-coder-6.7b-instruct parameters: density: [1, 0.7, 0.1] # density gradient weight: 1.0 - model: m-a-p/OpenCodeInterpreter-DS-6.7B parameters: density: 0.5 weight: [0, 0.3, 0.7, 1] # weight gradient merge_method: ties base_model: deepseek-ai/deepseek-coder-6.7b-base parameters: normalize: true int8_mask: true dtype: float16


Density
: Refers to the proportion of weight differences from the base model that are kept.

Gradient values: This parameter consists of a sequence of floating-point numbers that dictate the blending proportions for merging the tensors from two models, usually within a range of 0.0 and 1.0. (Read more about Gradient Parameters)

Understanding above parameters:

--allow-crimes  (allows mixing architectures)

--copy-tokenizer  (copy a tokenizer to the output) 

--out-shard-size 1B  (number of parameters per output shard)

--lazy-unpickle  (experimental lazy unpickle for lower memory usage)

Additionally, we may use the follow parameters:

--low-cpu-memory  (store results and intermediate values on GPU, useful if VRAM > RAM)

--write-model-card  (output README.md containing details of the merge)

The above should now start downloading the models in a “output-model-directory” folder. Depending on the type of GPU/CPU you are using, the time of merge varies. In this case, a single V100 GPU with 16GB VRAM/8 vCPU was used.

image2 (1)



Upload the Merged Model

Once the merge gets completed, the new model weights can be uploaded to Hugging Face, using a WRITE token.

You may create an organisation or any personal space where the model can be uploaded. 

Use the following python script to initiate the upload. 

from huggingface_hub import HfApi username = "neha-ori" api = HfApi() model_repo_name = "ori-cloud/ds-trinity-7b-v1" api.upload_folder( repo_id=f"ori-cloud/ds-trinity-7b-v1", folder_path="output-model-directory", )


Check the output of the newly merged model

Due to the large size of the newly merged model, a GPU with higher specs such as VRAM can be used to check its performance. In this case, 1x H100 is being used.

Install all the Python dependencies as suggested in the earlier part of the guide, except a virtual environment is not needed to run the model.

Running the new merged model using the base model tokenizer, deepseek-ai/deepseek-coder-6.7b-base and it’s prompt format in Python:

from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base") model = AutoModelForCausalLM.from_pretrained("ori-cloud/ds-trinity-7b-v1", torch_dtype=torch.bfloat16, device_map="auto") input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=1024) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 

Output A

#write a quick sort algorithm in python def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[len(arr) // 2] left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) print(quick_sort([3,6,8,10,1,2,1])) # Output: [1, 1, 2, 3, 6, 8, 10] # This is a quick sort algorithm that sorts an array in ascending order. It works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted. # The time complexity of the quick sort algorithm is O(n log n) in the average case and O(n^2) in the worst case. The space complexity is O(log n) due to the recursive call stack. # The quick sort algorithm is an efficient sorting algorithm for large datasets. It is a divide-and-conquer algorithm that works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted. # The worst-case time complexity of quick sort is O(n^2), but this scenario is rare. The average-case time complexity is O(n log n), which makes quick sort an efficient sorting algorithm for large datasets. The space complexity of quick sort is O(log n) due to the recursive call stack. ...

 

Run the merged model’s AutoTokenizer, m-a-p/OpenCodeInterpreter-DS-6.7B and it’s prompt format in Python:

#write a quick sort algorithm in python def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[len(arr) // 2] left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) print(quick_sort([3,6,8,10,1,2,1])) # Output: [1, 1, 2, 3, 6, 8, 10] # This is a quick sort algorithm that sorts an array in ascending order. It works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted. # The time complexity of the quick sort algorithm is O(n log n) in the average case and O(n^2) in the worst case. The space complexity is O(log n) due to the recursive call stack. # The quick sort algorithm is an efficient sorting algorithm for large datasets. It is a divide-and-conquer algorithm that works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted. # The worst-case time complexity of quick sort is O(n^2), but this scenario is rare. The average-case time complexity is O(n log n), which makes quick sort an efficient sorting algorithm for large datasets. The space complexity of quick sort is O(log n) due to the recursive call stack. ...

 

Output B

Sure, here is a Python implementation of the Quick Sort algorithm: ```python def quick_sort(arr): if len(arr) <= 1: return arr else: pivot = arr[len(arr) // 2] left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) # Test the function print(quick_sort([3,6,8,10,1,2,1])) # Output: [1, 1, 2, 3, 6, 8, 10] ``` This algorithm works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted.

Run the merged model’s AutoTokenizer, deepseek-ai/deepseek-coder-6.7b-instruct and it’s prompt format 

from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("ori-cloud/ds-trinity-7b-v1", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto") messages=[ { 'role': 'user', 'content': "write a quick sort algorithm."} ] inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) # tokenizer.eos_token_id is the id of <|EOT|> token outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id) print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

 

Output C

Setting `pad_token_id` to `eos_token_id`:32021 for open-end generation. Sure, here is a simple implementation of the Quick Sort algorithm in Python: ```python def quick_sort(arr): if len(arr) <= 1: return arr else: pivot = arr[len(arr) // 2] left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quick_sort(left) + middle + quick_sort(right) # Test the function print(quick_sort([3,6,8,10,1,2,1])) # Output: [1, 1, 2, 3, 6, 8, 10] ``` This algorithm works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted. Please note that this is a simple implementation and may not be the most efficient for large arrays. In practice, you would likely want to use a more efficient pivot selection strategy, such as the "median of three" strategy.

 

Update Tokenizer files

Now we can see that the best output generated above is using the “m-a-p/OpenCodeInterpreter-DS-6.7B” Tokenizer. To make the model usable, we’ll update three Tokenizer files of the new merged model;

  • config.json - replace the field eos_token_id: <value>

  • Replace the files tokenizer.json and tokennizer_config.json tokennizer_config.json

Conclusion

The techniques we've discussed for merging pre-trained models are not limited to the use case provided. Mergekit is capable of facilitating the integration of various LLMs, offering a broad canvas for innovation and customisation. For those eager to dive deeper and experiment firsthand, the /examples folder in the mergekit GitHub repository is an excellent resource, filled with sample scripts and scenarios to test and learn from.

We encourage our readers to not just follow along but to actively participate in this journey. Try out merging models yourself and experiment with the capabilities of mergekit. To get you started, you can access the merged models we've discussed on the Hugging face platform, ori-cloud/ds-trinity-7b-v1, readily available for you to try out.

Stay tuned for our next blog posts where we'll share insights on measuring LLM performance, and much more!

huggingface_logo-noborder 1-1

 

References:

Hugging Face - Code Llama Models

Hugging Face’s Docker Spaces

Similar posts

Join the new class of AI infrastructure! 

Build a modern GPU cloud with Ori to accelerate your AI workloads at scale.