cross-posted from: https://lemmy.world/post/3879861
Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B
Hello everyone! This post marks an exciting moment for [email protected] and everyone in the open-source large language model and AI community.
We appear to have a new contender on the block, a model apparently capable of surpassing OpenAI's state of the art ChatGPT-4 in coding evals (evaluations).
This is huge. Not too long ago I made an offhand comment on us catching up to GPT-4 within a year. I did not expect that prediction to end up being reality in half the time. Let's hope this isn't a one-off scenario and that we see a new wave of open-source models that begin to challenge OpenAI.
Buckle up, it's going to get interesting!
Here's some notes from the blog, which you should visit and read in its entirety:
Blog Post
We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.
The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.
- CodeLlama-34B achieved 48.8% pass@1 on HumanEval
- CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval
We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.
The methodology is:
- For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.
- A match was identified if any sampled substring was a substring of the processed training example.
For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models:
- Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval
- Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval
Download
We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results.
If you get a chance to try either of these models out, let us know how it goes in the comments below!
If you found anything about this post interesting, consider subscribing to [email protected].
Cheers to the power of open-source! May we continue the fight for optimization, efficiency, and performance.
I'm just glad for some positive results training ai on ai content. I assume theres still a lot of ways we can improve or adjust large bodies of data with AI.