Skip to content

Commit e7f6bca

Browse files
juliensimonJulien Simonphilschmidjeffboudierpcuenca
authored
New post: SafeCoder vs. Closed-source Services (#1447)
* Initial version * Update safecoder-vs-closed-models.md Co-authored-by: Philipp Schmid <[email protected]> * Update safecoder-vs-closed-models.md Co-authored-by: Philipp Schmid <[email protected]> * Update safecoder-vs-closed-models.md Co-authored-by: Philipp Schmid <[email protected]> * Update safecoder-vs-closed-models.md Co-authored-by: Philipp Schmid <[email protected]> * Update safecoder-vs-closed-models.md Co-authored-by: Philipp Schmid <[email protected]> * Fix typo * Update safecoder-vs-closed-models.md Co-authored-by: Jeff Boudier <[email protected]> * Update safecoder-vs-closed-models.md Co-authored-by: Jeff Boudier <[email protected]> * Update safecoder-vs-closed-models.md Co-authored-by: Pedro Cuenca <[email protected]> * Minor changes * Add image * - Minor tweaks - Update blog index --------- Co-authored-by: Julien Simon <[email protected]> Co-authored-by: Philipp Schmid <[email protected]> Co-authored-by: Jeff Boudier <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
1 parent 829b46b commit e7f6bca

File tree

3 files changed

+97
-0
lines changed

3 files changed

+97
-0
lines changed

_blog.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2777,3 +2777,11 @@
27772777
- collaboration
27782778
- diffusers
27792779
- diffusion
2780+
2781+
- local: safecoder-vs-closed-source-code-assistants
2782+
title: "SafeCoder vs. Closed-source Code Assistants"
2783+
author: julsimon
2784+
thumbnail: /blog/assets/safecoder-vs-closed-source-code-assistants/image.png
2785+
date: September 11, 2023
2786+
tags:
2787+
- bigcode
128 KB
Loading
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: "SafeCoder vs. Closed-source Code Assistants"
3+
thumbnail: /blog/assets/safecoder-vs-closed-source-code-assistants/image.png
4+
authors:
5+
- user: juliensimon
6+
---
7+
8+
# SafeCoder vs. Closed-source Code Assistants
9+
10+
<!-- {blog_metadata} -->
11+
<!-- {authors} -->
12+
13+
14+
For decades, software developers have designed methodologies, processes, and tools that help them improve code quality and increase productivity. For instance, agile, test-driven development, code reviews, and CI/CD are now staples in the software industry.
15+
16+
In "How Google Tests Software" (Addison-Wesley, 2012), Google reports that fixing a bug during system tests - the final testing stage - is 1000x more expensive than fixing it at the unit testing stage. This puts much pressure on developers - the first link in the chain - to write quality code from the get-go.
17+
18+
For all the hype surrounding generative AI, code generation seems a promising way to help developers deliver better code fast. Indeed, early studies show that managed services like [GitHub Copilot](https://github.blog/2023-06-27-the-economic-impact-of-the-ai-powered-developer-lifecycle-and-lessons-from-github-copilot) or [Amazon CodeWhisperer](https://aws.amazon.com/codewhisperer/) help developers be more productive.
19+
20+
However, these services rely on closed-source models that can't be customized to your technical culture and processes. Hugging Face released [SafeCoder](https://huggingface.co/blog/starcoder) a few weeks ago to fix this. SafeCoder is a code assistant solution built for the enterprise that gives you state-of-the-art models, transparency, customizability, IT flexibility, and privacy.
21+
22+
In this post, we'll compare SafeCoder to closed-source services and highlight the benefits you can expect from our solution.
23+
24+
25+
## State-of-the-art models
26+
27+
SafeCoder is currently built on top of the [StarCoder](https://huggingface.co/blog/starcoder) models, a family of open-source models designed and trained within the [BigCode](https://huggingface.co/bigcode) collaborative project.
28+
29+
StarCoder is a 15.5 billion parameter model trained for code generation in over 80 programming languages. It uses innovative architectural concepts, like [Multi-Query Attention](https://arxiv.org/abs/1911.02150) (MQA), to improve throughput and reduce latency, a technique also present in the [Falcon](https://huggingface.co/blog/falcon) and adapted for [LLaMa 2](https://huggingface.co/blog/llama2) models.
30+
31+
StarCoder has an 8192-token context window, helping it take into account more of your code to generate new code. It can also do fill-in-the-middle, i.e., insert within your code, instead of just appending new code at the end.
32+
33+
Lastly, like [HuggingChat](https://huggingface.co/chat/), SafeCoder will introduce new state-of-the-art models over time, giving you a seamless upgrade path.
34+
35+
Unfortunately, closed-source code assistant services don't share information about the underlying models, their capabilities, and their training data.
36+
37+
## Transparency
38+
39+
In line with the [Chinchilla Scaling Law](https://arxiv.org/abs/2203.15556v1), SafeCoder is a compute-optimal model trained on 1 trillion (1,000 billion) code tokens. These tokens are extracted from [The Stack](https://huggingface.co/datasets/bigcode/the-stack), a 2.7 terabyte dataset built from permissively licensed open-source repositories.
40+
All efforts are made to honor opt-out requests, and we built a [tool](https://huggingface.co/spaces/bigcode/in-the-stack) that lets repository owners check if their code is part of the dataset.
41+
42+
In the spirit of transparency, our [research paper](https://arxiv.org/abs/2305.06161) discloses the model architecture, the training process, and detailed metrics.
43+
44+
Unfortunately, closed-source services stick to vague information, such as "[the model was trained on] billions of lines of code." To the best of our knowledge, no metrics are available.
45+
46+
## Customization
47+
48+
The StarCoder models have been specifically designed to be customizable, and we have already built different versions:
49+
50+
* [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): the original model trained on 80+ languages from The Stack.
51+
* [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
52+
* [StarCoder+](https://huggingface.co/bigcode/starcoderplus): StarCoderBase further trained on English web data for coding conversations.
53+
54+
We also shared the [fine-tuning code](https://github.com/bigcode-project/starcoder/) on GitHub.
55+
56+
Every company has its preferred languages and coding guidelines, i.e., how to write inline documentation or unit tests, or do's and don'ts on security and performance. With SafeCoder, we can help you train models that learn the peculiarities of your software engineering process. Our team will help you prepare high-quality datasets and fine-tune StarCoder on your infrastructure. Your data will never be exposed to anyone.
57+
58+
Unfortunately, closed-source services cannot be customized.
59+
60+
## IT flexibility
61+
62+
SafeCoder relies on Docker containers for fine-tuning and deployment. It's easy to run on-premise or in the cloud on any container management service.
63+
64+
In addition, SafeCoder includes our [Optimum](https://github.com/huggingface/optimum) hardware acceleration libraries. Whether you work with CPU, GPU, or AI accelerators, Optimum will kick in automatically to help you save time and money on training and inference. Since you control the underlying hardware, you can also tune the cost-performance ratio of your infrastructure to your needs.
65+
66+
Unfortunately, closed-source services are only available as managed services.
67+
68+
## Security and privacy
69+
70+
Security is always a top concern, all the more when source code is involved. Intellectual property and privacy must be protected at all costs.
71+
72+
Whether you run on-premise or in the cloud, SafeCoder is under your complete administrative control. You can apply and monitor your security checks and maintain strong and consistent compliance across your IT platform.
73+
74+
SafeCoder doesn't spy on any of your data. Your prompts and suggestions are yours and yours only. SafeCoder doesn't call home and send telemetry data to Hugging Face or anyone else. No one but you needs to know how and when you're using SafeCoder. SafeCoder doesn't even require an Internet connection. You can (and should) run it fully air-gapped.
75+
76+
Closed-source services rely on the security of the underlying cloud. Whether this works or not for your compliance posture is your call. For enterprise users, prompts and suggestions are not stored (they are for individual users). However, we regret to point out that GitHub collects ["user engagement data"](https://docs.github.com/en/copilot/overview-of-github-copilot/about-github-copilot-for-business) with no possibility to opt-out. AWS does the same by default but lets you [opt out](https://docs.aws.amazon.com/codewhisperer/latest/userguide/sharing-data.html).
77+
78+
## Conclusion
79+
80+
We're very excited about the future of SafeCoder, and so are our customers. No one should have to compromise on state-of-the-art code generation, transparency, customization, IT flexibility, security, and privacy. We believe SafeCoder delivers them all, and we'll keep working hard to make it even better.
81+
82+
If you’re interested in SafeCoder for your company, please [contact us](mailto:[email protected]). Our team will contact you shortly to learn more about your use case and discuss requirements.
83+
84+
Thanks for reading!
85+
86+
87+
88+
89+

0 commit comments

Comments
 (0)