up

harm-devries · harm-devries · commit 6bada72b7c76 · 2023-06-19T12:07:38.000+02:00
diff --git a/content/en/docs/about/ip.md b/content/en/docs/about/ip.md
@@ -15,13 +15,14 @@ toc: true
 
 We believe the soul of BigCode to be clear and transparent communication striving towards open collaboration. The project, therefore, runs under the following set of open and permissive licenses. 
 
+**Datasets**. We value openness and transparency about the training data of LLMs and intend to release datasets whenever we have the rights to do so. We will also provide data cards for all datasets we release. Please see the [Dataset Card for The Stack](https://huggingface.co/datasets/bigcode/the-stack#dataset-card-for-the-stack). We are aware of ongoing discussions around the data governance of LLMs and would like to research better processes for it. See, for example, how we experiment with giving developers the possibility to [have their data removed from The Stack](https://www.bigcode-project.org/docs/about/the-stack/). You can track [which models have been trained using The Stack](https://huggingface.co/models?dataset=dataset:bigcode/the-stack) on the Hugging Face platform. 
+
 **Code**. All inbound code contributions (e.g. for model training or dataset analysis) must be made under an [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). All outbound code will be made available under the Apache 2.0 license. The Apache 2.0 license is the most commonly used open source license due to its permissive character and clarity regarding copyright and patents. 
 
 **Documentation**. Documentation will be received and made available by the Project under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). For the sake of clarity, “Documentation” means any material related to the project which is not: code, a machine learning model or related features, nor a dataset. For instance, “Documentation” can be -but not limited to- specifications; guidelines; blog posts; academic papers; etc.
 
-**Machine Learning Models**. Any machine learning model and related features (e.g. checkpoints) resulting from the Project will be licensed under an [Open & Responsible AI License](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses). Please take a look at the [FAQ for our CodeML OpenRAIL-M v0.1](https://www.bigcode-project.org/docs/pages/model-license-faq). OpenRAILs are licenses designed to permit free and open access, re-use, and downstream distribution of the Model and its derivatives while establishing a set of behavioral-use restrictions for which the model cannot be used, due to ethics-informed concerns and/or the technical limitations of the model as informed by its model card. 
+**Machine Learning Models**. Any machine learning model and related features (e.g. checkpoints) resulting from the Project will be licensed under an [Open & Responsible AI License](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses). Please take a look at the [FAQ for our BigCode OpenRAIL-M](https://www.bigcode-project.org/docs/pages/bigcode-openrail/). OpenRAILs are licenses designed to permit free and open access, re-use, and downstream distribution of the Model and its derivatives while establishing a set of behavioral-use restrictions for which the model cannot be used, due to ethics-informed concerns and/or the technical limitations of the model as informed by its model card. 
 
-**Datasets**. We value openness and transparency about the training data of LLMs and intend to release datasets whenever we have the rights to do so. We will also provide [data cards](https://arxiv.org/abs/2204.01075) for all datasets we release. We are aware of ongoing discussions around the data governance of LLMs and would like to research better processes for it. See, for example, how we experiment with giving developers the possibility [to have their data removed from The Stack]({{< relref "docs/about/the-stack.md" >}}).  
 
 ## Contributions under a different license
 We are flexible and understand that each individual contributor or contributing party might have its own interests besides the collective BigCode effort. In case you do not feel comfortable licensing some of your contributions to the project under the Apache 2.0, please get in touch with us. We will see how to work around and make everyone comfortable. Note that for contributions with a non-permissive license, our general policy is to put them in a separate Github repository living outside the BigCode organisation. 
diff --git a/content/en/docs/about/organization.md b/content/en/docs/about/organization.md
@@ -14,10 +14,25 @@ toc: true
 ---
 
 ## Sponsors
-BigCode is led by [ServiceNow Research](https://servicenow.com/research) and [HuggingFace](https://huggingface.co). Both organizations commit research and engineering time to ensure that the collaboration runs smoothly and makes progress towards the pre-set goals. ServiceNow Research also makes their compute cluster available for large-scale training. 
+BigCode is a community project jointly led by [Hugging Face](https://huggingface.co) and [ServiceNow](https://servicenow.com/research). Both organizations committed research, engineering, ethics, governance, and legal resources to ensure that the collaboration runs smoothly and makes progress towards the stated goals. ServiceNow Research and Hugging Face have made their respective compute clusters available for large-scale training of the BigCode models, and Hugging Face hosts the datasets, models, and related applications from the community to make it easy for everyone to access and use. 
+
+## Project Governance
+The BigCode project is governed by a steering committee jointly led by ServiceNow and Hugging Face, and is responsible for organizing and managing the project (including research strategy and publication goals), and provides oversight across all working groups.  
+
+Decisions that cannot be addressed at the community level are elevated to the lead of the Working Group for facilitated discussion, with further inputs and tie-breaker decision making by the Steering Committee as a last resort.  
+
+Governance for the project is open, meaning that the BigCode project encourages anyone from the community to join any working group or task force of interest, and for them to engage and contribute to work and decision making in the group. 
+
+Please see the [Governance Card](https://huggingface.co/datasets/bigcode/governance-card) for more details.  
 
 ## Members
-We will shortly provide more details on the governance structure of the project. For now, the organization is led by **core members** of ServiceNow and HuggingFace while we are onboarding **contributors**. Core members dedicate a significant portion of their working time to the BigCode project whereas contributors advise on specific aspects of the project or take on smaller tasks. 
+BigCode is a research collaboration and is open to participants who: 
+- have a professional research background and 
+- are able to commit time to the project. 
+
+In general, participants are affiliated with a research organization (either in academia or industry) and work on the technical/ethical/legal aspects of LLMs for coding applications.  
+
+Community-invited guest subject matter experts are also encouraged to participate in relevant discussions where they are able to make an active contribution to the goals of the project. 
 
 ## Operations
 We run the BigCode project through the following tools and platforms:
@@ -27,5 +42,9 @@ We run the BigCode project through the following tools and platforms:
 - We host all code repositories on [Github](https://github.com/bigcode-project)
 - We host all model weights and datasets on [HuggingFace](https://huggingface.co/BigCode)
 
+## Supporters 
+We are thankful for the support and contributions of the broader AI ecosystem, and would like to thank [Toloka](https://www.toloka.ai), for supporting BigCode with the use of their crowd platform and professional services in support of work in our PII task force.
+ 
+
 <!-- ## Bi-yearly goals
 Big Code runs on a bi-yearly cadence. Each half year, the core members discuss and set milestones for the next six months of work. You can find the objectives for the first iteration here.  -->
diff --git a/content/en/docs/about/the-stack.md b/content/en/docs/about/the-stack.md
@@ -1,5 +1,5 @@
 ---
-title: "The Stack"
+title: "Datasets"
 description: ""
 lead: ""
 date: 2020-11-16T13:59:39+01:00
@@ -13,53 +13,45 @@ weight: 210
 toc: true
 ---
 
-As part of the BigCode project, we released and will maintain [The Stack v1.1](https://huggingface.co/datasets/bigcode/the-stack), a 6.4 TB dataset of permissively licensed source code in 358 programming languages. One of our goals in this project is to give people agency over their source code by letting them decide whether or not it should be used to develop and evaluate LLMs, as we acknowledge that not all developers may wish to have their data used for that purpose.
+As part of the BigCode project, we released and will maintain [The Stack](https://huggingface.co/datasets/bigcode/the-stack), a 6.4 TB dataset of permissively licensed source code in 358 programming languages, along with a collection of datasets created through the course of research during the project. 
 
-Our first step to that end was to select source code with permissive licenses, i.e. those with minimal restrictions on how the software can be copied, modified and redistributed. You can find the list of [selected open-source licenses below]({{< relref "#licenses" >}}). In addition, we are giving developers the ability to have their code removed from the dataset upon request. The process for submitting and enacting removal requests will keep evolving throughout the project as we receive feedback and build up more data governance tools. The following FAQ presents the current state of this process, as well as the planned next steps. 
+| Release    | Description                    |
+| ---------- | ------------------------------ |
+| v1.0       | Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.    |
+| v1.1       |  The three copyleft licenses (MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages was increased from 30 to 358 languages. Also opt-out request submitted by 15.11.2022 were excluded from this ersion of the dataset. The resulting near-deduplicated dataset is 6TB in size.     |
+| v1.2       | Opt-out request submitted by 09.02.2023 were excluded from this ersion of the dataset as well as initially flagged malicious files (not exhaustive). |
+
+## Datasets and data governance tools released by BigCode
+- The Stack: Exact deduplicated version of The Stack. 
+- The Stack dedup: Near deduplicated version of The Stack (recommended for training). 
+- The Stack issues: Collection of GitHub issues. 
+- The Stack Metadata: Metadata of the repositories in The Stack. 
+- Am I in the Stack: Check if your data is in The Stack and request opt-out. 
+
+One of our goals in this project is to give people agency over their source code by letting them decide whether or not it should be used to develop and evaluate LLMs, as we acknowledge that not all developers may wish to have their data used for that purpose. 
+
+Our first step to that end was to select source code with permissive licenses, i.e. those with minimal restrictions on how the software can be copied, modified and redistributed. You can find the list of selected open-source licenses below. In addition, we are giving developers the ability to have their code removed from the dataset upon request. The process for submitting and enacting removal requests will keep evolving throughout the project as we receive feedback and build up more data governance tools. The following FAQ presents the current state of this process, as well as the planned next steps. 
 
 ### How do I know if my data is in The Stack?
 We have a developed a tool to help users understand whether their data is in The Stack. Check out [Am I in The Stack?](https://huggingface.co/spaces/bigcode/in-the-stack). 
 
 ### How can I request that my data be removed from The Stack?
-In order to request that data from your repositories be removed from The Stack, we ask that you first fill out the [following form](https://forms.gle/6o2A6h3YcAuGYxtm7) with your GitHub username and the email address associated with your git activity. After submitting the form, we will invite you to a private repository on the BigCode organization and ask you to open an issue with the topic “remove my Github repositories from The Stack”. This will verify your Github username and we will mark all public repositories under your username for removal in the next dataset release cycle. The verification process is manual at the moment but we are looking into ways to fully automate it. 
+You can opt-out your repositories from The Stack dataset by creating an issue in our GitHub opt-out repository and listing the repositories you would like to exclude. We will then exclude those repositories in the next iteration of The Stack.  To initiate this process, you should first check if any of your repositories are actually in The Stack using the Am I in the Stack app.  
+
+If you decide that you wish to have repos owned by you removed from The Stack, please create an issue so that we can verify that you are in fact the owner of the repositories requested for opt-out. 
 
-The form also has a field for general feedback and motivation for requesting the removal; it is not required for the request, but we would be very grateful for any additional information to help inform future data policies.
+If you are experiencing difficulty with this process, please email contact@bigcode-project.org. 
 
 ### What data can I request be removed from The Stack?
-Currently, you can request that we remove all public repositories under the provided username. In the coming months, we will be working on expanding the scope of data removal requests to address requests at a finer granularity (specific repositories, specific files) and to a greater range of contribution types (for example, based on whether a file or repository contains push events associated with your username according to [https://www.gharchive.org/](https://www.gharchive.org/)).
+You can choose to request either (1) all repos, or (2) you can specify select repos that you own to be removed. You can also specify Commits and GitHub Issues to be removed as part of your opt-out request. More details about this process on GitHub.
 
 ### Can I also prevent my data from being included in future versions of The Stack?
-The removal request form will be used to validate removal requests and remove appropriate data. Validated requests and associated code pointers will also be stored in order to ensure that the code does not appear in future versions of The Stack.
+The removal request process will be used to validate removal requests and for processing of removal requests to remove opt-out data. Validated requests and associated code pointers will also be stored in order to ensure that the code does not appear in future versions of The Stack.
 
 ### What happens to my data once I’ve requested its removal?
 For as long as we are maintaining The Stack dataset, we will provide regular updates to the dataset to remove data that has been flagged since the last version. The current plan is to update the dataset every 3 months, although the schedule may change based on the volume of requests received. If we are not in a position to continue maintaining the dataset, we plan to stop distributing it in its current format and update its terms of use to limit its range of applications further, including for training new LLMs. Finally, we [require](https://huggingface.co/datasets/bigcode/the-stack#terms-of-use-for-the-stack) that people who download the dataset agree to use the most recent allowed version in order to incorporate the removal requests. 
 
 ### What is the license for The Stack dataset? {#licenses}
-The Stack is a collection of source code from repositories with various licenses. Any use of code gathered in The Stack must abide by the code’s original license terms, including attribution clauses when relevant. To facilitate this, The Stack contains provenance information, including the source of the code and its license, for each data point.
-
-The Stack was filtered to include only permissive licenses—i.e., those with minimal restrictions on how the software can be copied, modified, and redistributed e.g., MIT and Apache 2.0. Note that we intentionally exclude copyleft licenses like GPL, as this community has [strongly expressed their concern](ttps://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/) that machine learning models and inferred outputs violate the terms and conditions of the licenses.
-
-For The Stack v1.0, we included the following list of [SPDX license identifiers](https://spdx.org/licenses/) in the dataset:
-- MIT-0
-- MIT
-- MIT-feh
-- Apache-2.0
-- BSD-3-Clause
-- BSD-3-Clause-Clear
-- BSD-3-Clause-No-Nuclear-License-2014
-- BSD-2-Clause
-- CC0-1.0
-- EPL-1.0
-- MPL-2.0
-- Unlicense
-- ISC
-- Artistic-2.0
-- deprecated\_LGPL-3.0+
-- deprecated\_LGPL-2.1+
-- ECL-2.0
-- SHL-0.51
-- MPL-2.0-no-copyleft-exception
-
-You can find the license distribution in [Table 2 of the supporting research paper](https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i_7G5Bb/view). MIT and Apache 2.0 make up the majority of the released dataset. 
+The Stack is a collection of source code from repositories with various licenses. Any use of code gathered in The Stack must abide by the code’s original license terms, including attribution clauses when relevant. We faciliate this by provenance information for each data point. 
 
-After releasing the dataset, it was brought to our attention that licenses such as MPL, LGPL, and EPL were erroneously labeled as permissive when they are in fact [weak copyleft licenses](https://blueoakcouncil.org/copyleft). For the the new version of The Stack (v1.1), we have removed these weak copyleft license files from the dataset. We've also added more programming languages and permissive licenses to the dataset. You can find the new set of 193 licenses [here](https://huggingface.co/datasets/bigcode/the-stack/blob/main/licenses.json). 
+The list of SPDX license identifiers included in the dataset can be found [here](https://huggingface.co/datasets/bigcode/the-stack/blob/main/licenses.json).