Additional information added. Including code and a more detailed and explained readmes

aiMojica10 · aiMojica10 · commit faf1fc4aa1ca · 2023-06-22T14:44:44.000+02:00
diff --git a/src/pages/MainPage/index.tsx b/src/pages/MainPage/index.tsx
@@ -157,7 +157,8 @@ export const MainPage: React.FC<Props> = ({data}) => {
           <br/>
           <br/>
           <Typography variant="body1" align="justify"> <b>Taging process:</b> In th following <a href="https://github.com/TheSoftwareDesignLab/ML_best_practices/tree/main/tagging">link</a>, you will find the information related to the labels assigned to each post per each tagger. </Typography>
-          <Typography variant="body1" align="justify"> <b>Dataset:</b> In th following Zenodo <a href="https://zenodo.org/record/7908722#.ZFkxOS8RqJ8">link</a>, you will find the original posts (i.e., questions and answers) used in this study.</Typography>
+          <Typography variant="body1" align="justify"> <b>Dataset:</b> In th following Zenodo <a href="https://zenodo.org/record/8058979">link</a>, you will find the original posts (i.e., questions and answers) used in this study.</Typography>
+          <Typography variant="body1" align="justify"> <b>Code:</b> In th following <a href="https://github.com/TheSoftwareDesignLab/ML_best_practices/tree/main/used_code">link</a>, you will find the code used to extract the tagged and analyzed Stack Exchange posts.</Typography>
         </>
         : <TaxPage data={data} onBackClick={() => setOpenTax(false)}/>}
     </Container>
diff --git a/tagging/Readme.md b/tagging/Readme.md
@@ -1 +1,13 @@
-In this folder, we have the files related to the tagging and merging process
+# Decription
+
+In this folder, we have the file related to the tagging and merging process. In particular, we have a file that contains per each answer, the taggas that were assigned to each answer (post). For this we present the following information:
+
+* *Tagger*: Identification of the tagger in charge. Possible Values: Merge, Tagger1_, Tagger1_2, Tagger2_, Tagger2_2, Tagger3_, Tagger3_2. Merge indicates that are the final taggs after the merging process (discussion).  Tagger1_, Tagger2_, Tagger3_ indicate that the tags associated to that row are the initial taggs given by a tagger. Tagger1_2, Tagger2_2, and Tagger3_2, indicate that the tags associated to that row are given after a conflict is identified in the initial taggs. 
+* *Answer ID*: The unique ID of the answer, it is complemented with the *Website_URL* column. 
+* *externalReferences*: External references (urls) identified in the post during the tagging process. 
+* *goodPractice*: Good/best practices identified in the post during the tagging process.
+* *isFalsePositive*: this variable indicates if a post is a false positive or not.
+* *mlPipeline*: ML pipeline(s) associated to the post.
+* *Reason False Positive*: Extra clarification about, why a post is considered a false positive. 
+
+
diff --git a/tagging/Tagging_merging_info.xlsx b/tagging/Tagging_merging_info.xlsx
diff --git a/used_code/Nootebook1.ipynb b/used_code/Nootebook1.ipynb
diff --git a/used_code/Notebook2.ipynb b/used_code/Notebook2.ipynb
diff --git a/used_code/Readme.md b/used_code/Readme.md
@@ -1,3 +1,34 @@
+# General description of the code
 In this folder the code used in the article can be found
-* Notebook 1: It is used to extract the analyzed posts in the study, as filtered as possible
-* Notebook 2: It is used to extract all the machinine learning posts (posts that have machine-learning tag) and compared it with the ones that discussed best practices (BP)
+* Notebook1.ipynb: It is used to extract the analyzed posts in the study, as filtered as possible
+* Notebook2.ipynb: It is used to extract all the machinine learning posts (posts that have machine-learning tag) and compared it with the ones that discussed best practices (BP)
+* config_file.ini: Configuration file that is used to store the credentials and the suggested name of the databases for each of the selected websites n the article.
+* create_data_noSTO.sh: bash script that is used to create the needed databases (dbs) for each of the mentioned Q&A websites in the article. It also loads the respective data into the dbs and creates the required indexes for the post table. This is done with an auxiliar script *load_create_noSTO.sql*.  **(This is not used for Stack Overflow (STO) dump)**. 
+* load_create_noSTO.sql: SQL script that creates the required databases and load the information to the post table.  **(This is not used for Stack Overflow (STO) dump)**.
+* create_and_load_STO.sql: SQL script that was manually executed to created the STO database, and the required tables and indexes for the posts.
+
+## Requisites 
+* Mysql (8.0) installed and running.
+* Having an user name for Mysql with a password.
+
+## Recommendations and guidelines
+
+### Data --> Databases
+1. Download the data from [Stack Exchange dump 03.2021](https://archive.org/details/stackexchange_20210301)
+2. Uncompress the data, in an specific path, e.g, [path_to_the_base_folder], taking into acount the name of the each database as a folder name i.e.,[path_to_the_base_folder]/[wesite_folder]/. For example, for the website *Data Science*, a the dump must be decompressed in  [path_to_the_base_folder]/Data Science/.
+3. **[Not for STO].** Inside the base folder (i.e., [path_to_the_base_folder]), create a new folder, e.g., [temp]. This folder is used in the *create_data_noSTO.sh*.
+4. **[Not for STO].** In the script *create_data_noSTO.sh* identify the [path_to_the_original_folder]/[temp] path and replace it with your own paths. 
+5. **[Not for STO].** In the sql script *load_create_noSTO.sql* replace the path /[path_to_the_base_folder]/[temp] with your own path.
+6. **[Not for STO].** Execute *create_data_noSTO.sh* for each of the 13 studied Q&A websites (not including Stack Overflow). For this, you should replacethe [wesite_folder], for the corresponding value for each of the 13 websites. In addition, the [Name_database] should also be changed for a database name of your preferences for each database. In the *config_file.ini* file there is a list of suggested names for all of the 13 web sites.
+8. **[For STO].** Be sure to have extracted the *Posts.xml*, then, copy it in the [path_to_the_base_folder]/[temp]/ folder.
+9. **[For STO].** In the sql script *create_and_load_STO.sql*, replace the path /[path_to_the_base_folder]/[temp] with your own path.
+10. **[For STO].** Read the script, *create_and_load_STO.sql*, and identify which part you need to execute. In any case, the *Part 1* is needed, then, you can choose between executing the *Part 2* (required table for extracting the analyzed posts in the article), and *Part 3* (posts that do not consider the scores as a criteria of filtering). 
+
+
+### Databases --> Studied Posts
+* With *Notebook1.ipynb* you can export the set of used posts for the article (before tagging). Just a set of questions and answers filtered by a defined criteria.
+* *Notebook2.ipynb* is a complementary material, which allows you to extract a broader set of questions and answers that relate to "machine-learning", and/or "best practices", without consider the score of the posts. 
+
+##### Acronyms
+* STO = Stack Overflow
+* BP = Best practices
diff --git a/used_code/config_file.ini b/used_code/config_file.ini
@@ -0,0 +1,6 @@
+[DATABASE]
+HOST=localhost
+database=ArtificiaIntelligenceExchange,CodeReviewExchange,ComputationalScienceExchange,ComputerScienceExchange,DataScienceExchange,ElectricalEngineeringExchange,OpenDataExchange,SignalProcessingExchange,SoftwareEngineeringExchange,StatisticalExchange,TheoreticalCOmputerScienceExchange,iOTExchange,EngineeringExchange,StackOverFlow
+urls=https://ai.stackexchange.com/questions/,https://codereview.stackexchange.com/questions/,https://scicomp.stackexchange.com/questions/,https://cs.stackexchange.com/questions/,https://datascience.stackexchange.com/questions/,https://electronics.stackexchange.com/questions/,https://opendata.stackexchange.com/questions/,https://dsp.stackexchange.com/questions/,https://softwareengineering.stackexchange.com/questions/,https://stats.stackexchange.com/questions/,https://cstheory.stackexchange.com/questions/,https://iot.stackexchange.com/questions/,https://engineering.stackexchange.com/questions/,https://stackoverflow.com/questions/
+user=root
+passwd=ML2021ai_
diff --git a/used_code/create_and_load_STO.sql b/used_code/create_and_load_STO.sql
@@ -0,0 +1,216 @@
+# This code is based on the following gist 
+# https://gist.github.com/enricorotundo/1e074af39d90629252a7df3fc1066397 
+# Some comments are taken from the following post in stackExchange
+# https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
+# We added the content licence for comments, posts and postlinks, 
+# Be sure of having by default innodb tables, otherwise specify it (needed for a text full index)
+# If you are having problems loading data into your DataBase you must 
+# 1) run mysql with --secure_file_priv="" 
+# 2) If you are on a mac an your  are having problems running the complementary files, you can run amysql with mysqld_safe --secure_file_priv=""  located on BIN folder
+# 3) The path of your xml file has to be an absolute path
+
+# - -------- ----------- ------------ --------------- ----------
+
+
+
+# ---PART 1---------------------------------- CREATE the post TABLE FOR stackoverflow (the only thing that changes with the other tables is the abscence of title-body full index)
+# ------------------------------------- It also creates the schema for the auxiliar table (PostsQuestionsFiltered) 
+
+
+CREATE TABLE Posts (
+  Id INT NOT NULL PRIMARY KEY,
+  PostTypeId TINYINT NOT NULL,
+  # 1 = Question
+  # 2 = Answer
+  # 3 = Orphaned tag wiki
+  # 4 = Tag wiki excerpt
+  # 5 = Tag wiki	
+  # 6 = Moderator nomination
+  # 7 = "Wiki placeholder" (seems to only be the election description)
+  # 8 = Privilege wiki
+  AcceptedAnswerId INT,
+  ParentId INT,
+  CreationDate DATETIME NOT NULL,
+  DeletionDate DATETIME,
+  Score INT NULL,
+  ViewCount INT NULL,
+  Body text NULL,
+  OwnerUserId INT,
+  OwnerDisplayName varchar(256),
+  LastEditorUserId INT,
+  LastEditorDisplayName VARCHAR(40),
+  LastEditDate DATETIME,
+  LastActivityDate DATETIME,
+  Title varchar(256),
+  Tags VARCHAR(256),
+  AnswerCount INT DEFAULT 0,
+  CommentCount INT DEFAULT 0,
+  FavoriteCount INT DEFAULT 0,
+  ClosedDate DATETIME,
+  CommunityOwnedDate DATETIME,
+  ContentLicense VARCHAR(20)
+);
+
+CREATE TABLE PostsQuestionsFiltered (
+  Id INT NOT NULL PRIMARY KEY,
+  PostTypeId TINYINT NOT NULL,
+  # 1 = Question
+  # 2 = Answer
+  # 3 = Orphaned tag wiki
+  # 4 = Tag wiki excerpt
+  # 5 = Tag wiki	
+  # 6 = Moderator nomination
+  # 7 = "Wiki placeholder" (seems to only be the election description)
+  # 8 = Privilege wiki
+  AcceptedAnswerId INT,
+  ParentId INT,
+  CreationDate DATETIME NOT NULL,
+  DeletionDate DATETIME,
+  Score INT NULL,
+  ViewCount INT NULL,
+  Body text NULL,
+  OwnerUserId INT,
+  OwnerDisplayName varchar(256),
+  LastEditorUserId INT,
+  LastEditorDisplayName VARCHAR(40),
+  LastEditDate DATETIME,
+  LastActivityDate DATETIME,
+  Title varchar(256),
+  Tags VARCHAR(256),
+  AnswerCount INT DEFAULT 0,
+  CommentCount INT DEFAULT 0,
+  FavoriteCount INT DEFAULT 0,
+  ClosedDate DATETIME,
+  CommunityOwnedDate DATETIME,
+  ContentLicense VARCHAR(20)
+);
+
+SELECT
+  DATABASE();
+
+select
+  count(*) postsCount
+from
+  Posts;
+
+# need this (it depends on your security and OS)
+load xml LOCAL infile '/[path_to_the_base_folder]/[temp]/Posts.xml' into table Posts rows identified by '<row>';
+
+
+show databases;
+
+create index Posts_idx_1 on Posts(AcceptedAnswerId);
+
+create index Posts_idx_2 on Posts(ParentId);
+
+create index Posts_idx_3 on Posts(OwnerUserId);
+
+create index Posts_idx_4 on Posts(LastEditorUserId);
+
+SHOW INDEX
+FROM
+  Posts;
+CREATE FULLTEXT INDEX index_Tags ON Posts(Tags);
+
+SHOW INDEX
+FROM
+  Posts;
+ 
+
+# ---PART 2------- (ARTICLE) Query that filters ALL the posts that ARE questions (This is for creating a smaller STO tables)
+
+
+#----PART 2.1----- CREATE PostsQuestionsFiltered
+
+INSERT INTO PostsQuestionsFiltered 
+SELECT * FROM Posts p 
+WHERE p.PostTypeId = 1 and 
+p.Score >0 and p.AcceptedAnswerId  is not NULL and 
+MATCH (p.Tags)  AGAINST ('"machine-learning"' IN BOOLEAN MODE);
+
+
+# --- Due TO INSERT performance this creation IS done AFTER the IS loaded
+create index Posts_idx_1 on PostsQuestionsFiltered(AcceptedAnswerId);
+
+create index Posts_idx_2 on PostsQuestionsFiltered(ParentId);
+
+create index Posts_idx_3 on PostsQuestionsFiltered(OwnerUserId);
+
+create index Posts_idx_4 on PostsQuestionsFiltered(LastEditorUserId);
+
+SHOW INDEX
+FROM
+  PostsQuestionsFiltered;
+ 
+CREATE FULLTEXT INDEX index_Tags ON PostsQuestionsFiltered(Tags);
+
+SHOW INDEX
+FROM
+  PostsQuestionsFiltered;
+ 
+CREATE FULLTEXT INDEX index_Text_title ON PostsQuestionsFiltered(Title, Body);
+
+SHOW INDEX
+FROM
+  PostsQuestionsFiltered;
+ 
+
+
+# ------PART 3---------------------------- Query that filters ALL the posts that ARE questions (This is for creating a smaller STO tables) NOT filtering BY score
+# ------------------- We assume that we have already the Post TABLE created WITH the INDEX above IN PART 1
+
+CREATE TABLE PostsQuestionsFilteredNoScore (
+  Id INT NOT NULL PRIMARY KEY,
+  PostTypeId TINYINT NOT NULL,
+  # 1 = Question
+  # 2 = Answer
+  # 3 = Orphaned tag wiki
+  # 4 = Tag wiki excerpt
+  # 5 = Tag wiki	
+  # 6 = Moderator nomination
+  # 7 = "Wiki placeholder" (seems to only be the election description)
+  # 8 = Privilege wiki
+  AcceptedAnswerId INT,
+  ParentId INT,
+  CreationDate DATETIME NOT NULL,
+  DeletionDate DATETIME,
+  Score INT NULL,
+  ViewCount INT NULL,
+  Body text NULL,
+  OwnerUserId INT,
+  OwnerDisplayName varchar(256),
+  LastEditorUserId INT,
+  LastEditorDisplayName VARCHAR(40),
+  LastEditDate DATETIME,
+  LastActivityDate DATETIME,
+  Title varchar(256),
+  Tags VARCHAR(256),
+  AnswerCount INT DEFAULT 0,
+  CommentCount INT DEFAULT 0,
+  FavoriteCount INT DEFAULT 0,
+  ClosedDate DATETIME,
+  CommunityOwnedDate DATETIME,
+  ContentLicense VARCHAR(20)
+);
+
+
+
+INSERT INTO  PostsQuestionsFilteredNoScore 
+SELECT * FROM Posts p 
+WHERE p.PostTypeId = 1 and 
+p.AcceptedAnswerId  is not NULL and 
+MATCH (p.Tags)  AGAINST ('"machine-learning"' IN BOOLEAN MODE);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/used_code/create_data_noSTO.sh b/used_code/create_data_noSTO.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+
+cp /[path_to_the_original_folder]/[wesite_folder]/Posts.xml /[path_to_the_base_folder]/[temp]
+mysql --load-data-local-dir=/[path_to_the_base_folder]/[temp] -u [user] -p[password] -e "SET @@SESSION.SQL_MODE='';  create database [Name_database] DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; use [Name_database]; $(cat load_create_noSTO.sql)"
+rm  -rf /[path_to_the_base_folder]/[temp]/Posts.xml
diff --git a/used_code/load_create_noSTO.sql b/used_code/load_create_noSTO.sql
@@ -0,0 +1,86 @@
+# This code is based on the following gist 
+# https://gist.github.com/enricorotundo/1e074af39d90629252a7df3fc1066397 
+# Some comments are taken from the following post in stackExchange
+# https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
+# We added the content licence for comments, posts and postlinks, 
+# Be sure of having by default innodb tables, otherwise specify it (needed for a text full index)
+# If you are having problems loading data into your DataBase you must 
+# 1) uncomment the first  line, 2) run mysql with --secure_file_priv="" 
+# 2.1) If you are on a mac an your  are having problems running the complementary files, you can run amysql with mysqld_safe --secure_file_priv=""  located on BIN
+# 3) The path of your xml files have to be absolute paths 
+
+# - -------- ----------- ------------ --------------- ----------
+
+#SET @@SESSION.SQL_MODE='';
+
+CREATE TABLE Posts (
+  Id INT NOT NULL PRIMARY KEY,
+  PostTypeId TINYINT NOT NULL,
+  # 1 = Question
+  # 2 = Answer
+  # 3 = Orphaned tag wiki
+  # 4 = Tag wiki excerpt
+  # 5 = Tag wiki	
+  # 6 = Moderator nomination
+  # 7 = "Wiki placeholder" (seems to only be the election description)
+  # 8 = Privilege wiki
+  AcceptedAnswerId INT,
+  ParentId INT,
+  CreationDate DATETIME NOT NULL,
+  DeletionDate DATETIME,
+  Score INT NULL,
+  ViewCount INT NULL,
+  Body text NULL,
+  OwnerUserId INT,
+  OwnerDisplayName varchar(256),
+  LastEditorUserId INT,
+  LastEditorDisplayName VARCHAR(40),
+  LastEditDate DATETIME,
+  LastActivityDate DATETIME,
+  Title varchar(256),
+  Tags VARCHAR(256),
+  AnswerCount INT DEFAULT 0,
+  CommentCount INT DEFAULT 0,
+  FavoriteCount INT DEFAULT 0,
+  ClosedDate DATETIME,
+  CommunityOwnedDate DATETIME,
+  ContentLicense VARCHAR(20)
+);
+
+SELECT
+  DATABASE();
+
+select
+  count(*) postsCount
+from
+  Posts;
+
+# need this (it depends on your security and OS)
+load xml LOCAL infile '/[path_to_the_base_folder]/[temp]/Posts.xml' into table Posts rows identified by '<row>';
+
+
+show databases;
+
+create index Posts_idx_1 on Posts(AcceptedAnswerId);
+
+create index Posts_idx_2 on Posts(ParentId);
+
+create index Posts_idx_3 on Posts(OwnerUserId);
+
+create index Posts_idx_4 on Posts(LastEditorUserId);
+
+SHOW INDEX
+FROM
+  Posts;
+CREATE FULLTEXT INDEX index_Tags ON Posts(Tags);
+
+SHOW INDEX
+FROM
+  Posts;
+
+# This index is not created for StackOverflow due to the amount of information and lack of resources, it is created over a subset of the resources
+CREATE FULLTEXT INDEX index_Text_title ON Posts(Title, Body);
+
+SHOW INDEX
+FROM
+  Posts;