Skip to content

Commit faf1fc4

Browse files
committed
Additional information added. Including code and a more detailed and explained readmes
1 parent bf28986 commit faf1fc4

File tree

10 files changed

+1966
-4
lines changed

10 files changed

+1966
-4
lines changed

src/pages/MainPage/index.tsx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,8 @@ export const MainPage: React.FC<Props> = ({data}) => {
157157
<br/>
158158
<br/>
159159
<Typography variant="body1" align="justify"> <b>Taging process:</b> In th following <a href="https://github.com/TheSoftwareDesignLab/ML_best_practices/tree/main/tagging">link</a>, you will find the information related to the labels assigned to each post per each tagger. </Typography>
160-
<Typography variant="body1" align="justify"> <b>Dataset:</b> In th following Zenodo <a href="https://zenodo.org/record/7908722#.ZFkxOS8RqJ8">link</a>, you will find the original posts (i.e., questions and answers) used in this study.</Typography>
160+
<Typography variant="body1" align="justify"> <b>Dataset:</b> In th following Zenodo <a href="https://zenodo.org/record/8058979">link</a>, you will find the original posts (i.e., questions and answers) used in this study.</Typography>
161+
<Typography variant="body1" align="justify"> <b>Code:</b> In th following <a href="https://github.com/TheSoftwareDesignLab/ML_best_practices/tree/main/used_code">link</a>, you will find the code used to extract the tagged and analyzed Stack Exchange posts.</Typography>
161162
</>
162163
: <TaxPage data={data} onBackClick={() => setOpenTax(false)}/>}
163164
</Container>

tagging/Readme.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,13 @@
1-
In this folder, we have the files related to the tagging and merging process
1+
# Decription
2+
3+
In this folder, we have the file related to the tagging and merging process. In particular, we have a file that contains per each answer, the taggas that were assigned to each answer (post). For this we present the following information:
4+
5+
* *Tagger*: Identification of the tagger in charge. Possible Values: Merge, Tagger1_, Tagger1_2, Tagger2_, Tagger2_2, Tagger3_, Tagger3_2. Merge indicates that are the final taggs after the merging process (discussion). Tagger1_, Tagger2_, Tagger3_ indicate that the tags associated to that row are the initial taggs given by a tagger. Tagger1_2, Tagger2_2, and Tagger3_2, indicate that the tags associated to that row are given after a conflict is identified in the initial taggs.
6+
* *Answer ID*: The unique ID of the answer, it is complemented with the *Website_URL* column.
7+
* *externalReferences*: External references (urls) identified in the post during the tagging process.
8+
* *goodPractice*: Good/best practices identified in the post during the tagging process.
9+
* *isFalsePositive*: this variable indicates if a post is a false positive or not.
10+
* *mlPipeline*: ML pipeline(s) associated to the post.
11+
* *Reason False Positive*: Extra clarification about, why a post is considered a false positive.
12+
13+

tagging/Tagging_merging_info.xlsx

67.8 KB
Binary file not shown.

used_code/Nootebook1.ipynb

Lines changed: 711 additions & 0 deletions
Large diffs are not rendered by default.

used_code/Notebook2.ipynb

Lines changed: 894 additions & 0 deletions
Large diffs are not rendered by default.

used_code/Readme.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,34 @@
1+
# General description of the code
12
In this folder the code used in the article can be found
2-
* Notebook 1: It is used to extract the analyzed posts in the study, as filtered as possible
3-
* Notebook 2: It is used to extract all the machinine learning posts (posts that have machine-learning tag) and compared it with the ones that discussed best practices (BP)
3+
* Notebook1.ipynb: It is used to extract the analyzed posts in the study, as filtered as possible
4+
* Notebook2.ipynb: It is used to extract all the machinine learning posts (posts that have machine-learning tag) and compared it with the ones that discussed best practices (BP)
5+
* config_file.ini: Configuration file that is used to store the credentials and the suggested name of the databases for each of the selected websites n the article.
6+
* create_data_noSTO.sh: bash script that is used to create the needed databases (dbs) for each of the mentioned Q&A websites in the article. It also loads the respective data into the dbs and creates the required indexes for the post table. This is done with an auxiliar script *load_create_noSTO.sql*. **(This is not used for Stack Overflow (STO) dump)**.
7+
* load_create_noSTO.sql: SQL script that creates the required databases and load the information to the post table. **(This is not used for Stack Overflow (STO) dump)**.
8+
* create_and_load_STO.sql: SQL script that was manually executed to created the STO database, and the required tables and indexes for the posts.
9+
10+
## Requisites
11+
* Mysql (8.0) installed and running.
12+
* Having an user name for Mysql with a password.
13+
14+
## Recommendations and guidelines
15+
16+
### Data --> Databases
17+
1. Download the data from [Stack Exchange dump 03.2021](https://archive.org/details/stackexchange_20210301)
18+
2. Uncompress the data, in an specific path, e.g, [path_to_the_base_folder], taking into acount the name of the each database as a folder name i.e.,[path_to_the_base_folder]/[wesite_folder]/. For example, for the website *Data Science*, a the dump must be decompressed in [path_to_the_base_folder]/Data Science/.
19+
3. **[Not for STO].** Inside the base folder (i.e., [path_to_the_base_folder]), create a new folder, e.g., [temp]. This folder is used in the *create_data_noSTO.sh*.
20+
4. **[Not for STO].** In the script *create_data_noSTO.sh* identify the [path_to_the_original_folder]/[temp] path and replace it with your own paths.
21+
5. **[Not for STO].** In the sql script *load_create_noSTO.sql* replace the path /[path_to_the_base_folder]/[temp] with your own path.
22+
6. **[Not for STO].** Execute *create_data_noSTO.sh* for each of the 13 studied Q&A websites (not including Stack Overflow). For this, you should replacethe [wesite_folder], for the corresponding value for each of the 13 websites. In addition, the [Name_database] should also be changed for a database name of your preferences for each database. In the *config_file.ini* file there is a list of suggested names for all of the 13 web sites.
23+
8. **[For STO].** Be sure to have extracted the *Posts.xml*, then, copy it in the [path_to_the_base_folder]/[temp]/ folder.
24+
9. **[For STO].** In the sql script *create_and_load_STO.sql*, replace the path /[path_to_the_base_folder]/[temp] with your own path.
25+
10. **[For STO].** Read the script, *create_and_load_STO.sql*, and identify which part you need to execute. In any case, the *Part 1* is needed, then, you can choose between executing the *Part 2* (required table for extracting the analyzed posts in the article), and *Part 3* (posts that do not consider the scores as a criteria of filtering).
26+
27+
28+
### Databases --> Studied Posts
29+
* With *Notebook1.ipynb* you can export the set of used posts for the article (before tagging). Just a set of questions and answers filtered by a defined criteria.
30+
* *Notebook2.ipynb* is a complementary material, which allows you to extract a broader set of questions and answers that relate to "machine-learning", and/or "best practices", without consider the score of the posts.
31+
32+
##### Acronyms
33+
* STO = Stack Overflow
34+
* BP = Best practices

used_code/config_file.ini

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[DATABASE]
2+
HOST=localhost
3+
database=ArtificiaIntelligenceExchange,CodeReviewExchange,ComputationalScienceExchange,ComputerScienceExchange,DataScienceExchange,ElectricalEngineeringExchange,OpenDataExchange,SignalProcessingExchange,SoftwareEngineeringExchange,StatisticalExchange,TheoreticalCOmputerScienceExchange,iOTExchange,EngineeringExchange,StackOverFlow
4+
urls=https://ai.stackexchange.com/questions/,https://codereview.stackexchange.com/questions/,https://scicomp.stackexchange.com/questions/,https://cs.stackexchange.com/questions/,https://datascience.stackexchange.com/questions/,https://electronics.stackexchange.com/questions/,https://opendata.stackexchange.com/questions/,https://dsp.stackexchange.com/questions/,https://softwareengineering.stackexchange.com/questions/,https://stats.stackexchange.com/questions/,https://cstheory.stackexchange.com/questions/,https://iot.stackexchange.com/questions/,https://engineering.stackexchange.com/questions/,https://stackoverflow.com/questions/
5+
user=root
6+
passwd=ML2021ai_

used_code/create_and_load_STO.sql

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# This code is based on the following gist
2+
# https://gist.github.com/enricorotundo/1e074af39d90629252a7df3fc1066397
3+
# Some comments are taken from the following post in stackExchange
4+
# https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
5+
# We added the content licence for comments, posts and postlinks,
6+
# Be sure of having by default innodb tables, otherwise specify it (needed for a text full index)
7+
# If you are having problems loading data into your DataBase you must
8+
# 1) run mysql with --secure_file_priv=""
9+
# 2) If you are on a mac an your are having problems running the complementary files, you can run amysql with mysqld_safe --secure_file_priv="" located on BIN folder
10+
# 3) The path of your xml file has to be an absolute path
11+
12+
# - -------- ----------- ------------ --------------- ----------
13+
14+
15+
16+
# ---PART 1---------------------------------- CREATE the post TABLE FOR stackoverflow (the only thing that changes with the other tables is the abscence of title-body full index)
17+
# ------------------------------------- It also creates the schema for the auxiliar table (PostsQuestionsFiltered)
18+
19+
20+
CREATE TABLE Posts (
21+
Id INT NOT NULL PRIMARY KEY,
22+
PostTypeId TINYINT NOT NULL,
23+
# 1 = Question
24+
# 2 = Answer
25+
# 3 = Orphaned tag wiki
26+
# 4 = Tag wiki excerpt
27+
# 5 = Tag wiki
28+
# 6 = Moderator nomination
29+
# 7 = "Wiki placeholder" (seems to only be the election description)
30+
# 8 = Privilege wiki
31+
AcceptedAnswerId INT,
32+
ParentId INT,
33+
CreationDate DATETIME NOT NULL,
34+
DeletionDate DATETIME,
35+
Score INT NULL,
36+
ViewCount INT NULL,
37+
Body text NULL,
38+
OwnerUserId INT,
39+
OwnerDisplayName varchar(256),
40+
LastEditorUserId INT,
41+
LastEditorDisplayName VARCHAR(40),
42+
LastEditDate DATETIME,
43+
LastActivityDate DATETIME,
44+
Title varchar(256),
45+
Tags VARCHAR(256),
46+
AnswerCount INT DEFAULT 0,
47+
CommentCount INT DEFAULT 0,
48+
FavoriteCount INT DEFAULT 0,
49+
ClosedDate DATETIME,
50+
CommunityOwnedDate DATETIME,
51+
ContentLicense VARCHAR(20)
52+
);
53+
54+
CREATE TABLE PostsQuestionsFiltered (
55+
Id INT NOT NULL PRIMARY KEY,
56+
PostTypeId TINYINT NOT NULL,
57+
# 1 = Question
58+
# 2 = Answer
59+
# 3 = Orphaned tag wiki
60+
# 4 = Tag wiki excerpt
61+
# 5 = Tag wiki
62+
# 6 = Moderator nomination
63+
# 7 = "Wiki placeholder" (seems to only be the election description)
64+
# 8 = Privilege wiki
65+
AcceptedAnswerId INT,
66+
ParentId INT,
67+
CreationDate DATETIME NOT NULL,
68+
DeletionDate DATETIME,
69+
Score INT NULL,
70+
ViewCount INT NULL,
71+
Body text NULL,
72+
OwnerUserId INT,
73+
OwnerDisplayName varchar(256),
74+
LastEditorUserId INT,
75+
LastEditorDisplayName VARCHAR(40),
76+
LastEditDate DATETIME,
77+
LastActivityDate DATETIME,
78+
Title varchar(256),
79+
Tags VARCHAR(256),
80+
AnswerCount INT DEFAULT 0,
81+
CommentCount INT DEFAULT 0,
82+
FavoriteCount INT DEFAULT 0,
83+
ClosedDate DATETIME,
84+
CommunityOwnedDate DATETIME,
85+
ContentLicense VARCHAR(20)
86+
);
87+
88+
SELECT
89+
DATABASE();
90+
91+
select
92+
count(*) postsCount
93+
from
94+
Posts;
95+
96+
# need this (it depends on your security and OS)
97+
load xml LOCAL infile '/[path_to_the_base_folder]/[temp]/Posts.xml' into table Posts rows identified by '<row>';
98+
99+
100+
show databases;
101+
102+
create index Posts_idx_1 on Posts(AcceptedAnswerId);
103+
104+
create index Posts_idx_2 on Posts(ParentId);
105+
106+
create index Posts_idx_3 on Posts(OwnerUserId);
107+
108+
create index Posts_idx_4 on Posts(LastEditorUserId);
109+
110+
SHOW INDEX
111+
FROM
112+
Posts;
113+
CREATE FULLTEXT INDEX index_Tags ON Posts(Tags);
114+
115+
SHOW INDEX
116+
FROM
117+
Posts;
118+
119+
120+
# ---PART 2------- (ARTICLE) Query that filters ALL the posts that ARE questions (This is for creating a smaller STO tables)
121+
122+
123+
#----PART 2.1----- CREATE PostsQuestionsFiltered
124+
125+
INSERT INTO PostsQuestionsFiltered
126+
SELECT * FROM Posts p
127+
WHERE p.PostTypeId = 1 and
128+
p.Score >0 and p.AcceptedAnswerId is not NULL and
129+
MATCH (p.Tags) AGAINST ('"machine-learning"' IN BOOLEAN MODE);
130+
131+
132+
# --- Due TO INSERT performance this creation IS done AFTER the IS loaded
133+
create index Posts_idx_1 on PostsQuestionsFiltered(AcceptedAnswerId);
134+
135+
create index Posts_idx_2 on PostsQuestionsFiltered(ParentId);
136+
137+
create index Posts_idx_3 on PostsQuestionsFiltered(OwnerUserId);
138+
139+
create index Posts_idx_4 on PostsQuestionsFiltered(LastEditorUserId);
140+
141+
SHOW INDEX
142+
FROM
143+
PostsQuestionsFiltered;
144+
145+
CREATE FULLTEXT INDEX index_Tags ON PostsQuestionsFiltered(Tags);
146+
147+
SHOW INDEX
148+
FROM
149+
PostsQuestionsFiltered;
150+
151+
CREATE FULLTEXT INDEX index_Text_title ON PostsQuestionsFiltered(Title, Body);
152+
153+
SHOW INDEX
154+
FROM
155+
PostsQuestionsFiltered;
156+
157+
158+
159+
# ------PART 3---------------------------- Query that filters ALL the posts that ARE questions (This is for creating a smaller STO tables) NOT filtering BY score
160+
# ------------------- We assume that we have already the Post TABLE created WITH the INDEX above IN PART 1
161+
162+
CREATE TABLE PostsQuestionsFilteredNoScore (
163+
Id INT NOT NULL PRIMARY KEY,
164+
PostTypeId TINYINT NOT NULL,
165+
# 1 = Question
166+
# 2 = Answer
167+
# 3 = Orphaned tag wiki
168+
# 4 = Tag wiki excerpt
169+
# 5 = Tag wiki
170+
# 6 = Moderator nomination
171+
# 7 = "Wiki placeholder" (seems to only be the election description)
172+
# 8 = Privilege wiki
173+
AcceptedAnswerId INT,
174+
ParentId INT,
175+
CreationDate DATETIME NOT NULL,
176+
DeletionDate DATETIME,
177+
Score INT NULL,
178+
ViewCount INT NULL,
179+
Body text NULL,
180+
OwnerUserId INT,
181+
OwnerDisplayName varchar(256),
182+
LastEditorUserId INT,
183+
LastEditorDisplayName VARCHAR(40),
184+
LastEditDate DATETIME,
185+
LastActivityDate DATETIME,
186+
Title varchar(256),
187+
Tags VARCHAR(256),
188+
AnswerCount INT DEFAULT 0,
189+
CommentCount INT DEFAULT 0,
190+
FavoriteCount INT DEFAULT 0,
191+
ClosedDate DATETIME,
192+
CommunityOwnedDate DATETIME,
193+
ContentLicense VARCHAR(20)
194+
);
195+
196+
197+
198+
INSERT INTO PostsQuestionsFilteredNoScore
199+
SELECT * FROM Posts p
200+
WHERE p.PostTypeId = 1 and
201+
p.AcceptedAnswerId is not NULL and
202+
MATCH (p.Tags) AGAINST ('"machine-learning"' IN BOOLEAN MODE);
203+
204+
205+
206+
207+
208+
209+
210+
211+
212+
213+
214+
215+
216+

used_code/create_data_noSTO.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
#!/bin/bash
2+
3+
cp /[path_to_the_original_folder]/[wesite_folder]/Posts.xml /[path_to_the_base_folder]/[temp]
4+
mysql --load-data-local-dir=/[path_to_the_base_folder]/[temp] -u [user] -p[password] -e "SET @@SESSION.SQL_MODE=''; create database [Name_database] DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; use [Name_database]; $(cat load_create_noSTO.sql)"
5+
rm -rf /[path_to_the_base_folder]/[temp]/Posts.xml

used_code/load_create_noSTO.sql

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# This code is based on the following gist
2+
# https://gist.github.com/enricorotundo/1e074af39d90629252a7df3fc1066397
3+
# Some comments are taken from the following post in stackExchange
4+
# https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
5+
# We added the content licence for comments, posts and postlinks,
6+
# Be sure of having by default innodb tables, otherwise specify it (needed for a text full index)
7+
# If you are having problems loading data into your DataBase you must
8+
# 1) uncomment the first line, 2) run mysql with --secure_file_priv=""
9+
# 2.1) If you are on a mac an your are having problems running the complementary files, you can run amysql with mysqld_safe --secure_file_priv="" located on BIN
10+
# 3) The path of your xml files have to be absolute paths
11+
12+
# - -------- ----------- ------------ --------------- ----------
13+
14+
#SET @@SESSION.SQL_MODE='';
15+
16+
CREATE TABLE Posts (
17+
Id INT NOT NULL PRIMARY KEY,
18+
PostTypeId TINYINT NOT NULL,
19+
# 1 = Question
20+
# 2 = Answer
21+
# 3 = Orphaned tag wiki
22+
# 4 = Tag wiki excerpt
23+
# 5 = Tag wiki
24+
# 6 = Moderator nomination
25+
# 7 = "Wiki placeholder" (seems to only be the election description)
26+
# 8 = Privilege wiki
27+
AcceptedAnswerId INT,
28+
ParentId INT,
29+
CreationDate DATETIME NOT NULL,
30+
DeletionDate DATETIME,
31+
Score INT NULL,
32+
ViewCount INT NULL,
33+
Body text NULL,
34+
OwnerUserId INT,
35+
OwnerDisplayName varchar(256),
36+
LastEditorUserId INT,
37+
LastEditorDisplayName VARCHAR(40),
38+
LastEditDate DATETIME,
39+
LastActivityDate DATETIME,
40+
Title varchar(256),
41+
Tags VARCHAR(256),
42+
AnswerCount INT DEFAULT 0,
43+
CommentCount INT DEFAULT 0,
44+
FavoriteCount INT DEFAULT 0,
45+
ClosedDate DATETIME,
46+
CommunityOwnedDate DATETIME,
47+
ContentLicense VARCHAR(20)
48+
);
49+
50+
SELECT
51+
DATABASE();
52+
53+
select
54+
count(*) postsCount
55+
from
56+
Posts;
57+
58+
# need this (it depends on your security and OS)
59+
load xml LOCAL infile '/[path_to_the_base_folder]/[temp]/Posts.xml' into table Posts rows identified by '<row>';
60+
61+
62+
show databases;
63+
64+
create index Posts_idx_1 on Posts(AcceptedAnswerId);
65+
66+
create index Posts_idx_2 on Posts(ParentId);
67+
68+
create index Posts_idx_3 on Posts(OwnerUserId);
69+
70+
create index Posts_idx_4 on Posts(LastEditorUserId);
71+
72+
SHOW INDEX
73+
FROM
74+
Posts;
75+
CREATE FULLTEXT INDEX index_Tags ON Posts(Tags);
76+
77+
SHOW INDEX
78+
FROM
79+
Posts;
80+
81+
# This index is not created for StackOverflow due to the amount of information and lack of resources, it is created over a subset of the resources
82+
CREATE FULLTEXT INDEX index_Text_title ON Posts(Title, Body);
83+
84+
SHOW INDEX
85+
FROM
86+
Posts;

0 commit comments

Comments
 (0)