Skip to content

Commit f934b98

Browse files
committed
Initial version
0 parents  commit f934b98

File tree

3 files changed

+216
-0
lines changed

3 files changed

+216
-0
lines changed

README.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# Data Repository MathOverflow - zbMATH links
2+
3+
This repository provides data used for the
4+
EMS Newsletter article
5+
6+
__References to academic literature in QA forums -
7+
A case study on zbMATH links from MathOverflow__
8+
_Fabian Müller, Moritz Schubotz Olaf Teschke_
9+
EMS Newsletter October 19
10+
11+
## Guide to reproduce
12+
13+
* Download MathOverflow dump from
14+
```
15+
wget https://archive.org/download/stackexchange/mathoverflow.net.7z
16+
17+
```
18+
* Check if the md5 sum of the file is `8011aabf2ae76358abbcf9a493ba9655.`
19+
20+
If the md5 sum does not match you might have downloaded a never version.
21+
To reproduce our results we stored the downloaded zip as [GitHub release]().
22+
23+
<details>
24+
<summary>Take a moment to review the [dataset description](https://archive.org/details/stackexchange) the
25+
meta information</summary>
26+
27+
- Format: 7zipped
28+
- Files:
29+
- **badges**.xml
30+
- UserId, e.g.: "420"
31+
- Name, e.g.: "Teacher"
32+
- Date, e.g.: "2008-09-15T08:55:03.923"
33+
- **comments**.xml
34+
- Id
35+
- PostId
36+
- Score
37+
- Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?"
38+
- CreationDate, e.g.:"2008-09-06T08:07:10.730"
39+
- UserId
40+
- **posts**.xml
41+
- Id
42+
- PostTypeId
43+
- 1: Question
44+
- 2: Answer
45+
- ParentID (only present if PostTypeId is 2)
46+
- AcceptedAnswerId (only present if PostTypeId is 1)
47+
- CreationDate
48+
- Score
49+
- ViewCount
50+
- Body
51+
- OwnerUserId
52+
- LastEditorUserId
53+
- LastEditorDisplayName="Jeff Atwood"
54+
- LastEditDate="2009-03-05T22:28:34.823"
55+
- LastActivityDate="2009-03-11T12:51:01.480"
56+
- CommunityOwnedDate="2009-03-11T12:51:01.480"
57+
- ClosedDate="2009-03-11T12:51:01.480"
58+
- Title=
59+
- Tags=
60+
- AnswerCount
61+
- CommentCount
62+
- FavoriteCount
63+
- **posthistory**.xml
64+
- Id
65+
- PostHistoryTypeId
66+
- 1: Initial Title - The first title a question is asked with.
67+
- 2: Initial Body - The first raw body text a post is submitted with.
68+
- 3: Initial Tags - The first tags a question is asked with.
69+
- 4: Edit Title - A question's title has been changed.
70+
- 5: Edit Body - A post's body has been changed, the raw text is stored here as markdown.
71+
- 6: Edit Tags - A question's tags have been changed.
72+
- 7: Rollback Title - A question's title has reverted to a previous version.
73+
- 8: Rollback Body - A post's body has reverted to a previous version - the raw text is stored here.
74+
- 9: Rollback Tags - A question's tags have reverted to a previous version.
75+
- 10: Post Closed - A post was voted to be closed.
76+
- 11: Post Reopened - A post was voted to be reopened.
77+
- 12: Post Deleted - A post was voted to be removed.
78+
- 13: Post Undeleted - A post was voted to be restored.
79+
- 14: Post Locked - A post was locked by a moderator.
80+
- 15: Post Unlocked - A post was unlocked by a moderator.
81+
- 16: Community Owned - A post has become community owned.
82+
- 17: Post Migrated - A post was migrated.
83+
- 18: Question Merged - A question has had another, deleted question merged into itself.
84+
- 19: Question Protected - A question was protected by a moderator
85+
- 20: Question Unprotected - A question was unprotected by a moderator
86+
- 21: Post Disassociated - An admin removes the OwnerUserId from a post.
87+
- 22: Question Unmerged - A previously merged question has had its answers and votes restored.
88+
- PostId
89+
- RevisionGUID: At times more than one type of history record can be recorded by a single action. All of these will be grouped using the same RevisionGUID
90+
- CreationDate: "2009-03-05T22:28:34.823"
91+
- UserId
92+
- UserDisplayName: populated if a user has been removed and no longer referenced by user Id
93+
- Comment: This field will contain the comment made by the user who edited a post
94+
- Text: A raw version of the new value for a given revision
95+
- If PostHistoryTypeId = 10, 11, 12, 13, 14, or 15 this column will contain a JSON encoded string with all users who have voted for the PostHistoryTypeId
96+
- If PostHistoryTypeId = 17 this column will contain migration details of either "from <url>" or "to <url>"
97+
- CloseReasonId
98+
- 1: Exact Duplicate - This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question.
99+
- 2: off-topic
100+
- 3: subjective
101+
- 4: not a real question
102+
- 7: too localized
103+
- **postlinks**.xml
104+
- Id
105+
- CreationDate
106+
- PostId
107+
- RelatedPostId
108+
- PostLinkTypeId
109+
- 1: Linked
110+
- 3: Duplicate
111+
- **users**.xml
112+
- Id
113+
- Reputation
114+
- CreationDate
115+
- DisplayName
116+
- EmailHash
117+
- LastAccessDate
118+
- WebsiteUrl
119+
- Location
120+
- Age
121+
- AboutMe
122+
- Views
123+
- UpVotes
124+
- DownVotes
125+
- **votes**.xml
126+
- Id
127+
- PostId
128+
- VoteTypeId
129+
- ` 1`: AcceptedByOriginator
130+
- ` 2`: UpMod
131+
- ` 3`: DownMod
132+
- ` 4`: Offensive
133+
- ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated
134+
- ` 6`: Close
135+
- ` 7`: Reopen
136+
- ` 8`: BountyStart
137+
- ` 9`: BountyClose
138+
- `10`: Deletion
139+
- `11`: Undeletion
140+
- `12`: Spam
141+
- `13`: InformModerator
142+
- CreationDate
143+
- UserId (only for VoteTypeId 5)
144+
- BountyAmount (only for VoteTypeId 9)
145+
</details>
146+
147+
<details>
148+
<summary>extract the file</summary>
149+
150+
```bash
151+
physikerwelt@math-docker:~/mathoverflow$ 7z e mathoverflow.net.7z
152+
153+
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
154+
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs Intel Core Processor (Skylake, IBRS) (506E3),ASM,AES-NI)
155+
156+
Scanning the drive for archives:
157+
1 file, 317278828 bytes (303 MiB)
158+
159+
Extracting archive: mathoverflow.net.7z
160+
--
161+
Path = mathoverflow.net.7z
162+
Type = 7z
163+
Physical Size = 317278828
164+
Headers Size = 349
165+
Method = BZip2
166+
Solid = +
167+
Blocks = 7
168+
169+
Everything is Ok
170+
171+
Files: 8
172+
Size: 1725254003
173+
Compressed: 317278828
174+
```
175+
<details>
176+
<summary>extract posts with references to zbmath.org</summary>
177+
178+
```bash
179+
physikerwelt@math-docker:~/mathoverflow$ wc -l Posts.xml
180+
252154 Posts.xml
181+
physikerwelt@math-docker:~/mathoverflow$ grep 'zbmath.org' Posts.xml > zblPosts.xml
182+
physikerwelt@math-docker:~/mathoverflow$ wc -l zblPosts.xml
183+
774 zblPosts.xml
184+
```
185+
186+
</details>
187+
188+
In the following we analyse
189+
[Posts](https://github.com/ag-gipp/19emsMathOverflow/releases/download/v0.1/PostHistory.7z)
190+
that contain the string `zbmath.org`.
191+
192+
## Files in the repository
193+
194+
The following csv files, were all created using search
195+
and replace in a standard text editor.
196+
197+
198+
* [counts.csv](counts.csv) contains the number of all posts
199+
(not only with links to zbMATH) grouped by month.
200+
The first incomplete year and the current year were deleted.
201+
* [mathoverflow-links-stat.xlsx](mathoverflow-links-stat.xlsx) is an Microsoft Excel File that analyses the date
202+
distribution of the MathOverflow posts with links to zbMath in the table
203+
dates.

counts.csv

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
month,abs,rel
2+
1,18220,0.928360338
3+
2,18072,0.920819321
4+
3,19488,0.992968511
5+
4,19626,1
6+
5,19191,0.977835524
7+
6,16816,0.856822582
8+
7,17070,0.869764598
9+
8,19530,0.99510853
10+
9,18189,0.926780801
11+
10,19299,0.983338429
12+
11,18826,0.959237746
13+
12,17457,0.889483338

mathoverflow-links-stat.xlsx

48.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)