|
| 1 | +# Data Repository MathOverflow - zbMATH links |
| 2 | + |
| 3 | +This repository provides data used for the |
| 4 | +EMS Newsletter article |
| 5 | + |
| 6 | + __References to academic literature in QA forums - |
| 7 | + A case study on zbMATH links from MathOverflow__ |
| 8 | + _Fabian Müller, Moritz Schubotz Olaf Teschke_ |
| 9 | + EMS Newsletter October 19 |
| 10 | + |
| 11 | +## Guide to reproduce |
| 12 | + |
| 13 | +* Download MathOverflow dump from |
| 14 | +``` |
| 15 | +wget https://archive.org/download/stackexchange/mathoverflow.net.7z |
| 16 | +
|
| 17 | +``` |
| 18 | +* Check if the md5 sum of the file is `8011aabf2ae76358abbcf9a493ba9655.` |
| 19 | + |
| 20 | +If the md5 sum does not match you might have downloaded a never version. |
| 21 | +To reproduce our results we stored the downloaded zip as [GitHub release](). |
| 22 | + |
| 23 | +<details> |
| 24 | +<summary>Take a moment to review the [dataset description](https://archive.org/details/stackexchange) the |
| 25 | +meta information</summary> |
| 26 | + |
| 27 | + - Format: 7zipped |
| 28 | + - Files: |
| 29 | + - **badges**.xml |
| 30 | + - UserId, e.g.: "420" |
| 31 | + - Name, e.g.: "Teacher" |
| 32 | + - Date, e.g.: "2008-09-15T08:55:03.923" |
| 33 | + - **comments**.xml |
| 34 | + - Id |
| 35 | + - PostId |
| 36 | + - Score |
| 37 | + - Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?" |
| 38 | + - CreationDate, e.g.:"2008-09-06T08:07:10.730" |
| 39 | + - UserId |
| 40 | + - **posts**.xml |
| 41 | + - Id |
| 42 | + - PostTypeId |
| 43 | + - 1: Question |
| 44 | + - 2: Answer |
| 45 | + - ParentID (only present if PostTypeId is 2) |
| 46 | + - AcceptedAnswerId (only present if PostTypeId is 1) |
| 47 | + - CreationDate |
| 48 | + - Score |
| 49 | + - ViewCount |
| 50 | + - Body |
| 51 | + - OwnerUserId |
| 52 | + - LastEditorUserId |
| 53 | + - LastEditorDisplayName="Jeff Atwood" |
| 54 | + - LastEditDate="2009-03-05T22:28:34.823" |
| 55 | + - LastActivityDate="2009-03-11T12:51:01.480" |
| 56 | + - CommunityOwnedDate="2009-03-11T12:51:01.480" |
| 57 | + - ClosedDate="2009-03-11T12:51:01.480" |
| 58 | + - Title= |
| 59 | + - Tags= |
| 60 | + - AnswerCount |
| 61 | + - CommentCount |
| 62 | + - FavoriteCount |
| 63 | + - **posthistory**.xml |
| 64 | + - Id |
| 65 | + - PostHistoryTypeId |
| 66 | + - 1: Initial Title - The first title a question is asked with. |
| 67 | + - 2: Initial Body - The first raw body text a post is submitted with. |
| 68 | + - 3: Initial Tags - The first tags a question is asked with. |
| 69 | + - 4: Edit Title - A question's title has been changed. |
| 70 | + - 5: Edit Body - A post's body has been changed, the raw text is stored here as markdown. |
| 71 | + - 6: Edit Tags - A question's tags have been changed. |
| 72 | + - 7: Rollback Title - A question's title has reverted to a previous version. |
| 73 | + - 8: Rollback Body - A post's body has reverted to a previous version - the raw text is stored here. |
| 74 | + - 9: Rollback Tags - A question's tags have reverted to a previous version. |
| 75 | + - 10: Post Closed - A post was voted to be closed. |
| 76 | + - 11: Post Reopened - A post was voted to be reopened. |
| 77 | + - 12: Post Deleted - A post was voted to be removed. |
| 78 | + - 13: Post Undeleted - A post was voted to be restored. |
| 79 | + - 14: Post Locked - A post was locked by a moderator. |
| 80 | + - 15: Post Unlocked - A post was unlocked by a moderator. |
| 81 | + - 16: Community Owned - A post has become community owned. |
| 82 | + - 17: Post Migrated - A post was migrated. |
| 83 | + - 18: Question Merged - A question has had another, deleted question merged into itself. |
| 84 | + - 19: Question Protected - A question was protected by a moderator |
| 85 | + - 20: Question Unprotected - A question was unprotected by a moderator |
| 86 | + - 21: Post Disassociated - An admin removes the OwnerUserId from a post. |
| 87 | + - 22: Question Unmerged - A previously merged question has had its answers and votes restored. |
| 88 | + - PostId |
| 89 | + - RevisionGUID: At times more than one type of history record can be recorded by a single action. All of these will be grouped using the same RevisionGUID |
| 90 | + - CreationDate: "2009-03-05T22:28:34.823" |
| 91 | + - UserId |
| 92 | + - UserDisplayName: populated if a user has been removed and no longer referenced by user Id |
| 93 | + - Comment: This field will contain the comment made by the user who edited a post |
| 94 | + - Text: A raw version of the new value for a given revision |
| 95 | + - If PostHistoryTypeId = 10, 11, 12, 13, 14, or 15 this column will contain a JSON encoded string with all users who have voted for the PostHistoryTypeId |
| 96 | + - If PostHistoryTypeId = 17 this column will contain migration details of either "from <url>" or "to <url>" |
| 97 | + - CloseReasonId |
| 98 | + - 1: Exact Duplicate - This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question. |
| 99 | + - 2: off-topic |
| 100 | + - 3: subjective |
| 101 | + - 4: not a real question |
| 102 | + - 7: too localized |
| 103 | + - **postlinks**.xml |
| 104 | + - Id |
| 105 | + - CreationDate |
| 106 | + - PostId |
| 107 | + - RelatedPostId |
| 108 | + - PostLinkTypeId |
| 109 | + - 1: Linked |
| 110 | + - 3: Duplicate |
| 111 | + - **users**.xml |
| 112 | + - Id |
| 113 | + - Reputation |
| 114 | + - CreationDate |
| 115 | + - DisplayName |
| 116 | + - EmailHash |
| 117 | + - LastAccessDate |
| 118 | + - WebsiteUrl |
| 119 | + - Location |
| 120 | + - Age |
| 121 | + - AboutMe |
| 122 | + - Views |
| 123 | + - UpVotes |
| 124 | + - DownVotes |
| 125 | + - **votes**.xml |
| 126 | + - Id |
| 127 | + - PostId |
| 128 | + - VoteTypeId |
| 129 | + - ` 1`: AcceptedByOriginator |
| 130 | + - ` 2`: UpMod |
| 131 | + - ` 3`: DownMod |
| 132 | + - ` 4`: Offensive |
| 133 | + - ` 5`: Favorite - if VoteTypeId = 5 UserId will be populated |
| 134 | + - ` 6`: Close |
| 135 | + - ` 7`: Reopen |
| 136 | + - ` 8`: BountyStart |
| 137 | + - ` 9`: BountyClose |
| 138 | + - `10`: Deletion |
| 139 | + - `11`: Undeletion |
| 140 | + - `12`: Spam |
| 141 | + - `13`: InformModerator |
| 142 | + - CreationDate |
| 143 | + - UserId (only for VoteTypeId 5) |
| 144 | + - BountyAmount (only for VoteTypeId 9) |
| 145 | +</details> |
| 146 | + |
| 147 | +<details> |
| 148 | +<summary>extract the file</summary> |
| 149 | + |
| 150 | +```bash |
| 151 | +physikerwelt@math-docker:~/mathoverflow$ 7z e mathoverflow.net.7z |
| 152 | + |
| 153 | +7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 |
| 154 | +p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs Intel Core Processor (Skylake, IBRS) (506E3),ASM,AES-NI) |
| 155 | + |
| 156 | +Scanning the drive for archives: |
| 157 | +1 file, 317278828 bytes (303 MiB) |
| 158 | + |
| 159 | +Extracting archive: mathoverflow.net.7z |
| 160 | +-- |
| 161 | +Path = mathoverflow.net.7z |
| 162 | +Type = 7z |
| 163 | +Physical Size = 317278828 |
| 164 | +Headers Size = 349 |
| 165 | +Method = BZip2 |
| 166 | +Solid = + |
| 167 | +Blocks = 7 |
| 168 | + |
| 169 | +Everything is Ok |
| 170 | + |
| 171 | +Files: 8 |
| 172 | +Size: 1725254003 |
| 173 | +Compressed: 317278828 |
| 174 | +``` |
| 175 | +<details> |
| 176 | +<summary>extract posts with references to zbmath.org</summary> |
| 177 | + |
| 178 | +```bash |
| 179 | +physikerwelt@math-docker:~/mathoverflow$ wc -l Posts.xml |
| 180 | +252154 Posts.xml |
| 181 | +physikerwelt@math-docker:~/mathoverflow$ grep 'zbmath.org' Posts.xml > zblPosts.xml |
| 182 | +physikerwelt@math-docker:~/mathoverflow$ wc -l zblPosts.xml |
| 183 | +774 zblPosts.xml |
| 184 | +``` |
| 185 | + |
| 186 | +</details> |
| 187 | + |
| 188 | +In the following we analyse |
| 189 | +[Posts](https://github.com/ag-gipp/19emsMathOverflow/releases/download/v0.1/PostHistory.7z) |
| 190 | +that contain the string `zbmath.org`. |
| 191 | + |
| 192 | +## Files in the repository |
| 193 | + |
| 194 | +The following csv files, were all created using search |
| 195 | +and replace in a standard text editor. |
| 196 | + |
| 197 | + |
| 198 | +* [counts.csv](counts.csv) contains the number of all posts |
| 199 | +(not only with links to zbMATH) grouped by month. |
| 200 | +The first incomplete year and the current year were deleted. |
| 201 | +* [mathoverflow-links-stat.xlsx](mathoverflow-links-stat.xlsx) is an Microsoft Excel File that analyses the date |
| 202 | +distribution of the MathOverflow posts with links to zbMath in the table |
| 203 | +dates. |
0 commit comments