-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathIndividual Coursework-DM .html
More file actions
142 lines (122 loc) · 8.25 KB
/
Individual Coursework-DM .html
File metadata and controls
142 lines (122 loc) · 8.25 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<title>COMP6237 Data Mining</title>
<link rel="stylesheet" href="../stylesheets/styles.css">
<link rel="stylesheet" href="../stylesheets/pygment_trac.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="../javascripts/respond.js"></script>
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!--[if lt IE 8]>
<link rel="stylesheet" href="../stylesheets/ie.css">
<![endif]-->
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
</head>
<body>
<div class="wrapper">
<section>
<div id="title">
<div style="background-image: url(../images/matrix.png); height: 250px; position: relative">
<div style="position: absolute; bottom: 0; width: 850px; opacity:0.8; filter:alpha(opacity=80); background-color: #000000; padding-top:20px">
<h1>COMP6237 Data Mining</h1>
<p>Coursework 2: Understanding Data</p>
</div>
</div>
<hr>
<span class="credits left">Maintained by <a href="http://www.ecs.soton.ac.uk/people/jsh2">Dr Jo Houghton</a>.</span>
</div>
<div style="text-align: justify">
<p><em>COMP6237 Data Mining</em></p>
<h1 id="coursework-2-understanding-data">Coursework 2: Understanding Data</h1>
<h2 id="brief">Brief</h2>
<p>Due date: Friday 29th March 2019, 16:00.<br />
Handin: 1819/COMP6237/3/
Required files: report.pdf<br />
Data: <a href="https://secure.ecs.soton.ac.uk/notes/comp6237/data/gap-html.zip">gap-html.zip</a><br />
Credit: 20% of overall module mark</p>
<h2 id="overview">Overview</h2>
<p>In this coursework your need to perform exploratory/descriptive data mining on a data set that we provide. You will need to write scripts to parse the data into a usable format, perform some kind of feature extraction and then apply standard techniques to explore relationships between data items, such as K-Means and Hierarchical Clustering, and data-analytic visualisation techniques like Multidimensional Scaling. Finally you need to put together a report that details your approach and your findings.</p>
<h2 id="details">Details</h2>
<p>The data you will be using for this assignment is a set of 24 texts about Antiquity (both classical and secondary literature); the original books have been scanned and run through an Optical Character Recognition system to produce an HTML document for each page. The scans and OCR data were produced by Google as part of the <a href="https://en.wikipedia.org/wiki/Google_Books_Library_Project">Google Books Library Project</a>. A typical scanned page and its OCR result is shown below:</p>
<div style="text-align:center">
<div style="float:left">
<img width="400" src="https://secure.ecs.soton.ac.uk/notes/comp6237/data/gap-images/gap_2X5KAAAAYAAJ/00000012.png" />
</div>
<div style="float:left">
</div>
<div style="float:right">
<iframe width="400" height="716" src="https://secure.ecs.soton.ac.uk/notes/comp6237/data/gap-html/gap_2X5KAAAAYAAJ/00000012.html"></iframe>
</div>
<div style="clear:both">
Book scanning example.<br /><br />
</div>
</div>
<p>You can download the a Zip file containing the HTML pages with the OCR results <a href="https://secure.ecs.soton.ac.uk/notes/comp6237/data/gap-html.zip">here</a>. Inside the zip file, their are 24 folders representing the 24 texts, with each page represented by the sequentially numbered HTML files. We haven’t included the original scanned images due to their size (around 4GB), however, you can browse the original scans <a href="https://secure.ecs.soton.ac.uk/notes/comp6237/data/gap-images/">here</a> if you wish.</p>
<p><strong>The aim of this coursework is for you to explore how these 24 texts are related by applying appropriate data mining techniques.</strong> You’ll need to create software to extract the contents of the HTML files and build some form of feature representation to which you can apply standard descriptive data mining techniques. At a minimum, we’re expecting you to experiment with Hierarchical Clustering and Multi-Dimensional Scaling, however you might also explore other approaches.</p>
<h3 id="deliverable">Deliverable</h3>
<p>You need to produce a <strong>concise</strong> 2-page “working notes” paper (see http://ceur-ws.org/Vol-1043/ for examples of standard academic working notes papers) using the <a href="https://www.acm.org/publications/proceedings-template">standard 2017 ACM conference proceedings style</a> (use the <code class="highlighter-rouge">sigconf</code> style option for the template). The two page limit on the paper is final; no additional pages or appendices are permitted. We’re expecting the paper to illustrate (with pictures as appropriate) what you have done and also demonstrate your ability to interpret what the data mining techniques are showing.</p>
<h2 id="marking-and-feedback">Marking and Feedback</h2>
<p>Full details of the marking scheme are given below:</p>
<h4 id="learning-outcomes">Learning Outcomes</h4>
<ul>
<li>Solve real-word data-mining, data-indexing and information extraction tasks</li>
<li>Demonstrate knowledge and understanding of:
<ul>
<li>Key concepts, tools and approaches for data mining on complex unstructured data sets</li>
<li>Theoretical concepts and the motivations behind different data-mining approaches</li>
</ul>
</li>
</ul>
<h3 id="mark-scheme">Mark Scheme</h3>
<p>Good working notes papers not only effectively apply techniques and describe results, but also <strong>offer critical insight</strong> into the findings of the analysis in the context of the underlying data. In particular you need to demonstrate that you understand the data, and, in the context of that understanding, that you can rationalise and reflect on why the analytic techniques are giving the results they do. The working notes paper will be marked using the following criteria:</p>
<table>
<thead>
<tr>
<th>Criterion</th>
<th>Description</th>
<th>Marks</th>
</tr>
</thead>
<tbody>
<tr>
<td>Experimentation</td>
<td>Analyse the problem and define suitable preprocessing and feature extraction operations</td>
<td>28</td>
</tr>
<tr>
<td>Application of techniques</td>
<td>Show ability to apply exploratory data mining techniques</td>
<td>28</td>
</tr>
<tr>
<td>Analysis</td>
<td>Reflection on what can be understood from the data through the application of exploratory techniques</td>
<td>28</td>
</tr>
<tr>
<td>Reporting</td>
<td>Clear and professional reporting</td>
<td>16</td>
</tr>
</tbody>
</table>
<p>Standard ECS late submission penalties apply.</p>
<p>Written individual feedback will be given covering the above points, and will be emailed out once marking is complete. We’ll also use one of the lecture slots for a further group feedback as well as a discussion about the data and the analysis.</p>
<h2 id="tools">Tools</h2>
<p>You can use any available existing tools, programming environments and software libraries for this coursework. It is however important that you include full details in your report - this must include details about which specific variant of the standard techniques are being used, with references as appropriate. Also include any details of the implementation doing something non-standard (for example making approximations in the sake of efficiency), and all parameters.</p>
<h2 id="questions">Questions</h2>
<p>If you have any problems/questions then <a href="mailto:j.houghton@ecs.soton.ac.uk">email</a> or speak to <a href="http://ecs.soton.ac.uk/people/jh1c18">Jo</a> in her office or after the lectures.</p>
</div>
</section>
<section style="text-align:center">
<span style="font-size: 11px; font-family: 'OpenSansRegular', "Helvetica Neue", Helvetica, Arial, sans-serif; font-weight: normal; color: #696969;">Copyright ©2019 <a href="http://www.soton.ac.uk">The University of Southampton</a>. All rights reserved.</span>
</section>
</div>
<!--[if !IE]><script>fixScale(document);</script><![endif]-->
</body>
</html>