UCL_SECU0057/UCL_SECU0057_module_outline.Rmd at master · OwenTheAnalyst/UCL_SECU0057 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "APPLIED DATA SCIENCE (UCL SECU0057)"
author: "Bennett Kleinberg (bennett.kleinberg@ucl.ac.uk)"
date: "13/01/2020"
output:
  pdf_document:
    toc: yes
  html_document:
    theme: paper
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: no
subtitle: Department of Security and Crime Science, UCL
urlcolor: blue
---


## Introduction

This MSc module introduces students to the field of data science and provides them with the conceptual understanding of the underlying principles. The students will be able to address crime and security problems in a computational manner (e.g., using data-driven crime analysis) and will become critical consumers of data science approaches. It will also allow students to understand the approaches which must be taken to translate these theoretical concepts into real-world applications used by the police and security services, as well in academia to influence and drive government policy.

The techniques covered in this module will be of relevance to students undertaking a data science-related independent research project as well as to those who wish to pursue a career in a data science or analyst role.

## Dates & times

The module is running in Term 2, 2019/2020, from 13 January 2020 - 27 March 2020.

- Time and date: Thursdays, 2-4pm.
- Each session includes a lecture and a tutorial (practical session)

UCL timetable page: [https://timetable.ucl.ac.uk/tt/createCustomTimet.do#](https://timetable.ucl.ac.uk/tt/createCustomTimet.do#)

## Contact & resources

- Dr Bennett Kleinberg, Assistant Professor in Data Science, bennett.kleinberg@ucl.ac.uk
- TA: Felix Soldner, Doctoral researcher, felix.soldner@ucl.ac.uk

The **moodle page** will accompany this module [here](https://moodle.ucl.ac.uk/course/view.php?id=16425).

Q&A forum: if you have a questions/problem related to the content of the lectures/tutorials, then please use [the Q&A forum](https://moodle.ucl.ac.uk/course/view.php?id=16425&section=4).

## Learning outcomes

Upon successful completion of this module, you will be able to:

- develop a holistic understanding of what Data Science encompasses and its pervasive influence in modern life, including its impact within law enforcement, crime analysis and security science
-	understand of key mathematical, computational, and statistical analysis techniques employed in the field of security to determine trends and patterns in measured crime data
- retrieve data from unstructured sources through web-scraping techniques
- write the code needed for the analytical steps in R
- appreciate the fundamentals of supervised and unsupervised machine learning
- understand the methodologies used to extract features from data.
- use techniques used in natural language processing to gain insights from text data
- discuss the ethical issues concerned with deploying latest advances in AI to areas such as criminal. justice, profiling
- understand the role that open data and the open science movement is playing in society, from academic research and the private sector, to shaping government policy

## Learning hours

This module is worth 15 UCL credits (= 7.5 ECTS) which equals to 150 hours of study, i.e. 150h/11 weeks (incl reading week) = 14 hours per week.

_Note that the content structure and the assessment assumes that you spend (on average) that amount of time with this module._

| Component                    	| Amount 	| Duration 	| Total hours 	|
|------------------------------	|--------	|----------	|-------------	|
| Lectures                     	| 10     	| 1h       	| 10h         	|
| Tutorials/practicals         	| 10     	| 1h       	| 10h         	|
| Assessment: class test       	| 1      	| 1.5h     	| 1.5h        	|
| Assessment: project          	| 1      	| 40.5h    	| 40.5h       	|
| Homework/revision/self-study 	| 11     	| 8h       	| 88h         	|
| TOTAL                        	| -      	| -        	| 150         	|


## Structure

The general structure of this module is as follows: there are five content blocks (web data collection, text mining, machine learning, advanced techniques) which each will be covered in weekly 2h-sessions consisting of lectures and tutorials. The lectures cover the approaches on a conceptual (what do they do?) and functional level (how do they work?). The tutorials are practical sessions in which you will learn how to implement the techniques in the R programming language. During the tutorials, we will be there to assist you and help you.

Each week is (roughly) structured as follows:

- Reading and comprehending the required literature (necessary preparation for the lecture)
- Lecture (weekly content)
- Tutorial (practical implementation of lecture content)
- Homework (helps you consolidate the concepts and build your R skills portfolio)

## Timetable

| UCL week 	| Module week 	| Date   	| Topic                    	|
|----------	|-------------	|--------	|--------------------------	|
| 21       	| 1           	| 16 Jan 	| Web data collection I    	|
| 22       	| 2           	| 23 Jan 	| Web data collection II   	|
| 23       	| 3           	| 30 Jan 	| Text mining I            	|
| 24       	| 4           	| 6 Feb  	| Text mining II           	|
| 25       	| 5           	| 13 Feb 	| Text mining III          	|
| 26       	| 6           	| -      	| READING WEEK             	|
| 27       	| 7           	| 27 Feb 	| Machine learning I       	|
| 28       	| 8           	| 5 Mar  	| Machine learning II      	|
| 29       	| 9           	| 12 Mar 	| Machine learning III     	|
| 30       	| 10          	| 19 Mar 	| Case studies            	|
| 31       	| 11          	| 26 Mar 	| **Class test**          	|


<!-- Tutorials: -->

<!-- - API Twitter -->
<!-- - API YouTube -->
<!-- - Reddit -->

<!-- - Newspaper -->
<!-- - MisPers -->
<!-- - Exotic animals -->

<!-- - lexical diversity on rap lyrics -->
<!-- - tfidf on rap lyrics -->

<!-- - ngrams, POS, named entities on guardian (which author used the most pronouns?) -->
<!-- - replication on drill music corpus -->

<!-- - text similarity 1, 2, 3 -->
<!-- - [M&K Word embeddings](...) -->
<!-- - across domains -->

<!-- - fake news classification -->
<!-- - sentiment classification -->
<!-- - creators for change classification -->

<!-- - clustering of YouTube videos -->
<!-- - .. -->

<!-- - simple DL in R -->


## Content

_Note: slides and tutorials will be added as the module progresses._

### Week 0 - Module preparation and software setup

Please ensure that you follow the guides below to setup your computer with the necessary software. We also assume that you have a basic understanding of pragmatically solving coding problems. Do do this, please take the time to walk through the tutorial below.

- Tutorial guide: [R for crime scientists in 12 steps](https://rawcdn.githack.com/ben-aaron188/UCL_SECU0050/8cee7970ae540c8c23d04a308ec7e19cb732abc3/tutorials/week_0/r_in_12_steps.html)
- Tutorial guide: [Getting ready for R](https://rawcdn.githack.com/ben-aaron188/UCL_SECU0050/8cee7970ae540c8c23d04a308ec7e19cb732abc3/tutorials/week_0/getting_ready_for_r.html)
- Tutorial guide: [How to solve data science problems](https://rawcdn.githack.com/ben-aaron188/UCL_SECU0050/df0b0c3114f3fc804468bbcd831fc8345547b91a/tutorials/week_0/how_to_solve_data_science_problems.html)

### Week 1 - Web data collection I (16 Jan)

- Topics covered: web data collection with APIs
- Required preparation
    - Pfeffer, J., Mayer, K., & Morstatter, F. (2018). Tampering with Twitter’s Sample API. EPJ Data Science, 7(1), 50. [https://doi.org/10.1140/epjds/s13688-018-0178-0](https://doi.org/10.1140/epjds/s13688-018-0178-0)
    - Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. ArXiv:1306.5204 [Physics]. Retrieved from [http://arxiv.org/abs/1306.5204](http://arxiv.org/abs/1306.5204)
    - [MDN HTML basics](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics), [MDN HTML tables](https://developer.mozilla.org/en-US/docs/Learn/HTML/Tables/Basics)
- Slides: [Week 1: Web data collection 1](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_1/week1_secu0057.html), [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/lectures/week_1/week1_secu0057.pdf)
- Tutorial + Homework: [Tutorial, week 1: web data collection](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_1/week1_tutorial_secu0057.nb.html), [Tutorial solutions](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_1/week1_tutorial_secu0057_solutions.nb.html)

### Week 2 - Web data collection II (23 Jan)

- Topics covered: HTML basics, web scraping in R
- Required preparation
    - [MDN Javascript basics](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/What_is_JavaScript)
    - [MDN CSS first steps](https://developer.mozilla.org/en-US/docs/Learn/CSS/First_steps/Getting_started)
    - Ignatow & Mihalcea, 2018: C6 Web crawling
    - [Rvest package documentation](https://cran.r-project.org/web/packages/rvest/rvest.pdf)
- Slides: [Week 2: Web data collection 2](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_2/week2_secu0057.html), [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/lectures/week_2/week2_secu0057.pdf)
- Tutorial + Homework: [Tutorial, week 2: web data collection 2](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_2/week2_tutorial_secu0057.nb.html), [Tutorial solutions](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/tutorials/week_2/week2_tutorial_secu0057_solutions.Rmd)


### Week 3 - Text mining I (30 Jan)

- Topics covered: quantification problem, TF-IDF, text metrics
- Required preparation
    - [Zipf's Mystery (YouTube)](https://www.youtube.com/watch?v=fCn8zs912OE)
    - Ignatow & Mihalcea, 2018: C7 Lexical resources
    - Ignatow & Mihalcea, 2018: C8 Basic text properties
    - [Grolemund & Wickham, 2016: C14 Strings](https://r4ds.had.co.nz/strings.html)
    - [Quanteda quick start guide](https://quanteda.io/articles/quickstart.html)
    - Schoonvelde, M., Brosius, A., Schumacher, G., & Bakker, B. N. (2019). Liberals lecture, conservatives communicate: Analyzing complexity and ideology in 381,609 political speeches. PLOS ONE, 14(2), e0208450. [https://doi.org/10.1371/journal.pone.0208450](https://doi.org/10.1371/journal.pone.0208450)
- Recommended:
    - [Regular expressions with stringr](https://stringr.tidyverse.org/articles/regular-expressions.html)
- Slides: [Week 3: Text Mining 1](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_3/week3_secu0057.html), [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/lectures/week_3/week3_secu0057.pdf)
- Tutorial + Homework: [Tutorial, week 3: Text Mining 1](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_3/week3_tutorial_secu0057.nb.html), [Tutorial solutions](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_3/week3_tutorial_secu0057_solutions.nb.html)

### Week 4 - Text mining II (6 Feb)

- Topics covered: POS tagging, ngrams, sentiment analysis, sentiment trajectories
- Required preparation:
    - Soldner, F., Ho, J. C., Makhortykh, M., van der Vegt, I. W. J., Mozes, M., & Kleinberg, B. (2019). Uphill from here: Sentiment patterns in videos from left- and right-wing YouTube news channels. Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science, 84–93. [https://doi.org/10.18653/v1/W19-2110](https://doi.org/10.18653/v1/W19-2110)
    - Kleinberg, B., Mozes, M., & van der Vegt, I. (2018). Identifying the sentiment styles of YouTube’s vloggers. 3581–3590. [https://www.aclweb.org/anthology/D18-1394](https://www.aclweb.org/anthology/D18-1394)
    - Gao, J., Jockers, M. L., Laudun, J., & Tangherlini, T. (2016). A multiscale theory for the dynamical evolution of sentiment in novels. Behavioral, Economic and Socio-Cultural Computing (BESC), 2016 International Conference On, 1–4.
    - Jockers, M. (2015). Revealing Sentiment and Plot Arcs with the Syuzhet Package. http://www.matthewjockers.net/2015/02/02/syuzhet/
- Slides: [Week 4: Text mining 2](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_4/week4_secu0057.html), [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/lectures/week_4/week4_secu0057.pdf)
- Tutorial+ Homework: [Tutorial: week 4: Text mining 2](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_4/week4_tutorial_secu0057.nb.html), [Tutorial solutions](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_4/week4_tutorial_secu0057_solutions.nb.html)


### Week 5 - Text mining III (13 Feb)

- Topics covered: text similarity, word embeddings
- Required preparation
    - [Introduction to word embeddings, Mozes (2019)](https://github.com/maximilianmozes/word_embeddings_workshop_resources/blob/master/slides/theo.pdf)
    - Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. [https://doi.org/10.3115/v1/D14-1162](https://doi.org/10.3115/v1/D14-1162) from [https://nlp.stanford.edu/pubs/glove.pdf](https://nlp.stanford.edu/pubs/glove.pdf)
    - Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. ArXiv:1310.4546 [Cs, Stat]. [http://arxiv.org/abs/1310.4546](http://arxiv.org/abs/1310.4546)
    - Vector dot products from [https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length)
    - Matrix vector products from [https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/null-column-space/v/matrix-vector-products](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/null-column-space/v/matrix-vector-products)
    - [Limitations of word embeddings, Burdick, 2019](https://github.com/maximilianmozes/word_embeddings_workshop_resources/blob/master/slides/limitations.pdf)
    - Wendlandt, L., Kummerfeld, J. K., & Mihalcea, R. (2018). Factors Influencing the Surprising Instability of Word Embeddings. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2092–2102. [https://doi.org/10.18653/v1/N18-1190](https://doi.org/10.18653/v1/N18-1190)
- Recommended:
    - Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644. [https://doi.org/10.1073/pnas.1720347115](https://doi.org/10.1073/pnas.1720347115)
    - Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. ArXiv:1607.06520 [Cs, Stat]. [http://arxiv.org/abs/1607.06520](http://arxiv.org/abs/1607.06520)
    - Vylomova, E., Rimell, L., Cohn, T., & Baldwin, T. (2016). Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning. ArXiv:1509.01692 [Cs]. [http://arxiv.org/abs/1509.01692](http://arxiv.org/abs/1509.01692)
- Slides: [Week 5: Text mining 3](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_5/week5_secu0057.html), [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/lectures/week_5/week5_secu0057.pdf)
- Tutorial + Homework: [Tutorial, week 5: Text mining 3](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_5/week5_tutorial_secu0057.nb.html), [Tutorial solutions](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/tutorials/week_5/week5_tutorial_secu0057_solutions.Rmd)

### Week 6 - Reading week

_Use this week to catch up on any literature or homework/tutorials._

### Week 7 - Machine learning I (27 Feb)

- Topics covered: supervised machine learning (classification + regression), core algorithms, performance metrics
- Required preparation
    - Ignatow & Mihalcea, 2018: C13 Text classification
    - Ignatow & Mihalcea, 2018: C9 Supervised learning
    - [Gatto, 2019: Supervised learning](https://lgatto.github.io/IntroMachineLearningWithR/supervised-learning.html)
    - [Deisenroth et al., 2019: C12 Classification with Support Vector Machines](https://mml-book.github.io/book/mml-book.pdf)
    - [Kuhn, 2019: C5 Model training](https://topepo.github.io/caret/model-training-and-tuning.html)
    - [Kuhn, 2019: C17 Measuring performance](https://topepo.github.io/caret/measuring-performance.html)

- Recommended: [YT, 3Blue1Brown: Bayes theorem, and making probability intuitive](https://www.youtube.com/watch?v=HZGCoVF3YvM)
- Slides: [Week 7: Machine Learning 1](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_7/week7_secu0057.html), [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/lectures/week_7/week7_secu0057.pdf)
- Tutorial + Homework: [Tutorial, week 7: Machine Learning 1](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_7/week7_tutorial_secu0057.nb.html), [Tutorial solutions](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_7/week7_tutorial_secu0057_solutions.nb.html)


### Week 8 - Machine learning II (5 Mar)

- Topics covered: unsupervised machine learning, core algorithm, problems
- Required preparation
    - [Gatto, 2019: C4 Unsupervised learning](https://lgatto.github.io/IntroMachineLearningWithR/unsupervised-learning.html)
    - Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. [https://doi.org/10.1017/pan.2017.44](https://doi.org/10.1017/pan.2017.44)
    - [Datacamp unsupervised learning](https://www.datacamp.com/courses/unsupervised-learning-in-r)
- Slides: [Week 8: Machine Learning 2](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/lectures/week_8/week8_secu0057.html)
- Tutorial + Homework: [Tutorial, week 8: Machine Learning 2](https://raw.githack.com/ben-aaron188/UCL_SECU0057/master/tutorials/week_8/week8_tutorial_secu0057.nb.html)


### Week 9 - Machine learning III (12 Mar)

Guest lecture: Josh Kamps (Doctoral researcher)

- Topics covered: neural networks in R, intro to deep learning
- Required preparation: [tbc]
- Slides: [tbc]
- Tutorial + homework: [tbc]


### Week 10 - Case studies (19 Mar)

<!-- This lecture includes a guest talk: -->

<!-- Guest lecture: Daniel Hammocks (Doctoral researcher) - "Facial recognition systems"  -->

<!-- - Required preparation -->
<!--     - Metropolitan Police (2019). Live Facial Recognition. [https://www.met.police.uk/live-facial-recognition-trial/](https://www.met.police.uk/live-facial-recognition-trial/) -->
<!--     - Big Brother Watch (2018). The lawless growth of facial recognition in UK policing. [https://bigbrotherwatch.org.uk/wp-content/uploads/2018/05/Face-Off-final-digital-1.pdf](https://bigbrotherwatch.org.uk/wp-content/uploads/2018/05/Face-Off-final-digital-1.pdf) -->
<!--     - Machine Learning From Scratch: Part 3 - Arrays and representations. [https://towardsdatascience.com/machine-learning-from-scratch-part-3-ed572330367d](https://towardsdatascience.com/machine-learning-from-scratch-part-3-ed572330367d) -->
<!--     - Biometrics and Forensics Ethics Group (2019). Ethical issues arising from the police use of live facial recognition technology. [https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/781745/Facial_Recognition_Briefing_BFEG_February_2019.pdf](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/781745/Facial_Recognition_Briefing_BFEG_February_2019.pdf) -->
<!--     - Recommended: Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 1, I–I. [https://doi.org/10.1109/CVPR.2001.990517](https://doi.org/10.1109/CVPR.2001.990517) -->
<!-- - Slides -->
<!-- - Tutorial -->
<!-- - Homework -->

<!-- ### Week 11 - Case studies II (26 Mar) -->

<!-- This lecture includes two guest talks: -->

- Daniel Hammocks - "Facial recognition systems"
- Maximilian Mozes - "On the robustness of intelligent systems"
- Isabelle van der Vegt - "Data Science for Threat Assessment"

- Required preparation
    - Metropolitan Police (2019). Live Facial Recognition. [https://www.met.police.uk/live-facial-recognition-trial/](https://www.met.police.uk/live-facial-recognition-trial/)
    - Machine Learning From Scratch: Part 3 - Arrays and representations. [https://towardsdatascience.com/machine-learning-from-scratch-part-3-ed572330367d](https://towardsdatascience.com/machine-learning-from-scratch-part-3-ed572330367d)
    - Biometrics and Forensics Ethics Group (2019). Ethical issues arising from the police use of live facial recognition technology. [https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/781745/Facial_Recognition_Briefing_BFEG_February_2019.pdf](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/781745/Facial_Recognition_Briefing_BFEG_February_2019.pdf)
    - Gu, T., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. ArXiv:1708.06733 [Cs]. Retrieved from [http://arxiv.org/abs/1708.06733](http://arxiv.org/abs/1708.06733)
    - Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial examples in the physical world. ArXiv:1607.02533 [Cs, Stat]. Retrieved from [http://arxiv.org/abs/1607.02533](http://arxiv.org/abs/1607.02533)
    - Azucar, D., Marengo, D., & Settanni, M. (2018). Predicting the Big 5 personality traits from digital footprints on social media: A meta-analysis. Personality and Individual Differences, 124, 150–159. [https://doi.org/10.1016/j.paid.2017.12.018](https://doi.org/10.1016/j.paid.2017.12.018)
    - van der Vegt, I., Mozes, M., Gill, P., & Kleinberg, B. (2019). Online influence, offline violence: Linguistic responses to the “Unite the Right” rally. ArXiv:1908.11599 [Cs]. Retrieved from http://arxiv.org/abs/1908.11599
- Slides: [tbc]
- Tutorial + Homework: [tbc]


## Materials

### Software

We will use the R programming language. All packages, required resources and tools needed are openly available and free to download to any computer. We strongly encourage students to bring their own laptops to the tutorials so they can customise their work environment. However, we will have a computer cluster available where you can use the UCL computers.

### Literature

You can find the required reading (as well as suggested further reading) in the table above and on the moodle page.)

For each week, the required literature includes one reading on techniques/tools that are important for data science projects but are not covered in the lectures or tutorials. You will likely need these techniques in the project and we assume that you have read and prepared the reading so that you are able to use the techniques.


Key resources for this module are:

- [An Introduction to Text Mining](http://us.sagepub.com/en-us/nam/an-introduction-to-text-mining/book249451#contents) (Ignatow & Mihalcea, 2018) - SAGE
    - _UCL library services have ordered a few copies of this book_
    - You can buy the book at a UCL student discount via this link [https://studysites.uk.sagepub.com/dts-studentdeal/](https://studysites.uk.sagepub.com/dts-studentdeal/)
- Text mining with R (Silge & Robinson, 2019) - interactive book version freely available at [https://www.tidytextmining.com/](https://www.tidytextmining.com/)
- Mathematics for Machine Learning (Deisenroth et al., 2019) - freely available at [https://mml-book.github.io/book/mml-book.pdf](https://mml-book.github.io/book/mml-book.pdf)
- R for Data Science (Grolemund & Wickham, 2016) - interactive book version freely available at [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/)
- The Caret package (Kuhn, 2019) - interactive book version freely available at [https://topepo.github.io/caret/index.html](https://topepo.github.io/caret/index.html)
- Machine learning with R (Gatto, 2019) - interactive book version freely available at [https://lgatto.github.io/IntroMachineLearningWithR/index.html](https://lgatto.github.io/IntroMachineLearningWithR/index.html)
- Applied Predictive Modelling (Kuhn & Johnson, 2013) - this book is freely available to you as UCL student through UCL Library's [free e-book access](https://ucl-new-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=TN_springer_s978-1-4614-6849-3_216330&context=PC&vid=UCL_VU2&lang=en_US&search_scope=CSCOP_UCL&adaptor=primo_central_multiple_fe&tab=local&query=any,contains,Applied%20Predictive%20Modeling&offset=0).

Other useful resources:

- Data Visualization with R (Kabacoff, 2018) - interactive book version freely available at [https://rkabacoff.github.io/datavis/](https://rkabacoff.github.io/datavis/), especially:
    - [Kabacoff, 2018: C2 Intro to ggplot2](https://rkabacoff.github.io/datavis/IntroGGPLOT.html)
    - [Kabacoff, 2018: C3 Univariate graphs](https://rkabacoff.github.io/datavis/Univariate.html)
    - [Kabacoff, 2018: C4 Bivariate graphs](https://rkabacoff.github.io/datavis/Bivariate.html)
    - [Kabacoff, 2018: C5 Multivariate graphs](https://rkabacoff.github.io/datavis/Multivariate.html)
- [Quanteda text visualisation](https://quanteda.io/articles/pkgdown/examples/plotting.html) -->

### Data

All datasets used are open-source and available without restrictions.


## Assessment

### Class test

- Weight for final grade: 30%
- Learning outcomes tested: (1) demonstrating knowledge of a broader range of analytical techniques used in the field of Security and Crime Science, (2) understanding the purpose, advantages and disadvantages of different forms of data science techniques, (3) interpreting the results of data science techniques.
- Date: 26 March 2020, during the normal lecture+tutorial hours, 2pm-4pm.

**Details:** This 1.5-hour closed-book test covers theoretical, technical and conceptual aspects of the lectures, tutorials and literature. You will be given 15 multiple-choice questions. For each question, there will be five possible answer options. At least one option is correct and you are asked to select all options that apply (i.e. this could be one, two, three, four, or all five options). **If you select the correct - _and only the correct_ - combination of answer options, you will be given ten points for a question.** In total, you can achieve 150 points in this test (transformed to a percentage grade).


### Applied Data Science Project

- Weight for final grade: 70%
- Learning outcomes tested: (1) demonstrating knowledge of a broader range of analytical techniques used in the field of Security and Crime Science, (2) performing data science analyses on crime and/or-security related issues, (3) applying the data science pipeline on crime and/or-security related issues, (4) interpreting and effectively reporting the results of said techniques
- Deadline: 16 April 2020, 4pm.
- Page limit: See below.
- Assessment details as [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/assessment/SECU0057_project_assignment.pdf).

> _This assessment is the capstone project of the module. It requires you to address a research problem in the full data science workflow (e.g., obtaining the data, processing the data, modelling the data, building predictive models, reporting on the findings, interpreting the outcomes). You will write a brief report on your project (a template will be provided) and you have to submit the R code needed to reproduce your findings. After passing this assessment, you will have the demonstrated the skills to solve a problem using data science techniques._

#### Feedback

A full project is a major step in your data science skills career. To help you in the process, you will receive feedback from us on a concept draft of your project. The deadline for the concept draft is **9 March 2020 (4pm)** via Turn-it-in on moodle. The requirements for the feedback submission are available on moodle. _The submission of the concept draft accounts for 10% of the Applied Data Science Project._

- Feedback form template as [PDF](https://github.com/ben-aaron188/UCL_SECU0057/blob/master/assessment/project_concept_draft_template.pdf).

#### Assessment topic

This year's topic is authorship attribution. Recently, an area of digital forensics closely connected to NLP that seeks to identify who wrote a piece of text/novel/song/etc. has received considerable attention (e.g., [Why Molière most likely did write his plays](https://advances.sciencemag.org/content/5/11/eaax5489.abstract), [Researcher uses AI to unravel the mystery of Shakespeare’s co-author](https://thenextweb.com/artificial-intelligence/2019/11/27/ai-shakespeare-henry-viii/), and [Plecháč, 2019](https://arxiv.org/abs/1911.05652)).

With more techniques at the researchers' disposal, this area is likely to impact on how we do digital forensic investigations (e.g. when trying to find who wrote a post). The project requires you to conduct your own authorship attribution project including (web) data collection, text processing and text mining, and approaches to identify authorship (e.g. through machine learning). _The exact nature of the project is up to you._ **The only restrictions on the topic and scope are: you must not use novels or Twitter as the source of data, and you must have at least 500 data points (e.g. texts/paragraphs/etc.) for each of at least 10 authors.**.

Further details:

- the project should be on authorship attribution (not author profiling)
- all steps in this project must be reproducible with your code supplement
- the project should have a large-scale focus (i.e. not a case study with one document)
- all three core areas of this module should be used: web data collection, text mining, machine learning


#### The code supplement

Submit your R code in the form of a commented R notebook. To ensure that no code is lost and that we can review all code equally, submit your code as an anonymised view-only version on the [Open Science Framework](https://osf.io/). Create a private repository, upload your code as an R notebook and create an anonymised, view-only link that you include in your report (for a guide on creating that link, see [here](https://help.osf.io/hc/en-us/articles/360019930333-Create-a-View-only-Link-for-a-Project)). For details on reporting your code as an R notebook, you can consult these guides: [guide 1](https://bookdown.org/yihui/rmarkdown/notebook.html), [guide 2](http://uc-r.github.io/r_notebook).

#### The report

For this assignment, you are asked to report your findings in the form of a short paper. Specifically, you should use the template of the ACL conference proceedings (these can be downloaded  [here for Latex and Word](http://acl2020.org/downloads/acl2020-templates.zip) or can be imported into Overleaf [here](https://www.overleaf.com/latex/templates/acl-2020-proceedings-template/zsrkcwjptpcd)).

Additional requirements for the report:

- use the ACL style guidelines (easiest through the templates)
- the **page limit is 8 content pages** + unlimited pages for the reference list
- use the ACL referencing style (this is available in reference managers like Zotero or handled directly in Overleaf) - i.e. adhere to their font type, font size and heading guidelines.
- use the anonymous submission version which contains line numbers
- the paper must contain only your examination number in the author line
- include a footnote with an anonymised view-only link to your code on the OSF (see above).
- submit the report via moodle as a pdf file using the following file name: _SECU0057_12345.pdf_ (replace 12345 with your examination number)


#### Deliverables

- Concept draft (9 Mar 2020)
- Project report (16 April 2020)
- Code supplement (16 April 2020)

#### Grading criteria

| Criterion                          	| Meaning                                                                                                                                            	| Weight 	|
|------------------------------------	|----------------------------------------------------------------------------------------------------------------------------------------------------	|--------	|
| Originality of project             	| The degree to which the student demonstrates insight and is able to use an innovative approach to address the question of authorship attribution.  	| 15%    	|
| Quality of data science techniques 	| The degree to which the techniques used in this project are appropriate to answer the research question and are utilised and interpreted properly. 	| 15%    	|
| Quality of the R code              	| The degree to which the R code is well-documented, reproducible (with provided data if needed), and correct.                                       	| 20%    	|
| Report "Introduction" section      	| The quality of the review of related works and the logical flow of the argument in the introduction section.                                       	| 10%    	|
| Report "Method" section            	| The clarity of the method section that details the steps of data acquisition, preprocessing and analysis.                                          	| 10%    	|
| Report "Results" section           	| The suitability and clarity of the results section.                                                                                                	| 10%    	|
| Report "Discussion" section        	| The quality of the overall interpretation  as well as limitations and suggestions for future work.                                                 	| 15%    	|
| Concept draft submission           	| Whether or not the concept draft was submitted in the required format.                                                                             	| 5%     	|

## Attendance requirement

We are obliged to record the attendance at all sessions (lectures and tutorials) and each student will have to attend at least 70% of the sessions to be able to pass the module. If you cannot attend for a good reason, please let the TA know about this well in advance. *We strongly advise you to attend all sessions as this eases the assessment for you and will help you get the most out of this module.*

------