diff --git a/Gemfile.lock b/Gemfile.lock index 14e2579..4eb6148 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -103,9 +103,6 @@ GEM pathutil (~> 0.9) rouge (>= 1.7, < 4) safe_yaml (~> 1.0) - jekyll-asciidoc (3.0.0) - asciidoctor (>= 1.5.0) - jekyll (>= 3.0.0) jekyll-avatar (0.7.0) jekyll (>= 3.0, < 5.0) jekyll-coffeescript (1.1.1) @@ -250,7 +247,7 @@ GEM thread_safe (0.3.6) typhoeus (1.4.0) ethon (>= 0.9.0) - tzinfo (1.2.9) + tzinfo (1.2.10) thread_safe (~> 0.1) unf (0.1.4) unf_ext @@ -263,7 +260,6 @@ PLATFORMS DEPENDENCIES github-pages - jekyll-asciidoc jekyll-seo-tag kramdown (>= 2.3.1) diff --git a/_data/people.yml b/_data/people.yml index cebd3c5..53f789e 100644 --- a/_data/people.yml +++ b/_data/people.yml @@ -7,9 +7,9 @@ twitter: jandot blog: http://saaientist.blogspot.com bio: | - Jan has a background in genetics and genomics, and performed his doctoral research at the University of Wageningen (Netherlands) on the Chicken Genome Sequencing Project. He then moved to Scotland to work as a postdoctoral researcher at the Roslin Institute on the Cow Genome Sequencing Project. Next, he continued his research at the Wellcome Trust Sanger Institute near Cambridge (UK) focusing on structural variation in the human and other primate genomes. At his return to Belgium at the KU Leuven in 2010 (as assistent and later associate professor), he shifted focus to data visualization and visual analytics, with the aim of finding interesting questions in large datasets (big data). His main research topics revolve around visual design, interaction design, and (human and computational) scalability. Since 2019, he is professor at Hasselt University where he continues his visual analytics work and helps build a new Data Science Institute. + Jan has a background in genetics and genomics, and performed his doctoral research at the University of Wageningen (Netherlands) on the Chicken Genome Sequencing Project. He then moved to Scotland to work as a postdoctoral researcher at the Roslin Institute on the Cow Genome Sequencing Project. Next, he continued his research at the Wellcome Trust Sanger Institute near Cambridge (UK) focusing on structural variation in the human and other primate genomes. At his return to Belgium at the KU Leuven in 2010 (as assistent and later associate professor), he shifted focus to data visualization and visual analytics, with the aim of finding interesting questions and gaining deeper insight in large and complex datasets. In support of this, he also researches topological data analysis (TDA) as a method to identify underlying structures. From 2019 until 2021 he was Full Professor at Hasselt University and Director of its Data Science Institute. Since 2022 he is part-time Professor at the universities of Leuven and Hasselt, and Research Fellow for Amador Bioscience. - Jan has been on the organising committees of several conferences (including BioVis and Beyond The Genome), and has chaired visualization-related sessions at conferences including VIZBI, the Bioinformatics Open Source Conference BOSC and EuroVis/VMLS. He is also Associate Editor for the BioMedCentral Thematic Series on Biological Data Visualization, and academic editor for PLoS One. He was founding member of the Young Academy – Royal Flemish Academy of Belgium for Sciences and the Arts. + Jan has been on the organising committees of several conferences (including BioVis and Beyond The Genome), and has chaired visualization-related sessions at conferences including VIZBI, the Bioinformatics Open Source Conference BOSC and EuroVis/VMLS. He is also Associate Editor for the BioMedCentral Thematic Series on Biological Data Visualization, and academic editor for PLoS One and Frontiers in Bioinformatics. He was founding member of the Young Academy - Royal Flemish Academy of Belgium for Sciences and the Arts. - name: Danai Kafetzaki class: current diff --git a/assets/exam_20220623.html b/assets/exam_20220623.html new file mode 100644 index 0000000..4d9d67b --- /dev/null +++ b/assets/exam_20220623.html @@ -0,0 +1,920 @@ + + + + + + + + +Data Management - Exam 23/6/2022 Section Jan Aerts + + + + + + +
+
+
+
+

These are the questions for the key/value and document stores.

+
+
+
+
+

Instructions

+
+
+

For the questions in this section, we will consider a document-oriented database with Yelp data. Imagine there are 3 collections: businesses, users and reviews.

+
+
+

Please email the answer to jan.aerts@uhasselt.be. Your email should include the answers for each statement like this (obviously mock-up):

+
+
+
Example answers
+
+
1.1: false
+1.2: true
+1.3: true
+1.4: false
+2.1: true
+2.2: ...
+
+
+
+ + + + + +
+
Note
+
+Explicitly state which are false and which are true. Do not just send a list of the true statements. +
+
+
+
+
+

Dataset

+
+
+
businesses
+
+
{
+    "_key": "tnhfDv5Il8EaGSXZGiuQGg",
+    "_id": "businesses/tnhfDv5Il8EaGSXZGiuQGg",
+
+    // the business's name
+    "name": "Garaje",
+
+    // the city
+    "city": "San Francisco",
+
+    // 2 character state code
+    "state": "CA",
+
+    // star rating
+    "stars": 4.5,
+
+    // number of reviews
+    "review_count": 1198,
+
+    // object, business attributes to values. note: some attribute values might be objects
+    "attributes": {
+        "RestaurantsTakeOut": true,
+        "BusinessParking": {
+            "garage": false,
+            "street": true,
+            "lot": false
+        },
+    },
+
+    // business category: Restaurant, Plumber, ...
+    "category": "Restaurant"
+}
+
+
+
+
reviews
+
+
{
+    "_key": "zdSx_SD6obEhz9VrW9uAWA",
+    "_id": "reviews/zdSx_SD6obEhz9VrW9uAWA",
+
+    // user id, maps to the user in users collection
+    "user_id": "users/Ha3iJu77CxlrFm-vQRs_8g",
+
+    // business id, maps to business in businesses collection
+    "business_id": "businesses/tnhfDv5Il8EaGSXZGiuQGg",
+
+    // star rating
+    "stars": 4,
+
+    // date of review
+    "date": {
+        "year": 2016,
+        "month": 3,
+        "day": 9
+    },
+
+    // number of useful votes received
+    "useful": 15,
+
+    // the review itself
+    "text": "Great place to hang out after work"
+}
+
+
+
+
users
+
+
{
+    "_key": "Ha3iJu77CxlrFm-vQRs_8g",
+    "_id": "users/Ha3iJu77CxlrFm-vQRs_8g",
+
+    // the user's first name
+    "name": "Sebastien",
+
+    // the number of reviews they've written
+    "review_count": 56,
+
+    // when the user joined Yelp
+    "yelping_since": {
+        "year": 2011,
+        "month": 1,
+        "day": 1
+    },
+
+    // number of fans the user has
+    "fans": 1032,
+
+    // the years the user was elite
+    "elite": [
+        2012,
+        2013
+    ],
+
+    // average rating of all reviews
+    "average_stars": 4.31
+}
+
+
+
+

We will make the following assumptions:

+
+
+
    +
  • +

    All documents are well-formed, and therefore have the same schema. In other words: all keys are present in all documents (e.g. attributes is not missing from one of the businesses).

    +
  • +
  • +

    There are users who have written no reviews and there are businesses that have received no reviews.

    +
  • +
+
+
+
+
+

Question 1

+
+
+

Consider the following query:

+
+
+
+
FOR r IN reviews
+COLLECT m=r.date.month AGGREGATE u=MAX(r.useful)
+LIMIT 5
+SORT u DESC
+RETURN {m:m, u:u}
+
+
+
+

Which of the following statements are true? Attention: there might be none, there might be more than one.

+
+
+

Possible answer 1.1 - This query shows the 5 months with the highest number of useful votes their reviews received.

+
+
+

Possible answer 1.2 - This query shows the 5 most useful reviews.

+
+
+

Possible answer 1.3 - This query will return a value for each month of the year, even if there are no reviews in that month.

+
+
+

Possible answer 1.4 - This query shows 5 random months together with the highest number of useful votes a review in them received.

+
+
+
+
+

Question 2

+
+
+

Consider the following query:

+
+
+
+
FOR u IN users
+FOR r IN reviews
+FILTER r.user_id == u._id
+FILTER r.stars < (u.average_stars/2)
+RETURN {n:u.name,us:u.average_stars,s:r.stars}
+
+
+
+

Which of the following statements are true? Attention: there might be none, there might be more than one.

+
+
+

Possible answer 2.1 - This query will only return results for users who have written reviews.

+
+
+

Possible answer 2.2 - All users will appear in the results.

+
+
+

Possible answer 2.3 - This query returns a result for each review where the user gives less than half of their average number of stars.

+
+
+

Possible answer 2.4 - This query will return the same results if the first two lines were swapped (i.e. first FOR r IN reviews, then FOR u IN users).

+
+
+
+
+

Question 3

+
+
+

Consider the following query:

+
+
+
+
FOR u IN users
+SORT u.fans DESC
+LIMIT 1
+RETURN {a:u.name, b:u.average_stars}
+
+
+
+

Which of the following statements are true? Attention: there might be none, there might be more than one.

+
+
+

Possible answer 3.1 - The result is not deterministic because there might be multiple users with an equal amount of fans.

+
+
+

Possible answer 3.2 - This query returns the name and average stars given for the user with the fewest fans.

+
+
+

Possible answer 3.3 - This query returns the name and average stars given for the user with the most fans.

+
+
+

Possible answer 3.4 - The result is independent of the maximum number of stars a user gave in their reviews.

+
+
+
+
+

Question 4

+
+
+

Consider the following query:

+
+
+
+
FOR b IN businesses
+FILTER b.state == "CA"
+RETURN DISTINCT {
+  name: b.name,
+  stars: (
+    FOR r IN reviews
+    FILTER r.business_id == b._id
+    FILTER r.date.year == 2016
+    RETURN r.stars
+)}
+
+
+
+

Which of the following statements are true? Attention: there might be none, there might be more than one.

+
+
+

Possible answer 4.1 - This returns the name of each business in California, plus an array of the stars they received in 2016. If a business didn’t have a review in 2016, that business is not included in the output.

+
+
+

Possible answer 4.2 - This returns the name of each business in California, plus an array of the stars they received in 2016. If a business didn’t have a review in 2016, an empty array is returned for the stars.

+
+
+

Possible answer 4.3 - The DISTINCT has no effect on the output and could have been removed.

+
+
+

Possible answer 4.4 - The SORT r.stars DESC has no effect on the output and could be removed.

+
+
+
+
+

Question 5

+
+
+

Which of the following queries returns the take-out restaurant with the highest number of reviews in 2018? The output should be a single object and look like this:

+
+
+
+
{
+    "_key": "GBTPC53ZrG1ZBY3DT8Mbcw",
+    "_id": "businesses/GBTPC53ZrG1ZBY3DT8Mbcw",
+    "name": "Luke",
+    "city": "New Orleans",
+    "state": "LA",
+    "stars": 4,
+    "review_count": 4554,
+    "attributes": {
+        "RestaurantsReservations": "True",
+        "RestaurantsTakeOut": "True"
+    },
+    "category": "Restaurant"
+}
+
+
+
+

Attention: there might be none, there might be more than one.

+
+
+

Possible answer 5.1

+
+
+
+
FOR a IN businesses
+FILTER a.attributes.RestaurantsTakeOut == "True" AND a.category == "Restaurant"
+SORT a.review_count DESC
+LIMIT 1
+RETURN a
+
+
+
+

Possible answer 5.2

+
+
+
+
LET a = (
+  FOR b IN reviews
+  FILTER b.date.year == 2018
+  COLLECT c = b.business_id WITH COUNT INTO cnt
+  SORT cnt DESC
+  LIMIT 1
+  RETURN DOCUMENT(c)
+)
+
+FOR d IN a
+FILTER d.attributes.RestaurantsTakeout == "True"
+FILTER d.category == "Restaurant"
+RETURN d
+
+
+
+

Possible answer 5.3

+
+
+
+
FOR r IN reviews
+FOR b IN businesses
+FILTER r.business_id == b._id
+FILTER r.date.year == 2018
+FILTER b.category == "Restaurant"
+FILTER b.attributes.RestaurantsTakeOut == "True"
+COLLECT c = r.business_id WITH COUNT INTO d
+SORT d DESC
+LIMIT 1
+RETURN DOCUMENT(c)
+
+
+
+

Possible answer 5.4

+
+
+
+
FOR b IN businesses
+FILTER b.attributes.RestaurantsTakeOut == "True"
+FILTER b.category == "Restaurant"
+SORT b.review_count DESC
+LIMIT 1
+RETURN b.name
+
+
+
+
+
+

Question 6

+
+
+

Which of the following queries results in a list of unique business categories? It would look like this:

+
+
+
+
["Restaurant","Plumber","Beauty & Spas","Gunsmith","Wedding Planner"]
+
+
+
+

Attention: there might be none, there might be more than one.

+
+
+

Possible answer 6.1

+
+
+
+
FOR b IN businesses
+COLLECT c=b.category
+RETURN c
+
+
+
+

Possible answer 6.2

+
+
+
+
FOR b IN businesses
+RETURN DISTINCT b.category
+
+
+
+

Possible answer 6.3

+
+
+
+
LET categories = (
+    FOR b IN businesses
+    RETURN b.category
+)
+FOR c IN categories
+RETURN DISTINCT c
+
+
+
+

Possible answer 6.4

+
+
+
+
FOR c IN (
+  FOR b IN businesses
+  RETURN b.category
+)
+RETURN DISTINCT c
+
+
+
+
+
+ + + + + \ No newline at end of file diff --git a/assets/sketch_example.png b/assets/sketch_example.png new file mode 100644 index 0000000..267612a Binary files /dev/null and b/assets/sketch_example.png differ diff --git a/dvds.md b/dvds.md index 56a9b39..24b0a9a 100644 --- a/dvds.md +++ b/dvds.md @@ -14,6 +14,8 @@ Content: * Programming visualizations: static and dynamic * Project: visualization of expert dataset +## 2021-2022 +Information about the retake exam in August 2022 is available [here]({{site.baseurl}}/visds_resit2022.html). ## 2020-2021 diff --git a/resources.md b/resources.md index ad53082..6911e9d 100644 --- a/resources.md +++ b/resources.md @@ -47,6 +47,7 @@ color: "#fff2ae" * [Svelte](http://svelte.dev) - General framework for creating web content and visuals * [D3](d3js.org) - Javascript visualisation library +* [D3 Discovery](d3-discovery.net) - Finding D3 plugins with ease * [P5](p5js.org) - Another javascript visualisation library * [OpenProcessing](openprocessing.org) - classroom platform for teaching P5 * [Quil](quil.info) - clojure implementation of P5 diff --git a/visds_resit2022.md b/visds_resit2022.md new file mode 100644 index 0000000..5d6bf3e --- /dev/null +++ b/visds_resit2022.md @@ -0,0 +1,124 @@ +--- +layout: page +title: Data Visualisation in Data Science - instructions August 2022 +permalink: visds_resit2022.html +--- +## Overview +For the July/August term, we ask you to create designs on a yelp dataset, as well as implement two visuals using svelte. The exercises done during the course will _not_ be part of this exam period. Thirty percent of the grade will be based on the designs; 70% on the implementation. + +Please submit your contributions **before August 22**. + +**Getting help** - The teaching assistants will be available to support you again. Please **do not wait until the last week before the deadline** to ask them your questions. Any open office moments will be announced on Blackboard/Toledo. Not all teaching assistants are available throughout the entire July/August term. Include all three in your correspondence for the best chance of a timely reply. + +- Jelmer Bot: not available until July 25th, and from August 15th +- Jannes Peeters: not available in last 2 weeks of August +- Dries Heylen: varying availability + +## Design +We ask you to go through a diverge and emerge phase for a yelp dataset. Yelp (http://www.yelp.com) is an online directory of businesses and services that includes reviews by customers. The data consists of information on businesses, users, and reviews. You can create sketches for these independently (i.e. businesses), or combined (e.g. businesses vs users); that is up to you. + +We refer you to the teaching material (slides and video) on methods to explore design space in the diverge and emerge stages. + +### The data +Imagine having tens of thousands of records for each datatype, but we only show one below to give you an idea of the information each holds. + +**Businesses** +{% highlight json %} +{ + "business_id": "Pns2l4eNsfO8kk83dixA6A", + "name": "Abby Rappoport, LAC, CMQ", + "address": "1616 Chapala St, Ste 2", + "city": "Santa Barbara", + "state": "CA", + "postal_code": "93101", + "latitude": 34.4266787, + "longitude": -119.7111968, + "stars": 5, + "review_count": 7, + "is_open": 0, + "attributes": { + "ByAppointmentOnly": "True" + }, + "categories": "Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists", + "hours": null +} +{% endhighlight %} + +**Users** +{% highlight json %} +{ + "user_id": "qVc8ODYU5SZjKXVBgXdI7w", + "name": "Walker", + "review_count": 585, + "yelping_since": "2007-01-25 16:47:26", + "useful": 7217, + "funny": 1259, + "cool": 5994, + "elite": "2007", + "friends": "NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA", + "fans": 267, + "average_stars": 3.91, + "compliment_hot": 250, + "compliment_more": 65, + "compliment_profile": 55, + "compliment_cute": 56, + "compliment_list": 18, + "compliment_note": 232, + "compliment_plain": 844, + "compliment_cool": 467, + "compliment_funny": 467, + "compliment_writer": 239, + "compliment_photos": 180 +} +{% endhighlight %} + +**Reviews** +{% highlight json %} +{ + "review_id": "KU_O5udG6zpxOg-VcAEodg", + "user_id": "mh_-eMZ6K5RLWhZyISBhwA", + "business_id": "XQfwVwDr-v0ZS3_CbbE5Xw", + "stars": 3, + "useful": 0, + "funny": 0, + "cool": 0, + "text": "If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.", + "date": "2018-07-07 22:09:11" +} +{% endhighlight %} + +### Expected for the diverge stage +For the diverge phase, we expect 10 to 20 sketches that explore design space. Your collection will be evaluated on diversity, novelty and relevance. Please find a balance between novelty and relevance: even though we want you to not be critical at this stage, don't scribble random things under the pretense of "novelty". + +The image below shows an example sketch from one of your colleagues for the energy dataset. + + + +### Expected for the emerge stage +For the emerge phase, we expect 5 to 10 sketches that either combine different sketches, or take certain sketches a step further. Again: refer to the teaching material for inspiration on different ways to combine sketches. + +### Specific instructions +As we did in the group session, please: + +* put your initials in the top-right corner of each sketch +* number each sketch in the top-left corner +* clearly indicate what each mark means +* for the emerge sketches, indicate which sketch(es) from the diverge (or emerge) phase are combined + +## Implementation +For the implementation part, we will use the energy dataset that we used for the exercises during the year. This is to make sure that you have access to the data. There are two designs that you need to implement. Specific instructions as well as the designs are available at [https://datavis-exercises.vercel.app/resit_project](https://datavis-exercises.vercel.app/resit_project). + +You have two choices to obtain these instructions: + +1. Update your existing website by following the [Receiving new instructions](https://datavis-exercises.vercel.app/instructions/working_on_exercises) section. +2. Create a fresh website for this term: + 1. Create a new fork of the [exercise repository](https://gitlab.com/vda-lab/datavis_exercises). + 2. Create a Vercel deployment for your new fork. + 3. Send us your new Gitlab and Vercel urls per email!!! + +**Do not delay asking for help if you run in to issues at this stage!** + +## How to submit +For the **designs**, we want you to submit a single zip-file which contains 2 folders: one called "diverge" with pictures of your diverge sketches, and one called "emerge" with pictures of your emerge sketches. We will create a Toledo/Blackboard assignment where you can upload them. + +For the **implementation**, we have created an additional folder in the git repository ("resit_project"), just like we did for the final visualisations in May. Remember that your visualisations have to show up on Vercel to get graded. \ No newline at end of file