-
Notifications
You must be signed in to change notification settings - Fork 18
feat: New ns_graph() and lower-level ns_df() and ns_metadata() to download from [netzschleuder](https://networks.skewed.de)
#23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Is it documented how networks are encoded into CSV on Netzschleuder, and if not, can we write a short description and include it in comments? I expect this won't be as simple to get working with the majority of dataset as it might seem. |
|
Here's a stress-test ;-) Not too large, but plenty of attributes. https://networks.skewed.de/net/blumenau_drug |
My plan is to write a script that runs through all networks once to see if there are any differences. My sampling so far showed that it looks pretty consistent edges.csv: source,target, attributes |
This one has a logical vertex attribute that is not currently imported as such. I'm not sure how easy it is to detect that this is logical while maintaining good performance. |
Good catch. I would add minty as a dependency for this |
maelle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 I find that the function name is a mouthful 😸
Co-authored-by: Maëlle Salmon <[email protected]>
|
@maelle how do I properly document the dots argument? check complains about them |
|
The function now has custom ratelimiting. 20 calls per minute. Is that reasonable @szhorvat? |
Indeed. Do you have a suggestion? I don't know how to make it shorter. |
|
Some comments from my side: Name: I think the name is good. It should be descriptive since it will be a relatively rarely seen function. Auto-complete makes this a non-issue. Or do you mean that Netzschleuder is just hard to spell for people with no exposure to German?
I get a warning with some networks, e.g. Multi-network datasets: These do not work currently, this is expected, right? When they do start working, what will happen if only the collection name is given, not a subnetwork? Can we support the slash syntax, like here? Is it very un-R-like to also support the tuple syntax, With these approaches we could get rid of the |
|
Thanks for the feedback!
Yes that was due to the csv format and me not wanting to add dependency when this was still planned as an igraph function. I will implement this via the API. Same as bipartite
Thanks, thats fixable.
Do you mean that it should be possible to download a whole collection at once? Like a list of all You are right currently we need collection name+network name to get one network from a collection. For consistency with gt, I think I will go with the slash notations.
Yes. When I saw this, I immediately thought you want to fetch two networks, not |
No, I didn't meant to imply that, I was just asking about what your plans were. I vote for not allowing a full set download. But we still need to choose some response to providing only a set name instead of a single network. This could be a simple not found, but perhaps it's nicer to tell the user to choose a specific network? Maybe the error message could even link to the relevant networks.skewed.de dataset page, to aid the user in selecting one? |
|
ok yes I agree that full collection shouldnt be possible. I will think of a logic for this |
|
Some more brainstorming: Now we have "read" to me does not say "dataframe", but perhaps this is because of my inexperience with R. There are "read" functions that don't produce dataframes, such as If we could find a more descriptive name that suggests "this downloads network data", then we can shorten The output format doesn't necessarily need to be split between two functions. It can also be controlled by a parameter, in which case we could have a single function. For example,
Other prefix ideas instead of
|
|
While you brainstormed, I implemented what we discussed before. Now we have pkgload::load_all("~/git/R_packages/igraphdata/")
#> ℹ Loading igraphdata
graph_from_netzschleuder("blumenau_drug")
#> IGRAPH 2849f38 UNW- 75 181 --
#> + attr: name (v/c), p_node (v/n), hormone (v/l), len_i (v/n), color
#> | (v/c), label (v/c), class (v/c), x (v/n), y (v/n), tau (e/n),
#> | severity (e/c), weight (e/n), color (e/c), gender (e/c),
#> | patients_norm (e/n), RRI.F (e/n), patients (e/n), tau_norm (e/n),
#> | RRI.M (e/n)
#> + edges from 2849f38 (vertex names):
#> [1] DB01174--DB00741 DB01174--DB00860 DB01174--DB00635 DB01174--DB00571
#> [5] DB01174--DB00367 DB01174--DB01223 DB00741--DB00252 DB00252--DB00537
#> [9] DB00252--DB01223 DB00252--DB00695 DB00252--DB00359 DB00252--DB00338
#> [13] DB00252--DB00196 DB00252--DB01234 DB00252--DB00860 DB00252--DB00829
#> + ... omitted several edges
graph_from_netzschleuder("fresh_webs/AkatoreA")
#> IGRAPH 58361eb DNW- 85 227 --
#> + attr: name (v/c), x (v/n), y (v/n), weight (e/n)
#> + edges from 58361eb (vertex names):
#> [1] 1 ->45 1 ->46 1 ->47 1 ->48 1 ->49 1 ->51 1 ->52 1 ->53 1 ->54 1 ->55
#> [11] 1 ->56 1 ->57 1 ->58 1 ->63 1 ->64 1 ->66 1 ->67 1 ->68 1 ->69 1 ->70
#> [21] 1 ->71 1 ->72 1 ->73 1 ->74 1 ->75 1 ->76 1 ->78 1 ->81 1 ->83 1 ->84
#> [31] 1 ->85 2 ->83 3 ->46 3 ->47 3 ->48 3 ->50 3 ->53 3 ->54 3 ->55 3 ->56
#> [41] 3 ->58 3 ->59 3 ->60 3 ->61 3 ->62 3 ->65 3 ->67 3 ->68 3 ->70 3 ->71
#> [51] 3 ->72 3 ->74 3 ->75 3 ->78 3 ->82 3 ->83 4 ->45 4 ->46 4 ->66 4 ->80
#> [61] 5 ->83 6 ->46 6 ->48 6 ->50 6 ->53 6 ->55 6 ->61 6 ->65 6 ->67 6 ->69
#> [71] 6 ->75 6 ->83 7 ->55 7 ->61 7 ->83 8 ->55 8 ->83 9 ->83 10->71 10->72
#> + ... omitted several edges
graph_from_netzschleuder("fresh_webs")
#> Error in `read_from_netzschleuder()` at igraphdata/R/netzschleuder.R:139:3:
#> ! fresh_webs is a collection and downloading a whole collection is not
#> permitted.
#> ℹ see <https://networks.skewed.de/net/fresh_webs>
graph_from_netzschleuder("fresh_webs/nope")
#> Error in `read_from_netzschleuder()` at igraphdata/R/netzschleuder.R:139:3:
#> ! nope is not part of the collection fresh_webs.
#> ℹ see <https://networks.skewed.de/net/fresh_webs>Created on 2025-04-10 with reprex v2.1.1 |
|
This is maybe personal, but I do not like it if the output of a function changes depending on a parameter like I am not good in naming things, so I am not sure what the best option would be. |
krlmlr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a very useful function. Some general remarks:
- Do we need to close the connections opened with
unz()to avoid warnings? An easy way iswithr::local_connection(unz(...))if we don't mind the dependency. - I prefer
[[over[for accessing single elements of a vector, andvapply()oversapply(): Strictness - Should we use the longer
netzschleuder_prefix instead ofns_? - I wonder if the df and graph functions should accept a metadata object. We'd add a class and provide a
print()method for extra sugar.
R/netzschleuder.R
Outdated
| if (net_ident[1] == net_ident[2] && length(unlist(raw$nets)) > 1) { | ||
| cli::cli_abort( | ||
| c( | ||
| "{net_ident[1]} is a collection and downloading a whole collection is not permitted.", | ||
| "i" = "see {.url {collection_url}}" | ||
| ) | ||
| ) | ||
| } else if (net_ident[1] == net_ident[2]) { | ||
| return(raw) | ||
| } else { | ||
| idx <- which(unlist(raw[["nets"]]) == net_ident[2]) | ||
| if (length(idx) == 0) { | ||
| cli::cli_abort( | ||
| c( | ||
| "{net_ident[2]} is not part of the collection {net_ident[1]}.", | ||
| "i" = "see {.url {collection_url}}" | ||
| ), | ||
| call = call | ||
| ) | ||
| } | ||
| raw[["analyses"]] <- raw[["analyses"]][[net_ident[2]]] | ||
| raw[["nets"]] <- raw[["nets"]][idx] | ||
| raw | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have examples for each of the code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes all three are needed and The error handling should tell users what they are supposed to do to get the correct network
| if (grepl("/", x)) { | ||
| return(strsplit(x, "/", fixed = TRUE)[[1]]) | ||
| } else { | ||
| c(x, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks interesting to me. I wonder if we can make the intent here clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mechanism is need for the API. If a collection has only one network, the url is
https://networks.skewed.de/net/7th_graders/files/7th_graders.csv.zip. SO the name is duplicated. For a network in a collection, the zip file has a different name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Should we then return a named list instead, to clarify the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
|
|
||
| download_file <- function(zip_url, token = NULL, file) { | ||
| resp <- make_request(zip_url, token) | ||
| writeBin(httr2::resp_body_raw(resp), file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the files always small enough to fit into memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some networks that go up to 10GB. Should we implement something that stops users to download files over a certain size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement something that stops users to download files over a certain size?
IMO it's not a bad idea. Many of our users are inexperienced, and the risk of accidentally trying to download something large is too high. But is it possible the determine the size in advance? Or would you use the edge and vertex count as a proxy for the size?
Do I understand correctly that httr2 does automatically cache requests, preventing the re-downloading of large data?
If you do this, I'd suggest having an option to override and download anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider is whether R has limits on the number of elements in dataframes, or number of rows it can read from CSVs. R still has this silly limitation of 32-bit integers, and while it can index with doubles now (weird!), not every object has support for such indexing. E.g. matrices don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about raw data loaded into memory, which is what it looks like this code is doing. Can we let curl handle the downloading in this case, to write directly to disk in chunks? There might be a way to show a progress bar too, and the user can cancel if it's too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will dig into httr2 a bit on this. I would like to keep using it here too for the rate limiting. I am sure there is something that can help us here. Otherwise my idea was to create an option or so that specifies a hard upper limit of edges/nodes that are permitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the rate limiting important for downloads, or just for API calls?
We can probably get the body in chunks and write piecemeal if really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rate limiting is the most important for files because that is the heaviest load on the server.
shouldn't
We discussed it further up that
Good idea. I will explore that |
|
Not sure the switch to a S3 method for |
|
I worked a bit more on trying to break things, and here's what I found: Note that Let's misspell the subnetwork: This error could be nicer. Let's write nothing after Let's add one more slash and an extra name: Let's take a dataset that has only a single network, so it should not use the |
|
Should |
|
Thanks for the thorough testing. Now try to break it again ;-). I think you are right with |
This sometimes happens when passing the wrong type, as that type of converted to character, which tends to result in a vector of more than one elements.
I'm out of ideas on how to break it. Out of curiosity, are downloaded networks cached? It's my impression that |
|
I was thinking about caching. If I find a way that is not too convoluted, then I implement it. Httr2 has a cache function but not sure how persistent it is. Creating a persistent rate limit was already quite convoluted |
|
This suggests that |
krlmlr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A wide range from questions to nitpicks to more serious concerns.
DESCRIPTION
Outdated
| cli, | ||
| httr2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those must be in "Suggests".
| #remove trailing / | ||
| x <- sub("/$", "", x) | ||
| #remove double slash | ||
| x <- sub("//", "/", x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because @szhorvat tried to break the name parameter and these are just some cleaning of common user input errors that could occur
| if (grepl("/", x)) { | ||
| return(strsplit(x, "/", fixed = TRUE)[[1]]) | ||
| } else { | ||
| c(x, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Should we then return a named list instead, to clarify the intent?
|
|
||
| download_file <- function(zip_url, token = NULL, file) { | ||
| resp <- make_request(zip_url, token) | ||
| writeBin(httr2::resp_body_raw(resp), file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about raw data loaded into memory, which is what it looks like this code is doing. Can we let curl handle the downloading in this case, to write directly to disk in chunks? There might be a way to show a progress bar too, and the user can cancel if it's too large.
| nodes_df_raw <- utils::read.csv(unz(temp, node_file_name)) | ||
| #suppress warning if no character columns found | ||
| nodes_df <- suppressWarnings(minty::type_convert(nodes_df_raw)) | ||
| names(nodes_df)[1] <- "id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards implementing lintr just for this, here and elsewhere.
| names(nodes_df)[1] <- "id" | |
| names(nodes_df)[[1]] <- "id" |
R/netzschleuder.R
Outdated
| #' @rdname netzschleuder | ||
| #' @export | ||
| ns_graph <- function(name, token = NULL) { | ||
| graph_data <- ns_df(name, token = token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was not about S3, we can implement poor man's dispatch here (hand-rolled check if it's a character or a classed object). Users may want to inspect or do other things to the meta object before creating a graph or a data frame from it. We could also decouple downloading from df creation, but that's a bit of a can of worms.
|
I think I have caught up with all suggestions now. The big ones:
|
ns_graph() and lower-level ns_df() and ns_metadata() to download from [netzschleuder](https://networks.skewed.de)
|
Thanks! Opening a new issue to address remaining comments. |
Fix #22