Skip to content

Commit 997f3e7

Browse files
committed
Merge branch 'arfffiles'
2 parents 9e25440 + 04883ad commit 997f3e7

File tree

5 files changed

+177
-96
lines changed

5 files changed

+177
-96
lines changed

Project.toml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,20 @@
11
name = "MLJOpenML"
22
uuid = "cbea4545-8c96-4583-ad3a-44078d60d369"
33
authors = ["Anthony D. Blaom <[email protected]>"]
4-
version = "1.0.0"
4+
version = "2.0.0"
55

66
[deps]
7+
ARFFFiles = "da404889-ca92-49ff-9e8b-0aa6b4d38dc8"
78
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
89
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
10+
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
11+
ScientificTypes = "321657f4-b219-11e9-178b-2701a2544e81"
912

1013
[compat]
11-
HTTP = "^0.8, 0.9"
12-
JSON = "^0.21"
14+
HTTP = "0.8, 0.9"
15+
JSON = "0.21"
16+
ScientificTypes = "2"
17+
ARFFFiles = "1.3"
1318
julia = "1"
1419

1520
[extras]

README.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ A package providing integration of [OpenML](https://www.openml.org) with the
88
[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) machine
99
learning framework.
1010

11+
Based entirely on Diego Arenas' original code contribution to MLJBase.jl.
12+
1113

1214
## Installation
1315

@@ -22,15 +24,27 @@ Load the iris data set from OpenML:
2224

2325
```julia
2426
using MLJOpenML
25-
rowtable = MLJOpenML.load(61)
27+
table = MLJOpenML.load(61) # a Tables.DictColumnTable
2628
```
2729

2830
Convert to a `DataFrame`:
2931

30-
```
32+
```julia
3133
Pkg.add("DataFrames")
3234
using DataFrames
33-
df = DataFrame(rowtable)
35+
df = DataFrame(table)
36+
```
37+
38+
Browsing and filtering datasets:
39+
40+
```julia
41+
using DataFrames
42+
ds = MLJOpenML.list_datasets(output_format = DataFrame)
43+
MLJOpenML.describe_dataset(6)
44+
MLJOpenML.list_tags() # lists valid tags
45+
ds = MLJOpenML.list_datasets(tag = "OpenML100",
46+
filter = "number_instances/100..1000/number_features/1..10",
47+
output_format = DataFrame)
3448
```
3549

3650
## Documentation

src/MLJOpenML.jl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
module MLJOpenML
22

3+
const OpenML = MLJOpenML
4+
export OpenML
5+
36
include("openml.jl")
47

58
end # module

src/openml.jl

Lines changed: 147 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
using HTTP
22
using JSON
3+
import ARFFFiles
4+
import ScientificTypes: Continuous, Count, Textual, Multiclass, coerce, autotype
5+
using Markdown
36

47
const API_URL = "https://www.openml.org/api/v1/json"
58

@@ -8,10 +11,10 @@ const API_URL = "https://www.openml.org/api/v1/json"
811
# https://github.com/openml/OpenML/tree/master/openml_OS/views/pages/api_new/v1/xsd
912
# https://www.openml.org/api_docs#!/data/get_data_id
1013

11-
# To do:
12-
# - Save the file in a local folder
13-
# - Check downloaded files in local folder before downloading it again
14-
# - Use local stored file whenever possible
14+
# TODO:
15+
# - Use e.g. DataDeps to cache data locally
16+
# - Put the ARFF parser to a separate package or use ARFFFiles when
17+
# https://github.com/cjdoris/ARFFFiles.jl/issues/4 is fixed.
1518

1619
"""
1720
Returns information about a dataset. The information includes the name,
@@ -42,73 +45,33 @@ function load_Dataset_Description(id::Int; api_key::String="")
4245
end
4346

4447
"""
45-
Returns a Vector of NamedTuples.
46-
Receives an `HTTP.Message.response` that has an
47-
ARFF file format in the `body` of the `Message`.
48-
"""
49-
function convert_ARFF_to_rowtable(response)
50-
data = String(response.body)
51-
data2 = split(data, "\n")
52-
53-
featureNames = String[]
54-
dataTypes = String[]
55-
# TODO: make this more performant by anticipating types?
56-
named_tuples = [] # `Any` type here bad
57-
for line in data2
58-
if length(line) > 0
59-
if line[1:1] != "%"
60-
d = []
61-
if occursin("@attribute", lowercase(line))
62-
push!(featureNames, replace(replace(split(line, " ")[2], "'" => ""), "-" => "_"))
63-
push!(dataTypes, split(line, " ")[3])
64-
elseif occursin("@relation", lowercase(line))
65-
nothing
66-
elseif occursin("@data", lowercase(line))
67-
# it means the data starts
68-
nothing
69-
else
70-
values = split(line, ",")
71-
for i in eachindex(featureNames)
72-
if lowercase(dataTypes[i]) in ["real","numeric"]
73-
push!(d, featureNames[i] => Meta.parse(values[i]))
74-
else
75-
# all the rest will be considered as String
76-
push!(d, featureNames[i] => values[i])
77-
end
78-
end
79-
push!(named_tuples, (; (Symbol(k) => v for (k,v) in d)...))
80-
end
81-
end
82-
end
83-
end
84-
return identity.(named_tuples) # not performant; see above
85-
end
48+
MLJOpenML.load(id; parser = :arff)
8649
87-
"""
88-
MLJOpenML.load(id)
50+
Load the OpenML dataset with specified `id`, from those listed by
51+
[`list_datasets`](@ref) or on the [OpenML site](https://www.openml.org/search?type=data).
52+
With `parser = :arff` (default) the ARFFFiles.jl parser is used.
53+
With `parser = :auto` the output of the ARFFFiles parser is coerced to
54+
automatically detected scientific types.
8955
90-
Load the OpenML dataset with specified `id`, from those listed on the
91-
[OpenML site](https://www.openml.org/search?type=data).
56+
Returns a table.
9257
93-
Returns a "row table", i.e., a `Vector` of identically typed
94-
`NamedTuple`s. A row table is compatible with the
95-
[Tables.jl](https://github.com/JuliaData/Tables.jl) interface and can
96-
therefore be readily converted to other compatible formats. For
97-
example:
58+
# Examples
9859
9960
```julia
10061
using DataFrames
101-
rowtable = MLJOpenML.load(61);
102-
df = DataFrame(rowtable);
103-
104-
using MLJ
105-
df2 = coerce(df, :class=>Multiclass)
62+
table = MLJOpenML.load(61);
63+
df = DataFrame(table);
10664
```
10765
"""
108-
function load(id::Int)
66+
function load(id::Int; parser = :arff)
10967
response = load_Dataset_Description(id)
11068
arff_file = HTTP.request("GET", response["data_set_description"]["url"])
111-
return convert_ARFF_to_rowtable(arff_file)
69+
data = ARFFFiles.load(IOBuffer(arff_file.body))
70+
if parser == :auto
71+
return coerce(data, autotype(data))
72+
else
73+
return data
74+
end
11275
end
11376

11477

@@ -205,33 +168,9 @@ function load_Data_Qualities(id::Int; api_key::String = "")
205168
end
206169

207170
"""
208-
List datasets, possibly filtered by a range of properties.
209-
Any number of properties can be combined by listing them one after
210-
the other in the
211-
form '/data/list/{filter}/{value}/{filter}/{value}/...'
212-
Returns an array with all datasets that match the constraints.
213-
214-
Any combination of these filters /limit/{limit}/offset/{offset} -
215-
returns only {limit} results starting from result number {offset}.
216-
Useful for paginating results. With /limit/5/offset/10,
217-
results 11..15 will be returned.
218-
219-
Both limit and offset need to be specified.
220-
/status/{status} - returns only datasets with a given status,
221-
either 'active', 'deactivated', or 'in_preparation'.
222-
/tag/{tag} - returns only datasets tagged with the given tag.
223-
/{data_quality}/{range} - returns only tasks for which the
224-
underlying datasets have certain qualities.
225-
{data_quality} can be data_id, data_name, data_version, number_instances,
226-
number_features, number_classes, number_missing_values. {range} can be a
227-
specific value or a range in the form 'low..high'.
228-
Multiple qualities can be combined, as in
229-
'number_instances/0..50/number_features/0..10'.
230-
231-
- 370 - Illegal filter specified.
232-
- 371 - Filter values/ranges not properly specified.
233-
- 372 - No results. There where no matches for the given constraints.
234-
- 373 - Can not specify an offset without a limit.
171+
load_List_And_Filter(filters; api_key = "")
172+
173+
See [OpenML API](https://www.openml.org/api_docs#!/data/get_data_list_filters).
235174
"""
236175
function load_List_And_Filter(filters::String; api_key::String = "")
237176
if api_key == ""
@@ -257,6 +196,126 @@ function load_List_And_Filter(filters::String; api_key::String = "")
257196
return nothing
258197
end
259198

199+
qualitynames(x) = haskey(x, "name") ? [x["name"]] : []
200+
201+
"""
202+
list_datasets(; tag = nothing, filters = "" api_key = "", output_format = NamedTuple)
203+
204+
Lists all active OpenML datasets, if `tag = nothing` (default).
205+
To list only datasets with a given tag, choose one of the tags in [`list_tags()`](@ref).
206+
An alternative `output_format` can be chosen, e.g. `DataFrame`, if the
207+
`DataFrames` package is loaded.
208+
209+
A filter is a string of `<data quality>/<range>` or `<data quality>/<value>`
210+
pairs, concatenated using `/`, such as
211+
212+
```julia
213+
filter = "number_features/10/number_instances/500..10000"
214+
```
215+
216+
The allowed data qualities include `tag`, `status`, `limit`, `offset`,
217+
`data_id`, `data_name`, `data_version`, `uploader`,
218+
`number_instances`, `number_features`, `number_classes`,
219+
`number_missing_values`.
220+
221+
For more on the format and effect of `filters` refer to the [openml
222+
API](https://www.openml.org/api_docs#!/data/get_data_list_filters).
223+
224+
# Examples
225+
```
226+
julia> using DataFrames
227+
228+
julia> ds = MLJOpenML.list_datasets(
229+
tag = "OpenML100",
230+
filter = "number_instances/100..1000/number_features/1..10",
231+
output_format = DataFrame
232+
)
233+
234+
julia> sort!(ds, :NumberOfFeatures)
235+
```
236+
"""
237+
function list_datasets(; tag = nothing, filter = "", filters=filter,
238+
api_key = "", output_format = NamedTuple)
239+
if tag !== nothing
240+
if is_valid_tag(tag)
241+
filters *= "/tag/$tag"
242+
else
243+
@warn "$tag is not a valid tag. See `list_tags()` for a list of tags."
244+
return
245+
end
246+
end
247+
data = MLJOpenML.load_List_And_Filter(filters; api_key = api_key)
248+
datasets = data["data"]["dataset"]
249+
qualities = Symbol.(union(vcat([vcat(qualitynames.(entry["quality"])...) for entry in datasets]...)))
250+
result = merge((id = Int[], name = String[], status = String[]),
251+
NamedTuple{tuple(qualities...)}(ntuple(i -> Union{Missing, Int}[], length(qualities))))
252+
for entry in datasets
253+
push!(result.id, entry["did"])
254+
push!(result.name, entry["name"])
255+
push!(result.status, entry["status"])
256+
for quality in entry["quality"]
257+
push!(getproperty(result, Symbol(quality["name"])),
258+
Meta.parse(quality["value"]))
259+
end
260+
for quality in qualities
261+
if length(getproperty(result, quality)) < length(result.id)
262+
push!(getproperty(result, quality), missing)
263+
end
264+
end
265+
end
266+
output_format(result)
267+
end
268+
269+
is_valid_tag(tag::String) = tag list_tags()
270+
is_valid_tag(tag) = false
271+
272+
"""
273+
list_tags()
274+
275+
List all available tags.
276+
"""
277+
function list_tags()
278+
url = string(API_URL, "/data/tag/list")
279+
try
280+
r = HTTP.request("GET", url)
281+
return JSON.parse(String(r.body))["data_tag_list"]["tag"]
282+
catch
283+
return nothing
284+
end
285+
end
286+
287+
"""
288+
describe_dataset(id)
289+
290+
Load and show the OpenML description of the data set `id`.
291+
Use [`list_datasets`](@ref) to browse available data sets.
292+
293+
# Examples
294+
```
295+
julia> MLJOpenML.describe_dataset(6)
296+
Author: David J. Slate Source: UCI
297+
(https://archive.ics.uci.edu/ml/datasets/Letter+Recognition) - 01-01-1991 Please cite: P.
298+
W. Frey and D. J. Slate. "Letter Recognition Using Holland-style Adaptive Classifiers".
299+
Machine Learning 6(2), 1991
300+
301+
1. TITLE:
302+
303+
Letter Image Recognition Data
304+
305+
The objective is to identify each of a large number of black-and-white
306+
rectangular pixel displays as one of the 26 capital letters in the English
307+
alphabet. The character images were based on 20 different fonts and each
308+
letter within these 20 fonts was randomly distorted to produce a file of
309+
20,000 unique stimuli. Each stimulus was converted into 16 primitive
310+
numerical attributes (statistical moments and edge counts) which were then
311+
scaled to fit into a range of integer values from 0 through 15. We
312+
typically train on the first 16000 items and then use the resulting model
313+
to predict the letter category for the remaining 4000. See the article
314+
cited above for more details.
315+
```
316+
"""
317+
describe_dataset(id) = Markdown.parse(load_Dataset_Description(id)["data_set_description"]["description"])
318+
260319
# Flow API
261320

262321
# Task API

test/openml.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ end
2323

2424
@testset "ARFF file conversion to NamedTuples" begin
2525
@test isempty(ntp_test) == false
26-
@test length(ntp_test) == 150
27-
@test length(ntp_test[1]) == 5
26+
@test length(ntp_test[1]) == 150
27+
@test length(ntp_test) == 5
2828
end
2929

3030
@testset "data api functions" begin

0 commit comments

Comments
 (0)