Skip to content

Commit 36a539d

Browse files
committed
added documentation for dataset creation
1 parent 0701711 commit 36a539d

File tree

5 files changed

+256
-2
lines changed

5 files changed

+256
-2
lines changed

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2020 Lei Ma
3+
Copyright (c) 2021 Lei Ma
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

MANIFEST.in

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
include README.md
1+
include README.md
2+
graft dataherb/serve/mkdocs_template

docs/tutorials/create/index.md

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
## Create New Dataset
2+
3+
!!! warning "The demo dataset"
4+
We will use the data files in [this repository](https://github.com/DataHerb/dataherb-python-demo-dataset) as a demo. Please [download the zip of this repo](https://github.com/emptymalei/dataherb-serve/archive/refs/heads/master.zip).
5+
6+
7+
8+
### Create
9+
10+
11+
After downloading and unzipping [the demo dataset](https://github.com/emptymalei/dataherb-serve/archive/refs/heads/master.zip), `cd` into the folder. In my case, it is
12+
13+
```bash
14+
cd ~/Downloads/dataherb-python-demo-dataset-main
15+
```
16+
17+
Run the command
18+
19+
```bash
20+
dataherb create
21+
```
22+
23+
and a few questions will pop out.
24+
25+
```bash
26+
Your current working directory is /Users/itsme/Downloads/dataherb-python-demo-dataset-main
27+
A dataherb.json file will be created right here.
28+
Are you sure this is the correct path? [y/N]: y
29+
```
30+
31+
This makes sure that we are working in the correct folder. In this case, we type `y` to confirm.
32+
33+
#### Which Type of Remote
34+
35+
```bash
36+
[?] Where is/will be the dataset synced to?: git
37+
> git
38+
s3
39+
```
40+
41+
Dataherb supports two different types of sources, S3 and git. In this case, we choose `git` as we would like to sync the dataset to a GitHub repo.
42+
43+
#### Name of the Dataset
44+
45+
```bash
46+
[?] How would you like to name the dataset?: Dataherb Demo Dataset
47+
```
48+
49+
This will be the name of your dataset.
50+
51+
#### ID of the Dataset
52+
53+
```bash
54+
[?] Please specify a unique id for the dataset: git-dataherb-python-demo-dataset
55+
```
56+
57+
ID of the dataset has to be unique in your whole flora.
58+
59+
#### Description
60+
61+
```bash
62+
[?] What is the dataset about? This will be the description of the dataset.: This is a demo dataset to test and show dataherb.
63+
```
64+
65+
Describe the dataset here.
66+
67+
#### URI
68+
69+
```bash
70+
[?] What is the dataset's URI? This will be the URI of the dataset.: https://github.com/DataHerb/dataherb-python-demo-dataset-created.git
71+
```
72+
73+
74+
75+
76+
### Result
77+
78+
Two things happened after this.
79+
80+
1. A file `dataherb.json` will be created in the current folder.
81+
2. The metadata for this dataset has been added to the Flora.
82+
83+
84+
85+
!!! note "Content of the `dataherb.json` file"
86+
The content of the file should be the following.
87+
88+
```json
89+
{
90+
"source": "git",
91+
"name": "Dataherb Demo Dataset",
92+
"id": "git-dataherb-python-demo-dataset",
93+
"description": "This is a demo dataset to test and show dataherb.",
94+
"uri": "https://github.com/DataHerb/dataherb-python-demo-dataset-created.git",
95+
"metadata_uri": "https://raw.githubusercontent.com/DataHerb/dataherb-python-demo-dataset-created/main/dataherb.json",
96+
"datapackage": {
97+
"profile": "tabular-data-package",
98+
"resources": [
99+
{
100+
"path": "dataset/indeed_job_listing.csv",
101+
"profile": "tabular-data-resource",
102+
"name": "indeed_job_listing",
103+
"format": "csv",
104+
"mediatype": "text/csv",
105+
"encoding": "windows-1252",
106+
"schema": {
107+
"fields": [
108+
{
109+
"name": "title",
110+
"type": "string",
111+
"format": "default"
112+
},
113+
{
114+
"name": "location",
115+
"type": "string",
116+
"format": "default"
117+
},
118+
{
119+
"name": "company",
120+
"type": "string",
121+
"format": "default"
122+
},
123+
{
124+
"name": "description",
125+
"type": "string",
126+
"format": "default"
127+
},
128+
{
129+
"name": "salary",
130+
"type": "string",
131+
"format": "default"
132+
},
133+
{
134+
"name": "url",
135+
"type": "string",
136+
"format": "default"
137+
},
138+
{
139+
"name": "published_at",
140+
"type": "string",
141+
"format": "default"
142+
},
143+
{
144+
"name": "id",
145+
"type": "string",
146+
"format": "default"
147+
}
148+
],
149+
"missingValues": [
150+
""
151+
]
152+
}
153+
},
154+
{
155+
"path": "dataset/stackoverflow_job_listing.csv",
156+
"profile": "tabular-data-resource",
157+
"name": "stackoverflow_job_listing",
158+
"format": "csv",
159+
"mediatype": "text/csv",
160+
"encoding": "utf-8",
161+
"schema": {
162+
"fields": [
163+
{
164+
"name": "link",
165+
"type": "string",
166+
"format": "default"
167+
},
168+
{
169+
"name": "category",
170+
"type": "string",
171+
"format": "default"
172+
},
173+
{
174+
"name": "title",
175+
"type": "string",
176+
"format": "default"
177+
},
178+
{
179+
"name": "description",
180+
"type": "string",
181+
"format": "default"
182+
},
183+
{
184+
"name": "published_at",
185+
"type": "string",
186+
"format": "default"
187+
},
188+
{
189+
"name": "location",
190+
"type": "string",
191+
"format": "default"
192+
},
193+
{
194+
"name": "stackoverflow_id",
195+
"type": "integer",
196+
"format": "default"
197+
},
198+
{
199+
"name": "author",
200+
"type": "string",
201+
"format": "default"
202+
},
203+
{
204+
"name": "location_country",
205+
"type": "string",
206+
"format": "default"
207+
},
208+
{
209+
"name": "location_city",
210+
"type": "string",
211+
"format": "default"
212+
},
213+
{
214+
"name": "updated_at",
215+
"type": "string",
216+
"format": "default"
217+
}
218+
],
219+
"missingValues": [
220+
""
221+
]
222+
}
223+
}
224+
]
225+
}
226+
}
227+
```

docs/tutorials/index.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,28 @@
11
## Tutorials
22

33
This is a series of tutorials to help you get started with `dataherb`.
4+
5+
### Herbs and Flora
6+
7+
The name Dataherb is a metaphor. Each data file is like a **Leaf** (or **Resource**) of a **Herb** (or **Dataset**). Many Herbs form a **Flora**.
8+
9+
| Dataherb Term | Meaning |
10+
|-----|-----|
11+
| Herb | A Dataset |
12+
| Resource | A Data File |
13+
14+
!!! note "Leaf and Resource"
15+
Leaf is an alias of Resource.
16+
17+
### Managing the Flora
18+
19+
This python package ships a command line tool that can be used to manage the **Flora**.
20+
21+
The core of a flora is basically a json file that lists the metadata of herbs. The default flora can be configured using `dataherb configure`.
22+
23+
Using this configuration, we can simply store the json representing the flora and recreate the whole flora on other devices.
24+
25+
!!! note "Sync the Flora to GitHub"
26+
For example, we can version control the flora and push it to GitHub and pull it down on other devices. In this way, we can sync and restore the flora.
27+
28+

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ nav:
7676
- "Tutorials":
7777
- "Tutorials": tutorials/index.md
7878
- "Configuration": tutorials/configuration/index.md
79+
- "Create Dataset": tutorials/create/index.md
7980
- References:
8081
- "Introduction": references/index.md
8182
- "dataherb.command": references/command.md

0 commit comments

Comments
 (0)