Skip to content

Commit 3c9dd3b

Browse files
authored
Refactor (#168)
* full refactor * README and tidy up * tidy
1 parent 55fdb56 commit 3c9dd3b

24 files changed

+3450
-1827
lines changed

README.md

Lines changed: 80 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -4,89 +4,91 @@
44

55
## Introduction
66

7-
When populating CouchDB databases, often the source of the data is initially a CSV or TSV file. CouchImport is designed to assist you with importing flat data into CouchDB efficiently.
7+
When populating CouchDB databases, often the source of the data is initially a CSV or TSV file. *couchimport* is designed to assist you with importing flat data into CouchDB efficiently.
88
It can be used either as command-line utilities `couchimport` and `couchexport` or the underlying functions can be used programmatically:
99

10-
* simply pipe the data file to 'couchimport' on the command line
11-
* handles tab or comma separated data
12-
* uses Node.js's streams for memory efficiency
13-
* plug in a custom function to add your own changes before the data is written
14-
* writes the data in bulk for speed
15-
* can also write huge JSON files using a streaming JSON parser
16-
* allows multiple writes to happen at once using the `--parallelism` option
10+
* simply pipe the data file to *couchimport* on the command line.
11+
* handles tab or comma-separated data.
12+
* uses Node.js's streams for memory efficiency.
13+
* plug in a custom function to add your own changes before the data is written.
14+
* writes the data in bulk for speed.
15+
* can also read huge JSON files using a streaming JSON parser.
16+
* allows multiple HTTP writes to happen at once using the `--parallelism` option.
1717

1818
![schematic](https://github.com/glynnbird/couchimport/raw/master/images/couchimport.png "Schematic Diagram")
1919

2020
## Installation
2121

2222
Requirements
23-
* node.js
24-
* npm
2523

26-
```
24+
- node.js
25+
= npm
26+
27+
```sh
2728
sudo npm install -g couchimport
2829
```
2930

3031
## Configuration
3132

32-
CouchImport's configuration parameters can be stored in environment variables or supplied as command line arguments.
33+
*couchimport*'s configuration parameters can be stored in environment variables or supplied as command line arguments.
3334

34-
### The location of CouchDB - default "http://localhost:5984"
35+
### The location of CouchDB
3536

36-
Simply set the "COUCH_URL" environment variable e.g. for a hosted Cloudant database
37+
Simply set the `COUCH_URL` environment variable e.g. for a hosted Cloudant database
3738

38-
```
39+
```sh
3940
export COUCH_URL="https://myusername:myPassw0rd@myhost.cloudant.com"
4041

4142
```
4243
or a local CouchDB installation:
4344

44-
```
45+
```sh
4546
export COUCH_URL="http://localhost:5984"
4647
```
4748

4849
### The name of the database - default "test"
4950

50-
Define the name of the CouchDB database to write to by setting the "COUCH_DATABASE" environment variable e.g.
51+
Define the name of the CouchDB database to write to by setting the `COUCH_DATABASE` environment variable e.g.
5152

52-
```
53+
```sh
5354
export COUCH_DATABASE="mydatabase"
5455
```
5556

5657
### Transformation function - default nothing
5758

5859
Define the path of a file containing a transformation function e.g.
5960

60-
```
61+
```sh
6162
export COUCH_TRANSFORM="/home/myuser/transform.js"
6263
```
6364

6465
The file should:
65-
* be a javascript file
66+
67+
* be a JavaScript file
6668
* export one function that takes a single doc and returns a single object or
6769
an array of objects if you need to split a row into multiple docs.
6870

69-
(see examples directory). N.B it's best to use full paths for the transform function.
71+
(see examples directory).
7072

7173
### Delimiter - default "\t"
7274

7375
The define the column delimiter in the input data e.g.
7476

75-
```
77+
```sh
7678
export COUCH_DELIMITER=","
7779
```
7880

7981
## Running
8082

8183
Simply pipe the text data into "couchimport":
8284

83-
```
85+
```sh
8486
cat ~/test.tsv | couchimport
8587
```
8688

8789
This example downloads public crime data, unzips and imports it:
8890

89-
```
91+
```sh
9092
curl 'http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip' > crime.zip
9193
unzip crime.zip
9294
export COUCH_DATABASE="crime_2013"
@@ -95,24 +97,28 @@ This example downloads public crime data, unzips and imports it:
9597
cat crime_incidents_2013_CSV.csv | couchimport
9698
```
9799

98-
In the above example we use "(ccurl)[https://github.com/glynnbird/ccurl]" a command-line utility that uses the same environment variables as "couchimport".
100+
In the above example we use (ccurl)[https://github.com/glynnbird/ccurl], a command-line utility that uses the same environment variables as *couchimport*.
99101

100102
## Output
101103

102104
The following output is visible on the console when "couchimport" runs:
103105

104106
```
105-
******************
106-
COUCHIMPORT - configuration
107-
{"COUCH_URL":"https://****:****@myhost.cloudant.com","COUCH_DATABASE":"aaa","COUCH_TRANSFORM":null,"COUCH_DELIMITER":","}
108-
******************
109-
Written 500 ( 500 )
110-
Written 500 ( 1000 )
111-
Written 500 ( 1500 )
112-
Written 500 ( 2000 )
113-
.
114-
.
115-
107+
couchimport
108+
-----------
109+
url : "https://****:****@myhost.cloudant.com"
110+
database : "test"
111+
delimiter : "\t"
112+
buffer : 500
113+
parallelism : 1
114+
type : "text"
115+
-----------
116+
couchimport Written ok:500 - failed: 0 - (500) +0ms
117+
couchimport { documents: 500, failed: 0, total: 500, totalfailed: 0 } +0ms
118+
couchimport Written ok:499 - failed: 0 - (999) +368ms
119+
couchimport { documents: 499, failed: 0, total: 999, totalfailed: 0 } +368ms
120+
couchimport writecomplete { total: 999, totalfailed: 0 } +0ms
121+
couchimport Import complete +81ms
116122
```
117123

118124
The configuration, whether default or overriden by environment variables or command line arguments, is shown. This is followed by a line of output for each block of 500 documents written, plus a cumulative total.
@@ -121,7 +127,7 @@ The configuration, whether default or overriden by environment variables or comm
121127

122128
If you want to see a preview of the JSON that would be created from your csv/tsv files then add `--preview true` to your command-line:
123129

124-
```
130+
```sh
125131
> cat text.txt | couchimport --preview true
126132
Detected a TAB column delimiter
127133
{ product_id: '1',
@@ -137,21 +143,21 @@ As well as showing a JSON preview, preview mode also attempts to detect the colu
137143
138144
If your source document is a GeoJSON text file, `couchimport` can be used. Let's say your JSON looks like this:
139145
140-
```
146+
```js
141147
{ "features": [ { "a":1}, {"a":2}] }
142148
```
143149
144-
and we need to import each feature object into CouchDB as separate documents, then this can be imported using the `type="json"` argument and specifying the JSON path using the `json-path` argument:
150+
and we need to import each feature object into CouchDB as separate documents, then this can be imported using the `type="json"` argument and specifying the JSON path using the `jsonpath` argument:
145151
146-
```
152+
```sh
147153
cat myfile.json | couchimport --database mydb --type json --jsonpath "features.*"
148154
```
149155
150156
## Importing JSON Lines file
151157
152158
If your source document is a [JSON Lines](http://jsonlines.org/) text file, `couchimport` can be used. Let's say your JSON Lines looks like this:
153159
154-
```
160+
```js
155161
{"a":1}
156162
{"a":2}
157163
{"a":3}
@@ -165,15 +171,15 @@ If your source document is a [JSON Lines](http://jsonlines.org/) text file, `cou
165171
166172
and we need to import each line as a JSON object into CouchDB as separate documents, then this can be imported using the `type="jsonl"` argument:
167173
168-
```
174+
```sh
169175
cat myfile.json | couchimport --database mydb --type jsonl
170176
```
171177
172178
## Importing a stream of JSONs
173179
174180
If your source data is a lot of JSON objects meshed or appended together, `couchimport` can be used. Let's say your file:
175181
176-
```
182+
```js
177183
{"a":1}{"a":2} {"a":3}{"a":4}
178184
{"a":5} {"a":6}
179185
{"a":7}{"a":8}
@@ -206,29 +212,31 @@ and we need to import each JSON objet to CouchDB as separate documents, then thi
206212
207213
You can also configure `couchimport` and `couchexport` using command-line parameters:
208214
209-
* --version - simply prints the version and exits
210-
* --url - the url of the CouchDB instance (required, or to be supplied in the environment)
211-
* --database (or --db) - the database to deal with (required, or to be supplied in the environment)
212-
* --delimiter - the delimiter to use (default '\t', not required)
213-
* --transform - the path of a transformation function (not required)
214-
* --meta - a json object which will be passed to the transform function (not required)
215-
* --buffer - the number of records written to CouchDB per bulk write (defaults to 500, not required)
216-
* --type - the type of file being imported, either "text", "json" or "jsonl" (defaults to "text", not required)
217-
* --json-path - the path into the incoming JSON document (only required for type=json imports)
218-
* --preview - if 'true', runs in preview mode
219-
* --ignorefields - a comma-separated list of fields to ignore input or output
215+
* `--help` - show help
216+
* `--version` - simply prints the version and exits
217+
* `--url`/`-u` - the url of the CouchDB instance (required, or to be supplied in the environment)
218+
* `--database`/`--db`/`-d` - the database to deal with (required, or to be supplied in the environment)
219+
* `--delimiter` - the delimiter to use (default '\t', not required)
220+
* `--transform` - the path of a transformation function (not required)
221+
* `--meta`/`-m` - a json object which will be passed to the transform function (not required)
222+
* `--buffer`/`-b` - the number of records written to CouchDB per bulk write (defaults to 500, not required)
223+
* `--type`/`-t` - the type of file being imported, either "text", "json" or "jsonl" (defaults to "text", not required)
224+
* `--jsonpath`/`-j` - the path into the incoming JSON document (only required for type=json imports)
225+
* `--preview`/`-p` - if 'true', runs in preview mode (default false)
226+
* `--ignorefields`/`-i` - a comma-separated list of fields to ignore input or output (default none)
227+
* `--parallelism` - the number of HTTP request to have in flight at any one time (default 1)
220228
221229
e.g.
222230
223-
```
231+
```sh
224232
cat test.csv | couchimport --database bob --delimiter ","
225233
```
226234
227235
## couchexport
228236
229237
If you have structured data in a CouchDB or Cloudant that has fixed keys and values e.g.
230238
231-
```
239+
```js
232240
{
233241
"_id": "badger",
234242
"_rev": "5-a9283409e3253a0f3e07713f42cd4d40",
@@ -246,16 +254,16 @@ If you have structured data in a CouchDB or Cloudant that has fixed keys and val
246254
247255
then it can be exported to a CSV like so (note how we set the delimiter):
248256
249-
```
257+
```sh
250258
couchexport --url http://localhost:5984 --database animaldb --delimiter "," > test.csv
251259
```
260+
252261
or to a TSV like so (we don't need to specify the delimiter since tab `\t` is the default):
253262
254-
```
263+
```sh
255264
couchexport --url http://localhost:5984 --database animaldb > test.tsv
256265
```
257266
258-
259267
N.B.
260268
261269
* design documents are ignored
@@ -264,27 +272,26 @@ N.B.
264272
* COUCH_DELIMITER or --delimiter can be used to provide a custom column delimiter (not required when tab-delimited)
265273
* if your document values contain carriage returns or the column delimiter, then this may not be the tool for you
266274
267-
268275
## Using programmatically
269276
270277
In your project, add `couchimport` into the dependencies of your package.json or run `npm install couchimport`. In your code, require
271278
the library with
272279
273-
```
280+
```js
274281
var couchimport = require('couchimport');
275282
```
276283
277284
and your options are set in an object whose keys are the same as the COUCH_* environment variables:
278285
279286
e.g.
280287
281-
```
282-
var opts = { COUCH_DELIMITER: ",", COUCH_URL: "http://localhost:5984", COUCH_DATABASE: "mydb" };
288+
```js
289+
var opts = { delimiter: ",", url: "http://localhost:5984", database: "mydb" };
283290
```
284291
285292
To import data from a readable stream (rs):
286293
287-
```
294+
```js
288295
var rs = process.stdin;
289296
couchimport.importStream(rs, opts, function(err,data) {
290297
console.log("done");
@@ -293,41 +300,40 @@ To import data from a readable stream (rs):
293300
294301
To import data from a named file:
295302
296-
```
303+
```js
297304
couchimport.importFile("input.txt", opts, function(err,data) {
298305
console.log("done",err,data);
299306
});
300307
```
301308
302309
To export data to a writable stream (ws):
303310
304-
```
311+
```js
305312
var ws = process.stdout;
306313
couchimport.exportStream(ws, opts, function(err, data) {
307314
console.log("done",err,data);
308315
});
309316
```
310317
311-
312318
To export data to a named file:
313319
314-
```
320+
```js
315321
couchimport.exportFile("output.txt", opts, function(err, data) {
316322
console.log("done",err,data);
317323
});
318324
```
319325
320326
To preview a file:
321327
322-
```
328+
```js
323329
couchimport.previewCSVFile('./hp.csv', opts, function(err, data, delimiter) {
324330
console.log("done", err, data, delimiter);
325331
});
326332
```
327333
328334
To preview a CSV/TSV on a URL:
329335
330-
```
336+
```js
331337
couchimport.previewURL('https://myhosting.com/hp.csv', opts, function(err, data) {
332338
console.log("done", err, data, delimiter);
333339
});
@@ -344,13 +350,13 @@ Both `importStream` and `importFile` return an EventEmitter which emits
344350
345351
e.g.
346352
347-
```
353+
```js
348354
couchimport.importFile("input.txt", opts, function(err,data) {
349355
console.log("done",err,data);
350356
}).on("written", function(data) {
351357
// data = { documents: 500, failed:6, total: 63000, totalfailed: 42}
352358
});
353-
````
359+
```
354360
355361
The emitted data is an object containing:
356362
@@ -361,10 +367,8 @@ The emitted data is an object containing:
361367
362368
## Parallelism
363369
364-
Using the `COUCH_PARALLELISM` environment variable or the `--parallelism` command-line option, couchimport can
365-
be configured to write data in multiple parallel operations. If you have the networkbandwidth, this can significantly
366-
speed up large data imports e.g.
370+
Using the `COUCH_PARALLELISM` environment variable or the `--parallelism` command-line option, couchimport can be configured to write data in multiple parallel operations. If you have the networkbandwidth, this can significantly speed up large data imports e.g.
367371
368-
```
372+
```sh
369373
cat bigdata.csv | couchimport --database mydb --parallelism 10 --delimiter ","
370374
```

0 commit comments

Comments
 (0)