Skip to content

Commit 8a90df9

Browse files
authored
Merge pull request doccano#294 from CatalystCode/feature/excel_import
Feature/Import from Excel
2 parents cf7d827 + 8a72d28 commit 8a90df9

16 files changed

+120
-22
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,4 +199,4 @@ node_modules/
199199
bundle/
200200
webpack-stats.json
201201

202-
.vscode/
202+
.vscode

README.md

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -58,20 +58,19 @@ Doccano can be deployed to AWS ([Cloudformation](https://docs.aws.amazon.com/AWS
5858

5959
> Notice: (1) EC2 KeyPair cannot be created automatically, so make sure you have an existing EC2 KeyPair in one region. Or [create one yourself](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). (2) If you want to access doccano via HTTPS in AWS, here is an [instruction](https://github.com/chakki-works/doccano/wiki/HTTPS-setting-for-doccano-in-AWS).
6060
61-
6261
## Features
6362

64-
* Collaborative annotation
65-
* Multi-Language support
66-
* Emoji :smile: support
67-
* (future) Auto labeling
63+
- Collaborative annotation
64+
- Multi-Language support
65+
- Emoji :smile: support
66+
- (future) Auto labeling
6867

6968
## Requirements
7069

71-
* Python 3.6+
72-
* Django 2.1.7+
73-
* Node.js 8.0+
74-
* Google Chrome(highly recommended)
70+
- Python 3.6+
71+
- Django 2.1.7+
72+
- Node.js 8.0+
73+
- Google Chrome(highly recommended)
7574

7675
## Installation
7776

@@ -162,7 +161,9 @@ Finally, to start the server, run the following command:
162161
```bash
163162
python manage.py runserver
164163
```
164+
165165
Optionally, you can change the bind ip and port using the command
166+
166167
```bash
167168
python manage.py runserver <ip>:<port>
168169
```
@@ -197,28 +198,34 @@ After creating a project, you will see the "Import Data" page, or click `Import
197198

198199
<img src="./docs/upload.png" alt="Upload project" width=600>
199200

200-
You can upload two types of files:
201-
- `CSV file`: file must contain a header with a `text` column or be one-column csv file.
202-
- `JSON file`: each line contains a JSON object with a `text` key. JSON format supports line breaks rendering.
201+
You can upload the following types of files (depending on project type):
202+
203+
- `Text file`: file must contain one sentence/document per line separated by new lines.
204+
- `CSV file`: file must contain a header with `"text"` as the first column or be one-column csv file. If using labels the sencond column must be the labels.
205+
- `Excel file`: file must contain a header with `"text"` as the first column or be one-column excel file. If using labels the sencond column must be the labels. Supports multiple sheets as long as format is the same.
206+
- `JSON file`: each line contains a JSON object with a `text` key. JSON format supports line breaks rendering.
203207

204208
> Notice: Doccano won't render line breaks in annotation page for sequence labeling task due to the indent problem, but the exported JSON file still contains line breaks.
205209
206-
`example.txt` (or `example.csv`)
207-
```python
210+
`example.txt/csv/xlsx`
211+
212+
```txt
208213
EU rejects German call to boycott British lamb.
209214
President Obama is speaking at the White House.
210215
He lives in Newark, Ohio.
211216
...
212217
```
218+
213219
`example.json`
220+
214221
```JSON
215222
{"text": "EU rejects German call to boycott British lamb."}
216223
{"text": "President Obama is speaking at the White House."}
217224
{"text": "He lives in Newark, Ohio."}
218225
...
219226
```
220227

221-
Any other columns (for csv) or keys (for json) are preserved and will be exported in the `metadata` column or key as is.
228+
Any other columns (for csv/excel) or keys (for json) are preserved and will be exported in the `metadata` column or key as is.
222229

223230
Once you select a TXT/JSON file on your computer, click `Upload dataset` button. After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.
224231

@@ -228,7 +235,6 @@ Click `Labels` button in left bar to define your own labels. You should see the
228235

229236
<img src="./docs/label_editor.png" alt="Edit label" width=600>
230237

231-
232238
### Annotation
233239

234240
Now, you are ready to annotate the texts. Just click the `Annotate Data` button in the navigation bar, you can start to annotate the documents you uploaded.
@@ -249,11 +255,14 @@ by adding `external_id` to the imported file. For example:
249255

250256
Input file may look like this:
251257
`import.json`
258+
252259
```JSON
253260
{"text": "EU rejects German call to boycott British lamb.", "meta": {"external_id": 1}}
254261
```
262+
255263
and the exported file will look like this:
256264
`output.json`
265+
257266
```JSON
258267
{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels": ["news"], "username": "root", "meta": {"external_id": 1}}
259268
```
@@ -270,7 +279,6 @@ As with any software, doccano is under continuous development. If you have reque
270279

271280
Here are some tips might be helpful. [How to Contribute to Doccano Project](https://github.com/chakki-works/doccano/wiki/How-to-Contribute-to-Doccano-Project)
272281

273-
274282
## Contact
275283

276284
For help and feedback, please feel free to contact [the author](https://github.com/Hironsan).
9.61 KB
Binary file not shown.
9.65 KB
Binary file not shown.

app/api/tests/data/example.xlsx

9.61 KB
Binary file not shown.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
text
2+
AAA
3+
BBB
4+
CCC
9.58 KB
Binary file not shown.
Binary file not shown.

app/api/tests/test_api.py

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -759,7 +759,7 @@ def setUp(self):
759759
def upload_test_helper(self, project_id, filename, file_format, expected_status, **kwargs):
760760
url = reverse(viewname='doc_uploader', args=[project_id])
761761

762-
with open(os.path.join(DATA_DIR, filename)) as f:
762+
with open(os.path.join(DATA_DIR, filename), 'rb') as f:
763763
response = self.client.post(url, data={'file': f, 'format': file_format})
764764

765765
self.assertEqual(response.status_code, expected_status)
@@ -803,6 +803,12 @@ def test_can_upload_seq2seq_csv(self):
803803
file_format='csv',
804804
expected_status=status.HTTP_201_CREATED)
805805

806+
def test_can_upload_single_column_csv(self):
807+
self.upload_test_helper(project_id=self.seq2seq_project.id,
808+
filename='example_one_column.csv',
809+
file_format='csv',
810+
expected_status=status.HTTP_201_CREATED)
811+
806812
def test_cannot_upload_csv_file_does_not_match_column_and_row(self):
807813
self.upload_test_helper(project_id=self.classification_project.id,
808814
filename='example.invalid.1.csv',
@@ -815,6 +821,43 @@ def test_cannot_upload_csv_file_has_too_many_columns(self):
815821
file_format='csv',
816822
expected_status=status.HTTP_400_BAD_REQUEST)
817823

824+
def test_can_upload_classification_excel(self):
825+
self.upload_test_helper(project_id=self.classification_project.id,
826+
filename='example.xlsx',
827+
file_format='excel',
828+
expected_status=status.HTTP_201_CREATED)
829+
830+
def test_can_upload_seq2seq_excel(self):
831+
self.upload_test_helper(project_id=self.seq2seq_project.id,
832+
filename='example.xlsx',
833+
file_format='excel',
834+
expected_status=status.HTTP_201_CREATED)
835+
836+
def test_can_upload_single_column_excel(self):
837+
self.upload_test_helper(project_id=self.seq2seq_project.id,
838+
filename='example_one_column.xlsx',
839+
file_format='excel',
840+
expected_status=status.HTTP_201_CREATED)
841+
842+
def test_cannot_upload_excel_file_does_not_match_column_and_row(self):
843+
self.upload_test_helper(project_id=self.classification_project.id,
844+
filename='example.invalid.1.xlsx',
845+
file_format='excel',
846+
expected_status=status.HTTP_400_BAD_REQUEST)
847+
848+
def test_cannot_upload_excel_file_has_too_many_columns(self):
849+
self.upload_test_helper(project_id=self.classification_project.id,
850+
filename='example.invalid.2.xlsx',
851+
file_format='excel',
852+
expected_status=status.HTTP_400_BAD_REQUEST)
853+
854+
@override_settings(IMPORT_BATCH_SIZE=1)
855+
def test_can_upload_small_batch_size(self):
856+
self.upload_test_helper(project_id=self.seq2seq_project.id,
857+
filename='example_one_column_no_header.xlsx',
858+
file_format='excel',
859+
expected_status=status.HTTP_201_CREATED)
860+
818861
def test_can_upload_classification_jsonl(self):
819862
self.upload_test_helper(project_id=self.classification_project.id,
820863
filename='classification.jsonl',

app/api/utils.py

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
import conllu
1010
from django.db import transaction
1111
from django.conf import settings
12+
import pyexcel
1213
from rest_framework.renderers import JSONRenderer
1314
from seqeval.metrics.sequence_labeling import get_entities
1415

@@ -324,13 +325,32 @@ class CSVParser(FileParser):
324325
def parse(self, file):
325326
file = io.TextIOWrapper(file, encoding='utf-8')
326327
reader = csv.reader(file)
328+
yield from ExcelParser.parse_excel_csv_reader(reader)
329+
330+
331+
class ExcelParser(FileParser):
332+
def parse(self, file):
333+
excel_book = pyexcel.iget_book(file_type="xlsx", file_content=file.read())
334+
# Handle multiple sheets
335+
for sheet_name in excel_book.sheet_names():
336+
reader = excel_book[sheet_name].to_array()
337+
yield from self.parse_excel_csv_reader(reader)
338+
339+
@staticmethod
340+
def parse_excel_csv_reader(reader):
327341
columns = next(reader)
328342
data = []
343+
if len(columns) == 1 and columns[0] != 'text':
344+
data.append({'text': columns[0]})
329345
for i, row in enumerate(reader, start=2):
330346
if len(data) >= settings.IMPORT_BATCH_SIZE:
331347
yield data
332348
data = []
333-
if len(row) == len(columns) and len(row) >= 2:
349+
# Only text column
350+
if len(row) == len(columns) and len(row) == 1:
351+
data.append({'text': row[0]})
352+
# Text, labels and metadata columns
353+
elif len(row) == len(columns) and len(row) >= 2:
334354
text, label = row[:2]
335355
meta = json.dumps(dict(zip(columns[2:], row[2:])))
336356
j = {'text': text, 'labels': [label], 'meta': meta}
@@ -352,7 +372,6 @@ def parse(self, file):
352372
data = []
353373
try:
354374
j = json.loads(line)
355-
#j = json.loads(line.decode('utf-8'))
356375
j['meta'] = json.dumps(j.get('meta', {}))
357376
data.append(j)
358377
except json.decoder.JSONDecodeError:

0 commit comments

Comments
 (0)