Skip to content

Commit 2f6d917

Browse files
committed
Added dumper option that outputs to an Excel file
1 parent 094ec1f commit 2f6d917

File tree

3 files changed

+90
-5
lines changed

3 files changed

+90
-5
lines changed

README.md

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,17 @@ Python utility package for scraping information on SINTA (Science and Technology
44

55
## A. Documentation
66

7-
### A.1. Author Verification
7+
### A.1. Installation
88

9-
#### A.1.i. Authentication
9+
You can install `sintautils` using PIP as follows:
10+
11+
```sh
12+
pip install sintautils
13+
```
14+
15+
### A.2. Author Verification
16+
17+
#### A.2.i. Authentication
1018

1119
Author verification menu is a restricted menu of SINTA. You must be registered as a university administrator and obtain an admin credential in order to use this function. An author verification (AV) admin's credential consists of an email-based username and a password.
1220

@@ -25,6 +33,38 @@ from sintautils import AV
2533
scraper = AV('admin@university.edu', 'password1234', autologin=True)
2634
```
2735

36+
#### A.2.ii. Basic Usage
37+
38+
After importing the modules and initializing the `AV` class, you can start dumping research information of a given author in SINTA using the `dump_author()` method. The following code dumps all research data pertaining to a SINTA author and saves the result to an Excel file named `sintautils_dump_author-1234.xlsx` under the current working directory. Each data category (IPR, book, Google Scholar publication, etc.) is represented by a separate Excel sheet.
39+
40+
```python
41+
# Change "1234" to the respective author's SINTA ID.
42+
scraper.dump_author('1234')
43+
```
44+
45+
You can customize which data type to scrape by specifying the `fields` parameter:
46+
47+
```python
48+
# Possible values for the "fields" parameter:
49+
# book, garuda, gscholar, ipr, research, scopus, service, wos
50+
# Use asterisks "*" (the default) in order to scrape all information.
51+
scraper.dump_author('1234', fields='book garuda wos')
52+
```
53+
54+
Also, you can change the output format, save directory, and filename prefix as follows:
55+
56+
```python
57+
# Possible values for the "out_format" parameter:
58+
# csv, json, json-pretty, xlsx
59+
scraper.dump_author('1234',
60+
out_format='json-pretty',
61+
out_folder='/path/to/save/directory',
62+
out_prefix='filename_prefix-'
63+
)
64+
```
65+
66+
If multiple fields are specified when using `out_format=csv`, each data type will be saved as a separate CSV file under the same `out_folder` directory.
67+
2868
## B. To-Do
2969

3070
### B.1. New Features
@@ -34,7 +74,7 @@ scraper = AV('admin@university.edu', 'password1234', autologin=True)
3474
- [X] Add scraper for IPR and book of each author.
3575
- [X] Add garuda scraper per author.
3676
- [X] Add author info dumper.
37-
- [ ] Add author info dumper using `openpyxl` implementation that outputs to an Excel/spreadsheet workbook file.
77+
- [X] Add author info dumper using `openpyxl` implementation that outputs to an Excel/spreadsheet workbook file.
3878

3979
### B.2. Bug Fixes
4080

src/sintautils/core.py

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
import requests as rq
3030
import time
3131

32+
from .exceptions import EmptyFieldException
3233
from .exceptions import InvalidLoginCredentialException
3334
from .exceptions import InvalidParameterException
3435
from .exceptions import NoLoginCredentialsException
@@ -197,6 +198,10 @@ def dump_author(
197198
- ["*"]
198199
"""
199200

201+
# Validating the fields.
202+
if fields.__len__() < 1:
203+
raise EmptyFieldException('book, garuda, gscholar, ipr, research, scopus, service, wos')
204+
200205
# Validating the output format.
201206
if type(out_format) is not str or out_format not in ['csv', 'json', 'xlsx']:
202207
raise InvalidParameterException('"out_format" must be one of "csv", "json", and "xlsx"')
@@ -244,8 +249,36 @@ def dump(dump_id):
244249
json.dump(b, fo)
245250

246251
elif out_format == 'xlsx':
247-
# TODO: Work on the implementation of xlsx dumper using openpyxl.
248-
pass
252+
wb = Workbook()
253+
for m in sorted(a.keys()):
254+
ws = wb.create_sheet(m, -1)
255+
256+
# Obtaining the data list and validate the data length.
257+
b: list = a[m]
258+
if b.__len__() < 1:
259+
continue
260+
261+
# Write the spreadsheet header.
262+
headers: list = list(b[0].keys())
263+
for i in range(len(headers)):
264+
n = headers[i]
265+
ws.cell(row=1, column=(i + 1), value=n)
266+
267+
# Write the column's content.
268+
for j in range(len(b)):
269+
c: dict = b[j]
270+
# Offset the row number by two, because the first row is header.
271+
ws.cell(row=(j + 2), column=(i + 1), value=c[n])
272+
273+
# Remove sheets that do not represent data type.
274+
if wb.sheetnames.__len__() > 0:
275+
for d in wb.sheetnames:
276+
if d not in a.keys():
277+
wb.remove(wb[d])
278+
279+
# Saving the spreadsheet.
280+
save_file = str(out_folder) + os.sep + str(out_prefix) + str(dump_id) + '.xlsx'
281+
wb.save(save_file)
249282

250283
if type(author_id) is str:
251284
dump(dump_id=author_id)

src/sintautils/exceptions.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,18 @@ def __repr__(self):
3939
__str__ = __repr__
4040

4141

42+
class EmptyFieldException(SintaException):
43+
""" Error raised when there is no field selection in the scraper function passed. """
44+
45+
def __init__(self, arg: str = ''):
46+
self.arg = arg
47+
48+
def __repr__(self):
49+
return f'You must specify at least one of the following fields: {self.arg}. Use "*" to select all fields.'
50+
51+
__str__ = __repr__
52+
53+
4254
class InvalidAuthorIDException(SintaException):
4355
""" Error raised when the user specifies an invalid (i.e., non-numerical) author ID. """
4456

0 commit comments

Comments
 (0)