diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5adab6b --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +__pycache__/ +gephi/ +*.csv + diff --git a/README.MD b/README.MD index 74f1fac..8d005bf 100644 --- a/README.MD +++ b/README.MD @@ -18,17 +18,33 @@ _Based on dpapathanasiou's [example script for pdfminer](https://github.com/dpap For the above to work, we do some text normalization (removing punctuation, whitespace, special characters) and assume that the title_y would only appear in text_x if it appears in the references section... +### Configuration + +Before using it, make sure that you have installed the dependencies in +an isolated environment. + +Create an activate a virtual environment called `venv`: + +``` +python -m venv venv +source venv/bin/activate +``` + +And then install the dependencies: + +``` +pip install -r requirements.txt +``` + ### Usage: + 1. Export list of articles as .csv from Zotero, (articles should have File attachments) -2. Run `analyze_papers.py zotero_file.csv` -3. Script should produce two files: Edges_titles.csv and Nodes_titles.csv in folder "gephi" +2. Run `python analyze_papers.py zotero_file.csv` +3. Script should produce two files in the `gephi` folder: `Edges_titles.csv` and `Nodes_titles.csv` 4. Load them into [Gephi](https://gephi.org) with "Load Spreadsheet" - ## Notes * Tested with Python3 * Uses the library [pdfminer](https://pypi.org/project/pdfminer/) * You can specify number of processes the script uses to parse the PDFs with parameter --processes (default value is 4) - - diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..2edae43 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +pdfminer==20191125 +