DepositOnce - TU Berlin's repository

Here is explained how the API of TUB's DepositOnce works as well as some facts about the content of the repository.

OAI-PMH

From the official website: The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.

Here is a cheat sheet with the possible queries, created by Richard Urban.

DepositOnce - TU's repository

Here are the guidelines from the TUB's website.

The base URL is https://depositonce.tu-berlin.de/oai.

edoc-Server - HU's repository

Here are the guidelines for the HU's repository.

The base URL is https://edoc.hu-berlin.de/oai.

Refubium - FU's repository

Here are the guidelines for the HU's repository.

The base URL is https://refubium.fu-berlin.de/oai.

Important data files

Here are brief descriptions of what the relevant data files contain. They are all stored in the folder data/json.

subjects.json: dictionary with the subjects as keys and a list of identifiers as values. The identifiers in the list refer to the publications that contain that subject in their metadata.

Open questions

Language

Even german publications seem to have an english description. My plan is to discard german texts altogether and focus on english ones. Going bilingual would increase the engineering workload without adding any value to my research.

Can I mine the PDFs?

Looking at the most popular licenses, I couldn't find any constraint to such uses, but several papers mention the legal barriers to text mining.
Even if it is legally allowed, is it possible? OAI-PMH doesn't seem to offer the option of downloading the PDFs.

How can I evaluate the models?

My first idea is combining authors, dates and the author keywords.

Publications with items in both languages

How can I remove the german subjects? langdetect doesn't seem able to do so. Answer: subjects in the dim metadata format are given a language. Abstracts are also identified in this format.

Repository analysis

Including only theses and publications written in english:

How many english subjects are there and how frequent are they? How many english subjects does each document have?
How specific are the DDC subjects, how frequent and how are they distributed?
How many potential duplicate authors are there?
How many documents have english abstracts?
How many documents have dates and how are they distributed?
How many types of documents are there and how are the documents distributed among the types.
How many documents have attachments?

Approaches

TF-IDF
TextRank
Sequential Information Bottleneck clustering

Interesting methods

POS tagging
Entity linking

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
analysis		analysis
docs		docs
language		language
logs		logs
retrieval		retrieval
scripts		scripts
.gitignore		.gitignore
README.md		README.md
harvester.py		harvester.py
process_data.py		process_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DepositOnce - TU Berlin's repository

OAI-PMH

DepositOnce - TU's repository

edoc-Server - HU's repository

Refubium - FU's repository

Important data files

Open questions

Language

Can I mine the PDFs?

How can I evaluate the models?

Publications with items in both languages

Repository analysis

Approaches

Interesting methods

About

Uh oh!

Releases

Packages

Languages

uni-carlosfranzreb/thesis_repo_analysis

Folders and files

Latest commit

History

Repository files navigation

DepositOnce - TU Berlin's repository

OAI-PMH

DepositOnce - TU's repository

edoc-Server - HU's repository

Refubium - FU's repository

Important data files

Open questions

Language

Can I mine the PDFs?

How can I evaluate the models?

Publications with items in both languages

Repository analysis

Approaches

Interesting methods

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages