backend.oai module

class backend.oai.BASEDCTranslator[source]

Bases: backend.oai.OAIDCTranslator

base_dc is very similar to oai_dc, so we don’t have much to change

class backend.oai.CiteprocReader[source]

Bases: oaipmh.metadata.MetadataReader

class backend.oai.CiteprocTranslator[source]

Bases: object

A translator for the JSON-based Citeproc format served by Crossref

translate(header, metadata)[source]
class backend.oai.CustomSourceOAIDCTranslator(source)[source]

Bases: backend.oai.OAIDCTranslator

Just like OAIDCTranslator, but with a custom source (not assuming that the endpoint is proaixy)

get_source(header, record)[source]
class backend.oai.OAIDCTranslator[source]

Bases: object

Translator for the default format supplied by OAI-PMH interfaces, called oai_dc.

add_oai_record(header, metadata, source, paper)[source]

Add a record (from OAI-PMH) to the given paper

extract_urls(header, metadata, source_identifier)[source]

Extracts URLs from the record, based on the identifier of its source.

The semantics of URLs vary greatly from provider to provider, so we build custom rules for each of the providers we cover. These rules are stored as URLExtractor.

Returns:a pair of URLs or Nones: the splash and pdf url. The splash URL is requred (cannot be None) and points to the URI where the resource is mentioned. This is typically an abstract page. The PDF url is non-empty if and only if we think a full text is available. If it is possible, this URL should point to the full text directly, otherwise to a page where we think a human user can find the full text by themselves (and for free).

Find the latest publication date (if any) in a record


Get the authors names out of a metadata record

get_source(header, metadata)[source]

Find the OAI source to use for this record

translate(header, metadata)[source]

Creates a BarePaper

class backend.oai.OaiPaperSource(endpoint, day_granularity=False, *args, **kwargs)[source]

Bases: backend.papersource.PaperSource

A paper source that fetches records from the OAI-PMH proxy (typically: proaixy).

It uses the ListRecord verb to fetch records from the OAI-PMH source. Each record is then converted to a BarePaper by an OaiTranslator that handles the format the metadata is served in.


Adds the given translator to the paper source, so that we know how to translate papers in the given format.

The paper source cannot hold more than one translator per OAI format (it decides what translator to use solely based on the format) so if there is already a translator for that format, it will be overriden.

create_paper_by_identifier(identifier, metadataPrefix)[source]

Queries the OAI-PMH proxy for a single paper.

  • identifier – the OAI identifier to fetch
  • metadataPrefix – the format to use (a translator has to be registered for that format, otherwise we return None with a warning message)

a Paper or None

ingest(from_date=None, metadataPrefix=u'any', resumptionToken=None)[source]

Main method to fill Dissemin with papers!

  • from_date – only fetch papers modified after that date in the proxy (useful for incremental fetching)
  • metadataPrefix – restrict the ingest for this metadata format
listRecords_or_empty(source, *args, **kwargs)[source]

pyoai raises NoRecordsMatchError when no records match, we would rather like to get an empty list in that case.

process_record(header, metadata)[source]

Saves the record given by the header and metadata (as returned by pyoai) into a Paper, or None if anything failed.


Save as Paper all the records contained in this list

class backend.oai.OaiTranslator[source]

Bases: object

A translator takes a metadata record from the OAI-PMH proxy and converts it to a BarePaper.


Returns the metadata format expected by the translator

translate(header, metadata)[source]

Main method of the translator: translates a metadata record to a BarePaper.

  • header – the OAI-PMH header, as returned by pyoai
  • metadata – the dictionary of the record, as returned by pyoai

a BarePaper or None if creation failed