papers.utils module

papers.utils.affiliation_is_greater(a, b)[source]

Compares to affiliation values. Returns True when the first contains more information than the second

>>> affiliation_is_greater(None, None)
False
>>> affiliation_is_greater(None, 'UPenn')
False
>>> affiliation_is_greater('UPenn', None)
True
>>> affiliation_is_greater('0000-0001-8633-6098', 'Ecole normale superieure, Paris')
True
>>> affiliation_is_greater('Ecole normale superieure', 'Upenn')
True
papers.utils.date_from_dateparts(dateparts)[source]

Constructs a date from a list of at most 3 integers.

>>> date_from_dateparts([])
datetime.date(1970, 1, 1)
>>> date_from_dateparts([2015])
datetime.date(2015, 1, 1)
>>> date_from_dateparts([2015,02])
datetime.date(2015, 2, 1)
>>> date_from_dateparts([2015,02,16])
datetime.date(2015, 2, 16)
>>> date_from_dateparts([2015,02,16])
datetime.date(2015, 2, 16)
>>> date_from_dateparts([2015,02,35])
Traceback (most recent call last):
    ...
ValueError: day is out of range for month
papers.utils.datetime_to_date(dt)[source]

Converts a datetime or date object to a date object.

>>> datetime_to_date(datetime.datetime(2016, 2, 11, 18, 34, 12))
datetime.date(2016, 2, 11)
>>> datetime_to_date(datetime.date(2015, 3, 1))
datetime.date(2015, 3, 1)
papers.utils.extract_domain(url)[source]

Extracts the domain name of an url

>>> extract_domain(u'https://gnu.org/test.html')
u'gnu.org'
>>> extract_domain(u'nonsense') is None
True
papers.utils.filter_punctuation(lst)[source]
Parameters:lst – list of strings
Returns:all the strings that contain at least one alphanumeric character
>>> filter_punctuation([u'abc',u'ab.',u'/,',u'a-b',u'#=', u'0'])
[u'abc', u'ab.', u'a-b', u'0']
papers.utils.index_of(elem, choices)[source]

Returns the index of elem (understood as a code) in the list of choices, where choices are expected to be pairs of (code,verbose_description).

>>> index_of(42, [])
0
>>> index_of('ok', [('ok','This is ok'),('nok','This is definitely not OK')])
0
>>> index_of('nok', [('ok','This is ok'),('nok','This is definitely not OK')])
1
papers.utils.iunaccent(s)[source]

Removes diacritics and case.

>>> iunaccent(u'BÉPO forever')
'bepo forever'
papers.utils.jpath(path, js, default=None)[source]

XPath for JSON!

Parameters:
  • path – a list of keys to follow in the tree of dicts, written in a string, separated by forward slashes
  • default – the default value to return when the key is not found
>>> jpath(u'message/items', {u'message':{u'items':u'hello'}})
u'hello'
papers.utils.kill_double_dollars(s)[source]

Removes double dollars (they generate line breaks with MathJax) This is included in the sanitize_html function.

>>> kill_double_dollars('This equation $$\\mathrm{P} = \\mathrm{NP}$$ breaks my design')
u'This equation $\\mathrm{P} = \\mathrm{NP}$ breaks my design'
papers.utils.kill_html(s)[source]

Removes every tag except <div> (but there are no <div> in titles as sanitize_html removes them)

>>> kill_html('My title<sub>is</sub><a href="http://dissem.in"><sup>nice</sup>    </a>')
u'My titleisnice'
papers.utils.maybe_recapitalize_title(title)[source]

Recapitalize a title if it is mostly uppercase (number of uppercase letters > number of lowercase letters)

>>> maybe_recapitalize_title(u'THIS IS CALLED SCREAMING')
u'This Is Called Screaming'
>>> maybe_recapitalize_title(u'This is just a normal title')
u'This is just a normal title'
>>> maybe_recapitalize_title(u'THIS IS JUST QUITE Awkward')
u'THIS IS JUST QUITE Awkward'
papers.utils.nocomma(lst)[source]

Join fields using ‘,’ ensuring that it does not appear in the fields This is used to output similarity graphs to be visualized with Gephi.

Parameters:lst – list of strings
Returns:these strings joined by commas, ensuring they do not contain commas themselves
>>> nocomma([u'a',u'b',u'cd'])
u'a,b,cd'
>>> nocomma([u'a,',u'b'])
u'a,b'
>>> nocomma([u'abc',u'',u'\n',u'def'])
u'abc, , ,def'
papers.utils.nstrip(s)[source]

Just like unicode.strip(), but works for None too.

>>> nstrip(None) is None
True
>>> nstrip(u'aa')
u'aa'
>>> nstrip(u'  aa \n')
u'aa'
papers.utils.parse_int(val, default)[source]

Returns an int or a default value if parsing the int failed.

>>> parse_int(90, None)
90
>>> parse_int(None, 90)
90
>>> parse_int('est', 8)
8
papers.utils.remove_diacritics(s)[source]

Removes diacritics using the unidecode package.

Param:an str or unicode string
Returns:if str: the same string. if unicode: the unidecoded string.
>>> remove_diacritics(u'aéèï')
'aeei'
>>> remove_diacritics(u'aéè'.encode('utf-8'))
'a\xc3\xa9\xc3\xa8'
papers.utils.remove_latex_braces(s)[source]

Removes spurious braces such as in “Th{é}odore” or “a {CADE} conference” This should be run after unescape_latex

>>> remove_latex_braces(u'Th{é}odore')
u'Th\xe9odore'
>>> remove_latex_braces(u'the {CADE} conference')
u'the CADE conference'
>>> remove_latex_braces(u'consider 2^{a+b}')
u'consider 2^{a+b}'
>>> remove_latex_braces(u'{why these braces?}')
u'why these braces?'
papers.utils.remove_latex_math_dollars(string)[source]

Removes LaTeX dollar tags.

>>> remove_latex_math_dollars(u'This is $\\beta$-reduction explained')
u'This is \\beta-reduction explained'
>>> remove_latex_math_dollars(u'Compare $\\frac{2}{3}$ to $\\pi$')
u'Compare \\frac{2}{3} to \\pi'
>>> remove_latex_math_dollars(u'Click here to win $100')
u'Click here to win $100'
>>> remove_latex_math_dollars(u'What do you prefer, $50 or $100?')
u'What do you prefer, $50 or $100?'
papers.utils.remove_nones(dct)[source]

Return a dict, without the None values

>>> remove_nones({u'orcid':None,u'wtf':u'pl'})
{u'wtf': u'pl'}
>>> remove_nones({u'orcid':u'blah',u'hey':u'you'})
{u'orcid': u'blah', u'hey': u'you'}
>>> remove_nones({None:1})
{None: 1}
papers.utils.sanitize_html(s)[source]

Removes most HTML tags, keeping the harmless ones. This also renders some LaTeX characters with unescape_latex, fixes overescaped HTML characters, and a few other fixes.

>>> sanitize_html('My title<sub>is</sub><a href="http://dissem.in"><sup>nice</sup></a>')
u'My title<sub>is</sub><sup>nice</sup>'
>>> sanitize_html('$\\alpha$-conversion')
u'$\u03b1$-conversion'
>>> sanitize_html('$$\\eta + \\omega$$')
u'$\u03b7 + \u03c9$'
>>> sanitize_html('abc & def')
u'abc &amp; def'
papers.utils.tokenize(l)[source]

A (very very simple) tokenizer.

>>> tokenize(u'Hello world!')
[u'Hello', u'world!']
>>> tokenize(u'99\tbottles\nof  beeron \tThe Wall')
[u'99', u'bottles', u'of', u'beeron', u'The', u'Wall']
papers.utils.tolerant_datestamp_to_datetime(datestamp)[source]

A datestamp to datetime that’s more tolerant of diverse inputs. Taken from pyoai.

>>> tolerant_datestamp_to_datetime('2016-02-11T18:34:12Z')
datetime.datetime(2016, 2, 11, 18, 34, 12)
>>> tolerant_datestamp_to_datetime('2016-02-11')
datetime.datetime(2016, 2, 11, 0, 0)
>>> tolerant_datestamp_to_datetime('2016/02/11')
datetime.datetime(2016, 2, 11, 0, 0)
>>> tolerant_datestamp_to_datetime('2016-02')
datetime.datetime(2016, 2, 1, 0, 0)
>>> tolerant_datestamp_to_datetime('2016')
datetime.datetime(2016, 1, 1, 0, 0)
>>> tolerant_datestamp_to_datetime('2016-02-11T18:34:12') # Z needed
Traceback (most recent call last):
    ...
ValueError: Invalid datestamp: 2016-02-11T18:34:12
>>> tolerant_datestamp_to_datetime('2016-02-11-3') # too many numbers
Traceback (most recent call last):
    ...
ValueError: Invalid datestamp: 2016-02-11-3
>>> tolerant_datestamp_to_datetime('2016-02-11T18:37:09:38') # too many numbers
Traceback (most recent call last):
    ...
ValueError: Invalid datestamp: 2016-02-11T18:37:09:38
>>> tolerant_datestamp_to_datetime('20151023371')
Traceback (most recent call last):
    ...
ValueError: Invalid datestamp: 20151023371
>>> tolerant_datestamp_to_datetime('2014T')
Traceback (most recent call last):
    ...
ValueError: Invalid datestamp: 2014T
papers.utils.try_date(year, month, day)[source]
papers.utils.ulower(s)[source]

Converts to unicode and lowercase. :param s: a string :return: unicode(s).lower()

>>> ulower('abSc')
u'absc'
>>> ulower(None)
u'none'
>>> ulower(89)
u'89'
papers.utils.unescape_latex(s)[source]

Replaces LaTeX symbols by their unicode counterparts using the unicode_tex package.

>>> unescape_latex(u'the $\\alpha$-rays of $\\Sigma$-algebras')
u'the $\u03b1$-rays of $\u03a3$-algebras'
>>> unescape_latex(u'$\textit{K}$ -trivial')
u'$\textit{K}$ -trivial'
papers.utils.urlize(val)[source]

Ensures a would-be URL actually starts with “http://” or “https://”.

Parameters:val – the URL
Returns:the cleaned URL
>>> urlize(u'gnu.org')
u'http://gnu.org'
>>> urlize(None) is None
True
>>> urlize(u'https://gnu.org')
u'https://gnu.org'
papers.utils.valid_publication_date(dt)[source]

Checks that the date is not too far in the future (otherwise it is not a plausible publication date).

>>> valid_publication_date(datetime.date(6789, 1, 1))
False
>>> valid_publication_date(datetime.date(2018, 3, 4))
True
>>> valid_publication_date(None)
False
papers.utils.validate_orcid(orcid)[source]
Returns:a cleaned ORCiD if the argument represents a valid ORCiD, None otherwise

This does not check that the id actually exists on orcid.org, only checks that it is syntactically valid (including the checksum). See http://support.orcid.org/knowledgebase/articles/116780-structure-of-the-orcid-identifier

See the test suite for a more complete set of examples

>>> validate_orcid(u' 0000-0001-8633-6098\n')
u'0000-0001-8633-6098'