ArXiv Parser¶

arXiv client with convenient typed wrappers.

This module provides ArxivParser and helper functions to search for papers, fetch metadata, download PDFs, and extract text. It is intentionally lightweight and dependency-minimal.

clearExample:

from shared.arxiv_parser import ArxivParser

parser = ArxivParser()
results = parser.search_papers("RAG small datasets", max_results=5)
for p in results:
    print(p.id, p.title)

class shared.arxiv_parser.ArxivPaper(id, title, authors, summary, categories, published, updated, pdf_url, abs_url, journal_ref=None, doi=None, comment=None, primary_category=None)[source]¶

Bases: object

Class for representing a scientific article from arXiv.

Variables:

id – arXiv identifier (e.g., "2301.07041").
title – Paper title.
authors – Author names.
summary – Abstract text.
categories – arXiv categories.
published – Submission date.
updated – Last updated date.
pdf_url – Link to the PDF.
abs_url – Link to the abstract page.
journal_ref – Optional journal reference.
doi – Optional DOI.
comment – Optional author comment.
primary_category – Optional primary category.

Parameters:

id (str)
title (str)
authors (List[str])
summary (str)
categories (List[str])
published (datetime)
updated (datetime)
pdf_url (str)
abs_url (str)
journal_ref (str | None)
doi (str | None)
comment (str | None)
primary_category (str | None)

abs_url: str¶

authors: List[str]¶

categories: List[str]¶

comment: Optional[str] = None¶

doi: Optional[str] = None¶

id: str¶

journal_ref: Optional[str] = None¶

pdf_url: str¶

primary_category: Optional[str] = None¶

published: datetime¶

summary: str¶

title: str¶

updated: datetime¶

class shared.arxiv_parser.ArxivParser(downloads_dir='downloads')[source]¶

Bases: object

Main class for working with the arXiv API.

Parameters:: downloads_dir (str) – Directory used to store temporary files when downloading PDFs.
Returns:: None.

download_pdf(paper, filename=None)[source]¶

Download a paper’s PDF file.

Parameters:

paper (ArxivPaper) – The paper to download.
filename (Optional[str]) – Optional filename override; defaults to a safe name from id/title.

Return type:

Optional[str]

Returns:

Path to the downloaded file or None on error.

extract_text_from_pdf(pdf_path)[source]¶

Extract text from a PDF file.

Parameters:: pdf_path (str) – Path to the local PDF file.
Return type:: Optional[str]
Returns:: Extracted text or None on error.

get_paper_by_id(arxiv_id)[source]¶

Get article data by ID.

Parameters:: arxiv_id (str) – Article ID on arXiv (e.g., "2301.07041")
Return type:: Optional[ArxivPaper]
Returns:: The paper if found, otherwise None.

get_paper_text_online(paper)[source]¶

Get article text online without downloading the PDF.

Parameters:: paper (ArxivPaper) – The paper descriptor.
Return type:: Optional[str]
Returns:: The article text or None on error.

get_recent_papers(category=None, days=7, max_results=10)[source]¶

Get recent articles.

Parameters:

category (Optional[str]) – Category filter.
days (int) – Number of days back from now.
max_results (int) – Maximum number of results.

Return type:

List[ArxivPaper]

Returns:

List of recent articles.

search_by_author(author_name, max_results=10)[source]¶

Search articles by author.

Parameters:

author_name (str) – Author name.
max_results (int) – Maximum number of results.

Return type:

List[ArxivPaper]

Returns:

List of the author’s articles.

search_by_category(category, max_results=10)[source]¶

Search articles by category.

Parameters:

category (str) – Category (e.g., "cs.AI", "cs.LG").
max_results (int) – Maximum number of results.

Return type:

List[ArxivPaper]

Returns:

List of articles in the category.

search_papers(query, max_results=10, sort_by=arxiv.SortCriterion.Relevance, sort_order=arxiv.SortOrder.Descending, categories=None, date_from=None, date_to=None, start=0)[source]¶

Search articles by query with optional filters.

Parameters:

query (str) – Search query string.
max_results (int) – Maximum number of results to return.
sort_by (SortCriterion) – Sort criterion, e.g., arxiv.SortCriterion.Relevance.
sort_order (SortOrder) – Sort order, e.g., arxiv.SortOrder.Descending.
categories (Optional[List[str]]) – Category filter like ["cs.AI", "cs.LG"].
date_from (Optional[datetime]) – Start date for results (inclusive).
date_to (Optional[datetime]) – End date for results (inclusive).
start (int) – Starting index for pagination (default 0).

Return type:

List[ArxivPaper]

Returns:

Found papers as typed records.

shared.arxiv_parser.download_paper(arxiv_id, downloads_dir='downloads')[source]¶

Quick article download.

Parameters:

arxiv_id (str) – arXiv identifier.
downloads_dir (str) – Directory to store the PDF file.

Return type:

Optional[str]

Returns:

Path to the downloaded PDF or None.

shared.arxiv_parser.get_paper(arxiv_id)[source]¶

Quick article retrieval by ID.

Parameters:: arxiv_id (str) – arXiv identifier.
Return type:: Optional[ArxivPaper]
Returns:: ArxivPaper instance or None.

shared.arxiv_parser.search_papers(query, max_results=10)[source]¶

Quick article search.

Parameters:

query (str) – Free-text search query.
max_results (int) – Maximum number of results to return.

Return type:

List[ArxivPaper]

Returns:

List of ArxivPaper instances.