ArXiv Parser¶
arXiv client with convenient typed wrappers.
This module provides ArxivParser
and helper functions to search for
papers, fetch metadata, download PDFs, and extract text. It is intentionally
lightweight and dependency-minimal.
clearExample:
from shared.arxiv_parser import ArxivParser
parser = ArxivParser()
results = parser.search_papers("RAG small datasets", max_results=5)
for p in results:
print(p.id, p.title)
- class shared.arxiv_parser.ArxivPaper(id, title, authors, summary, categories, published, updated, pdf_url, abs_url, journal_ref=None, doi=None, comment=None, primary_category=None)[source]¶
Bases:
object
Class for representing a scientific article from arXiv.
- Variables:
id – arXiv identifier (e.g.,
"2301.07041"
).title – Paper title.
authors – Author names.
summary – Abstract text.
categories – arXiv categories.
published – Submission date.
updated – Last updated date.
pdf_url – Link to the PDF.
abs_url – Link to the abstract page.
journal_ref – Optional journal reference.
doi – Optional DOI.
comment – Optional author comment.
primary_category – Optional primary category.
- Parameters:
- class shared.arxiv_parser.ArxivParser(downloads_dir='downloads')[source]¶
Bases:
object
Main class for working with the arXiv API.
- Parameters:
downloads_dir (
str
) – Directory used to store temporary files when downloading PDFs.- Returns:
None
.
- download_pdf(paper, filename=None)[source]¶
Download a paper’s PDF file.
- Parameters:
paper (
ArxivPaper
) – The paper to download.filename (
Optional
[str
]) – Optional filename override; defaults to a safe name from id/title.
- Return type:
- Returns:
Path to the downloaded file or
None
on error.
- get_paper_by_id(arxiv_id)[source]¶
Get article data by ID.
- Parameters:
arxiv_id (
str
) – Article ID on arXiv (e.g.,"2301.07041"
)- Return type:
- Returns:
The paper if found, otherwise
None
.
- get_paper_text_online(paper)[source]¶
Get article text online without downloading the PDF.
- Parameters:
paper (
ArxivPaper
) – The paper descriptor.- Return type:
- Returns:
The article text or
None
on error.
- search_by_author(author_name, max_results=10)[source]¶
Search articles by author.
- Parameters:
- Return type:
- Returns:
List of the author’s articles.
- search_by_category(category, max_results=10)[source]¶
Search articles by category.
- Parameters:
- Return type:
- Returns:
List of articles in the category.
- search_papers(query, max_results=10, sort_by=arxiv.SortCriterion.Relevance, sort_order=arxiv.SortOrder.Descending, categories=None, date_from=None, date_to=None, start=0)[source]¶
Search articles by query with optional filters.
- Parameters:
query (
str
) – Search query string.max_results (
int
) – Maximum number of results to return.sort_by (
SortCriterion
) – Sort criterion, e.g.,arxiv.SortCriterion.Relevance
.sort_order (
SortOrder
) – Sort order, e.g.,arxiv.SortOrder.Descending
.categories (
Optional
[List
[str
]]) – Category filter like["cs.AI", "cs.LG"]
.date_from (
Optional
[datetime
]) – Start date for results (inclusive).date_to (
Optional
[datetime
]) – End date for results (inclusive).start (
int
) – Starting index for pagination (default 0).
- Return type:
- Returns:
Found papers as typed records.
- shared.arxiv_parser.download_paper(arxiv_id, downloads_dir='downloads')[source]¶
Quick article download.
- shared.arxiv_parser.get_paper(arxiv_id)[source]¶
Quick article retrieval by ID.
- Parameters:
arxiv_id (
str
) – arXiv identifier.- Return type:
- Returns:
ArxivPaper
instance orNone
.
- shared.arxiv_parser.search_papers(query, max_results=10)[source]¶
Quick article search.
- Parameters:
- Return type:
- Returns:
List of
ArxivPaper
instances.