Search Module

Search utilities for the pipeline.

This module provides: - Query generation (simple heuristic without embeddings) - Retrieval from multiple sources (arXiv, Google Scholar, PubMed, GitHub)

All functions are synchronous wrappers around sync parsers to keep things simple for initial integration. The pipeline orchestrator can run them in threads or plain sync for now.

Search arXiv and convert results to PaperCandidate items.

Parameters:
  • query (str) – Search query string.

  • categories (Optional[List[str]]) – Optional list of arXiv categories, e.g. ["cs.AI", "cs.LG"].

  • max_results (int) – Page size for the search request (default 100).

  • start (int) – Offset for pagination (default 0).

Return type:

List[PaperCandidate]

Returns:

A list of candidate papers converted from arXiv results.

Example:

items = arxiv_search(query="RAG AND small datasets", max_results=10)
print(len(items))
agent.pipeline.search.collect_candidates(task, queries, per_query_limit=50)[source]

Run source-specific search per query and collect unique candidates.

Parameters:
  • task (PipelineTask) – The pipeline task providing categories and other context.

  • queries (Iterable[GeneratedQuery]) – Iterable of GeneratedQuery with per-query source.

  • per_query_limit (int) – Max results retrieved for each query (default 50).

Return type:

List[PaperCandidate]

Returns:

Unique candidates from all queries.

Search GitHub repositories and represent them as candidates.

The pipeline treats repositories as candidates with title and snippet.

Parameters:
  • query (str) – Search query string.

  • max_results (int) – Page size for the search request (default 50).

  • start (int) – Offset for pagination (default 0).

Return type:

List[PaperCandidate]

Returns:

Candidate list with repo name and link.

Search PubMed and convert results to candidates.

Parameters:
  • query (str) – Search query string.

  • max_results (int) – Page size for the search request (default 50).

  • start (int) – Offset for pagination (default 0).

Return type:

List[PaperCandidate]

Returns:

Candidate list with title and PubMed link.

Search Google Scholar and convert results to lightweight candidates.

Since Scholar results do not provide abstracts, the summary field uses the snippet text when available. Categories and arXiv-specific fields are left empty.

Parameters:
  • query (str) – Search query string.

  • max_results (int) – Page size for the search request (default 50).

  • start (int) – Offset for pagination (default 0).

Return type:

List[PaperCandidate]

Returns:

Candidate list with title and link.