API Reference¶

`scholaraio.index` ¶

index.py — SQLite FTS5 全文检索索引¶

索引字段：title + abstract + conclusion（均可检索）其余字段（paper_id, authors, year, journal, doi, paper_type, citation_count, md_path）存储但不参与检索。

用法： from scholaraio.index import build_index, search build_index(papers_dir, db_path) results = search("turbulent boundary layer", db_path)

`build_index(papers_dir, db_path, rebuild=False)` ¶

建立或增量更新 SQLite FTS5 全文检索索引。

索引字段: title + abstract + conclusion，均参与全文检索。其余字段（paper_id, authors 等）仅存储。

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	已入库论文目录，扫描其中的 `*.json`。	required
`db_path`	`Path`	SQLite 数据库路径，不存在时自动创建。	required
`rebuild`	`bool`	为 `True` 时清空旧数据后重建。	`False`

Returns:

Type	Description
`int`	本次索引的论文数量。

`build_proceedings_index(proceedings_root, db_path, rebuild=False)` ¶

Build a keyword index for proceedings child papers.

`search(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

FTS5 关键词全文检索。

在 title、abstract、conclusion 字段上执行 FTS5 MATCH，按 BM25 相关性排序返回结果。

Parameters:

Name	Type	Description	Default
`query`	`str`	检索词（多词用空格分隔，FTS5 语法）。	required
`db_path`	`Path`	SQLite 索引数据库路径。	required
`top_k`	`int \| None`	最多返回条数，为 `None` 时从 `cfg.search.top_k` 读取。	`None`
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config` 实例。	`None`
`year`	`str \| None`	年份过滤（`"2023"` / `"2020-2024"` / `"2020-"`）。	`None`
`journal`	`str \| None`	期刊名过滤（LIKE 模糊匹配）。	`None`
`paper_type`	`str \| None`	论文类型过滤（如 `"review"`、`"journal-article"`）。	`None`
`paper_ids`	`set[str] \| None`	论文 UUID 白名单，仅返回集合内的结果。	`None`

Returns:

Type	Description
`list[dict]`	匹配的论文字典列表，每项包含 `paper_id`, `title`,
`list[dict]`	`authors`, `year`, `journal`, `doi`, `paper_type`,
`list[dict]`	`citation_count`。

Raises:

Type	Description
`FileNotFoundError`	索引文件或 FTS5 表不存在。

`search_author(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

按作者名搜索论文（LIKE 模糊匹配）。

Parameters:

Name	Type	Description	Default
`query`	`str`	作者名（或部分名字），不区分大小写。	required
`db_path`	`Path`	SQLite 索引数据库路径。	required
`top_k`	`int \| None`	最多返回条数，为 `None` 时从 `cfg.search.top_k` 读取。	`None`
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config` 实例。	`None`
`year`	`str \| None`	年份过滤（`"2023"` / `"2020-2024"` / `"2020-"`）。	`None`
`journal`	`str \| None`	期刊名过滤（LIKE 模糊匹配）。	`None`
`paper_type`	`str \| None`	论文类型过滤（如 `"review"`、`"journal-article"`）。	`None`
`paper_ids`	`set[str] \| None`	论文 UUID 白名单，仅返回集合内的结果。	`None`

Returns:

Type	Description
`list[dict]`	匹配的论文字典列表。

`top_cited(db_path, top_k=10, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

按引用量降序返回论文。

Parameters:

Name	Type	Description	Default
`db_path`	`Path`	SQLite 索引数据库路径。	required
`top_k`	`int`	最多返回条数。	`10`
`year`	`str \| None`	年份过滤（`"2023"` / `"2020-2024"` / `"2020-"`）。	`None`
`journal`	`str \| None`	期刊名过滤（LIKE 模糊匹配）。	`None`
`paper_type`	`str \| None`	论文类型过滤（如 `"review"`、`"journal-article"`）。	`None`
`paper_ids`	`set[str] \| None`	论文 UUID 白名单，仅返回集合内的结果。	`None`

Returns:

Type	Description
`list[dict]`	论文字典列表，按 `citation_count` 降序排列。

Raises:

Type	Description
`FileNotFoundError`	索引文件或 FTS5 表不存在。

`unified_search(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

融合检索：FTS5 关键词 + FAISS 语义向量，合并去重排序。

两路并行检索，各取 top_k 条候选，按 paper_id 去重后以综合得分排序返回。FTS5 命中的论文获得排名加分，向量检索的论文按余弦相似度得分。同时命中的论文得分叠加，排名更靠前。

当向量索引不可用时（未运行 embed），自动降级为纯 FTS5 检索。

Parameters:

Name	Type	Description	Default
`query`	`str`	自然语言查询文本。	required
`db_path`	`Path`	SQLite 数据库路径。	required
`top_k`	`int \| None`	最多返回条数，为 `None` 时从配置读取。	`None`
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config` 实例。	`None`
`year`	`str \| None`	年份过滤。	`None`
`journal`	`str \| None`	期刊名过滤。	`None`
`paper_type`	`str \| None`	论文类型过滤。	`None`
`paper_ids`	`set[str] \| None`	论文 UUID 白名单，仅返回集合内的结果。	`None`

Returns:

Type	Description
`list[dict]`	论文字典列表，按综合得分降序。每项包含 `paper_id`, `title`,
`list[dict]`	`authors`, `year`, `journal`, `score`, `match`
`list[dict]`	（`"fts"` / `"vec"` / `"both"`）。

`search_proceedings(query, db_path, top_k=20, *, year=None, journal=None, paper_type=None)` ¶

Keyword search over proceedings child papers.

`lookup_paper(db_path, user_input)` ¶

查找论文：支持 UUID、dir_name、DOI、专利公开号。

按以下顺序尝试匹配: UUID → dir_name → DOI → publication_number。公开号查询会自动归一化为大写。

Parameters:

Name	Type	Description	Default
`db_path`	`Path`	SQLite 数据库路径。	required
`user_input`	`str`	UUID、目录名、DOI 或专利公开号。	required

Returns:

Type	Description
`dict \| None`	`papers_registry` 行字典，找不到时返回 `None`。

`get_references(paper_id, db_path, *, paper_ids=None)` ¶

查询论文的参考文献列表。

Parameters:

Name	Type	Description	Default
`paper_id`	`str`	论文 UUID。	required
`db_path`	`Path`	SQLite 数据库路径。	required
`paper_ids`	`set[str] \| None`	论文 UUID 白名单（仅过滤库内结果）。	`None`

Returns:

Type	Description
`list[dict]`	参考文献列表，每项含 `target_doi`、`target_id`，
`list[dict]`	库内论文另含 `title`、`dir_name`、`year`、`first_author`。

`get_citing_papers(paper_id, db_path, *, paper_ids=None)` ¶

查询哪些本地论文引用了指定论文（库内反向查找）。

Parameters:

Name	Type	Description	Default
`paper_id`	`str`	被引论文的 UUID。	required
`db_path`	`Path`	SQLite 数据库路径。	required
`paper_ids`	`set[str] \| None`	论文 UUID 白名单。	`None`

Returns:

Type	Description
`list[dict]`	引用方论文列表，每项含 `source_id`、`dir_name`、`title`、`year`。

`get_shared_references(paper_id_list, db_path, min_shared=2, *, paper_ids=None)` ¶

查询多篇论文的共同参考文献。

Parameters:

Name	Type	Description	Default
`paper_id_list`	`list[str]`	论文 UUID 列表。	required
`db_path`	`Path`	SQLite 数据库路径。	required
`min_shared`	`int`	最少被几篇论文共同引用才纳入结果。	`2`
`paper_ids`	`set[str] \| None`	论文 UUID 白名单（仅过滤库内结果）。	`None`

Returns:

Type	Description
`list[dict]`	共同引用列表，每项含 `target_doi`、`shared_count`、`target_id`，
`list[dict]`	库内论文另含 `title`、`dir_name`。

`scholaraio.loader` ¶

loader.py — 分层内容加载 + TOC 提取 + L3 结论提取¶

L1: title / authors / year / journal / doi ← JSON 字段 L2: abstract ← JSON 字段 L3: conclusion ← JSON 字段（需先运行 enrich_l3 提取） L4: full markdown ← 读 .md 文件

TOC 提取（enrich_toc）¶

regex 提取所有 # 标题 + 行号
LLM 过滤 noise（author running headers、期刊名、论文标题重复等），并为每个真实节标题分配层级（level）
写入 JSON["toc"]：[{"line": N, "level": N, "title": "..."}]

L3 提取（enrich_l3）¶

若 JSON 已有 TOC，直接从中定位结论节（跳过第一次 LLM 调用）。否则走 Primary path：LLM 从原始标题列表选出结论节 → Python 截取 → LLM 校验。 Fallback path：LLM 直接给出起止行号 → Python 截取 → LLM 校验。

`load_l1(json_path)` ¶

加载 L1 层元数据（标题、作者、年份、期刊、DOI）。

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	论文 JSON 元数据文件路径。	required

Returns:

Type	Description
`dict`	包含 `paper_id`, `title`, `authors`, `year`,
`dict`	`journal`, `doi` 的字典。

`load_l2(json_path)` ¶

加载 L2 层摘要文本。

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	论文 JSON 元数据文件路径。	required

Returns:

Type	Description
`str`	摘要文本，无摘要时返回 `"[No abstract available]"`。

`load_l3(json_path)` ¶

加载 L3 层结论文本。

需先运行 :func:enrich_l3 提取结论段到 JSON。

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	论文 JSON 元数据文件路径。	required

Returns:

Type	Description
`str \| None`	结论文本，尚未提取时返回 `None`。

`load_l4(md_path, *, lang=None)` ¶

加载 L4 层全文 Markdown，可选加载翻译版本。

当指定 lang 时，优先加载 paper_{lang}.md（如 paper_zh.md），不存在则回退到原文 paper.md。

Parameters:

Name	Type	Description	Default
`md_path`	`Path`	MinerU 输出的 `.md` 文件路径。	required
`lang`	`str \| None`	目标语言代码（如 `"zh"`），为 `None` 时加载原文。	`None`

Returns:

Type	Description
`str`	完整 Markdown 文本。

`load_notes(paper_dir)` ¶

加载论文的 agent 分析笔记。

笔记文件 (notes.md) 由 agent 在分析论文时自动创建和追加，用于跨会话、跨工作区复用分析结论。

Parameters:

Name	Type	Description	Default
`paper_dir`	`Path`	论文目录路径（包含 `meta.json` 的目录）。	required

Returns:

Type	Description
`str \| None`	笔记文本，不存在时返回 `None`。

`append_notes(paper_dir, section)` ¶

向论文笔记文件追加一条分析记录。

如果 notes.md 不存在则创建。每条记录之间用空行分隔。

Parameters:

Name	Type	Description	Default
`paper_dir`	`Path`	论文目录路径。	required
`section`	`str`	要追加的笔记内容（Markdown 格式，建议以 `## 日期 \| 来源` 开头）。	required

`enrich_toc(json_path, md_path, config, *, force=False, inspect=False)` ¶

用 LLM 提取论文目录结构，写入 JSON["toc"]。

从 Markdown 中提取所有 # 标题，通过 LLM 过滤 running headers、期刊名、作者名等噪声，为真实节标题分配层级。

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	论文 JSON 元数据文件路径（结果写回此文件）。	required
`md_path`	`Path`	论文 Markdown 文件路径。	required
`config`	`Config`	全局配置（用于 LLM 调用）。	required
`force`	`bool`	为 `True` 时覆盖已有 TOC。	`False`
`inspect`	`bool`	为 `True` 时打印过滤过程详情。	`False`

Returns:

Type	Description
`bool`	提取成功返回 `True`，失败返回 `False`。

`enrich_l3(json_path, md_path, config, *, force=False, max_retries=2, inspect=False)` ¶

用 LLM 提取结论段，写入 JSON["l3_conclusion"]。

提取策略（按优先级）: 1. 从已有 TOC 定位结论节 → Python 截取 → LLM 校验清洗 2. Primary path: LLM 从标题列表选出结论节 → 截取 → 校验 3. Fallback path: LLM 直接给出起止行号 → 截取 → 校验

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	论文 JSON 元数据文件路径（结果写回此文件）。	required
`md_path`	`Path`	论文 Markdown 文件路径。	required
`config`	`Config`	全局配置（用于 LLM 调用）。	required
`force`	`bool`	为 `True` 时覆盖已有结论。	`False`
`max_retries`	`int`	每条路径的最大重试次数。	`2`
`inspect`	`bool`	为 `True` 时打印提取过程详情。	`False`

Returns:

Type	Description
`bool`	提取成功返回 `True`，失败返回 `False`。

`scholaraio.export` ¶

export.py — 论文导出（BibTeX / RIS / Markdown / DOCX 等格式）¶

将 meta.json 转换为标准引用格式输出。

`meta_to_bibtex(meta)` ¶

Convert a single meta.json dict to a BibTeX entry string.

Parameters:

Name	Type	Description	Default
`meta`	`dict`	Paper metadata dictionary.	required

Returns:

Type	Description
`str`	Formatted BibTeX entry string.

`export_bibtex(papers_dir, *, paper_ids=None, year=None, journal=None, paper_type=None)` ¶

Export papers to BibTeX format.

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	Root papers directory.	required
`paper_ids`	`list[str] \| None`	Specific paper dir names to export. None = all.	`None`
`year`	`str \| None`	Year filter (e.g. "2023", "2020-2024").	`None`
`journal`	`str \| None`	Journal name filter (case-insensitive substring).	`None`
`paper_type`	`str \| None`	Paper type filter (case-insensitive substring).	`None`

Returns:

Type	Description
`str`	Complete BibTeX string with all matching entries.

`scholaraio.audit` ¶

audit.py — 已入库论文数据质量审计¶

扫描 data/papers/ 中的所有论文，检查元数据完整性、数据质量和内容一致性问题。返回结构化的问题报告供用户审阅。

规则化检查（无需 LLM）： - 关键字段缺失（doi, abstract, year, authors, journal） - 配对完整性（目录内 meta.json / paper.md 是否齐全） - 文件名规范性（目录名不符合 Author-Year-Title 格式） - DOI 重复检测 - MD 内容过短（可能转换失败） - JSON title 与 MD 首个 H1 不一致

`Issue` `dataclass` ¶

单个审计问题。

Attributes:

Name	Type	Description
`paper_id`	`str`	论文 ID（目录名）。
`severity`	`str`	严重程度，`"error"` \| `"warning"` \| `"info"`。
`rule`	`str`	检查规则名称。
`message`	`str`	问题描述。

`audit_papers(papers_dir)` ¶

对论文目录执行全量数据质量审计。

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	已入库论文目录（每篇一目录结构）。	required

Returns:

Type	Description
`list[Issue]`	按严重程度排序的问题列表（error 在前）。

`scholaraio.workspace` ¶

workspace.py — 工作区论文子集管理¶

每个工作区是 workspace/<name>/ 目录，内含 papers.json 索引文件指向 data/papers/ 中的论文。工作区内可自由存放笔记、代码、草稿等。

`create(ws_dir)` ¶

创建工作区目录并初始化空 papers.json。

Parameters:

Name	Type	Description	Default
`ws_dir`	`Path`	工作区目录路径。	required

Returns:

Type	Description
`Path`	papers.json 文件路径。

`add(ws_dir, paper_refs, db_path, *, resolved=None)` ¶

添加论文到工作区。

通过 UUID、目录名或 DOI 解析论文，去重后追加到 papers.json。

当调用方已持有解析好的论文信息时，可通过 resolved 参数直接传入，跳过逐个 lookup_paper() 查询（避免 O(N) 次 DB 连接开销）。

Parameters:

Name	Type	Description	Default
`ws_dir`	`Path`	工作区目录路径。	required
`paper_refs`	`list[str]`	论文引用列表（UUID / 目录名 / DOI）。当 resolved 不为 `None` 时本参数被忽略。	required
`db_path`	`Path`	index.db 路径，用于 lookup_paper。	required
`resolved`	`list[dict] \| None`	预解析的论文列表，每个元素须含 `"id"` 和 `"dir_name"` 键。提供时跳过 lookup_paper 查询。	`None`

Returns:

Type	Description
`list[dict]`	新增条目列表。

`remove(ws_dir, paper_refs, db_path)` ¶

从工作区移除论文。

Parameters:

Name	Type	Description	Default
`ws_dir`	`Path`	工作区目录路径。	required
`paper_refs`	`list[str]`	论文引用列表（UUID / 目录名 / DOI）。	required
`db_path`	`Path`	index.db 路径。	required

Returns:

Type	Description
`list[dict]`	被移除的条目列表。

`list_workspaces(ws_root)` ¶

列出所有含 papers.json 的工作区。

Parameters:

Name	Type	Description	Default
`ws_root`	`Path`	workspace/ 根目录。	required

Returns:

Type	Description
`list[str]`	工作区名称列表（排序）。

`read_paper_ids(ws_dir)` ¶

返回工作区中所有论文的 UUID 集合。

Parameters:

Name	Type	Description	Default
`ws_dir`	`Path`	工作区目录路径。	required

Returns:

Type	Description
`set[str]`	UUID 字符串集合，用于搜索过滤。

`scholaraio.papers` ¶

papers.py — 论文目录结构的唯一真相源¶

所有模块通过此模块访问论文路径，不自行拼路径。

目录结构： data/papers// ├── meta.json # 含 "id": "" 字段 └── paper.md

`paper_dir(papers_dir, dir_name)` ¶

Return the directory path for a paper.

`meta_path(papers_dir, dir_name)` ¶

Return the meta.json path for a paper.

`md_path(papers_dir, dir_name)` ¶

Return the paper.md path for a paper.

`iter_paper_dirs(papers_dir)` ¶

Yield sorted subdirectories containing meta.json.

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	Root papers directory.	required

Yields:

Type	Description
`Path`	Path to each paper subdirectory that contains a `meta.json`.

`scholaraio.proceedings` ¶

Helpers for proceedings library storage and iteration.

`proceedings_db_path(root)` ¶

`iter_proceedings_dirs(proceedings_root)` ¶

`iter_proceedings_papers(proceedings_root)` ¶

Yield child-paper rows enriched with proceeding metadata.

`scholaraio.vectors` ¶

vectors.py — 向量嵌入与语义检索¶

使用 Qwen3-Embedding-0.6B（本地 ModelScope 缓存）生成论文向量。嵌入文本 = title + abstract，存入 index.db 的 paper_vectors 表。

用法： from scholaraio.vectors import build_vectors, vsearch build_vectors(papers_dir, db_path) results = vsearch("turbulent drag reduction", db_path, top_k=5)

`build_vectors(papers_dir, db_path, rebuild=False, cfg=None)` ¶

为论文生成语义嵌入向量并写入 paper_vectors 表。

嵌入文本 = title + abstract 拼接。使用 Sentence Transformer 模型（默认 Qwen3-Embedding-0.6B）。

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	已入库论文目录，扫描其中的 `*.json`。	required
`db_path`	`Path`	SQLite 数据库路径，不存在时自动创建。	required
`rebuild`	`bool`	为 `True` 时清空旧向量后重建。	`False`
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config`，用于读取模型/设备配置。	`None`

Returns:

Type	Description
`int`	本次新写入的向量数量。

`vsearch(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

语义向量检索，使用 FAISS 加速余弦相似度搜索。

将查询文本编码为向量，通过 FAISS IndexFlatIP 检索最相似的论文。 FAISS 索引在首次查询时自动构建并缓存到磁盘，向量变更后自动失效重建。

Parameters:

Name	Type	Description	Default
`query`	`str`	自然语言查询文本。	required
`db_path`	`Path`	SQLite 数据库路径（需包含 `paper_vectors` 表）。	required
`top_k`	`int \| None`	最多返回条数，为 `None` 时从 `cfg.embed.top_k` 读取。	`None`
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config`，用于加载嵌入模型。	`None`
`year`	`str \| None`	年份过滤（`"2023"` / `"2020-2024"` / `"2020-"`）。	`None`
`journal`	`str \| None`	期刊名过滤（LIKE 模糊匹配）。	`None`
`paper_type`	`str \| None`	论文类型过滤（如 `"review"`、`"journal-article"`）。	`None`
`paper_ids`	`set[str] \| None`	论文 UUID 白名单，仅返回集合内的结果。	`None`

Returns:

Type	Description
`list[dict]`	论文字典列表，按 `score` 降序排列。每项包含
`list[dict]`	`paper_id`, `title`, `authors`, `year`, `journal`, `score`。

Raises:

Type	Description
`FileNotFoundError`	索引文件或 `paper_vectors` 表不存在。

`scholaraio.topics` ¶

topics.py — BERTopic 主题建模¶

复用 paper_vectors 表中的 Qwen3 嵌入向量，通过 BERTopic 对全库论文做主题聚类。支持全库主题概览、单主题论文列表、层级主题树、主题间关联发现。

用法： from scholaraio.topics import build_topics, get_topic_overview, get_topic_papers model = build_topics(db_path, papers_dir) overview = get_topic_overview(model) papers = get_topic_papers(model, topic_id=0)

`build_topics(db_path, papers_dir=None, *, papers_map=None, min_topic_size=5, nr_topics='auto', save_path=None, cfg=None, **fit_kwargs)` ¶

构建 BERTopic 主题模型。

复用 paper_vectors 表中已有的嵌入向量，不重新编码。使用 UMAP 降维 + HDBSCAN 聚类 + c-TF-IDF 提取主题关键词。

Parameters:

Name	Type	Description	Default
`db_path`	`Path`	SQLite 数据库路径。	required
`papers_dir`	`Path \| None`	论文 JSON 目录（主库模式）。与 `papers_map` 二选一。	`None`
`papers_map`	`dict[str, dict] \| None`	paper_id → 元数据字典映射（explore 模式）。	`None`
`min_topic_size`	`int`	最小主题大小（HDBSCAN `min_cluster_size`）。	`5`
`nr_topics`	`int \| str \| None`	目标主题数。`"auto"` 自动合并；`None` 不合并；整数指定数量。	`'auto'`
`save_path`	`Path \| None`	模型保存路径，为 `None` 时不保存。	`None`
`cfg`	`Config \| None`	可选配置。	`None`
`**fit_kwargs`	`Any`	传递给 `_fit_bertopic()` 的额外参数（如 `n_neighbors`, `n_components`, `ngram_range` 等）。	`{}`

Returns:

Type	Description
`BERTopic`	训练好的 BERTopic 模型实例。

`load_model(path)` ¶

加载已保存的 BERTopic 模型。

Parameters:

Name	Type	Description	Default
`path`	`Path`	模型目录路径。	required

Returns:

Type	Description
`BERTopic`	加载的 BERTopic 模型实例（含附加元数据）。

Raises:

Type	Description
`FileNotFoundError`	模型文件不存在。

`get_topic_overview(model)` ¶

获取所有主题的概览信息。

Parameters:

Name	Type	Description	Default
`model`	`BERTopic`	已训练的 BERTopic 模型。	required

Returns:

Type	Description
`list[dict]`	主题字典列表，每项包含 `topic_id`, `count`, `name`,
`list[dict]`	`keywords`（前 10 关键词）, `representative_papers`。

`get_topic_papers(model, topic_id)` ¶

获取指定主题的全部论文。

Parameters:

Name	Type	Description	Default
`model`	`BERTopic`	已训练的 BERTopic 模型。	required
`topic_id`	`int`	主题 ID。	required

Returns:

Type	Description
`list[dict]`	该主题下所有论文的元数据列表。

`get_outliers(model)` ¶

获取未被归入任何主题的论文（outlier，topic_id == -1）。

Parameters:

Name	Type	Description	Default
`model`	`BERTopic`	已训练的 BERTopic 模型。	required

Returns:

Type	Description
`list[dict]`	outlier 论文的元数据列表。

`reduce_topics_to(model, nr_topics, save_path=None, cfg=None)` ¶

在已有模型上快速合并主题到指定数量（不重新聚类）。

Parameters:

Name	Type	Description	Default
`model`	`BERTopic`	已训练的 BERTopic 模型。	required
`nr_topics`	`int`	目标主题数。	required
`save_path`	`Path \| None`	模型保存路径，为 `None` 时不保存。	`None`
`cfg`	`Config \| None`	可选配置（用于初始化嵌入模型）。	`None`

Returns:

Type	Description
`BERTopic`	合并后的 BERTopic 模型实例。

`merge_topics_by_ids(model, topics_to_merge, save_path=None, cfg=None)` ¶

手动合并指定主题（供 Claude Code 调用）。

Parameters:

Name	Type	Description	Default
`model`	`BERTopic`	已训练的 BERTopic 模型。	required
`topics_to_merge`	`list[list[int]]`	要合并的主题 ID 列表，如 `[[1, 6, 14], [3, 5]]` 表示将 1/6/14 合并为一个主题，3/5 合并为一个主题。	required
`save_path`	`Path \| None`	模型保存路径，为 `None` 时不保存。	`None`
`cfg`	`Config \| None`	可选配置（用于初始化嵌入模型）。	`None`

Returns:

Type	Description
`BERTopic`	合并后的 BERTopic 模型实例。

`scholaraio.translate` ¶

translate.py — 论文 Markdown 自动翻译¶

将非目标语言的论文 Markdown 翻译为目标语言（默认中文），保留 LaTeX 公式、代码块、图片引用和 Markdown 格式。

翻译结果保存为 paper_{lang}.md（如 paper_zh.md），原文 paper.md 保持不变。

单篇翻译会按 config.translate.concurrency 并发请求多个分块，将中间结果持久化到临时工作目录，并按原顺序推进最终输出。当启用 portable 导出时，还会额外生成 workspace/translation-ws/<paper-dir>/ 可移植包。

用法： from scholaraio.translate import translate_paper, batch_translate translate_paper(paper_dir, config) batch_translate(papers_dir, config)

`translate_paper(paper_dir, config, *, target_lang=None, force=False, portable=False, chunk_workers=None, progress_callback=None)` ¶

Translate a paper's markdown to the target language.

The translation is saved as paper_{lang}.md in the same directory. Original paper.md is preserved.

Parameters:

Name	Type	Description	Default
`paper_dir`	`Path`	Paper directory containing `paper.md`.	required
`config`	`Config`	Global config.	required
`target_lang`	`str \| None`	Target language code, defaults to `config.translate.target_lang`.	`None`
`force`	`bool`	Re-translate even if translation file already exists.	`False`

Returns:

Type	Description
`TranslateResult`	class:`TranslateResult` with `path` (or `None`) and `skip_reason`.

`batch_translate(papers_dir, config, *, target_lang=None, force=False, portable=False, paper_ids=None)` ¶

Batch translate all papers in the library.

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	Papers directory.	required
`config`	`Config`	Global config.	required
`target_lang`	`str \| None`	Target language code.	`None`
`force`	`bool`	Re-translate existing translations.	`False`
`paper_ids`	`set[str] \| None`	Optional set of paper UUIDs to limit translation scope.	`None`

Returns:

Type	Description
`dict[str, int]`	Stats dict with `translated`, `skipped`, `failed` counts.

`detect_language(text)` ¶

Detect the primary language of text using character-class heuristics.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text (first ~2000 chars are examined).	required

Returns:

Type	Description
`str`	ISO 639-1 code: `"zh"`, `"ja"`, `"ko"`, `"en"`, `"de"`,
`str`	`"fr"`, or `"es"`. Falls back to `"en"` when uncertain.

`scholaraio.explore` ¶

explore.py — 学术探索¶

从 OpenAlex 批量拉取论文（支持 ISSN / concept / author / institution / keyword 等多维度 filter），本地嵌入 + FAISS 语义搜索 + FTS5 关键词检索 + RRF 融合检索。主题建模、可视化、查询复用 topics.py（通过 papers_map 参数）。数据存储在 data/explore/<name>/，与主库完全隔离。

用法::

from scholaraio.explore import fetch_explore, build_explore_vectors, build_explore_topics
fetch_explore("jfm", issn="0022-1120")
fetch_explore("turbulence", concept="C62520636", year_range="2020-2025")
build_explore_vectors("jfm")
build_explore_topics("jfm")

`fetch_explore(name, *, issn=None, concept=None, topic=None, author=None, institution=None, keyword=None, source_type=None, year_range=None, min_citations=None, oa_type=None, incremental=False, limit=None, cfg=None)` ¶

从 OpenAlex 批量拉取论文（支持多维度 filter）。

使用 cursor-based 分页遍历符合条件的所有论文，提取 title、abstract、authors 等字段，写入 JSONL 文件。

Parameters:

Name	Type	Description	Default
`name`	`str`	探索库名称（如 `"jfm"`），用作目录名。	required
`issn`	`str \| None`	期刊 ISSN 过滤（如 `"0022-1120"`）。	`None`
`concept`	`str \| None`	OpenAlex concept ID（如 `"C62520636"` = Turbulence）。	`None`
`topic`	`str \| None`	OpenAlex topic ID。	`None`
`author`	`str \| None`	OpenAlex author ID。	`None`
`institution`	`str \| None`	OpenAlex institution ID。	`None`
`keyword`	`str \| None`	标题/摘要关键词搜索。	`None`
`source_type`	`str \| None`	来源类型过滤（journal / conference / repository）。	`None`
`year_range`	`str \| None`	年份过滤（如 `"2020-2025"`）。	`None`
`min_citations`	`int \| None`	最小引用量过滤。	`None`
`oa_type`	`str \| None`	OpenAlex work type 过滤（article / review 等）。	`None`
`incremental`	`bool`	为 `True` 时追加到现有 JSONL，基于 DOI 去重。	`False`
`limit`	`int \| None`	最多拉取的论文数量上限（`None` 表示无限制）。	`None`
`cfg`	`Config \| None`	可选的全局配置。	`None`

Returns:

Type	Description
`int`	本次新拉取的论文数量。

`build_explore_vectors(name, *, rebuild=False, cfg=None)` ¶

为探索库生成语义向量。

复用主库的 Qwen3-Embedding 模型，向量存入探索库自己的 explore.db。

Parameters:

Name	Type	Description	Default
`name`	`str`	探索库名称。	required
`rebuild`	`bool`	为 `True` 时清空重建。	`False`
`cfg`	`Config \| None`	可选的全局配置（用于模型加载）。	`None`

Returns:

Type	Description
`int`	本次新嵌入的论文数量。

`build_explore_topics(name, *, rebuild=False, min_topic_size=30, nr_topics=None, cfg=None)` ¶

对探索库运行 BERTopic 主题建模。

复用主库的 build_topics() 流程，但参数针对大规模数据调整（默认 min_topic_size=30）。模型以统一格式保存（bertopic_model.pkl + scholaraio_meta.pkl），可直接用 topics.load_model() 加载。

Parameters:

Name	Type	Description	Default
`name`	`str`	探索库名称。	required
`rebuild`	`bool`	为 `True` 时重建模型。	`False`
`min_topic_size`	`int`	HDBSCAN 最小聚类大小。	`30`
`nr_topics`	`int \| str \| None`	目标主题数。`"auto"` 自动合并。	`None`
`cfg`	`Config \| None`	可选的全局配置。	`None`

Returns:

Type	Description
`dict`	统计字典：`{"n_topics": N, "n_outliers": N, "n_papers": N}`。

`explore_search(name, query, *, top_k=20, cfg=None)` ¶

在探索库中进行 FTS5 关键词搜索。

Parameters:

Name	Type	Description	Default
`name`	`str`	探索库名称。	required
`query`	`str`	查询文本。	required
`top_k`	`int`	返回条数。	`20`
`cfg`	`Config \| None`	可选的全局配置。	`None`

Returns:

Type	Description
`list[dict]`	论文列表，按 BM25 排名。

`explore_vsearch(name, query, *, top_k=10, cfg=None)` ¶

在探索库中进行语义搜索（FAISS 加速）。

Parameters:

Name	Type	Description	Default
`name`	`str`	探索库名称。	required
`query`	`str`	查询文本。	required
`top_k`	`int`	返回条数。	`10`
`cfg`	`Config \| None`	可选的全局配置。	`None`

Returns:

Type	Description
`list[dict]`	论文列表，按 cosine similarity 降序。

`explore_unified_search(name, query, *, top_k=20, cfg=None)` ¶

探索库融合检索：FTS5 关键词 + FAISS 语义，RRF 合并排序。

Parameters:

Name	Type	Description	Default
`name`	`str`	探索库名称。	required
`query`	`str`	查询文本。	required
`top_k`	`int`	返回条数。	`20`
`cfg`	`Config \| None`	可选的全局配置。	`None`

Returns:

Type	Description
`list[dict]`	论文列表，按 RRF 综合得分降序。

`list_explore_libs(cfg=None)` ¶

列出所有探索库名称。

`explore_db_path(name, cfg=None)` ¶

Return the SQLite DB path for an explore library.

Parameters:

Name	Type	Description	Default
`name`	`str`	Explore library name.	required
`cfg`	`Config \| None`	Optional Config instance; resolved from environment if omitted.	`None`

Returns:

Type	Description
`Path`	Path to `explore.db` inside the library directory.

`validate_explore_name(name)` ¶

Return True if name is a safe, non-traversing library identifier.

Rejects empty strings, absolute paths, and names that contain path separators or .. components so that callers cannot escape the data/explore/ directory.

Parameters:

Name	Type	Description	Default
`name`	`str`	Candidate explore library name supplied by the user.	required

Returns:

Type	Description
`bool`	`True` when the name is safe to use in path construction.

`scholaraio.insights` ¶

Research behavior analytics helpers used by the insights CLI.

`extract_hot_keywords(search_events, *, top_k=10)` ¶

Return the most frequent non-stopword tokens from search events.

`aggregate_most_read_titles(read_events, papers_dir, *, top_k=10)` ¶

Aggregate read counts by resolved paper title.

`build_weekly_read_trend(read_events)` ¶

Group read events by ISO year-week key.

`recent_unique_read_names(read_events, *, limit=5)` ¶

Return recent unique paper names preserving newest-first order.

`recommend_unread_neighbors(store, cfg, *, recent_days=7, recent_limit=5, top_k=5)` ¶

Recommend semantically similar unread papers from recent reading behavior.

`list_workspace_counts(ws_root)` ¶

Return workspace names with paper counts.

`scholaraio.ingest.extractor` ¶

extractor.py — 论文元数据提取器¶

提供四种 Stage-1 实现（从 MinerU markdown 提取 title/authors/year/doi/journal）：

RegexExtractor — 调用 metadata.py 中的正则提取逻辑（默认） LLMExtractor — 调用 LLM API（OpenAI 兼容协议），适合正则失败的边界 case FallbackExtractor — 先 regex，失败时 fallback 到 LLM（auto 模式） RobustExtractor — regex + LLM 双跑，LLM 校正 OCR 错误 + multi-DOI 检测（robust 模式）

用法¶

from scholaraio.config import load_config
from scholaraio.ingest.extractor import get_extractor

config = load_config()
extractor = get_extractor(config)
meta = extractor.extract(Path("paper.md"))

`get_extractor(config)` ¶

根据配置返回对应的元数据提取器实例。

Parameters:

Name	Type	Description	Default
`config`	`Config`	全局配置，从 `config.ingest.extractor` 读取模式。	required

Returns:

Name	Type	Description
`实现`	`MetadataExtractor`	class:`MetadataExtractor` 协议的提取器实例。

Raises:

Type	Description
`RuntimeError`	`robust` 或 `llm` 模式缺少 API key 时抛出。

支持的模式

regex: 纯正则，最快，不调 LLM。
auto: regex 优先，关键字段缺失时 LLM 兜底。
robust: regex + LLM 始终双跑，LLM 校正 OCR 错误。
llm: 纯 LLM，不跑 regex。

`scholaraio.ingest.metadata` ¶

metadata — 论文元数据提取、API 查询、JSON 输出与文件重命名¶

从 MinerU 转换的学术论文 markdown 文件中提取元数据（标题、作者、年份、DOI、期刊），通过 Crossref / Semantic Scholar / OpenAlex 三个 API 查询引用量、摘要、论文类型，输出同名 JSON 元数据文件，并将 md + json 重命名为 {一作}-{年份}-{完整标题} 格式。

子模块

_models — PaperMetadata dataclass + 常量 + HTTP session _extract — markdown 正则解析（title/authors/doi/year/journal） _api — API 查询（Crossref/S2/OA）+ enrich_metadata _abstract — abstract 提取（regex/LLM/DOI fetch/backfill） _writer — JSON 序列化 + 文件重命名 _cli — 独立 CLI 命令

`PaperMetadata` `dataclass` ¶

一篇学术论文的完整元数据。

Attributes:

Name	Type	Description
`id`	`str`	UUID，入库时生成，永不改变。
`title`	`str`	论文标题。
`authors`	`list[str]`	作者列表。
`first_author`	`str`	第一作者全名。
`first_author_lastname`	`str`	第一作者姓氏（用于生成文件名）。
`year`	`int \| None`	发表年份。
`doi`	`str`	DOI 标识符（不含 `https://doi.org/` 前缀）。
`journal`	`str`	期刊或会议名称。
`abstract`	`str`	摘要文本。
`paper_type`	`str`	论文类型（article, review, conference-paper 等）。
`citation_count_s2`	`int \| None`	Semantic Scholar 引用数。
`citation_count_openalex`	`int \| None`	OpenAlex 引用数。
`citation_count_crossref`	`int \| None`	Crossref 引用数。
`s2_paper_id`	`str`	Semantic Scholar 论文 ID。
`openalex_id`	`str`	OpenAlex 论文 ID。
`crossref_doi`	`str`	Crossref 返回的 DOI。
`api_sources`	`list[str]`	成功返回数据的 API 列表。
`references`	`list[str]`	参考文献 DOI 列表（从 Semantic Scholar 获取）。
`source_file`	`str`	原始文件名。
`arxiv_id`	`str`	arXiv 标识符（如 `2401.12345` 或 `hep-th/9901001`，不含版本后缀）。
`extraction_method`	`str`	提取方式（`doi_lookup` \| `arxiv_lookup` \| `title_search` \| `title_search_relaxed` \| `title_search_s2` \| `local_only`）。

`enrich_metadata(meta)` ¶

通过 API 查询补全和覆盖元数据。

查询策略（多层降级）: 1. Tier 1 — DOI 直查（三个 API 并行，无限流） 2. Tier 2 — Crossref + OA 标题搜索（严格匹配 ≥0.85） 3. Tier 3 — Crossref + OA 放宽搜索（匹配 ≥0.65） 4. Tier 4 — S2 标题搜索（最后手段，可能被限流） 5. Tier 5 — 本地数据（无 API 结果可用）

合并优先级: Crossref > Semantic Scholar > OpenAlex > 正则提取。

Parameters:

Name	Type	Description	Default
`meta`	`PaperMetadata`	已提取的元数据（至少需要 `title` 或 `doi`）。	required

Returns:

Name	Type	Description
`同一个`	`PaperMetadata`	class:`PaperMetadata` 实例，字段已被 API 数据覆盖/补全。

`extract_abstract_from_md(md_path, cfg=None)` ¶

从 MinerU 解析的 markdown 文件中提取 Abstract 段落。

提取流程由 cfg.ingest.abstract_llm_mode 控制：

"off"：纯正则提取。
"fallback"：正则失败时调用 LLM 直接提取。
"verify"（默认）：正则成功后 LLM 校验/修正，失败时 LLM 直接提取。

无 cfg 或无 LLM API key 时自动降级为纯正则。

Parameters:

Name	Type	Description	Default
`md_path`	`Path`	MinerU 输出的 `.md` 文件路径。	required
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config`。	`None`

Returns:

Type	Description
`str \| None`	提取到的 abstract 文本，无法提取时返回 `None`。

`fetch_abstract_by_doi(doi)` ¶

通过 DOI 从出版商落地页抓取 abstract。

先用 requests 访问 https://doi.org/<doi> 并跟随重定向，从 HTML meta 标签提取 abstract。若遭遇 Cloudflare 403，回退到 curl_cffi（模拟浏览器 TLS 指纹）重试。

Parameters:

Name	Type	Description	Default
`doi`	`str`	论文 DOI，如 `"10.1017/jfm.2024.1191"`。	required

Returns:

Type	Description
`str \| None`	提取到的 abstract 文本，失败时返回 `None`。

`backfill_abstracts(papers_dir, *, dry_run=False, doi_fetch=False, cfg=None)` ¶

批量补全或更新论文 abstract。

扫描 papers_dir 下所有 JSON：

默认模式：只处理缺 abstract 的论文，从 .md 提取 + LLM fallback。
doi_fetch=True：对所有有 DOI 的论文尝试从出版商网页抓取官方 abstract。成功则覆盖现有 abstract（官方源优先）；失败则保留原有值，仅对仍无 abstract 的论文 fallback 到 .md 提取。

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	已入库论文目录。	required
`dry_run`	`bool`	为 `True` 时只预览，不写文件。	`False`
`doi_fetch`	`bool`	为 `True` 时启用 DOI 网页抓取（官方源优先）。	`False`
`cfg`	`Config \| None`	可选的 :class:`~scholaraio.config.Config`，提供后启用 LLM fallback。	`None`

Returns:

Type	Description
`dict`	统计字典：`{"filled": N, "skipped": N, "failed": N, "updated": N}`。

`generate_new_stem(meta)` ¶

生成标准化文件名 stem: {LastName}-{year}-{FullTitle}。

去除变音符号、LaTeX 公式、非法字符，适用于文件系统。

Parameters:

Name	Type	Description	Default
`meta`	`PaperMetadata`	元数据实例。	required

Returns:

Type	Description
`str`	文件系统安全的文件名 stem（不含扩展名）。

`metadata_to_dict(meta)` ¶

将 :class:PaperMetadata 转换为可序列化的字典。

输出字段包括 title, authors, year, doi, journal, abstract, citation_count, ids, api_sources 等。

Parameters:

Name	Type	Description	Default
`meta`	`PaperMetadata`	元数据实例。	required

Returns:

Type	Description
`dict`	JSON 可序列化的字典。

`refetch_metadata(json_path)` ¶

对已入库论文重新查询 API，补全引用量等字段。

从 JSON 反构造 :class:PaperMetadata，调用 :func:enrich_metadata 重新查询三个 API，然后将新数据合并回 JSON（保留 toc、 l3_conclusion 等已有富化字段）。

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	已入库论文的 JSON 文件路径。	required

Returns:

Type	Description
`bool`	`True` 表示有字段被更新，`False` 表示无变化或查询失败。

`rename_paper(json_path, *, dry_run=False)` ¶

根据 JSON 元数据重命名论文目录。

读取 meta.json 中的 first_author_lastname、year、title，用 :func:generate_new_stem 生成标准目录名。若新旧目录名一致则跳过。

Parameters:

Name	Type	Description	Default
`json_path`	`Path`	论文 `meta.json` 文件路径。	required
`dry_run`	`bool`	为 `True` 时只返回新路径，不实际重命名。	`False`

Returns:

Type	Description
`Path \| None`	重命名后的新 `meta.json` 路径，未变更时返回 `None`。

`write_metadata_json(meta, output_path)` ¶

将元数据写入 JSON 文件（原子写入）。

Parameters:

Name	Type	Description	Default
`meta`	`PaperMetadata`	元数据实例。	required
`output_path`	`Path`	输出 JSON 文件路径。	required

`scholaraio.ingest.pipeline` ¶

pipeline.py — 可组合步骤流水线¶

步骤（scope）： inbox — 每个 PDF 依次执行：mineru → extract → dedup → ingest papers — 每篇已入库论文执行：toc → l3 global — 全局执行一次：index

预设： full = mineru, extract, dedup, ingest, toc, l3, embed, index ingest = mineru, extract, dedup, ingest, embed, index enrich = toc, l3, embed, index reindex = embed, index

用法（CLI）： scholaraio pipeline full scholaraio pipeline enrich --force scholaraio pipeline --steps toc,l3 scholaraio pipeline full --dry-run

`StepResult` ¶

Bases: Enum

流水线步骤返回值。

`InboxCtx` `dataclass` ¶

Inbox 步骤间传递的单文件上下文。

Attributes:

Name	Type	Description
`pdf_path`	`Path \| None`	原始 PDF 路径，md-only 入库时为 `None`。
`inbox_dir`	`Path`	inbox 目录路径。
`papers_dir`	`Path`	已入库论文目录路径。
`existing_dois`	`dict[str, Path]`	已入库论文的 DOI → JSON 路径映射（用于去重）。
`cfg`	`Config`	全局配置。
`opts`	`dict[str, Any]`	运行选项（dry_run, no_api, force 等）。
`pending_dir`	`Path \| None`	无 DOI 论文的待审目录。
`md_path`	`Path \| None`	Markdown 文件路径（MinerU 输出或直接放入）。
`meta`	`Any`	提取后的 :class:`~scholaraio.ingest.metadata.PaperMetadata`。
`status`	`str`	当前状态，`"pending"` \| `"ingested"` \| `"duplicate"` \| `"needs_review"` \| `"failed"` \| `"skipped"`。

`run_pipeline(step_names, cfg, opts)` ¶

执行指定步骤序列。

按 scope 分三阶段依次执行

inbox — 逐个文件: mineru → extract → dedup → ingest
papers — 逐篇已入库论文: toc → l3 → translate（auto_translate 开启时自动注入）
global — 全局执行一次: embed → index

当 config.translate.auto_translate 为 True 且 pipeline 包含 inbox 步骤时，会在 papers scope 阶段自动注入 translate 步骤（位于 embed/index 之前）。

Parameters:

Name	Type	Description	Default
`step_names`	`list[str]`	步骤名称列表，如 `["extract", "dedup", "ingest"]`。可用步骤见 :data:`STEPS`。	required
`cfg`	`Config`	全局配置。	required
`opts`	`dict[str, Any]`	运行选项字典，支持的键: `dry_run` (bool): 预览模式，不写文件。 `no_api` (bool): 跳过外部 API 查询。 `force` (bool): 强制重新处理（toc/l3）。 `inspect` (bool): 展示处理详情。 `max_retries` (int): l3 最大重试次数。 `rebuild` (bool): 重建索引（index/embed）。 `inbox_dir` (Path): 自定义 inbox 目录。 `papers_dir` (Path): 自定义 papers 目录。	required

`import_external(records, cfg, *, pdf_paths=None, no_api=False, dry_run=False)` ¶

从外部来源（Endnote 等）批量导入论文。

对每条记录运行 dedup + ingest，最后一次性 embed + index。如提供 pdf_paths（与 records 索引对齐），入库时自动复制 PDF 到论文目录。

Parameters:

Name	Type	Description	Default
`records`	`list`	PaperMetadata 列表。	required
`cfg`	`Config`	全局配置。	required
`pdf_paths`	`list[Path \| None] \| None`	与 records 对齐的 PDF 路径列表（可选）。	`None`
`no_api`	`bool`	跳过 API 查询。	`False`
`dry_run`	`bool`	预览模式。	`False`

Returns:

Type	Description
`dict[str, int]`	统计字典 `{"ingested": N, "duplicate": N, "needs_review": N, "failed": N, "skipped": N}`。

`batch_convert_pdfs(cfg, *, enrich=False)` ¶

批量转换已入库论文的 PDF 为 paper.md，可选 enrich。

扫描 data/papers/ 中有 PDF 无 paper.md 的论文，云端模式使用 convert_pdfs_cloud_batch() 真正批量转换，本地模式逐篇调用。转换后可选运行 toc + l3 + abstract backfill，最后一次性 embed + index。

Parameters:

Name	Type	Description	Default
`cfg`	`Config`	全局配置。	required
`enrich`	`bool`	转换后是否运行 toc + l3 + abstract backfill。	`False`

Returns:

Type	Description
`dict[str, int]`	统计字典 `{"converted": N, "failed": N, "skipped": N}`。

`step_embed(papers_dir, cfg, opts)` ¶

生成语义向量写入 index.db（global 作用域）。

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	论文目录。	required
`cfg`	`Config`	全局配置。	required
`opts`	`dict`	运行选项。	required

Returns:

Type	Description
`StepResult`	`StepResult.OK`；缺少 embed 依赖时跳过并返回 `StepResult.SKIP`。

`step_index(papers_dir, cfg, opts)` ¶

更新 SQLite FTS5 索引（global 作用域）。

Parameters:

Name	Type	Description	Default
`papers_dir`	`Path`	论文目录。	required
`cfg`	`Config`	全局配置。	required
`opts`	`dict`	运行选项。	required

Returns:

Type	Description
`StepResult`	`StepResult.OK`。

API Reference¶

scholaraio.index ¶

index.py — SQLite FTS5 全文检索索引¶

build_index(papers_dir, db_path, rebuild=False) ¶

build_proceedings_index(proceedings_root, db_path, rebuild=False) ¶

search(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None) ¶

search_author(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None) ¶

top_cited(db_path, top_k=10, *, year=None, journal=None, paper_type=None, paper_ids=None) ¶

unified_search(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None) ¶

search_proceedings(query, db_path, top_k=20, *, year=None, journal=None, paper_type=None) ¶

lookup_paper(db_path, user_input) ¶

get_references(paper_id, db_path, *, paper_ids=None) ¶

get_citing_papers(paper_id, db_path, *, paper_ids=None) ¶

get_shared_references(paper_id_list, db_path, min_shared=2, *, paper_ids=None) ¶

scholaraio.loader ¶

loader.py — 分层内容加载 + TOC 提取 + L3 结论提取¶

TOC 提取（enrich_toc）¶

L3 提取（enrich_l3）¶

load_l1(json_path) ¶

load_l2(json_path) ¶

load_l3(json_path) ¶

load_l4(md_path, *, lang=None) ¶

load_notes(paper_dir) ¶

append_notes(paper_dir, section) ¶

enrich_toc(json_path, md_path, config, *, force=False, inspect=False) ¶

enrich_l3(json_path, md_path, config, *, force=False, max_retries=2, inspect=False) ¶

scholaraio.export ¶

export.py — 论文导出（BibTeX / RIS / Markdown / DOCX 等格式）¶

meta_to_bibtex(meta) ¶

export_bibtex(papers_dir, *, paper_ids=None, year=None, journal=None, paper_type=None) ¶

scholaraio.audit ¶

audit.py — 已入库论文数据质量审计¶

Issue dataclass ¶

audit_papers(papers_dir) ¶

scholaraio.workspace ¶

workspace.py — 工作区论文子集管理¶

create(ws_dir) ¶

add(ws_dir, paper_refs, db_path, *, resolved=None) ¶

remove(ws_dir, paper_refs, db_path) ¶

list_workspaces(ws_root) ¶

read_paper_ids(ws_dir) ¶

scholaraio.papers ¶

papers.py — 论文目录结构的唯一真相源¶

paper_dir(papers_dir, dir_name) ¶

meta_path(papers_dir, dir_name) ¶

md_path(papers_dir, dir_name) ¶

iter_paper_dirs(papers_dir) ¶

scholaraio.proceedings ¶

proceedings_db_path(root) ¶

iter_proceedings_dirs(proceedings_root) ¶

iter_proceedings_papers(proceedings_root) ¶

scholaraio.vectors ¶

vectors.py — 向量嵌入与语义检索¶

build_vectors(papers_dir, db_path, rebuild=False, cfg=None) ¶

vsearch(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None) ¶

scholaraio.topics ¶

topics.py — BERTopic 主题建模¶

build_topics(db_path, papers_dir=None, *, papers_map=None, min_topic_size=5, nr_topics='auto', save_path=None, cfg=None, **fit_kwargs) ¶

load_model(path) ¶

get_topic_overview(model) ¶

get_topic_papers(model, topic_id) ¶

get_outliers(model) ¶

reduce_topics_to(model, nr_topics, save_path=None, cfg=None) ¶

merge_topics_by_ids(model, topics_to_merge, save_path=None, cfg=None) ¶

scholaraio.translate ¶

translate.py — 论文 Markdown 自动翻译¶

translate_paper(paper_dir, config, *, target_lang=None, force=False, portable=False, chunk_workers=None, progress_callback=None) ¶

batch_translate(papers_dir, config, *, target_lang=None, force=False, portable=False, paper_ids=None) ¶

detect_language(text) ¶

scholaraio.explore ¶

explore.py — 学术探索¶

fetch_explore(name, *, issn=None, concept=None, topic=None, author=None, institution=None, keyword=None, source_type=None, year_range=None, min_citations=None, oa_type=None, incremental=False, limit=None, cfg=None) ¶

build_explore_vectors(name, *, rebuild=False, cfg=None) ¶

build_explore_topics(name, *, rebuild=False, min_topic_size=30, nr_topics=None, cfg=None) ¶

explore_search(name, query, *, top_k=20, cfg=None) ¶

explore_vsearch(name, query, *, top_k=10, cfg=None) ¶

explore_unified_search(name, query, *, top_k=20, cfg=None) ¶

list_explore_libs(cfg=None) ¶

explore_db_path(name, cfg=None) ¶

validate_explore_name(name) ¶

`scholaraio.index` ¶

`build_index(papers_dir, db_path, rebuild=False)` ¶

`build_proceedings_index(proceedings_root, db_path, rebuild=False)` ¶

`search(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

`search_author(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

`top_cited(db_path, top_k=10, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

`unified_search(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

`search_proceedings(query, db_path, top_k=20, *, year=None, journal=None, paper_type=None)` ¶

`lookup_paper(db_path, user_input)` ¶

`get_references(paper_id, db_path, *, paper_ids=None)` ¶

`get_citing_papers(paper_id, db_path, *, paper_ids=None)` ¶

`get_shared_references(paper_id_list, db_path, min_shared=2, *, paper_ids=None)` ¶

`scholaraio.loader` ¶

`load_l1(json_path)` ¶

`load_l2(json_path)` ¶

`load_l3(json_path)` ¶

`load_l4(md_path, *, lang=None)` ¶

`load_notes(paper_dir)` ¶

`append_notes(paper_dir, section)` ¶

`enrich_toc(json_path, md_path, config, *, force=False, inspect=False)` ¶

`enrich_l3(json_path, md_path, config, *, force=False, max_retries=2, inspect=False)` ¶

`scholaraio.export` ¶

`meta_to_bibtex(meta)` ¶

`export_bibtex(papers_dir, *, paper_ids=None, year=None, journal=None, paper_type=None)` ¶

`scholaraio.audit` ¶

`Issue` `dataclass` ¶

`audit_papers(papers_dir)` ¶

`scholaraio.workspace` ¶

`create(ws_dir)` ¶

`add(ws_dir, paper_refs, db_path, *, resolved=None)` ¶

`remove(ws_dir, paper_refs, db_path)` ¶

`list_workspaces(ws_root)` ¶

`read_paper_ids(ws_dir)` ¶

`scholaraio.papers` ¶

`paper_dir(papers_dir, dir_name)` ¶

`meta_path(papers_dir, dir_name)` ¶

`md_path(papers_dir, dir_name)` ¶

`iter_paper_dirs(papers_dir)` ¶

`scholaraio.proceedings` ¶

`proceedings_db_path(root)` ¶

`iter_proceedings_dirs(proceedings_root)` ¶

`iter_proceedings_papers(proceedings_root)` ¶

`scholaraio.vectors` ¶

`build_vectors(papers_dir, db_path, rebuild=False, cfg=None)` ¶

`vsearch(query, db_path, top_k=None, cfg=None, *, year=None, journal=None, paper_type=None, paper_ids=None)` ¶

`scholaraio.topics` ¶

`build_topics(db_path, papers_dir=None, *, papers_map=None, min_topic_size=5, nr_topics='auto', save_path=None, cfg=None, **fit_kwargs)` ¶

`load_model(path)` ¶

`get_topic_overview(model)` ¶

`get_topic_papers(model, topic_id)` ¶

`get_outliers(model)` ¶

`reduce_topics_to(model, nr_topics, save_path=None, cfg=None)` ¶

`merge_topics_by_ids(model, topics_to_merge, save_path=None, cfg=None)` ¶

`scholaraio.translate` ¶

`translate_paper(paper_dir, config, *, target_lang=None, force=False, portable=False, chunk_workers=None, progress_callback=None)` ¶

`batch_translate(papers_dir, config, *, target_lang=None, force=False, portable=False, paper_ids=None)` ¶

`detect_language(text)` ¶

`scholaraio.explore` ¶

`fetch_explore(name, *, issn=None, concept=None, topic=None, author=None, institution=None, keyword=None, source_type=None, year_range=None, min_citations=None, oa_type=None, incremental=False, limit=None, cfg=None)` ¶

`build_explore_vectors(name, *, rebuild=False, cfg=None)` ¶

`build_explore_topics(name, *, rebuild=False, min_topic_size=30, nr_topics=None, cfg=None)` ¶

`explore_search(name, query, *, top_k=20, cfg=None)` ¶

`explore_vsearch(name, query, *, top_k=10, cfg=None)` ¶

`explore_unified_search(name, query, *, top_k=20, cfg=None)` ¶

`list_explore_libs(cfg=None)` ¶

`explore_db_path(name, cfg=None)` ¶

`validate_explore_name(name)` ¶

`scholaraio.insights` ¶

`extract_hot_keywords(search_events, *, top_k=10)` ¶

`aggregate_most_read_titles(read_events, papers_dir, *, top_k=10)` ¶

`build_weekly_read_trend(read_events)` ¶

`recent_unique_read_names(read_events, *, limit=5)` ¶

`recommend_unread_neighbors(store, cfg, *, recent_days=7, recent_limit=5, top_k=5)` ¶

`list_workspace_counts(ws_root)` ¶

`scholaraio.ingest.extractor` ¶

`get_extractor(config)` ¶

`scholaraio.ingest.metadata` ¶

`PaperMetadata` `dataclass` ¶

`enrich_metadata(meta)` ¶

`extract_abstract_from_md(md_path, cfg=None)` ¶