Status: Current layout specification
Last Updated: 2026-04-24
Scope: repository layout, runtime instance layout, agent-surface placement, and migration constraints for future refactors.
2026-04-24 status note:
data/libraries/, data/spool/, and data/state/scholaraio migrate finalize --confirmThis document defines the target directory structure for ScholarAIO and the compatibility constraints that MUST be preserved while migrating from the current layout.
This is a refactoring specification, not a release note. It exists to:
cli.py, pipeline.py, and workspace refactorsThe key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are to be interpreted as requirement levels for future refactors.
ScholarAIO MUST distinguish between:
The repository root and the runtime-instance root MAY be the same directory in local-clone mode. In plugin mode, the runtime-instance root MAY instead be ~/.scholaraio/.
Directories MUST be partitioned by lifecycle and ownership, not only by feature name. At minimum, the design MUST distinguish:
Agent host discovery relies on fixed file locations. Therefore:
The repository root is the top-level project tree used by contributors and agent hosts.
The following files or directories MUST remain at repository root:
AGENTS.mdCLAUDE.mdAGENTS_CN.md.qwen/QWEN.md.cursor/rules/.clinerules.windsurfrules.github/copilot-instructions.md.claude-plugin/clawhub.yamlRationale:
The canonical skill source MUST be:
.claude/skills/The following compatibility entry points MUST continue to resolve to the same skill set:
.agents/skills.qwen/skillsskillsThese MAY remain symlinks or MAY become equivalent wrapper directories, but they MUST continue to expose the same skill inventory.
scholaraio/ MUST NOT become the canonical physical home of SKILL.md files.
The target repository layout is:
repo-root/
├── AGENTS.md
├── CLAUDE.md
├── AGENTS_CN.md
├── .claude/skills/
├── .agents/skills -> ../.claude/skills
├── .qwen/QWEN.md
├── .qwen/skills -> ../.claude/skills
├── .cursor/rules/
├── .clinerules
├── .windsurfrules
├── .github/copilot-instructions.md
├── .claude-plugin/
├── clawhub.yaml
├── scholaraio/
├── gui/
├── docs/
├── tests/
└── scripts/
scholaraio/ Package LayoutThe Python package SHOULD evolve toward the following second-level structure:
scholaraio/
├── core/
├── providers/
├── stores/
├── projects/
├── services/
├── interfaces/
└── compat/
The intended responsibilities are:
core/: config, logging, observability, shared primitives, error typesproviders/: external adapters such as LLM providers, scholarly APIs, parsing backends, import adapters, and transport-specific clients for web/runtime integrations (for example HTTP services or MCP-backed adapters)stores/: persistence-facing contracts for papers, proceedings, explore, toolref, citation styles, and similar durable storesprojects/: workspace and other project-level boundaries built on top of shared storesservices/: ingest, retrieval, authoring, scientific runtime, and operational orchestrationinterfaces/: CLI-facing and agent-facing entry adapterscompat/: temporary compatibility shims during migrationgui/gui/ MUST be reserved as a top-level source directory for the future presentation shell.
gui/:
The runtime-instance root is the directory relative to which ScholarAIO resolves config and user data.
Until config discovery is redesigned, the following files MUST remain valid at the runtime-instance root:
config.yamlconfig.local.yamlRationale:
load_config() searches config.yaml upward from the current working directory and falls back to ~/.scholaraio/config.yamlconfig/ would break current discovery behaviorFor compatibility with the current codebase, the runtime-instance root MUST continue to support:
data/workspace/This applies both in repository-local mode and in plugin mode.
In addition, future migration-capable versions SHOULD reserve a root-level control directory:
.scholaraio-control/Within those top-level compatibility anchors, the target runtime layout is:
instance-root/
├── config.yaml
├── config.local.yaml
├── data/
│ ├── libraries/
│ ├── spool/
│ ├── state/
│ ├── cache/
│ └── runtime/
├── .scholaraio-control/
└── workspace/
The purpose of each subtree is defined below.
.scholaraio-control/ is the reserved root-level control directory for migration and instance metadata.
It SHOULD contain control-plane artifacts such as:
instance.jsonmigration.lockIt MUST NOT be treated as part of:
data/libraries/data/state/workspace/Rationale:
data/ and workspace/ are themselves migration targetsThe detailed contract for this directory is defined in:
docs/development/migration-mechanism-spec.mddata/ Subtree Specificationdata/libraries/data/libraries/ contains durable, user-meaningful knowledge stores.
Target second-level layout:
data/libraries/
├── papers/
├── proceedings/
├── explore/
├── toolref/
└── citation_styles/
Requirements:
data/spool/data/spool/ contains queued work items awaiting later processing or manual review.
Target second-level layout:
data/spool/
├── inbox/
├── inbox-thesis/
├── inbox-patent/
├── inbox-doc/
├── inbox-proceedings/
└── pending/
Requirements:
data/state/data/state/ contains persistent internal state that is important to the application but is not itself a user-facing library.
Target second-level layout:
data/state/
├── search/
├── metrics/
├── topics/
└── sessions/
Examples:
Requirements:
data/cache/data/cache/ contains rebuildable derived data.
Target second-level layout:
data/cache/
├── parser/
├── previews/
├── vectors/
└── topics/
Requirements:
data/runtime/data/runtime/ contains temporary runtime artifacts.
Target second-level layout:
data/runtime/
├── tmp/
├── locks/
└── sockets/
Requirements:
workspace/ Subtree Specificationworkspace/ MUST be treated as a first-class project root, not merely as a paper-subset helper.
Each workspace MAY contain:
.git/Therefore, workspace/ MUST NOT be modeled only as a view over data/libraries/papers/.
The target layout for a user workspace is:
workspace/<name>/
├── workspace.yaml
├── refs/
│ ├── papers.json
│ ├── explore.json
│ └── toolref.json
├── notes/
├── drafts/
├── outputs/
├── runs/
└── .git/
This reference shape is not a rigid scaffold. Named workspaces remain user-owned project roots and MAY contain additional files or subdirectories beyond the examples above.
Requirements:
workspace/<name>/ MUST be safe to use as an independent project rootworkspace/<name>/ MUST remain a free-form user project treepapers.json, refs/papers.json, and any future additive workspace.yamlworkspace.yaml MUST stay additive and MUST NOT replace papers.json / refs/papers.json as the paper-reference compatibility chainworkspace.yaml EnvelopeFor the next design pass, the minimal additive workspace.yaml envelope SHOULD be:
schema_version: 1
name: turbulence-review
description: Drafting workspace for a turbulence review article
tags:
- review
- turbulence
mounts:
explore: []
toolref: []
outputs:
default_dir: outputs/
Interpretation rules:
schema_version identifies the manifest format and is required once workspace.yaml existsname, description, and tags are optional metadata onlymounts is optional and expresses explicit opt-in external attachments; if explore mounts are ever implemented, they SHOULD start as read-only shared-store referencesoutputs is optional and only expresses workspace-local preferences such as a default output directorypapers.json / refs/papers.json, not be duplicated or replaced inside workspace.yamlThe minimal workspace.yaml envelope above SHOULD follow these validation and normalization rules:
workspace.yaml MUST remain a normal, fully supported stateschema_version: 1; if a newer schema version is encountered, implementations SHOULD treat the file as opaque metadata and MUST NOT rewrite it blindlyname and description, when present, SHOULD be strings trimmed of surrounding whitespace; empty strings SHOULD normalize to absencename is descriptive metadata only and MUST NOT be treated as the canonical workspace directory nametags, when present, SHOULD be a list of strings; normalization SHOULD trim whitespace, drop empty items, and de-duplicate exact repeats while preserving ordermounts.explore and mounts.toolref, when present, SHOULD be lists of logical shared-store identifiers rather than filesystem paths.. traversal, and MUST NOT imply ownership of data/libraries/ or workspace/_system/outputs.default_dir, when present, MUST resolve to a workspace-relative path; it MUST NOT be absolute and MUST NOT escape the workspace rootworkspace.yaml MUST NOT duplicate paper-reference payloads, search indexes, or other heavyweight derived state that already belongs elsewhereSystem-generated workspaces or workspace-like output trees SHOULD use a reserved namespace under workspace/.
Recommended form:
workspace/_system/
Examples:
Legacy compatibility outputs such as workspace/translation-ws/, workspace/figures/, or root-level files like workspace/output.docx MAY remain temporarily, but system-owned or cross-workspace outputs SHOULD converge under workspace/_system/. Only outputs that are explicitly scoped to one named workspace SHOULD prefer workspace/<name>/outputs/.
workspace/ MUST reference libraries through stable IDs or manifests, not by taking ownership of library filesdata/libraries/ MUST NOT depend on workspace layoutdata/state/, data/cache/, and data/runtime/ MUST NOT store user-authored canonical contentproviders/ MUST NOT depend on interfaces/stores/ MUST NOT depend on interfaces/projects/ MAY depend on stores/ and services/, but MUST NOT define external provider clientsservices/ MAY compose providers/, stores/, and projects/interfaces/ SHOULD remain thin and MUST NOT become the only place where business rules existwebsearch / webextract MUST NOT be locked to a single skill-packaging or HTTP-only implementation shapeAGENTS.md plus the skill systemThe following constraints are mandatory:
.claude/skills/ MUST remain the canonical skill sourceThe following discovery surfaces MUST continue to work:
CLAUDE.md and .claude/skills/AGENTS.md and .agents/skills/.qwen/QWEN.md and .qwen/skills/.cursor/rules/scholaraio.mdc, then AGENTS.md, then .claude/skills/*/SKILL.md.clinerules and .claude/skills/.windsurfrules.github/copilot-instructions.md.claude-plugin/ and clawhub.yamlAny refactor that changes the physical location or wrapper path of skills MUST update:
No directory-structure migration is complete until those discovery surfaces still work.
The current codebase still uses legacy paths. During migration, the following logical mapping SHOULD be adopted:
| Current path | Target logical location |
|---|---|
data/papers/ |
data/libraries/papers/ |
data/proceedings/ |
data/libraries/proceedings/ |
data/explore/ |
data/libraries/explore/ |
data/toolref/ |
data/libraries/toolref/ |
data/citation_styles/ |
data/libraries/citation_styles/ |
data/inbox* |
data/spool/* |
data/pending/ |
data/spool/pending/ |
data/index.db |
data/state/search/index.db |
data/metrics.db |
data/state/metrics/metrics.db |
data/topic_model/ |
data/state/topics/ or data/cache/topics/, depending on rebuild policy |
workspace/translation-ws/ |
workspace/_system/translation-bundles/ |
workspace/figures/ |
workspace/_system/figures/ |
workspace/output.* |
workspace/_system/output/ |
This mapping is a migration target, not a requirement for an all-at-once rename.
The migration MUST be incremental.
Before any large directory move, ScholarAIO SHOULD first:
cfg._root / "data" / ... expressions in feature modulesThe current codebase is not yet ready for an atomic layout flip. Therefore:
data/ or workspace/ SHOULD NOT happen firstconfig.yaml discovery behavior SHOULD remain stable until an explicit config-discovery redesign is approvedgit init inside a workspace without affecting the main repositoryThis specification does not define:
Those belong in companion architecture and execution documents.
Until superseded by a later approved version, future refactors SHOULD treat this document as the governing directory-structure target for:
cli.py decompositioningest/pipeline.py decomposition