Technical architecture for Owlat's semantic file storage — version tracking with provenance today, plus the planned embedding-based retrieval and auto-tagging layer.

Semantic File System Architecture

Owlat already stores media assets for email campaigns. The semantic file system extends this into a broader organizational file layer — where files get summaries, auto-tags, embeddings, and conversation-linked provenance. The vision is that files become part of the agent's context, surfacing automatically when relevant.

Most of this is built today: the semanticFiles table, CRUD functions in apps/api/convex/semanticFiles.ts, full-text search, version history, and a dashboard files UI (apps/web/app/pages/dashboard/files/, plus a files tab on each contact). The AI processing pipeline (text extraction, summary, auto-tags, embeddings) is now wired in and scheduled — the create mutation kicks off processFile, true vector search runs over the vector_files index, and the agent's context_retrieval step pulls relevant files into its briefing. Inbound email attachments are auto-ingested: mail.delivery.ingestFromWebhook pulls each attachment leaf out of the delivered .eml and calls the semanticFiles.ingest entry point with sourceType: 'email_attachment'. What remains: an agent_generated producer (the ingest entry point already accepts that source type — no agent flow emits files yet), real extraction for non-PDF binaries (Word/Excel/images), and tag-merge suggestions. Each is called out inline below.

How it works

Files enter the system through three source types (the sourceType field):

Direct upload (upload) — through the dashboard files UI.
Email attachment (email_attachment) — the intended path for inbound attachments.
Agent-generated (agent_generated) — artifacts an agent produces.

Today vs. planned

Direct upload is wired end-to-end through the dashboard, and now runs the full processing pipeline. The shared insertSemanticFile helper (semanticFiles.ts) supports all three source types but only inserts the row (and syncs contacts). Two entry points schedule processFile: the user-facing create mutation (upload), and the internal ingest mutation (email_attachment / agent_generated). Inbound attachments now flow automatically — mail.delivery.ingestFromWebhook extracts each attachment from the raw .eml and calls ingest, so they land in the file library and the file→knowledge pipeline. The remaining gap is the agent_generated source: ingest accepts it, but no agent flow emits artifacts yet.

The processing pipeline (processFile in apps/api/convex/semanticFileProcessing.ts) runs on every new file:

File uploaded / received
  → Store binary in Convex _storage
  → Extract text content (by MIME type):
    - text/*, application/json → raw text
    - text/html → tags stripped
    - text/csv (or .csv) → raw text
    - PDF → real text extraction via unpdf (placeholder on failure)
    - Word, Excel, images → placeholder string ([Word document: name], etc.)
  → Generate title, 2-3 sentence summary, and 5-10 auto-tags via the LLM
  → Inherit context auto-tags from the thread subject + related contacts
  → Generate an embedding (text-embedding-3-small, 1536 dims)
  → Record embeddingModel + embeddingGeneratedAt for re-embedding
  → Build searchableText for full-text search
  → Patch metadata back onto the semanticFiles row
  → Feed real extracted text into the knowledge graph (extractFromFile)

The processing pipeline is now scheduled

processFile is an internalAction that the user-facing create mutation schedules (ctx.scheduler.runAfter(0, …)) the moment a file is inserted. A backfillUnprocessed cron (crons.ts, every 15 minutes) is a safety net that re-schedules processing for recently-created files whose embedding never landed (e.g. lost to a deploy gap). So uploaded files now carry an LLM-generated title/summary, auto-tags, a real embedding, and embeddingModel / embeddingGeneratedAt.

Binary text extraction is active for PDFs only

extractText() reads plaintext, JSON, HTML, and CSV, and now extracts real text from PDFs using the unpdf package (pure-JS, serverless-friendly), falling back to the [PDF file: contract.pdf] placeholder on any failure. Word, Excel, and image files are still stored with a placeholder string (e.g. [Word document: report.docx], [Image: scan.png]) and rely on the filename and any user-provided title for the summary and embedding — there is no DOCX/XLSX parser and no image OCR yet. Files left as a placeholder stub are skipped by the file→knowledge extraction step.

Retrieval

Full-text search

Keyword search over searchableText is live, via the search query in apps/api/convex/semanticFiles.ts:

const results = await ctx.db
  .query('semanticFiles')
  .withSearchIndex('search_files', (q) => q.search('searchableText', searchQuery))
  .take(limit ?? 20)

There is no filterFields on the search index — Owlat runs one organization per deployment (see apps/api/convex/lib/sessionOrganization.ts), so there is no organizationId to filter on.

Semantic search

True vector search over files is live. Because Convex's vector API is action-only, it runs in an internalAction — semanticSearch in apps/api/convex/semanticFileProcessing.ts. It is a hybrid retrieval: the action embeds the query text (or accepts a pre-computed embedding) and runs two over-fetched legs — a vector leg via ctx.vectorSearch over the vector_files index, and a full-text leg via ftsRankedFileIds over search_files — then fuses the two rankings with reciprocal rank fusion (lib/rrf.ts). The fused ids are hydrated into full file documents (with storage URLs and a similarity _score) via getByIds, post-filtered by the required scopeToContact arg for contact-level data isolation, and finally sliced to limit:

// semanticFileProcessing.semanticSearch (internalAction) — simplified excerpt:
// Over-fetch both legs so the post-fusion contact filter still has survivors.
const fetchLimit = Math.min(256, Math.max(limit * 5, 50))
const hits = await ctx.vectorSearch('semanticFiles', 'vector_files', {
  vector, // query embedding produced in the same action
  limit: fetchLimit,
})
const ftsRanked = await ctx.runQuery(internal.semanticFiles.ftsRankedFileIds, {
  queryText,
  limit: fetchLimit,
})
// Fuse the vector + full-text rankings (scale-agnostic).
const fusedIds = reciprocalRankFusion([hits.map((h) => h._id), ftsRanked])
const files = await ctx.runQuery(internal.semanticFiles.getByIds, { ids: fusedIds })
// Contact-scope AFTER fusion, then slice to `limit`.

There is intentionally no `semanticSearch` query

The old recency-fallback stub was removed. semanticFiles.ts now carries only a comment explaining why no semanticSearch query exists: a query context can't call ctx.vectorSearch, so such a query would silently return recency-ordered files while masquerading as semantic search — a trap. The only semanticSearch is the internalAction in semanticFileProcessing.ts, which does true ctx.vectorSearch. The agent and any real semantic-search caller go through that action.

Contextual retrieval for agents

Now wired

The agent pipeline searches semanticFiles when it handles a message. The context_retrieval step (apps/api/convex/agent/steps/context_retrieval/index.ts) builds a query from the inbound subject + body and calls internal.semanticFileProcessing.semanticSearch (limited to fileLimit, default 3), folding the hits into a [RELEVANT FILES] briefing section alongside the [KNOWLEDGE] section it pulls from internal.knowledge.retrieval.semanticSearch. Both lookups sit behind the step's existing token budget (normal / compacted / emergency tiers).

So when the agent handles a message, the context step pulls in relevant documents — contracts, invoices, proposals — so responses are grounded in real artifacts (filename, title, summary) rather than nothing. File and knowledge retrieval only run when the inbound message has enough query text (more than ~10 characters of subject + body).

Auto-tagging

Every file can carry two kinds of tags:

Auto-tags (autoTags) — generated by the LLM summarize call during processing (from extracted text and filename), then augmented with context tags inherited from the conversation it was shared in. processFile slugifies the thread subject and related contact names (e.g. a file dropped in a "Q3 Financials" thread with "Acme Corp" picks up q3-financials and acme-corp) via slugifyTag and merges them into autoTags.
Manual tags (tags) — user-applied through the update mutation, stored as-is.

Version provenance also lands here: when a file supersedes a previous version, processFile computes a coarse changeSummary (e.g. "42 words added vs previous version") from the word-count delta against the prior version's extracted text.

Tag reconciliation is not implemented

Context-inherited auto-tags ship today, but there is still no tag-frequency tracking and no suggestion to merge similar tags (e.g. q3-finances and q3-financials). Auto-tags are produced per file at processing time; they are not reconciled or de-duplicated across the corpus. Corpus-wide tag reconciliation / merge suggestions remain a future idea.

Version tracking

Files are version-linked, not replaced. Each upload can reference a previousVersionId, and create derives the next version number from it:

Contract v1 (uploaded Feb 10 by Alice, threadId = "Acme negotiation")
  → Contract v2 (uploaded Feb 18 by Bob, same thread, after legal review)
    → Contract v3 (uploaded Feb 25 by Alice, final signed version)

Each row records:

Who uploaded it (uploadedBy)
Where — the conversation thread (threadId)
Which contacts it relates to (contactIds)

The previousVersionId field forms a linked list of versions, which getVersionHistory walks to surface the full chain in the UI.

Schema

The real table lives in apps/api/convex/schema/knowledge.ts (knowledgeTables.semanticFiles):

semanticFiles: defineTable({
  storageId: v.id('_storage'),
  filename: v.string(),
  mimeType: v.string(),
  fileSize: v.number(),
  // Semantic metadata
  title: v.optional(v.string()),
  summary: v.optional(v.string()),
  extractedText: v.optional(v.string()),
  tags: v.optional(v.array(v.string())),
  autoTags: v.optional(v.array(v.string())),
  // Provenance
  sourceType: v.union(
    v.literal('upload'),
    v.literal('email_attachment'),
    v.literal('agent_generated')
  ),
  sourceMessageId: v.optional(v.string()),
  uploadedBy: v.optional(v.string()),
  // Why/where this version was shared (JSON-stringified context blob)
  uploadContext: v.optional(v.string()),
  // Relationships
  contactIds: v.optional(v.array(v.id('contacts'))),
  threadId: v.optional(v.id('conversationThreads')),
  // Versioning
  version: v.number(),
  previousVersionId: v.optional(v.id('semanticFiles')),
  // Human-readable diff vs the previous version (text files only)
  changeSummary: v.optional(v.string()),
  // Embedding for semantic search
  embedding: v.array(v.float64()),
  // Model that produced `embedding`; re-embed when this changes
  embeddingModel: v.optional(v.string()),
  // When `embedding` was generated; used to schedule re-embedding
  embeddingGeneratedAt: v.optional(v.number()),
  // Full-text search
  searchableText: v.optional(v.string()),
  createdAt: v.number(),
  updatedAt: v.number(),
})
  .index('by_created_at', ['createdAt'])
  .index('by_thread', ['threadId'])
  .index('by_previous_version', ['previousVersionId'])
  .searchIndex('search_files', {
    searchField: 'searchableText',
    filterFields: [],
  })
  .vectorIndex('vector_files', {
    vectorField: 'embedding',
    dimensions: 1536,
    filterFields: [],
  })

There is no organizationId field — Owlat is single-org-per-deployment, so neither index filters by organization. embeddingModel and embeddingGeneratedAt record which model produced the vector (text-embedding-3-small, the CURRENT_EMBEDDING_MODEL constant) so stale embeddings can be re-generated when the model changes.

Integration with existing systems

The semantic file system does not replace the existing mediaAssets table (apps/api/convex/schema/templates.ts). Media assets are purpose-built for the email builder (images, with width/height, search, tagging). Semantic files are a broader layer for organizational documents. Both use Convex _storage for binary data.

Surfacing files inside the agent's context now works (the context_retrieval step queries vector_files). A unified view — searching media assets and semantic files through a single query — is still the longer-term goal: today the two layers are separate. The remaining cross-cutting pieces are agent-generated auto-ingestion (the ingest entry point accepts the source type, but no agent flow emits files yet — inbound email attachments already auto-ingest), real extraction for non-PDF binaries (Word/Excel/images), and corpus-wide tag reconciliation.

Files are not a separate product

The file system is a layer in the same architecture: the same deployment, the same permissions model (writes go through requireAdminContext), and the same Convex storage. See the Files guide for the user-facing view and the Knowledge Graph vision for the related typed-knowledge layer.