Semantic File System
Technical architecture for Owlat's semantic file storage — embedding-based retrieval, auto-tagging, and version tracking with provenance.
Semantic File System Architecture
Owlat already stores media assets for email campaigns. The semantic file system extends this into a full organizational file layer — where every file gets embeddings, auto-tags, and conversation-linked provenance. Files become part of the agent's context, surfacing automatically when relevant.
How it works
Files enter the system through three channels:
- Direct upload — drag-and-drop in the dashboard (extends existing media library)
- Email attachment — inbound emails with attachments automatically index the files
- Agent-generated — agents produce artifacts (reports, visualizations, drafts) stored as files
Every file goes through the same processing pipeline:
File uploaded / received
→ Store binary in Convex _storage (existing pattern)
→ Extract text content:
- PDF: text extraction via pdf-lib (already in deps)
- Images: OCR via LLM vision (if available)
- Documents: text extraction
→ Generate summary and auto-tags via AI SDK
→ Generate embedding for semantic search
→ Link to source (conversation, contact, uploader)
→ Store metadata in semanticFiles table
Retrieval
Semantic search
Natural language file retrieval using vector search:
// "Find the contract we signed with Acme Corp"
const results = await ctx.vectorSearch('semanticFiles', 'vector_files', {
vector: await generateEmbedding(query),
limit: 10,
filter: (q) => q.eq('organizationId', orgId),
})
Contextual retrieval for agents
When the Agent Pipeline processes a message, the context retrieval step (Step 1) also searches relevant files:
// Agent handling a contract question from Acme Corp
const relevantFiles = await ctx.vectorSearch('semanticFiles', 'vector_files', {
vector: await generateEmbedding(messageContent),
limit: 5,
filter: (q) => q.eq('organizationId', orgId),
})
The agent receives not just knowledge graph entries but the actual source documents — contracts, invoices, proposals. This grounds responses in real artifacts rather than summaries.
Full-text search
For keyword-based file search in the UI:
const results = await ctx.db
.query('semanticFiles')
.withSearchIndex('search_files', (q) =>
q.search('searchableText', searchQuery)
.eq('organizationId', orgId)
)
.take(25)
Auto-tagging
Every file gets two types of tags:
- Auto-tags — generated by the AI from file content and the context in which it was shared. A PDF shared in a thread about "Q3 financials" with "Acme Corp" automatically inherits
q3-financialsandacme-corptags. - Manual tags — user-applied tags for custom organization.
Tags evolve as the organization's vocabulary evolves. The system tracks tag frequency and suggests merging similar tags (e.g., q3-finances and q3-financials).
Version tracking
Files are version-linked, not replaced:
Contract v1 (shared Feb 10 by Alice in "Acme negotiation" thread)
→ Contract v2 (shared Feb 18 by Bob in same thread, after legal review)
→ Contract v3 (shared Feb 25 by Alice, final signed version)
Each version records:
- Who uploaded or shared it
- Where — which conversation thread
- Why — context from the surrounding messages
- What changed — diff summary (for text-based files)
The previousVersionId field creates a linked list of versions. The UI shows the full version history with conversation context for each revision.
Schema
semanticFiles: defineTable({
organizationId: v.string(),
storageId: v.id('_storage'),
filename: v.string(),
mimeType: v.string(),
fileSize: v.number(),
// Semantic metadata
title: v.optional(v.string()),
summary: v.optional(v.string()),
extractedText: v.optional(v.string()),
tags: v.optional(v.array(v.string())),
autoTags: v.optional(v.array(v.string())),
// Provenance
sourceType: v.union(
v.literal('upload'),
v.literal('email_attachment'),
v.literal('agent_generated')
),
sourceMessageId: v.optional(v.string()),
uploadedBy: v.optional(v.string()),
// Relationships
contactIds: v.optional(v.array(v.id('contacts'))),
threadId: v.optional(v.id('conversationThreads')),
// Versioning
version: v.number(),
previousVersionId: v.optional(v.id('semanticFiles')),
// Embedding for semantic search
embedding: v.array(v.float64()),
// Full-text search
searchableText: v.optional(v.string()),
createdAt: v.number(),
updatedAt: v.number(),
})
.index('by_organization', ['organizationId'])
.index('by_contact', ['contactIds'])
.index('by_thread', ['threadId'])
.index('by_previous_version', ['previousVersionId'])
.searchIndex('search_files', {
searchField: 'searchableText',
filterFields: ['organizationId'],
})
.vectorIndex('vector_files', {
vectorField: 'embedding',
dimensions: 1536,
filterFields: ['organizationId'],
})
Integration with existing systems
The semantic file system does not replace the existing mediaAssets table. Media assets are purpose-built for the email builder (images, with width/height, search, tagging). Semantic files are a broader layer for organizational documents.
However, both use Convex _storage for binary data. In the future, the media library UI can surface semantic files alongside media assets — searching across both with a unified query.
The file system is a layer in the same architecture. Files are scoped to the organization, searchable through the same retrieval pipeline, governed by the same permissions model, and logged in the same audit trail.