Semantic File System

Technical architecture for Owlat's semantic file storage — embedding-based retrieval, auto-tagging, and version tracking with provenance.

Semantic File System Architecture

Owlat already stores media assets for email campaigns. The semantic file system extends this into a full organizational file layer — where every file gets embeddings, auto-tags, and conversation-linked provenance. Files become part of the agent's context, surfacing automatically when relevant.

How it works

Files enter the system through three channels:

  1. Direct upload — drag-and-drop in the dashboard (extends existing media library)
  2. Email attachment — inbound emails with attachments automatically index the files
  3. Agent-generated — agents produce artifacts (reports, visualizations, drafts) stored as files

Every file goes through the same processing pipeline:

File uploaded / received
  → Store binary in Convex _storage (existing pattern)
  → Extract text content:
    - PDF: text extraction via pdf-lib (already in deps)
    - Images: OCR via LLM vision (if available)
    - Documents: text extraction
  → Generate summary and auto-tags via AI SDK
  → Generate embedding for semantic search
  → Link to source (conversation, contact, uploader)
  → Store metadata in semanticFiles table

Retrieval

Natural language file retrieval using vector search:

// "Find the contract we signed with Acme Corp"
const results = await ctx.vectorSearch('semanticFiles', 'vector_files', {
  vector: await generateEmbedding(query),
  limit: 10,
  filter: (q) => q.eq('organizationId', orgId),
})

Contextual retrieval for agents

When the Agent Pipeline processes a message, the context retrieval step (Step 1) also searches relevant files:

// Agent handling a contract question from Acme Corp
const relevantFiles = await ctx.vectorSearch('semanticFiles', 'vector_files', {
  vector: await generateEmbedding(messageContent),
  limit: 5,
  filter: (q) => q.eq('organizationId', orgId),
})

The agent receives not just knowledge graph entries but the actual source documents — contracts, invoices, proposals. This grounds responses in real artifacts rather than summaries.

For keyword-based file search in the UI:

const results = await ctx.db
  .query('semanticFiles')
  .withSearchIndex('search_files', (q) =>
    q.search('searchableText', searchQuery)
      .eq('organizationId', orgId)
  )
  .take(25)

Auto-tagging

Every file gets two types of tags:

  • Auto-tags — generated by the AI from file content and the context in which it was shared. A PDF shared in a thread about "Q3 financials" with "Acme Corp" automatically inherits q3-financials and acme-corp tags.
  • Manual tags — user-applied tags for custom organization.

Tags evolve as the organization's vocabulary evolves. The system tracks tag frequency and suggests merging similar tags (e.g., q3-finances and q3-financials).

Version tracking

Files are version-linked, not replaced:

Contract v1 (shared Feb 10 by Alice in "Acme negotiation" thread)
  → Contract v2 (shared Feb 18 by Bob in same thread, after legal review)
    → Contract v3 (shared Feb 25 by Alice, final signed version)

Each version records:

  • Who uploaded or shared it
  • Where — which conversation thread
  • Why — context from the surrounding messages
  • What changed — diff summary (for text-based files)

The previousVersionId field creates a linked list of versions. The UI shows the full version history with conversation context for each revision.

Schema

semanticFiles: defineTable({
  organizationId: v.string(),
  storageId: v.id('_storage'),
  filename: v.string(),
  mimeType: v.string(),
  fileSize: v.number(),
  // Semantic metadata
  title: v.optional(v.string()),
  summary: v.optional(v.string()),
  extractedText: v.optional(v.string()),
  tags: v.optional(v.array(v.string())),
  autoTags: v.optional(v.array(v.string())),
  // Provenance
  sourceType: v.union(
    v.literal('upload'),
    v.literal('email_attachment'),
    v.literal('agent_generated')
  ),
  sourceMessageId: v.optional(v.string()),
  uploadedBy: v.optional(v.string()),
  // Relationships
  contactIds: v.optional(v.array(v.id('contacts'))),
  threadId: v.optional(v.id('conversationThreads')),
  // Versioning
  version: v.number(),
  previousVersionId: v.optional(v.id('semanticFiles')),
  // Embedding for semantic search
  embedding: v.array(v.float64()),
  // Full-text search
  searchableText: v.optional(v.string()),
  createdAt: v.number(),
  updatedAt: v.number(),
})
  .index('by_organization', ['organizationId'])
  .index('by_contact', ['contactIds'])
  .index('by_thread', ['threadId'])
  .index('by_previous_version', ['previousVersionId'])
  .searchIndex('search_files', {
    searchField: 'searchableText',
    filterFields: ['organizationId'],
  })
  .vectorIndex('vector_files', {
    vectorField: 'embedding',
    dimensions: 1536,
    filterFields: ['organizationId'],
  })

Integration with existing systems

The semantic file system does not replace the existing mediaAssets table. Media assets are purpose-built for the email builder (images, with width/height, search, tagging). Semantic files are a broader layer for organizational documents.

However, both use Convex _storage for binary data. In the future, the media library UI can surface semantic files alongside media assets — searching across both with a unified query.

Files are not a separate product

The file system is a layer in the same architecture. Files are scoped to the organization, searchable through the same retrieval pipeline, governed by the same permissions model, and logged in the same audit trail.