Documentation

Documents API Reference

Upload, process, and search documents with AI-powered text extraction

Document Processing Pipeline

  • Supported formats: PDF, DOCX, TXT, MD (max 50MB per file)
  • Processing: Text extraction → Content segmentation → Search indexing
  • Search: Semantic similarity search for finding content by meaning
  • Chat Integration: Documents can be added to chat sessions for grounded AI responses

Document Upload

POST/api/v2/document-uploads/request-presigned-url

Generate presigned URL for direct S3 document upload (expires in 10 minutes)

Request

{
"filename": "report.pdf",
"content_type": "application/pdf",
"size_bytes": 2048576,
"storage_target": "default",
"idempotency_key": "unique-retry-key-123"
}
// Supported content_type values:
// - application/pdf
// - application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)
// - text/plain
// - text/markdown
//
// storage_target: "default" | "custom"
// idempotency_key: optional, 1-128 chars, alphanumeric with _-. allowed

Response

{
"upload_url": "https://nyc3.digitaloceanspaces.com/bucket/presigned-url...",
"upload_method": "PUT",
"upload_fields": null,
"upload_headers": {
"Content-Type": "application/pdf"
},
"object_key": "documents/{tenant_id}/{uuid}/report.pdf",
"expires_at": "2025-01-15T10:40:00",
"max_size_bytes": 52428800,
"storage_target": "default",
"bucket_name": null
}
// upload_fields: populated for POST uploads (null for PUT)
// upload_headers: populated for PUT uploads (null for POST)
// storage_target: "default" or "custom"
// bucket_name: populated only for custom storage targets
POST/api/v2/document-uploads/confirm

Confirm document upload and trigger text extraction processing (returns 201)

Request

{
"object_key": "documents/{tenant_id}/{uuid}/report.pdf",
"size_bytes": 2048576,
"content_type": "application/pdf",
"checksum": "sha256:abc123def456..."
}
// checksum is optional - if omitted, a hash is generated from object_key:size_bytes

Response

{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"object_key": "documents/{tenant_id}/{uuid}/report.pdf",
"filename": "report.pdf",
"document_type": "pdf",
"status": "queued",
"confirmed": true,
"deduplicated": false,
"is_idempotent_retry": false
}
GET/api/v2/document-uploads/{document_id}/status

Check document processing status

Response

{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "report.pdf",
"document_type": "pdf",
"text_extraction_status": "completed",
"page_count": 15,
"chunk_count": 42,
"created_at": "2025-01-15T10:30:00Z",
"processing_started_at": "2025-01-15T10:30:05Z",
"completed_at": "2025-01-15T10:32:00Z"
}
// text_extraction_status values: pending | processing | completed | failed
GET/api/v2/document-uploads/quota-check

Check document upload quota before starting

Request

// Query parameters:
?file_count=5 // required

Response

{
"can_proceed": true,
"requested": 5,
"available": 995,
"monthly_limit": 1000,
"current_usage": 5,
"message": null
}

Batch Document Upload

POST/api/v2/document-uploads/batch-prepare

Prepare batch document upload (1-100 files per batch)

Request

{
"files": [
{
"filename": "report1.pdf",
"size_bytes": 1048576,
"content_type": "application/pdf",
"idempotency_key": "file-1-retry-key",
"storage_target": "default"
},
{
"filename": "manual.docx",
"size_bytes": 2097152,
"content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
],
"additional_params": {}
}
// Note: custom storage_target is not supported for batch uploads.
// Use the single-file endpoint for custom storage.

Response

{
"batch_id": "batch_doc_550e8400",
"upload_plan": {
"strategy": "parallel",
"max_concurrent": 5,
"retry_policy": {"max_retries": 3, "backoff_ms": 1000},
"timeout_per_upload": 300
},
"presigned_urls": [
{
"file_index": 0,
"filename": "report1.pdf",
"upload_url": "https://nyc3.digitaloceanspaces.com/...",
"upload_method": "PUT",
"upload_fields": null,
"upload_headers": null,
"object_key": "documents/batch_doc_550e8400/report1.pdf",
"expires_at": "2025-01-15T10:40:00Z",
"upload_intent_id": "intent_abc123"
}
],
"total_size_bytes": 3145728,
"estimated_time_seconds": 45,
"expires_at": "2025-01-15T10:40:00Z"
}
POST/api/v2/document-uploads/batch-confirm

Confirm batch document uploads and trigger processing

Request

{
"batch_id": "batch_doc_550e8400",
"confirmations": [
{
"object_key": "documents/batch_doc_550e8400/report1.pdf",
"success": true,
"file_size": 1048576,
"checksum": "sha256:abc123...",
"error_message": null
}
]
}

Response

{
"batch_id": "batch_doc_550e8400",
"successful_uploads": 2,
"failed_uploads": 0,
"processing_status": "queued",
"failed_files": null,
"message": "All 2 documents confirmed and queued for processing",
"documents": [
{
"document_id": "doc_550e8400_001",
"object_key": "documents/batch_doc_550e8400/report1.pdf",
"filename": "report1.pdf",
"document_type": "pdf",
"text_extraction_status": "pending",
"created_at": "2025-01-15T10:30:00Z"
}
]
}
// processing_status values: queued | partial | failed | already_processed
GET/api/v2/document-uploads/batch/{batch_id}/status

Get batch document upload status

Response

{
"batch_id": "batch_doc_550e8400",
"status": "completed",
"total_documents": 2,
"confirmed_documents": 2,
"processed_documents": 2,
"failed_documents": 0,
"created_at": "2025-01-15T10:30:00Z",
"updated_at": "2025-01-15T10:35:00Z",
"documents": [
{
"document_id": "doc_550e8400_001",
"object_key": "documents/batch_doc_550e8400/report1.pdf",
"filename": "report1.pdf",
"document_type": "pdf",
"text_extraction_status": "completed",
"created_at": "2025-01-15T10:30:00Z"
}
]
}
// status values: pending | uploading | processing | completed | failed | expired

Document Management

GET/api/v2/documents

List all documents with pagination and filtering

Request

// Query parameters:
?page=1 // optional, default: 1
&page_size=20 // optional, default: 20, 1-100
&status_filter=completed // optional, options: pending | processing | completed | failed

Response

{
"documents": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "safety_manual.pdf",
"content_type": "application/pdf",
"size_bytes": 2048576,
"page_count": 45,
"chunk_count": 128,
"extraction_status": "completed",
"created_at": "2025-01-15T10:30:00Z"
}
],
"total_count": 15,
"page": 1,
"page_size": 20
}
GET/api/v2/documents/{document_id}

Get document metadata

Response

{
"id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "safety_manual.pdf",
"content_type": "application/pdf",
"size_bytes": 2048576,
"page_count": 45,
"chunk_count": 128,
"extraction_status": "completed",
"created_at": "2025-01-15T10:30:00Z"
}
PATCH/api/v2/documents/{document_id}

Update document metadata (title and tags). Only provided fields are updated.

Request

{
"title": "Updated Safety Manual 2025",
"tags": ["safety", "compliance", "manual"]
}
// Both fields are optional - send only what you want to update
// title: max 255 characters
// tags: max 40 tags, each tag max 50 characters

Response

{
"id": "550e8400-e29b-41d4-a716-446655440000",
"title": "Updated Safety Manual 2025",
"tags": ["safety", "compliance", "manual"],
"updated_at": "2025-01-15T11:00:00Z"
}
GET/api/v2/documents/{document_id}/text

Get full extracted text from document

Response

{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "safety_manual.pdf",
"text": "SAFETY MANUAL\n\nChapter 1: Introduction\n\nThis manual provides...",
"page_count": 45,
"metadata": {
"extraction_method": "pdfplumber",
"language": "en"
}
}
GET/api/v2/documents/{document_id}/chunks

Get all chunks from a document

Request

// Query parameters:
?include_embeddings=true // optional, default: false

Response

{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"chunks": [
{
"chunk_id": "chunk_001",
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"document_filename": "",
"chunk_index": 0,
"content": "Safety inspections must be conducted quarterly...",
"page_numbers": [12, 13],
"heading_hierarchy": ["Chapter 3", "Inspections"],
"similarity_score": 1.0,
"metadata": {
"token_count": 256,
"chunk_type": "paragraph",
"embedding_status": "completed"
}
}
],
"total_chunks": 128,
"status_counts": {"completed": 128, "pending": 0, "failed": 0}
}
// status_counts is only included when include_embeddings=true
// similarity_score is always 1.0 (not from a search query)
// embedding_status in metadata is null when include_embeddings=false
GET/api/v2/documents/{document_id}/download

Download original document file (redirects to presigned S3 URL)

Response

// Returns 302 Redirect to presigned S3 URL
// URL expires in 1 hour
// Content-Disposition header set for download
DELETE/api/v2/documents/{document_id}

Delete document with cascade cleanup (chunks and S3 file)

Response

// Returns 204 No Content on success
// Note: Cannot delete while document is being processed
// Returns 409 Conflict if text extraction or embedding in progress
POST/api/v2/documents/batch-delete

Delete multiple documents at once (max 100 per batch)

Request

{
"document_ids": [
"550e8400-e29b-41d4-a716-446655440000",
"660f9500-f39c-52e5-b827-557766550111"
]
}

Response

{
"deleted": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "deleted",
"message": null,
"deleted_at": "2025-01-15T10:30:00Z"
}
],
"skipped": [],
"failed": [
{
"id": "660f9500-f39c-52e5-b827-557766550111",
"status": "failed",
"message": "Document is currently being processed",
"deleted_at": null
}
],
"summary": {
"total": 2,
"deleted": 1,
"skipped": 0,
"failed": 1
}
}

Semantic Search

AI-Powered Search

Search uses semantic similarity to find content by meaning, not just keywords - "damaged equipment" matches "broken machinery" even without exact words. Multiple embeddings are used to power search across both documents and image descriptions.

POST/api/v2/documents/search

Search document chunks by semantic similarity

Request

{
"query": "safety inspection requirements",
"limit": 20,
"similarity_threshold": 0.3,
"document_ids": ["550e8400-e29b-41d4-a716-446655440000"]
}
// document_ids is optional - omit to search all documents

Response

{
"query": "safety inspection requirements",
"results": [
{
"chunk_id": "chunk_001",
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"document_filename": "safety_manual.pdf",
"chunk_index": 5,
"content": "Safety inspections must be conducted quarterly...",
"page_numbers": [12, 13],
"heading_hierarchy": ["Chapter 3", "Inspections", "Schedule"],
"similarity_score": 0.87,
"metadata": {"chunk_type": "paragraph"}
}
],
"total_count": 15,
"search_time_ms": 45
}

Limits & Constraints

Document Limits

  • Max file size: 50MB per document
  • Batch size: 1-100 documents per batch
  • Concurrent batches: 200 active batches per tenant
  • Presigned URL expiry: 10 minutes
  • Search query length: 1-1000 characters
  • Search results: 1-100 per query
  • Tags per document: max 40 tags, each max 50 characters
  • Title length: max 255 characters
  • Batch delete: max 100 documents per request