Documentation
Uploading Documents
Upload, process, and search documents with AI-powered text extraction and semantic search
One Resource for Every Media Type
Documents share the unified /api/v2/files/* surface with images and videos — same upload endpoints, same listing endpoint, same deletion. The server detects the file type from content and routes the document through extraction, chunking, and embedding automatically.
RAG-Ready Documents
Documents are automatically processed through a pipeline: text extraction (PDF/DOCX parsing) → chunking (splitting into searchable segments) → embedding (generating vectors for semantic search). This enables semantic search and RAG integration with chat.
Supported Formats
.pdf, .docx, .txt, and .md. Maximum 50 MB per document via streaming. Larger documents can use the presigned upload flow under /api/v2/files/uploads.
Quick Start
Upload a Document
async with Scopix(api_key="scopix_...") as client: result = await client.files.upload("report.pdf")
print(f"File ID: {result.image_id}") print(f"Filename: {result.filename}")
# Poll the unified processing status for extraction details status = await client.files.get_processing_status(result.image_id) print(f"Extraction: {status.text_extraction_status}") print(f"Pages: {status.page_count}") print(f"Chunks: {status.chunk_count}")Upload Options (SDK)
result = await client.files.upload( "report.pdf", folder_id=None, # Optional folder UUID project_id=None, # Optional project workspace storage_target="default", # "default" or "custom" (BYOB) skip_duplicates=True, # Return existing file_id on hash match content_category="document", # Tailors AI processing)Batch Upload
Upload Multiple Documents
# upload_batch sends one multipart/form-data request with all files.# Returns BatchUploadResults — a list subclass of UploadResult, iterate directly.results = await client.files.upload_batch([ "report1.pdf", "report2.docx", "notes.txt",])
print(f"Uploaded {len(results)} documents")for r in results: print(f" {r.filename}: {r.image_id}")
# Helper methods for batch inspectionif results.has_failures: for r in results.failed(): print(f"FAIL {r.filename}: {r.description_error}")print(results.summary()) # e.g. "3 succeeded"Processing Status
Check Processing Status
status = await client.files.get_processing_status("550e8400-...")
print(f"Extraction: {status.text_extraction_status}") # pending | processing | completed | failedprint(f"Pages: {status.page_count}")print(f"Chunks: {status.chunk_count}")Per-Page Digitization
For PDF documents, the digitization pipeline returns per-page structural elements (headings, paragraphs, tables, key-value pairs) with normalized bounding boxes.
# Lightweight status (no element data)status = await client.files.get_digitization_status("550e8400-...")print(status["status"]) # pending | processing | completed | failed
# Full digitization (all pages)result = await client.files.get_digitization("550e8400-...")for page in result["pages"]: print(f"Page {page['page_number']}: {page['element_count']} elements")
# Single pagepage = await client.files.get_digitization_page("550e8400-...", page_number=2)for el in page["elements"]: print(f" {el['type']}: {el['content'][:60]}...")Semantic Search
AI-Powered Search
Search uses semantic similarity — search by meaning, not just keywords. "damaged equipment" will find content about "broken machinery" even if those exact words aren't present.
Search Documents
results = await client.files.search( query="safety inspection requirements", limit=20, similarity_threshold=0.3,)
for chunk in results.results: print(f"Document: {chunk.document_filename}") print(f"Score: {chunk.score:.2f}") print(f"Content: {chunk.content[:200]}...")Search Specific Documents
results = await client.files.search( query="compliance requirements", document_ids=["doc_abc123", "doc_def456"], limit=10,)
for chunk in results.results: print(f"{chunk.document_filename}: {chunk.content[:100]}...")Document Management
Documents are managed through the same unified files resource as images and videos.
List, Get, Download, Delete
# List documents only — filter by media_typefiles = await client.files.list(media_types=["document"], limit=20)print(f"Total: {files.total_count}")for f in files.items: print(f" {f.filename} ({f.document_type})")
# Get document detailsdoc = await client.files.get("550e8400-...")print(f"Filename: {doc.filename}, Pages: {doc.page_count}")
# Get extracted texttext_result = await client.files.get_text("550e8400-...")print(f"Text length: {len(text_result['text'])} characters")
# Get chunks (for RAG debugging)chunks_result = await client.files.get_chunks("550e8400-...")print(f"Total chunks: {chunks_result['total_chunks']}")
# Get a temporary download URL for the original filedownload_url = await client.files.download_url("550e8400-...")
# Delete document and all chunksawait client.files.delete("550e8400-...")Documents in Chat (RAG)
Automatic Document Access
The chat system's document search agent has access to all your uploaded documents automatically. There is no need to explicitly attach documents — the AI searches relevant documents based on your query.
Chat with Documents
async with Scopix(api_key="scopix_...") as client: async with client.chat_session() as session: response = await session.send( "What are the key safety requirements mentioned in the documents?" ) print(response.content)
response2 = await session.send("Which section covers equipment maintenance?") print(response2.content)Deduplication
Documents are deduplicated by SHA-256 content hash — safe to retry failed uploads.
r1 = await client.files.upload("report.pdf", skip_duplicates=True)r2 = await client.files.upload("report.pdf", skip_duplicates=True)# r2.deduplicated == True; r2.image_id references the same document as r1# (populated on presigned/multipart completions; may be None on streaming uploads)Limits & Quotas
- Streaming max file size: 50 MB per document (use presigned upload for larger)
- Batch size: 1–100 documents per batch (tier-dependent)
- Concurrent batches: 200 active batches per tenant
- Supported formats: PDF, DOCX, TXT, MD

