Documentation

Uploading Documents

Upload, process, and search documents with AI-powered text extraction and semantic search

One Resource for Every Media Type

Documents share the unified /api/v2/files/* surface with images and videos — same upload endpoints, same listing endpoint, same deletion. The server detects the file type from content and routes the document through extraction, chunking, and embedding automatically.

RAG-Ready Documents

Documents are automatically processed through a pipeline: text extraction (PDF/DOCX parsing) → chunking (splitting into searchable segments) → embedding (generating vectors for semantic search). This enables semantic search and RAG integration with chat.

Supported Formats

.pdf, .docx, .txt, and .md. Maximum 50 MB per document via streaming. Larger documents can use the presigned upload flow under /api/v2/files/uploads.

Quick Start

Upload a Document

python
async with Scopix(api_key="scopix_...") as client:
result = await client.files.upload("report.pdf")
print(f"File ID: {result.image_id}")
print(f"Filename: {result.filename}")
# Poll the unified processing status for extraction details
status = await client.files.get_processing_status(result.image_id)
print(f"Extraction: {status.text_extraction_status}")
print(f"Pages: {status.page_count}")
print(f"Chunks: {status.chunk_count}")

Upload Options (SDK)

python
result = await client.files.upload(
"report.pdf",
folder_id=None, # Optional folder UUID
project_id=None, # Optional project workspace
storage_target="default", # "default" or "custom" (BYOB)
skip_duplicates=True, # Return existing file_id on hash match
content_category="document", # Tailors AI processing
)

Batch Upload

Upload Multiple Documents

python
# upload_batch sends one multipart/form-data request with all files.
# Returns BatchUploadResults — a list subclass of UploadResult, iterate directly.
results = await client.files.upload_batch([
"report1.pdf",
"report2.docx",
"notes.txt",
])
print(f"Uploaded {len(results)} documents")
for r in results:
print(f" {r.filename}: {r.image_id}")
# Helper methods for batch inspection
if results.has_failures:
for r in results.failed():
print(f"FAIL {r.filename}: {r.description_error}")
print(results.summary()) # e.g. "3 succeeded"

Processing Status

Check Processing Status

python
status = await client.files.get_processing_status("550e8400-...")
print(f"Extraction: {status.text_extraction_status}") # pending | processing | completed | failed
print(f"Pages: {status.page_count}")
print(f"Chunks: {status.chunk_count}")

Per-Page Digitization

For PDF documents, the digitization pipeline returns per-page structural elements (headings, paragraphs, tables, key-value pairs) with normalized bounding boxes.

python
# Lightweight status (no element data)
status = await client.files.get_digitization_status("550e8400-...")
print(status["status"]) # pending | processing | completed | failed
# Full digitization (all pages)
result = await client.files.get_digitization("550e8400-...")
for page in result["pages"]:
print(f"Page {page['page_number']}: {page['element_count']} elements")
# Single page
page = await client.files.get_digitization_page("550e8400-...", page_number=2)
for el in page["elements"]:
print(f" {el['type']}: {el['content'][:60]}...")

Semantic Search

AI-Powered Search

Search uses semantic similarity — search by meaning, not just keywords. "damaged equipment" will find content about "broken machinery" even if those exact words aren't present.

Search Documents

python
results = await client.files.search(
query="safety inspection requirements",
limit=20,
similarity_threshold=0.3,
)
for chunk in results.results:
print(f"Document: {chunk.document_filename}")
print(f"Score: {chunk.score:.2f}")
print(f"Content: {chunk.content[:200]}...")

Search Specific Documents

python
results = await client.files.search(
query="compliance requirements",
document_ids=["doc_abc123", "doc_def456"],
limit=10,
)
for chunk in results.results:
print(f"{chunk.document_filename}: {chunk.content[:100]}...")

Document Management

Documents are managed through the same unified files resource as images and videos.

List, Get, Download, Delete

python
# List documents only — filter by media_type
files = await client.files.list(media_types=["document"], limit=20)
print(f"Total: {files.total_count}")
for f in files.items:
print(f" {f.filename} ({f.document_type})")
# Get document details
doc = await client.files.get("550e8400-...")
print(f"Filename: {doc.filename}, Pages: {doc.page_count}")
# Get extracted text
text_result = await client.files.get_text("550e8400-...")
print(f"Text length: {len(text_result['text'])} characters")
# Get chunks (for RAG debugging)
chunks_result = await client.files.get_chunks("550e8400-...")
print(f"Total chunks: {chunks_result['total_chunks']}")
# Get a temporary download URL for the original file
download_url = await client.files.download_url("550e8400-...")
# Delete document and all chunks
await client.files.delete("550e8400-...")

Documents in Chat (RAG)

Automatic Document Access

The chat system's document search agent has access to all your uploaded documents automatically. There is no need to explicitly attach documents — the AI searches relevant documents based on your query.

Chat with Documents

python
async with Scopix(api_key="scopix_...") as client:
async with client.chat_session() as session:
response = await session.send(
"What are the key safety requirements mentioned in the documents?"
)
print(response.content)
response2 = await session.send("Which section covers equipment maintenance?")
print(response2.content)

Deduplication

Documents are deduplicated by SHA-256 content hash — safe to retry failed uploads.

python
r1 = await client.files.upload("report.pdf", skip_duplicates=True)
r2 = await client.files.upload("report.pdf", skip_duplicates=True)
# r2.deduplicated == True; r2.image_id references the same document as r1
# (populated on presigned/multipart completions; may be None on streaming uploads)

Limits & Quotas

  • Streaming max file size: 50 MB per document (use presigned upload for larger)
  • Batch size: 1–100 documents per batch (tier-dependent)
  • Concurrent batches: 200 active batches per tenant
  • Supported formats: PDF, DOCX, TXT, MD