Documentation

Files API Reference

Unified file resource for images, documents, and videos — uploads, retrieval, search, digitization, and management

One Resource for Every Media Type

All file operations live under /api/v2/files/*. Images, documents, videos, and links share the same CRUD endpoints; media-specific sub-paths (variants, text, chunks, digitization, similar) return 400 if used on the wrong media type.

Automatic File Type Detection

The API auto-detects file types from content using magic byte signatures. You don't need to set the correct Content-Type header in multipart form data — if omitted or mismatched, the server inspects the payload and routes the file to the right pipeline. Unrecognizable or unsafe files (executables, scripts) are rejected.

Choosing an Upload Method

Three REST endpoints cover three scenarios. If you use the Python SDK,client.files.upload()auto-routes by file size — you don't need to pick.

Use caseEndpointSize capRoundtrips
Single small filePOST /files/upload100 MB1
Multiple small filesPOST /files/upload/batch100 MB / file1
Large file (>100 MB)POST /files/direct-uploadsPUT to S3POST /files/direct-uploads/{id}/complete5 TB3

If you send a file over 100 MB to the streaming endpoints, the response includes a suggestion.endpoint pointing you at the presigned flow.

Streaming Upload

Single-request multipart upload for files up to 100 MB. The recommended path for almost every upload — no init/complete dance, no client-side hashing.

POST/api/v2/files/upload

Upload a single file (multipart/form-data). Auto-routes by detected media type. Returns 201 Created.

Request

json
// multipart/form-data fields:
// Required:
// file: (binary) — file to upload (up to 100 MB)
// Optional:
// title: (string, max 255) — file title
// tags: (string) — comma-separated tags
// auto_describe: (boolean, default true) — run AI description pipeline
// skip_duplicates: (boolean, default false) — skip if hash already exists
// storage_target: (string, default "default") — "default" or "custom"
// folder_id: (string) — destination folder UUID
// project_id: (string) — project workspace UUID (used when no folder_id)
// content_category: (string, default "general") — content category for tailored AI
// Valid values: general, blueprint, ce_plan, technical_diagram,
// architectural_design, product_photo, real_estate, mining, robotics,
// artwork, screenshot, document, map, pid, pfd, construction,
// facility_assessment
// custom_schema_id: (string) — optional saved custom extraction schema UUID;
// triggers a second VLM pass with that schema
// compliance_type: (string) — "mls" or "marketplace"
// compliance_standard: (string) — required if compliance_type is set
// (e.g. "nar_baseline", "amazon")
// compliance_image_type: (string, default "main") — "main" or "secondary"
curl -X POST https://api.scopix.ai/api/v2/files/upload \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@photo.jpg" \
-F "title=Site Inspection" \
-F "tags=inspection,site"

Response

json
// 201 Created
{
"file_id": "550e8400-e29b-41d4-a716-446655440000",
"upload_method": "STREAMING",
"status": "completed", // "completed" | "processing" | "skipped"
"processing_time_ms": 1250.5,
"upload_completed": true,
"thumbnail_generation_started": true,
"analysis_started": true,
"skipped": false,
"existing_file_id": null, // set when skipped=true and a prior copy exists
"storage_target": "default",
"media_type": "image", // "image" | "document" | "video"
"document_type": null, // "pdf" | "docx" | "txt" | "md" (documents only)
"text_extraction_status": null // "pending" | "processing" | "completed" | "failed" (documents)
}
// 429 Too Many Requests — backpressure (Retry-After header set)
// 413 Payload Too Large — file exceeds streaming limit (use /files/direct-uploads multipart)

What does "batch" mean here?

"Batch" in the Scopix API means multiple files uploaded in one HTTP request — the endpoint groups them into a tracked upload session. It is not a job queue. All AI processing of uploaded files happens automatically in the background; you don't submit jobs separately.

POST/api/v2/files/upload/batch

Multi-file batch upload. Per-tier file count: FREE 10, STARTER 50, PROFESSIONAL 100, ENTERPRISE 200. Each file is capped at 100 MB. Returns 201 Created.

Request

json
// multipart/form-data fields:
// Required:
// files: (binary[]) — multiple files (each up to 100 MB)
// Optional:
// tags: (string) — comma-separated tags applied to all files
// auto_describe: (boolean, default true) — run AI description pipeline
// skip_duplicates: (boolean, default false)
// storage_target: (string, default "default")
// folder_id: (string) — destination folder UUID
// project_id: (string) — project workspace UUID
// content_category: (string, default "general")
// custom_schema_id: (string) — optional saved custom extraction schema UUID
// applied to every file in the batch
// compliance_type: (string) — "mls" or "marketplace"
// compliance_standard: (string) — required if compliance_type is set
// compliance_image_type: (string, default "main") — "main" or "secondary"
curl -X POST https://api.scopix.ai/api/v2/files/upload/batch \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "files=@photo1.jpg" \
-F "files=@photo2.jpg" \
-F "files=@report.pdf"

Response

json
// 201 Created
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"total_files": 3,
"accepted_files": 3,
"rejected_files": 0,
"status": "completed", // "completed" | "partial" | "processing" | "rejected"
"immediate_results": [
{
"file_id": "660f9500-e29b-41d4-a716-446655440000",
"filename": "photo1.jpg",
"status": "completed", // "completed" | "failed" | "skipped"
"processing_time_ms": 850.2,
"skipped": false,
"existing_file_id": null, // set when skipped=true and a prior copy exists
"error": null,
"storage_target": "default",
"media_type": "image",
"document_type": null,
"text_extraction_status": null
}
],
"status_url": "/api/v2/files/sessions/{session_id}/status",
"websocket_channel": "batch.{session_id}",
"rejections": null
}
// For larger batches, poll status_url or subscribe to websocket_channel

Presigned & Multipart Upload

For files larger than 100 MB or when you want the bytes to bypass the API entirely, use the upload-intent flow: request → PUT directly to S3 → complete. Use upload_mode: "single_shot" for files up to 5 GB; "multipart" for anything larger (videos, large datasets).

POST/api/v2/files/direct-uploads

Create an upload intent. Returns a presigned PUT URL (single-shot) or per-part presigned URLs (multipart). The client must compute SHA-256 of the file and pin it as claimed_file_hash; the server verifies on /complete. Accepts an optional Idempotency-Key header (1-128 chars, [a-zA-Z0-9_-.]).

Request

json
{
"filename": "inspection.mp4",
"content_type": "video/mp4",
"size_bytes": 524288000,
"claimed_file_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"upload_mode": "multipart", // optional — omit to let server pick by size.
// "single_shot" (<=5 GB) | "multipart" (>=5 MB)
"part_size_bytes": 8388608, // multipart only — min 5 MB per part
"title": "Site Inspection", // optional
"tags": ["inspection", "site-a"], // optional, max 20 tags (1-50 chars each)
"folder_id": null, // optional folder UUID
"project_id": null, // optional project UUID
"skip_duplicates": false, // optional
"storage_target": "default", // optional (not currently honored server-side)
"auto_describe": true, // optional, default true
"content_category": "general", // optional
"custom_schema_id": null, // optional saved schema UUID
"compliance_type": null, // optional: "mls" | "marketplace"
"compliance_standard": null, // required if compliance_type is set
"compliance_image_type": "main" // optional: "main" | "secondary"
}
// Required: filename, content_type, size_bytes, claimed_file_hash
// upload_mode is OPTIONAL — the server auto-selects by size_bytes
// claimed_file_hash: 64-char SHA-256 hex (server verifies post-upload)

Response

json
// Single-shot response:
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"upload_mode": "single_shot",
"media_type": "video",
"method": "PUT",
"presigned_url": "https://s3.amazonaws.com/...",
"headers": {
"Content-Type": "video/mp4",
"x-amz-checksum-sha256": "<base64(sha256)>",
"x-amz-sdk-checksum-algorithm": "SHA256"
},
"object_key": "videos/<tenant>/<hash>.mp4",
"expires_at": "2026-04-15T10:40:00Z",
"max_size_bytes": 524288000,
"bucket_name": "scopix-uploads"
}
// Multipart response:
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"upload_mode": "multipart",
"media_type": "video",
"s3_upload_id": "abc...XYZ",
"object_key": "videos/<tenant>/<hash>.mp4",
"part_urls": [
{"part_number": 1, "url": "https://s3.amazonaws.com/...", "expires_at": "2026-04-15T10:40:00Z"},
{"part_number": 2, "url": "https://s3.amazonaws.com/...", "expires_at": "2026-04-15T10:40:00Z"}
],
"part_size_bytes": 8388608,
"total_parts": 63,
"expires_at": "2026-04-15T10:40:00Z",
"bucket_name": "scopix-uploads"
}
GET/api/v2/files/direct-uploads/{upload_id}

Get the current state of an upload intent (PENDING, UPLOADED, COMPLETED, FAILED) and per-part progress for multipart.

Response

json
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"upload_mode": "multipart",
"status": "UPLOADED", // PENDING | UPLOADED | COMPLETED | FAILED
"media_type": "video",
"object_key": "videos/<tenant>/<hash>.mp4",
"filename": "inspection.mp4",
"size_bytes": 524288000,
"total_parts": 63, // null for single_shot
"parts_confirmed": 63, // null for single_shot
"progress_percent": 100.0, // null for single_shot
"created_at": "2026-04-15T10:30:00Z",
"expires_at": "2026-04-15T10:40:00Z",
"confirmed_at": "2026-04-15T10:38:00Z",
"error_message": null
}
POST/api/v2/files/direct-uploads/{upload_id}/parts/confirm

Confirm a successfully uploaded multipart chunk. Call after each PUT to S3 with the returned ETag.

Request

json
{
"part_number": 1,
"etag": "\"abc123def456\"",
"size_bytes": 8388608
}
// part_number: 1-indexed
// etag: from S3 PUT response (quoted form is fine)

Response

json
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"part_number": 1,
"parts_confirmed": 1,
"total_parts": 63,
"progress_percent": 1.59
}
POST/api/v2/files/direct-uploads/{upload_id}/parts/retry

Get a fresh presigned URL for re-uploading a failed multipart chunk.

Request

json
{
"part_number": 5
}

Response

json
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"part_number": 5,
"url": "https://s3.amazonaws.com/...",
"expires_at": "2026-04-15T10:50:00Z"
}
POST/api/v2/files/direct-uploads/{upload_id}/complete

Finalize an upload (single-shot or multipart). Server completes the S3 multipart, verifies the SHA-256 against claimed_file_hash, creates the file record, and queues media-specific processing (variants/description for images, extraction for documents, ffprobe + analysis for videos). Empty body — server is fully authoritative. Supports the Idempotency-Key header — see /docs/api/idempotency.

Request

json
{}
// Body must be empty by design. The server uses claimed_file_hash from the
// initiate request and the parts list it tracked from /parts/confirm calls.
// No client-supplied duration/analysis params — videos use server-side
// ffprobe and a 2-credit reservation that the worker reconciles.

Response

json
// 202 Accepted — file stored, downstream processing queued.
// Poll GET /files/{file_id} until text_extraction_status /
// description_status / video_analysis_status leave "pending".
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"file_id": "660f9500-e29b-41d4-a716-446655440000",
"media_type": "video", // "image" | "document" | "video"
"filename": "inspection.mp4",
"object_key": "videos/<tenant>/<hash>.mp4",
"size_bytes": 524288000,
"deduplicated": false, // true if an existing file had the same hash
"status": "processing" // "processing" | "completed"
}
// 409 Conflict — claimed_file_hash mismatch (SHA-256 didn't match S3 object)
// 422 Unprocessable Entity — required parts missing on multipart complete
DELETE/api/v2/files/direct-uploads/{upload_id}

Abort an upload intent. For multipart, also aborts the underlying S3 multipart upload (refunds reserved credits if applicable).

Request

json
// Optional query parameter:
?reason=User%20cancelled // up to 255 chars

Response

json
{
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"aborted": true,
"reason": "User cancelled"
}

File Listing & Retrieval

GET/api/v2/files

List files with full-text search and filters. Heterogeneous results across media types — use media_types query param to scope.

Request

json
// Query parameters:
?search=damage report // optional, full-text search
&search_mode=all // optional, default: all, options: all | metadata | visible_text
&tags=safety&tags=inspection // optional, multi-value filter by tags
&media_types=image&media_types=document // optional, multi-value: image | document | video | link
&folder_id=folder_abc123 // optional, filter by folder
&project_id=uuid // optional, filter by project workspace
&has_description=true // optional, filter by description status
&ids=uuid1&ids=uuid2 // optional, multi-value filter by file IDs
&compliance_status=passed // optional, filter by compliance status
&date_from=2026-01-01T00:00:00Z // optional
&date_to=2026-01-31T23:59:59Z // optional
&sort_by=content_created_at // optional, options: created_at | content_created_at | title | size_bytes
&sort_order=desc // optional, default: desc
&limit=20 // optional, default: 20, 1-100
&offset=0 // optional, default: 0

Response

json
{
"items": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"title": "Site Photo A",
"filename": "site_photo.jpg",
"thumbnail_url": "https://cdn.scopix.ai/thumbs/...",
"upload_description": "Damaged concrete pillar with visible cracks...",
"visible_text": "WARNING: STRUCTURAL DAMAGE",
"tags": ["damage", "concrete"],
"size_bytes": 2048576,
"created_at": "2026-01-15T10:30:00Z",
"content_created_at": "2026-01-14T08:00:00Z",
"has_full_description": true,
"dimensions": {"width": 4000, "height": 3000},
"format": "jpeg",
"primary_status": "completed",
"variant_status": "completed",
"variant_count": 5,
"medium_url": "https://cdn.scopix.ai/medium/...",
"full_url": "https://cdn.scopix.ai/large/...",
"blur_hash": "L6PZfSi_.AyE_3t7t7R**0o#DgR4",
"description_status": "completed",
"description_error": null,
"content_type": "image/jpeg",
"media_type": "image",
"content_category": "general",
"document_type": null,
"source_url": null
}
],
"total_count": 150,
"limit": 20,
"offset": 0,
"has_more": true
}
// primary_status is the canonical "poll until done" field — one of:
// "pending" | "processing" | "completed" | "failed" | "partially_completed"
// It's derived per media type from the component statuses below; prefer it over
// branching on media_type + per-component statuses in client code.
//
// Component statuses (all share the ComponentStatus enum:
// "pending" | "queued" | "processing" | "completed" | "failed" | "skipped"):
// image: variant_status, description_status
// document: text_extraction_status, digitization_status, description_status
// video: video_analysis_status, description_status
// link: crawl_status, description_status
//
// Conditional fields by media_type:
// document: document_type, page_count, text_extraction_status, chunk_count, document_url
// video: duration_seconds, frame_rate, video_codec, resolution, video_analysis_status
// link: source_url, domain, og_metadata, favicon_url, crawl_status,
// extracted_images, extracted_images_count
GET/api/v2/files/{file_id}

Get detailed file information. Discriminated by media_type — variant-specific fields appear only on the matching variant. Accepts full UUID or 8-character prefix.

Request

json
// Optional query parameter:
?format=markdown // optional — when set to "markdown" on an image, the
// response includes a formatted_document rendering
// of CE plan / legend / schedule / description data

Response

json
// media_type: "image"
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"media_type": "image",
"title": "Site Photo A",
"tags": ["damage", "concrete"],
"size_bytes": 2048576,
"content_type": "image/jpeg",
"dimensions": {"width": 4000, "height": 3000},
"format": "jpeg",
"full_url": "https://cdn.scopix.ai/large/...",
"thumbnail_url": "https://cdn.scopix.ai/thumbs/...",
"medium_url": "https://cdn.scopix.ai/medium/...",
"original_url": "https://cdn.scopix.ai/originals/...",
"variant_status": "completed",
"variant_count": 5,
"upload_description": "Damaged concrete pillar...",
"visible_text": "WARNING: STRUCTURAL DAMAGE",
"text_regions": [
{"text": "WARNING: STRUCTURAL DAMAGE",
"bounding_box": {"x_min": 0.25, "y_min": 0.4, "x_max": 0.75, "y_max": 0.52}}
],
"description_generated_at": "2026-01-15T10:32:00Z",
"full_descriptions": [...],
"created_at": "2026-01-15T10:30:00Z",
"updated_at": "2026-01-15T10:35:00Z",
"blur_hash": "L6PZfSi_.AyE_3t7t7R**0o#DgR4",
"description_status": "completed",
"content_category": "general"
}
// media_type: "document"
{
"id": "...", "media_type": "document",
"filename": "safety_manual.pdf",
"document_type": "pdf",
"page_count": 45,
"chunk_count": 128,
"text_extraction_status": "completed",
"extracted_text": "SAFETY MANUAL\n\nChapter 1...",
...
}
// media_type: "video"
{
"id": "...", "media_type": "video",
"filename": "inspection.mp4",
"duration_seconds": 240.5,
"frame_rate": 30.0,
"video_codec": "h264",
"resolution": "1920x1080",
"analysis_status": "completed",
"thumbnail_url": "https://...",
...
}
GET/api/v2/files/{file_id}/download

Download original file. Returns 302 redirect to a temporary download URL with Content-Disposition header.

Response

json
// Returns 302 Redirect to presigned download URL
// URL expires in 5 minutes (300 seconds)
// Content-Disposition header set for download

File Updates & Deletion

PATCH/api/v2/files/{file_id}

Update file metadata (title, tags, user_description). Pass only the fields you want to change.

Request

json
{
"title": "Updated Photo Title",
"tags": ["updated", "reviewed"],
"user_description": "Quarterly inspection — minor surface cracks only"
}
// title: optional, max 255 characters
// tags: optional, max 40 tags, each max 50 characters
// user_description: optional, max 10000 chars; pass null to reset to AI-generated description

Response

json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"title": "Updated Photo Title",
"tags": ["updated", "reviewed"],
"user_description": "Quarterly inspection — minor surface cracks only",
"upload_description": "A concrete pillar with visible damage...",
"updated_at": "2026-01-15T11:00:00Z"
}
DELETE/api/v2/files/{file_id}

Soft-delete a file. Recoverable within 30 days.

Response

json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"deleted_at": "2026-01-15T11:00:00Z",
"message": "File deleted successfully"
}
// 409 Conflict — cannot delete while document text extraction or
// embedding is in progress
POST/api/v2/files/batch-delete

Delete up to 100 files in a single request. Each file is reported individually so partial failures don't block the batch.

Request

json
{
"file_ids": [
"550e8400-e29b-41d4-a716-446655440000",
"660f9500-f39c-52e5-b827-557766550111"
]
}
// 1-100 unique UUIDs

Response

json
{
"deleted": [
{"id": "550e8400-e29b-41d4-a716-446655440000", "status": "deleted",
"message": null, "deleted_at": "2026-01-15T11:00:00Z"}
],
"skipped": [],
"failed": [
{"id": "660f9500-f39c-52e5-b827-557766550111", "status": "failed",
"message": "File not found", "deleted_at": null}
],
"summary": {"total": 2, "deleted": 1, "skipped": 0, "failed": 1}
}

Image Operations

Image-only sub-paths. Calling these on a non-image file returns 400.

GET/api/v2/files/{file_id}/variant/{variant_type}

Get a specific image variant. Returns 302 redirect to the variant URL (1-hour expiry).

Request

json
// variant_type options:
// - original: Original uploaded image
// - tiny_64: 64px max dimension
// - small_256: 256px max dimension
// - medium_750: 750px max dimension
// - large_1024: 1024px max dimension

Response

json
// Returns 302 Redirect to variant URL
// 400 Bad Request if file media_type != "image"
POST/api/v2/files/{file_id}/trigger-variants

Manually re-queue variant generation. Useful for recovery if the original variant pipeline failed.

Response

json
{
"success": true,
"message": "Variant generation triggered",
"task_id": "task_550e8400",
"current_status": "processing",
"image_id": "550e8400-e29b-41d4-a716-446655440000"
}
// If already processing:
// {"success": true, "message": "Variant generation already in progress",
// "skipped_duplicate": true, ...}
GET/api/v2/files/{file_id}/similar

Find visually similar images using hybrid embedding + semantic similarity.

Request

json
// Query parameters:
?limit=20 // optional, 1-50, default: 20

Response

json
{
"reference_image_id": "550e8400-e29b-41d4-a716-446655440000",
"items": [
{
"image_id": "660f9500-e29b-41d4-a716-446655440000",
"title": "Similar beam photo",
"description": "Steel beam with surface corrosion...",
"relevance_score": 0.92,
"vector_similarity": 0.88,
"thumbnail_url": "https://cdn.scopix.ai/thumbs/...",
"medium_url": "https://cdn.scopix.ai/medium/...",
"full_url": "https://cdn.scopix.ai/large/...",
"folder_id": "770a0600-e29b-41d4-a716-446655440000",
"created_at": "2026-01-10T08:00:00Z"
}
],
"total_count": 1
}
// 400 Bad Request if file media_type != "image"
PATCH/api/v2/files/{file_id}/extractions/{domain_name}/review

Review AI extraction results — confirm, reject, or edit extracted items for a domain. Corrections layer on top of AI outputs (originals preserved). Multiple calls merge additively.

Request

json
{
"item_reviews": {
"furniture_items.0": "confirmed",
"furniture_items.1": "rejected",
"materials.2": "confirmed"
},
"field_edits": {
"furniture_items.0.name": "Barcelona Chair",
"furniture_items.0.material": "leather"
}
}
// At least one of item_reviews or field_edits is required.
//
// domain_name: one of:
// architectural_design, ce_plan, layout_region, legend,
// mining, real_estate, technical_diagram, pid, pfd,
// text_regions, mls_compliance, schedule
//
// item_reviews: keys are dot-path identifiers (e.g. "items.0"),
// values must be "confirmed" or "rejected"
// field_edits: keys are dot-path field identifiers (e.g. "items.0.name"),
// values are the corrected data

Response

json
{
"image_id": "550e8400-e29b-41d4-a716-446655440000",
"domain_name": "architectural_design",
"corrections": {
"item_reviews": {"furniture_items.0": "confirmed", "furniture_items.1": "rejected"},
"field_edits": {"furniture_items.0.name": "Barcelona Chair"}
},
"updated_at": "2026-04-13T10:30:00Z"
}
// 400 Bad Request if file media_type != "image" or invalid domain
// 404 Not Found if file or extraction does not exist

Document Operations

Document-only sub-paths. Calling these on a non-document file returns 400.

GET/api/v2/files/{file_id}/text

Get the full extracted plain text from a document.

Response

json
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "safety_manual.pdf",
"text": "SAFETY MANUAL\n\nChapter 1: Introduction\n\nThis manual provides...",
"page_count": 45,
"metadata": {"language": "en"}
}
GET/api/v2/files/{file_id}/chunks

Get all chunks (for RAG / search) from a document. Optionally include the embedding vectors.

Request

json
// Query parameters:
?include_embeddings=false // optional, default: false

Response

json
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"chunks": [
{
"chunk_id": "chunk_001",
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"document_filename": "safety_manual.pdf",
"chunk_index": 0,
"content": "Safety inspections must be conducted quarterly...",
"page_numbers": [12, 13],
"heading_hierarchy": ["Chapter 3", "Inspections"],
"similarity_score": null,
"metadata": {
"token_count": 256,
"chunk_type": "paragraph",
"embedding_status": "completed"
}
}
],
"total_chunks": 128,
"status_counts": {"completed": 128, "pending": 0, "failed": 0}
}
// status_counts is only included when include_embeddings=true
// similarity_score is null for direct-fetch (only populated in search results)
GET/api/v2/files/{file_id}/digitization

Get the full structural digitization (per-page elements with bounding boxes) for a document.

Response

json
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"total_pages": 3,
"completed_pages": 3,
"failed_pages": 0,
"pages": [
{
"page_number": 1,
"status": "completed",
"element_count": 5,
"elements": [
{
"type": "heading",
"content": "Safety Manual",
"bounding_box": {"x": 0.15, "y": 0.05, "w": 0.70, "h": 0.04},
"metadata": {"level": 1}
},
{
"type": "paragraph",
"content": "This manual provides comprehensive safety guidelines...",
"bounding_box": {"x": 0.10, "y": 0.12, "w": 0.80, "h": 0.15}
},
{
"type": "table",
"content": "| Category | Frequency |\n|---|---|\n| Fire | Quarterly |",
"bounding_box": {"x": 0.10, "y": 0.30, "w": 0.80, "h": 0.20}
}
],
"error_message": null
}
]
}
// status: pending | processing | completed | failed
// element types: heading, paragraph, table, key_value, list, figure
// bounding_box coordinates are normalized 0-1 relative to page dimensions
GET/api/v2/files/{file_id}/digitization/pages/{page_number}

Get digitization elements for a single page (1-indexed).

Response

json
{
"page_number": 2,
"status": "completed",
"element_count": 3,
"elements": [
{
"type": "heading",
"content": "Chapter 2: Fire Safety",
"bounding_box": {"x": 0.10, "y": 0.05, "w": 0.60, "h": 0.04},
"metadata": {"level": 2}
}
],
"error_message": null
}
// 404 Not Found if no digitization exists for the requested page
GET/api/v2/files/{file_id}/digitization/status

Lightweight status check for digitization progress (no element data).

Response

json
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"total_pages": 5,
"page_statuses": {
"1": "completed",
"2": "completed",
"3": "processing",
"4": "pending",
"5": "pending"
}
}
GET/api/v2/files/{file_id}/processing-status

Cross-media processing status (works for image, document, and video). Includes per-component subprocess statuses.

Response

json
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "safety_manual.pdf",
"document_type": "pdf",
"text_extraction_status": "completed",
"page_count": 45,
"chunk_count": 128,
"created_at": "2026-01-15T10:30:00Z",
"processing_started_at": "2026-01-15T10:30:05Z",
"completed_at": "2026-01-15T10:32:00Z",
"error_message": null
}

Semantic Search & Analyze

AI-Powered Search

Search uses semantic similarity to find content by meaning, not just keywords — "damaged equipment" matches "broken machinery" even without exact words.

POST/api/v2/files/search

Semantic search over document chunks. Scope to specific documents via document_ids; omit to search all.

Request

json
{
"query": "safety inspection requirements",
"limit": 20,
"similarity_threshold": 0.3,
"document_ids": ["550e8400-e29b-41d4-a716-446655440000"]
}
// document_ids is optional — omit to search all documents

Response

json
{
"query": "safety inspection requirements",
"items": [
{
"chunk_id": "chunk_001",
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"document_filename": "safety_manual.pdf",
"chunk_index": 5,
"content": "Safety inspections must be conducted quarterly...",
"page_numbers": [12, 13],
"heading_hierarchy": ["Chapter 3", "Inspections", "Schedule"],
"similarity_score": 0.87,
"metadata": {"chunk_type": "paragraph"}
}
],
"total_count": 15,
"search_time_ms": 45
}
POST/api/v2/files/analyze

Upload a document and receive extracted text in a single call. Waits up to timeout seconds (default 60). If processing exceeds the timeout, response has status: processing and a job_id for polling. Max 10 MB — use POST /files/upload (or /files/direct-uploads for >100 MB) for larger files.

Request

json
curl -X POST https://api.scopix.ai/api/v2/files/analyze \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@report.pdf" \
-F "timeout=60"
// Form fields:
// file: required (PDF, DOCX, TXT, MD)
// timeout: optional, 5-120 (default: 60)
// skip_duplicates: optional (default: false)
// folder_id: optional
// project_id: optional

Response

json
// Discriminated union — check status first.
// status: "completed"
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"filename": "report.pdf",
"size_bytes": 2048576,
"processing_time_ms": 4500.0,
"document_type": "pdf",
"text_extraction_status": "completed",
"page_count": 15,
"chunk_count": 42,
"extracted_text": "SAFETY MANUAL\n\nChapter 1: Introduction..."
}
// status: "processing" (timeout exceeded — poll GET /job/{job_id})
{
"document_id": "...",
"status": "processing",
"job_id": "...",
"poll_url": "/api/v2/job/...",
"document_type": "pdf",
"text_extraction_status": "pending"
}
// status: "failed" or "skipped" (content-hash duplicate)
POST/api/v2/files/analyze/async

Same input as POST /files/analyze but always returns 202 immediately with a job_id. Use for fire-and-forget or concurrent document processing.

Request

json
curl -X POST https://api.scopix.ai/api/v2/files/analyze/async \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@report.pdf"

Response

json
// 202 Accepted
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"poll_url": "/api/v2/job/550e8400-e29b-41d4-a716-446655440000"
}

Export, Quota & Deduplication

GET/api/v2/files/export/columns

Get available columns grouped by category for building export requests.

Response

json
{
"groups": {
"basic": [
{"field_key": "id", "display_name": "ID", "group": "basic"},
{"field_key": "filename", "display_name": "Filename", "group": "basic"},
{"field_key": "title", "display_name": "Title", "group": "basic"},
{"field_key": "size_bytes", "display_name": "Size (bytes)", "group": "basic"},
{"field_key": "created_at", "display_name": "Created At", "group": "basic"}
],
"descriptions": [
{"field_key": "upload_description", "display_name": "AI Description", "group": "descriptions"},
{"field_key": "user_description", "display_name": "User Description", "group": "descriptions"},
{"field_key": "tags", "display_name": "Tags", "group": "descriptions"}
]
}
}
POST/api/v2/files/export

Export file metadata as CSV, XLSX, DOCX, or Google Sheets.

Request

json
{
"format": "csv",
"columns": [
{"field_key": "filename"},
{"field_key": "title"},
{"field_key": "upload_description", "display_name": "AI Description"},
{"field_key": "tags"},
{"field_key": "created_at"}
],
"folder_id": "550e8400-e29b-41d4-a716-446655440000",
"include_subfolders": true,
"flatten_tags": true,
"sheet_name": "Files"
}
// format: required — "csv", "xlsx", "docx", or "google_sheets"
// columns: required, at least 1 column
// field_key: required — from the /export/columns registry
// file_ids: optional UUIDs to scope export
// folder_id: optional folder scope
// include_subfolders: optional, default: false
// flatten_tags: optional, default: true
// google_sheets_title: optional (for google_sheets format)
// connection_id: optional UUID — Google Drive connection (required for google_sheets)

Response

json
{
"download_url": "https://storage.example.com/exports/files_2026-04-13.csv",
"spreadsheet_url": null,
"record_count": 42,
"format": "csv"
}
GET/api/v2/files/quota-check

Check upload quota before starting (prevents failed uploads from quota exhaustion).

Request

json
// Query parameters:
?file_count=10 // required

Response

json
{
"can_proceed": true,
"requested": 10,
"available": 990,
"monthly_limit": 1000,
"current_usage": 10,
"prepaid_credits": 0,
"max_batch_size": 50,
"max_concurrent_uploads": 10,
"message": null
}
// monthly_limit: -1 for unlimited tiers
// When quota exceeded, can_proceed=false and message describes the shortfall
POST/api/v2/files/check-duplicates

Check which file hashes already exist for this tenant before uploading.

Request

json
{
"hashes": [
"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592"
]
}
// hashes: SHA-256 content hashes (1-250 items)

Response

json
{
"duplicates": [
"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
],
"unique": [
"d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592"
]
}

Upload Sessions & Status

Batch uploads create a session — a per-batch tracking record. Use these endpoints to poll progress, retrieve per-file results, cancel pending work, and look up the unified processing status of any individual file by ID.

GET/api/v2/files/uploads-status/{file_id}

Get the unified upload + processing status for any file ID.

Response

json
{
"file_id": "550e8400-e29b-41d4-a716-446655440000",
"session_id": "660f9511-f3ac-52e5-b827-557766551111",
"unified_status": "processing", // uploading | confirming | queued | processing |
// completed | failed | partially_completed
"component_statuses": {
"variant_status": "completed",
"description_status": "processing",
"upload_status": "completed",
"processing_status": "processing"
},
"processing_ids": ["task_001", "task_002"],
"error_message": null,
"last_error_at": null,
"created_at": "2026-01-15T10:30:00Z",
"last_updated_at": "2026-01-15T10:30:45Z",
"completed_at": null,
"retry_count": 0,
"processing_duration_seconds": 45.2,
"is_stuck": false,
"is_terminal": false
}
GET/api/v2/files/sessions

List upload sessions for the authenticated tenant.

Request

json
// Query parameters:
?status=processing // optional, filter by status
&upload_method=streaming // optional, "streaming" or "presigned"
&offset=0 // pagination (default: 0)
&limit=20 // default: 20, 1-100

Response

json
{
"items": [
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"upload_method": "streaming",
"total_files": 20,
"completed_files": 18,
"failed_files": 1,
"skipped_files": 1,
"progress_percentage": 100.0,
"created_at": "2026-01-15T10:30:00Z",
"completed_at": "2026-01-15T10:32:00Z"
}
],
"total_count": 15,
"limit": 20,
"offset": 0,
"has_more": false
}
GET/api/v2/files/sessions/{session_id}/status

Get current progress and recent activity for an upload session.

Response

json
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing", // pending | uploading | processing |
// completed | failed | expired | cancelled
"total_files": 20,
"completed_files": 15,
"failed_files": 1,
"skipped_files": 2,
"pending_files": 2,
"progress_percentage": 90.0,
"created_at": "2026-01-15T10:30:00Z",
"started_at": "2026-01-15T10:30:00Z",
"completed_at": null,
"estimated_completion_time": null,
"recent_completions": [
{
"filename": "photo1.jpg",
"file_id": "660e8400-e29b-41d4-a716-446655440001",
"status": "completed",
"description": "A site inspection showing...",
"processing_time_ms": null
}
],
"recent_errors": [],
"results_url": "/api/v2/files/sessions/{session_id}/results",
"websocket_channel": "batch.{session_id}"
}
GET/api/v2/files/sessions/{session_id}/results

Paginated per-file results from a session.

Request

json
// Query parameters:
?include_failed=true // include failed files (default: true)
&offset=0 // default: 0
&limit=100 // 1-500, default: 100

Response

json
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"results": [
{
"file_id": "660e8400-e29b-41d4-a716-446655440001",
"filename": "photo1.jpg",
"status": "completed",
"description": "Safety inspection showing...",
"visible_text": "EXIT sign visible...",
"tags": ["safety", "construction"],
"processing_time_ms": null,
"error_message": null,
"thumbnail_url": "https://...",
"created_at": "2026-01-15T10:30:05Z"
}
],
"total_count": 20,
"offset": 0,
"limit": 100,
"has_more": false,
"summary": {"total_files": 20, "completed": 17, "failed": 1, "skipped": 2}
}
POST/api/v2/files/sessions/{session_id}/cancel

Cancel a pending or in-progress session. Already-processed files keep their results.

Response

json
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "cancelled",
"total_files": 20,
"completed_files": 10,
"failed_files": 0,
"skipped_files": 0,
"pending_files": 10,
"progress_percentage": 50.0,
"created_at": "2026-01-15T10:30:00Z",
"started_at": "2026-01-15T10:30:00Z",
"completed_at": "2026-01-15T10:31:00Z",
"estimated_completion_time": null,
"recent_completions": [],
"recent_errors": [],
"results_url": "/api/v2/files/sessions/{session_id}/results",
"websocket_channel": "batch.{session_id}"
}
// 400 Bad Request if session is already completed/cancelled/expired
GET/api/v2/files/sessions/{session_id}/summary

Aggregated per-file status counts for every file in a session (uploading / processing / completed / failed / stuck) plus description status counts. Optimised for dashboards.

Response

json
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"overall_status": "processing",
"completion_percentage": 65.0,
"counts": {
"total": 20,
"uploading": 2,
"confirming": 0,
"queued": 1,
"processing": 4,
"completed": 13,
"failed": 0,
"partially_completed": 0,
"stuck": 0
},
"description_counts": {
"pending": 0,
"processing": 4,
"completed": 13,
"failed": 0,
"skipped": 0
},
"error_summary": {
"count": 0,
"messages": []
},
"created_at": "2026-01-15T10:30:00Z",
"last_activity_at": "2026-01-15T10:33:15Z"
}
GET/api/v2/files/sessions/stuck

List uploads that have not made progress in the threshold window. Useful for client-side recovery flows. Each entry is the same shape as GET /files/uploads-status/{file_id}.

Request

json
// Query parameters:
?stuck_minutes=30 // default: 30, threshold for "stuck" (min 1)
&limit=100 // default: 100, 1-500

Response

json
{
"stuck_count": 1,
"images": [
{
"file_id": "550e8400-e29b-41d4-a716-446655440000",
"session_id": "770a0600-e29b-41d4-a716-446655440000",
"unified_status": "uploading",
"component_statuses": {
"variant_status": null,
"description_status": null,
"upload_status": "streaming",
"processing_status": null
},
"processing_ids": [],
"error_message": null,
"last_error_at": null,
"created_at": "2026-01-15T10:00:00Z",
"last_updated_at": "2026-01-15T10:00:15Z",
"completed_at": null,
"retry_count": 0,
"processing_duration_seconds": 1845.0,
"is_stuck": true,
"is_terminal": false
}
]
}

Telemetry

POST/api/v2/files/log-upload-event

Fire-and-forget client-side upload telemetry (e.g., browser-side errors, retries). Unauthenticated. Server logs the event for diagnostics; never blocks the upload.

Request

json
{
"event_type": "upload_retry", // required, max 100 chars
"message": "Chunk 5 failed with NetworkError, retrying", // required, max 2000 chars
"data": { // optional, serialized size <= 10 KB
"upload_id": "550e8400-e29b-41d4-a716-446655440000",
"part_number": 5,
"user_agent": "Mozilla/5.0 ..."
},
"timestamp": "2026-04-15T10:35:00Z", // optional client-side timestamp
"session_id": "660f9511-f3ac-52e5-b827-557766551111", // optional, max 100 chars (matches session_id from POST /files/upload/batch)
"file_index": 5, // optional, 0-10000
"file_name": "huge.pdf" // optional, max 500 chars
}

Response

json
{
"status": "logged",
"timestamp": "2026-04-15T10:35:00.123456+00:00"
}

Limits & Constraints

  • Streaming upload max: 100 MB per file
  • Single-shot presigned max: 5 GB (S3 PUT cap)
  • Multipart max: 5 TB (per S3 limits)
  • Multipart part size: 5 MB minimum per part; S3 imposes a 5 GB per-part hard limit
  • Synchronous document analyze: 10 MB (`/files/analyze`); use `/files/upload` for larger
  • Streaming batch size: 10–200 files per request (tier-dependent); each file capped at 100 MB
  • Batch delete: max 100 files per request
  • Search query length: 1–1000 characters
  • Tags per file: max 40 tags, each max 50 characters
  • Title length: max 255 characters
  • User description: max 10000 characters
  • Hash dedup batch: 1–250 hashes per call