Supported Document Formats

Hubrix supports nine file formats for upload to the Documents library. Files outside these types are rejected with an error message at the time of upload.

The Documents upload zone listing the supported file format icons

Supported formats

Format	Extension	Notes
PDF	`.pdf`	Text-based PDFs extract best. Scanned PDFs without OCR extract poorly.
Word document	`.docx`	Tables and lists are extracted. Embedded images are ignored.
Excel spreadsheet	`.xlsx`	Each sheet is extracted as tab-separated text.
PowerPoint	`.pptx`	Slide text is extracted; speaker notes are included.
Plain text	`.txt`	Extracted as-is. Any encoding is accepted.
Markdown	`.md`	Extracted as plain text; Markdown syntax is preserved.
CSV	`.csv`	Rows are extracted as structured text for RAG queries.
JSON	`.json`	Keys and values are extracted as readable text.
HTML	`.html`	Tags are stripped; visible text is extracted.

File size limit

The maximum file size is 50 MB per file. Files larger than 50 MB are rejected at upload. If you need to work with larger files, consider splitting the document into smaller parts before uploading.

Files larger than 25 MB will be skipped by Data Connector sync (Google Drive / OneDrive). The 50 MB limit applies to manual uploads only.

Getting the best extraction quality

Extraction quality determines how accurately the AI can answer questions about your document. Here are the most important factors:

Text-based PDFs vs scanned PDFs

The biggest quality difference is between:

Text-based PDFs — created digitally (from Word, Google Docs, or exported from software). The text is embedded and extracts perfectly.
Scanned PDFs — photographs of physical pages. Hubrix does not currently perform OCR (optical character recognition). If your PDF is a scan, the extracted text will be empty or gibberish.

To check if your PDF is text-based: try selecting text in your PDF viewer. If you can highlight words, the PDF is text-based.

Other quality tips

Avoid heavily formatted files — PDFs with complex multi-column layouts or overlapping text boxes may extract in the wrong reading order.
Use DOCX for Word content — DOCX exports more cleanly than PDF for Word documents with tables and lists.
Keep CSV files clean — make sure CSV files have a header row and consistent column counts.

After upload, check the document's status in the Documents library. If status shows Failed, the file likely could not be parsed. Try re-exporting in a different format or cleaning up the file's structure.

Unsupported file types

If you upload a file type that is not in the supported list (for example, a .zip archive or a .png image), the upload is rejected immediately with a message explaining the issue. There is no workaround for image-only files.

Was this helpful?

Still need help?

Email support

support@hubrix.ai

Book a call

Schedule via Calendly

In-app support

Chat with us in Hubrix

PreviousOverview

NextHow RAG works