Supported Document Formats

Hubrix supports nine file formats for upload to the Documents library. Files outside these types are rejected with an error message at the time of upload.

The Documents upload zone listing the supported file format icons
The Documents upload zone listing the supported file format icons

Supported formats

FormatExtensionNotes
PDF.pdfText-based PDFs extract best. Scanned PDFs without OCR extract poorly.
Word document.docxTables and lists are extracted. Embedded images are ignored.
Excel spreadsheet.xlsxEach sheet is extracted as tab-separated text.
PowerPoint.pptxSlide text is extracted; speaker notes are included.
Plain text.txtExtracted as-is. Any encoding is accepted.
Markdown.mdExtracted as plain text; Markdown syntax is preserved.
CSV.csvRows are extracted as structured text for RAG queries.
JSON.jsonKeys and values are extracted as readable text.
HTML.htmlTags are stripped; visible text is extracted.

File size limit

The maximum file size is 50 MB per file. Files larger than 50 MB are rejected at upload. If you need to work with larger files, consider splitting the document into smaller parts before uploading.

Files larger than 25 MB will be skipped by Data Connector sync (Google Drive / OneDrive). The 50 MB limit applies to manual uploads only.

Getting the best extraction quality

Extraction quality determines how accurately the AI can answer questions about your document. Here are the most important factors:

Text-based PDFs vs scanned PDFs

The biggest quality difference is between:

  • Text-based PDFs — created digitally (from Word, Google Docs, or exported from software). The text is embedded and extracts perfectly.
  • Scanned PDFs — photographs of physical pages. Hubrix does not currently perform OCR (optical character recognition). If your PDF is a scan, the extracted text will be empty or gibberish.

To check if your PDF is text-based: try selecting text in your PDF viewer. If you can highlight words, the PDF is text-based.

Other quality tips

  • Avoid heavily formatted files — PDFs with complex multi-column layouts or overlapping text boxes may extract in the wrong reading order.
  • Use DOCX for Word content — DOCX exports more cleanly than PDF for Word documents with tables and lists.
  • Keep CSV files clean — make sure CSV files have a header row and consistent column counts.

After upload, check the document's status in the Documents library. If status shows Failed, the file likely could not be parsed. Try re-exporting in a different format or cleaning up the file's structure.

Unsupported file types

If you upload a file type that is not in the supported list (for example, a .zip archive or a .png image), the upload is rejected immediately with a message explaining the issue. There is no workaround for image-only files.

Was this helpful?