Optional
apiOptional
chunkingOptional
clientOptional
combineIf chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500
Optional
coordinatesIf True
, return coordinates for each element extracted via OCR. Default: False
Optional
enableThe Unstructured SDK has logs they call console.info
to
log at request time. Passing true
will log these messages.
The default of false
will overwrite the console.info
function
so that it does not log.
Optional
encodingThe encoding method used to decode the text input. Default: utf-8
Optional
extractThe types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields.
Optional
gzIf file is gzipped, use this content type after unzipping.
Optional
hiThe name of the inference model used when strategy is hi_res
Optional
httpOptional
includeWhen a chunking strategy is specified, each returned chunk will include the elements consolidated to form that chunk as .metadata.orig_elements
. Default: true.
Optional
includeIf true, the output will include page breaks if the filetype supports it. Default: false
Optional
languagesThe languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages.
Optional
maxIf chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 500
Optional
multipageIf chunking strategy is set, determines if sections can span multiple sections. Default: true
Optional
newIf chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500
Optional
ocrDeprecated! The languages present in the document, for use in partitioning and/or OCR
Optional
outputThe format of the response. Supported formats are application/json and text/csv. Default: application/json.
Optional
overlapSpecifies the length of a string ('tail') to be drawn from each chunk and prefixed to the next chunk as a context-preserving mechanism. By default, this only applies to split-chunks where an oversized element is divided into multiple chunks by text-splitting. Default: 0
Optional
overlapWhen True
, apply overlap between 'normal' chunks formed from whole elements and not subject to text-splitting. Use this with caution as it entails a certain level of 'pollution' of otherwise clean semantic chunk boundaries. Default: False
Optional
partitionOptional
pdfDeprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
Optional
postOptional
retryAllows overriding the default retry config used by the SDK
Optional
securityThe security details required to authenticate the SDK
Optional
serverAllows overriding the default server used by the SDK
Optional
serverURLAllows overriding the default server URL used by the SDK
Optional
similarityA value between 0.0 and 1.0 describing the minimum similarity two elements must have to be included in the same chunk. Note that similar elements may be separated to meet chunk-size criteria; this value can only guarantees that two elements with similarity below the threshold will appear in separate chunks.
Optional
skipThe document types that you want to skip table extraction with. Default: []
Optional
splitNumber of maximum concurrent requests made when splitting PDF. Ignored on backend.
Optional
splitShould the pdf file be split at client. Ignored on backend.
Optional
startingWhen PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27.
Optional
strategyOptional
timeoutOptional
uniqueWhen True
, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False
Optional
xmlIf True
, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.
Use one of the supported strategies to chunk the returned elements after partitioning. When 'chunking_strategy' is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: 'basic', 'by_page', 'by_similarity', or 'by_title'