Document Operations
Supported Storage Options
- Local: Allows to load data from application local storage
- Azure Blob Storage (opens in a new tab): Allows to load data from Azure Blob Storage
- Amazon S3 (opens in a new tab): Allows to load data from Amazon S3 Buckets
- Google Cloud Storage (opens in a new tab): Allows to load data from Google Cloud Storage
Document operation should be used to load a single document or a list of documents. The document is at first parsed and then optionally split into chucks of the provided size.
These operations are supposed to be followed by a generate embedding from document
operation. The document operations output payload is ready to be used by the generate
embedding from document operation without any transformation.
Document | Load single
The [Document] Load single
operation parse a document and optionally splits it into text chunks based on the provided size.
How to Use
Add Document to Store
The [Document] Load single
operation should be followed by an [Embedding] Generate from document
operation.
The output payload is ready to be used by the [Embedding] Generate from document
operation without any transformation.
Input Fields
Module Configuration
This refers to the MuleSoft Vectors Document Configuration set up in the Getting Started section.
Document Fields
-
File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
-
Context Path: Behaviour changes based on storage type.
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
mule.home ++ "/apps/" ++ app.name ++ "/"
. - AZURE_BLOB: Contains container name and blob item name in the following format
<container-name>/<blob-item-name>
(eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...) - S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format
s3://<s3-bucket>/<s3-object-key>
(eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
Segmentation Fields
- Max Segment Size (Characters): The segment size of the document to be split in.
- Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:document-load-single
doc:name="[Document] Load single"
doc:id="9d197b8b-6ea7-46b6-9ed2-bdc9d7ed3c4f"
config-ref="MuleSoft_Vectors_Connector_Document_config"
fileType="any"
contextPath="#[payload.contextPath]"
maxSegmentSizeInChar="3000"
maxOverlapSizeInChars="300"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
Here an example of the JSON output.
{
"text-segments": [
{
"metadata": {
"index": "0",
"source": "s3://ms-vectors/invoicesample.pdf",
"file_type": "any",
"file_name": "invoicesample.pdf"
},
"text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
},
...
]
}
- text-segments: The segments of the text of the document / file.
- list-item (text-segment):
- text: The text segment
- metadata: The metadata key-value pairs.
- index: The segment/chunk number for the uploaded data source.
- absolute_directory_path: The full path to the file which contains relevant text segment.
- file_name: The name of the file, where the text segment was found.
- full_path: The full path to the file.
- file_Type: The file/source type.
- source: File path set by cloud storage services (eg. Amazon S3)
- url: Web page URL when processing file type url
- title: Web page title
- list-item (text-segment):
Attributes
- DocumentResponseAttributes:
- fileType: Contains the type of the document to be ingested into the embedding store.
- contextPath: Behaviour changes based on storage type.
Document | Load list
The [Document] Load list
operation parse a list of documents and optionally splits them into text chunks based on the provided size.
How to Use
Add Folder to Store
The [Document] Load list
operation can be followed by a Batch Job
, For Each
or For Each Parallel
including
an [Embedding] Generate from document
operation.
The output payload is ready to be used by the [Embedding] Generate from document
operation without any transformation.
Input Fields
Module Configuration
This refers to the MuleSoft Vectors Document Configuration set up in the Getting Started section.
Document Fields
-
File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
-
Context Path: Behaviour changes based on storage type.
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
mule.home ++ "/apps/" ++ app.name ++ "/"
. - AZURE_BLOB: Contains container name and blob item name in the following format
<container-name>/<blob-item-name>
(eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...) - S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format
s3://<s3-bucket>/<s3-object-key>
(eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
Segmentation Fields
- Max Segment Size (Characters): The segment size of the document to be split in.
- Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:document-load-list
doc:name="[Document] Load list"
doc:id="9d197b8b-6ea7-46b6-9ed2-bdc9d7ed3c4fìo"
config-ref="MuleSoft_Vectors_Connector_Document_config"
fileType="any"
contextPath="#[payload.contextPath]"
maxSegmentSizeInChar="3000"
maxOverlapSizeInChars="300"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
Here an example of the JSON output.
[
{
"text-segments": [
{
"metadata": {
"index": "0",
"source": "s3://ms-vectors/invoicesample.pdf",
"file_type": "any",
"file_name": "invoicesample.pdf"
},
"text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
},
...
]
}
]
- list-item (document):
- text-segments: The segments of the text of the document / file.
- list-item (text-segment):
- text: The text segment
- metadata: The metadata key-value pairs.
- index: The segment/chunk number for the uploaded data source.
- absolute_directory_path: The full path to the file which contains relevant text segment.
- file_name: The name of the file, where the text segment was found.
- full_path: The full path to the file.
- file_Type: The file/source type.
- source: File path set by cloud storage services (eg. Amazon S3)
- url: Web page URL when processing file type url
- title: Web page title
- list-item (text-segment):
- text-segments: The segments of the text of the document / file.
Attributes
- DocumentResponseAttributes:
- fileType: Contains the type of the document to be ingested into the embedding store.
- contextPath: Behaviour changes based on storage type.