Document Operations

Supported Storage Options

Local: Allows to load data from application local storage
Azure Blob Storage (opens in a new tab): Allows to load data from Azure Blob Storage
Amazon S3 (opens in a new tab): Allows to load data from Amazon S3 Buckets
Google Cloud Storage (opens in a new tab): Allows to load data from Google Cloud Storage

Document operation should be used to load a single document or a list of documents. The document is at first parsed and then optionally split into chucks of the provided size.

These operations are supposed to be followed by a generate embedding from document operation. The document operations output payload is ready to be used by the generate embedding from document operation without any transformation.

Document | Load single

The [Document] Load single operation parse a document and optionally splits it into text chunks based on the provided size.

How to Use

Add Document to Store

The [Document] Load single operation should be followed by an [Embedding] Generate from document operation. The output payload is ready to be used by the [Embedding] Generate from document operation without any transformation.

Input Fields

Module Configuration

This refers to the MuleSoft Vectors Storage Configuration set up in the Getting Started section.

Document Fields

File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
Context Path: Behaviour changes based on storage type.
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g., mule.home ++ "/apps/" ++ app.name ++ "/".
- AZURE_BLOB: Contains container name and blob item name in the following format <container-name>/<blob-item-name> (eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...)
- S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format s3://<s3-bucket>/<s3-object-key> (eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)

Segmentation Fields

Max Segment Size (Characters): The segment size of the document to be split in.
Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.

XML Configuration

Below is the XML configuration for this operation:

<ms-vectors:document-load-single
  doc:name="[Document] Load single"
  doc:id="9d197b8b-6ea7-46b6-9ed2-bdc9d7ed3c4f"
  config-ref="MuleSoft_Vectors_Connector_Document_config"
  fileType="any"
  contextPath="#[payload.contextPath]"
  maxSegmentSizeInChar="3000"
  maxOverlapSizeInChars="300"/>

Output Fields

Payload

This operation responds with a json payload.

Example

Here an example of the JSON output.

{
    "text-segments": [
        {
            "metadata": {
                "index": "0",
                "source": "s3://ms-vectors/invoicesample.pdf",
                "file_type": "any",
                "file_name": "invoicesample.pdf"
            },
            "text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
        },
        ...
    ]
}

text-segments: The segments of the text of the document / file.
- list-item (text-segment):
  - text: The text segment
  - metadata: The metadata key-value pairs.
    - index: The segment/chunk number for the uploaded data source.
    - absolute_directory_path: The full path to the file which contains relevant text segment.
    - file_name: The name of the file, where the text segment was found.
    - full_path: The full path to the file.
    - file_Type: The file/source type.
    - source: File path set by cloud storage services (eg. Amazon S3)
    - url: Web page URL when processing file type url
    - title: Web page title

Attributes

DocumentResponseAttributes:
- fileType: Contains the type of the document to be ingested into the embedding store.
- contextPath: Behaviour changes based on storage type.

Document | Load list

The [Document] Load list operation parse a list of documents and optionally splits them into text chunks based on the provided size.

How to Use

Add Folder to Store

The [Document] Load list operation can be followed by a Batch Job, For Each or For Each Parallel including an [Embedding] Generate from document operation. The output payload is ready to be used by the [Embedding] Generate from document operation without any transformation.

Input Fields

Module Configuration

This refers to the MuleSoft Vectors Storage Configuration set up in the Getting Started section.

Document Fields

File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
Context Path: Behaviour changes based on storage type.
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g., mule.home ++ "/apps/" ++ app.name ++ "/".
- AZURE_BLOB: Contains container name and blob item name in the following format <container-name>/<blob-item-name> (eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...)
- S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format s3://<s3-bucket>/<s3-object-key> (eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)

Segmentation Fields

Max Segment Size (Characters): The segment size of the document to be split in.
Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.

XML Configuration

Below is the XML configuration for this operation:

<ms-vectors:document-load-list
  doc:name="[Document] Load list"
  doc:id="9d197b8b-6ea7-46b6-9ed2-bdc9d7ed3c4fìo"
  config-ref="MuleSoft_Vectors_Connector_Document_config"
  fileType="any"
  contextPath="#[payload.contextPath]"
  maxSegmentSizeInChar="3000"
  maxOverlapSizeInChars="300"/>

Output Fields

Payload

This operation responds with a json payload.

Example

Here an example of the JSON output.

[
  {
      "text-segments": [
          {
              "metadata": {
                  "index": "0",
                  "source": "s3://ms-vectors/invoicesample.pdf",
                  "file_type": "any",
                  "file_name": "invoicesample.pdf"
              },
              "text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
          },
          ...
      ]
  }
]

list-item (document):
- text-segments: The segments of the text of the document / file.
  - list-item (text-segment):
    - text: The text segment
    - metadata: The metadata key-value pairs.
      - index: The segment/chunk number for the uploaded data source.
      - absolute_directory_path: The full path to the file which contains relevant text segment.
      - file_name: The name of the file, where the text segment was found.
      - full_path: The full path to the file.
      - file_Type: The file/source type.
      - source: File path set by cloud storage services (eg. Amazon S3)
      - url: Web page URL when processing file type url
      - title: Web page title

Attributes

DocumentResponseAttributes:
- fileType: Contains the type of the document to be ingested into the embedding store.
- contextPath: Behaviour changes based on storage type.

Document | Load from payload

The [Document] Load from payload operation parse a document sent as payload either in base64 or binary format and optionally splits it into text chunks based on the provided size.

Input Fields

Payload Fields

Content: The document represented either in binary or base64 format.
Format: The content format used to represents the document. Available formats are:
- binary
- base64
File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
File Name: [Optional] The document file name.

Segmentation Fields

Max Segment Size (Characters): The segment size of the document to be split in.
Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.

XML Configuration

Below is the XML configuration for this operation:

<ms-vectors:document-load-from-payload
  doc:name="[Document] Load from payload"
  doc:id="44108a74-cd69-44df-afbf-74f59b33ae29"
  format="binary"
  fileType="any"
  fileName="example.pdf"
  maxSegmentSizeInChar="1500"
  maxOverlapSizeInChars="150"/>

Output Fields

Payload

This operation responds with a json payload.

Example

Here an example of the JSON output.

{
    "text-segments": [
        {
            "metadata": {
                "index": "0",
                "file_type": "any",
                "file_name": "invoicesample.pdf"
            },
            "text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
        },
        ...
    ]
}

text-segments: The segments of the text of the document / file.
- list-item (text-segment):
  - text: The text segment
  - metadata: The metadata key-value pairs.
    - index: The segment/chunk number for the uploaded data source.
    - absolute_directory_path: The full path to the file which contains relevant text segment.
    - file_name: The name of the file, where the text segment was found.
    - full_path: The full path to the file.
    - file_Type: The file/source type.
    - source: File path set by cloud storage services (eg. Amazon S3)
    - url: Web page URL when processing file type url
    - title: Web page title

Attributes

DocumentResponseAttributes:
- fileType: Contains the type of the document to be ingested into the embedding store.
- contextPath: Behaviour changes based on storage type.

Supported Operations Media