Document Operations

Supported Storage Options

Document operation should be used to load a single document or a list of documents. The document is at first parsed and then optionally split into chucks of the provided size.

These operations are supposed to be followed by a generate embedding from document operation. The document operations output payload is ready to be used by the generate embedding from document operation without any transformation.

Document | Load single

The [Document] Load single operation parse a document and optionally splits it into text chunks based on the provided size.

Document Load Single

How to Use

Add Document to Store

The [Document] Load single operation should be followed by an [Embedding] Generate from document operation. The output payload is ready to be used by the [Embedding] Generate from document operation without any transformation.

Document Load Single Use Case

Input Fields

Module Configuration

This refers to the MuleSoft Vectors Document Configuration set up in the Getting Started section.

Document Fields

  • File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:

    • any: Any type except txt, url or crawl
    • text: Any type of text files (json, xml, txt, csv, etc.)
    • url: Only a single URL supported.
    • crawl: The file type created by the webcrawler connector.
  • Context Path: Behaviour changes based on storage type.

    • Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g., mule.home ++ "/apps/" ++ app.name ++ "/".
    • AZURE_BLOB: Contains container name and blob item name in the following format <container-name>/<blob-item-name> (eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...)
    • S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format s3://<s3-bucket>/<s3-object-key> (eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)

Segmentation Fields

  • Max Segment Size (Characters): The segment size of the document to be split in.
  • Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.

XML Configuration

Below is the XML configuration for this operation:

<ms-vectors:document-load-single
  doc:name="[Document] Load single"
  doc:id="9d197b8b-6ea7-46b6-9ed2-bdc9d7ed3c4f"
  config-ref="MuleSoft_Vectors_Connector_Document_config"
  fileType="any"
  contextPath="#[payload.contextPath]"
  maxSegmentSizeInChar="3000"
  maxOverlapSizeInChars="300"/>

Output Fields

Payload

This operation responds with a json payload.

Example

Here an example of the JSON output.

{
    "text-segments": [
        {
            "metadata": {
                "index": "0",
                "source": "s3://ms-vectors/invoicesample.pdf",
                "file_type": "any",
                "file_name": "invoicesample.pdf"
            },
            "text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
        },
        ...
    ]
}
  • text-segments: The segments of the text of the document / file.
    • list-item (text-segment):
      • text: The text segment
      • metadata: The metadata key-value pairs.
        • index: The segment/chunk number for the uploaded data source.
        • absolute_directory_path: The full path to the file which contains relevant text segment.
        • file_name: The name of the file, where the text segment was found.
        • full_path: The full path to the file.
        • file_Type: The file/source type.
        • source: File path set by cloud storage services (eg. Amazon S3)
        • url: Web page URL when processing file type url
        • title: Web page title

Attributes

  • DocumentResponseAttributes:
    • fileType: Contains the type of the document to be ingested into the embedding store.
    • contextPath: Behaviour changes based on storage type.

Document | Load list

The [Document] Load list operation parse a list of documents and optionally splits them into text chunks based on the provided size.

Document Load List

How to Use

Add Folder to Store

The [Document] Load list operation can be followed by a Batch Job, For Each or For Each Parallel including an [Embedding] Generate from document operation. The output payload is ready to be used by the [Embedding] Generate from document operation without any transformation.

Document Load List Use Case For Each

Input Fields

Module Configuration

This refers to the MuleSoft Vectors Document Configuration set up in the Getting Started section.

Document Fields

  • File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:

    • any: Any type except txt, url or crawl
    • text: Any type of text files (json, xml, txt, csv, etc.)
    • url: Only a single URL supported.
    • crawl: The file type created by the webcrawler connector.
  • Context Path: Behaviour changes based on storage type.

    • Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g., mule.home ++ "/apps/" ++ app.name ++ "/".
    • AZURE_BLOB: Contains container name and blob item name in the following format <container-name>/<blob-item-name> (eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...)
    • S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format s3://<s3-bucket>/<s3-object-key> (eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)

Segmentation Fields

  • Max Segment Size (Characters): The segment size of the document to be split in.
  • Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.

XML Configuration

Below is the XML configuration for this operation:

<ms-vectors:document-load-list
  doc:name="[Document] Load list"
  doc:id="9d197b8b-6ea7-46b6-9ed2-bdc9d7ed3c4fìo"
  config-ref="MuleSoft_Vectors_Connector_Document_config"
  fileType="any"
  contextPath="#[payload.contextPath]"
  maxSegmentSizeInChar="3000"
  maxOverlapSizeInChars="300"/>

Output Fields

Payload

This operation responds with a json payload.

Example

Here an example of the JSON output.

[
  {
      "text-segments": [
          {
              "metadata": {
                  "index": "0",
                  "source": "s3://ms-vectors/invoicesample.pdf",
                  "file_type": "any",
                  "file_name": "invoicesample.pdf"
              },
              "text": "Denny Gunawan\n\n221 Queen St\nMelbourne VIC 3000\n\n$39.60123 Somewhere St, Melbourne VIC 3000\n(03) 1234 5678\n\nInvoice Number: #20130304\n\nOrganic Items Price/kg Quantity(kg) Subtotal\n\nApple $5.00 1 $5.00\n\nOrange $1.99 2 $3.98\n\nWatermelon $1.69 3 $5.07\n\nMango $9.56 2 $19.12\n\nPeach $2.99 1 $2.99\n\nSubtotal..."
          },
          ...
      ]
  }
]
  • list-item (document):
    • text-segments: The segments of the text of the document / file.
      • list-item (text-segment):
        • text: The text segment
        • metadata: The metadata key-value pairs.
          • index: The segment/chunk number for the uploaded data source.
          • absolute_directory_path: The full path to the file which contains relevant text segment.
          • file_name: The name of the file, where the text segment was found.
          • full_path: The full path to the file.
          • file_Type: The file/source type.
          • source: File path set by cloud storage services (eg. Amazon S3)
          • url: Web page URL when processing file type url
          • title: Web page title

Attributes

  • DocumentResponseAttributes:
    • fileType: Contains the type of the document to be ingested into the embedding store.
    • contextPath: Behaviour changes based on storage type.