Embedding Operations

Embedding | Add Document to Store

The Add Document to Store operation adds a document into an embedding store. The document is ingested into external vector databases.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the store in the external Vector Database.
Context Path: Contains the full file path for the document to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g., mule.home ++ "/apps/" ++ app.name ++ "/customer-service.pdf".
Max segment size: The segment size of the document to be splitted in.
Max overlap size: The overlap size of the segments to fine tune the similarity search.

Context

File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any:
- crawl:
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: only a single URL supported.

Storage

Storage type: Defines the type of storage to be used to load data.
- Local: Allows to load data from application local storage
- AZURE_BLOB: Allows to load data from Azure Blob Storage
- S3: Allows to load data from AWS S3 Buckets

Additional Properties

Embedding Model Name: Indicates the embedding model to be used (default is text-embedding-ada-002).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-add-document-to-store
   doc:name="Embedding add document to store"
   doc:id="bad88ec3-4239-4f9d-9e99-7eba0456b799"
   config-ref="<YOUR_CONFIG>"
   fileType="any"
   storeName="mschainaicollection"
   contextPath="#[payload.filePath]"
   maxSegmentSizeInChars="3000"
   maxOverlapSizeInChars="150"
   embeddingModelName="text-embedding-3-small"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "filePath": "/Users/tbolis/batch 1/docs-accelerators__accelerators-cim_1.3_modules_ROOT_pages_cim-setup.adoc",
    "storeName": "gettingstarted",
    "fileType": "any",
    "status": "updated"
}

filePath: The file path or url of the document to be extracted.
storeName: The name of the vector store.
fileType: The file type selected on the operation.
status: The status of the operation.

Embedding | Add Folder to Store

The Add Folder to Store operation adds a complete folder into an embedding store. The documents are ingested into external vector databases.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the store in the external Vector Database.
Folder Path: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g., mule.home ++ "/apps/" ++ app.name ++ "/".
Max segment size: The segment size of the document to be splitted in.
Max overlap size: The overlap size of the segments to fine tune the similarity search.

Context

File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any:
- crawl:
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: only a single URL supported.

Storage

Storage type: Defines the type of storage to be used to load data.
- Local: Allows to load data from application local storage
- AZURE_BLOB: Allows to load data from Azure Blob Storage
- S3: Allows to load data from AWS S3 Buckets

Additional Properties

Embedding Model Name: Indicates the embedding model to be used (default is text-embedding-ada-002).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-add-folder-to-store
  doc:name="Embedding add folder to store"
  doc:id="9b60357c-e8e5-42ee-a75e-054cf97bb674"
  config-ref="<YOUR_CONFIG>"
  storeName="mschainaicollection"
  folderPath="#[payload.folderPath]"
  maxSegmentSizeInChars="3000"
  maxOverlapSizeInChars="300"
  fileType="any"
  embeddingModelName="text-embedding-3-small"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "folderPath": "/Users/amir.khan/Documents/Downloads/patch 8",
    "filesCount": 6,
    "storeName": "gettingstarted",
    "status": "updated"
}

folderPath: The folder path to the files to be ingested.
storeName: The name of the vector store.
filesCount: The number of files ingested.
status: The status of the operation.

Embedding | Add Text to Store

The Embedding add text to store operation adds a text into an embedding store.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the store in the external Vector Database.
Text: Contains text to be vectorised.

Additional Properties

Embedding Model Name: Indicates the embedding model to be used (default is text-embedding-ada-002).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-add-text-to-store
  doc:name="Embedding add text to store" 
  doc:id="9ab96d9c-51fd-4c3e-9669-3de4624fa2c0" 
  config-ref="<YOUR_CONFIG>" 
  storeName="mschainaicollection" 
  textToAdd="#[payload.text]"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "textSegment": "TextSegment { text = \"The capital of Switzerland is Bern\" metadata = {} }",
    "textEmbedding": "Embedding { vector = [0.0064053377, .[..]., -0.01793744] }",
    "storeName": "gettingstarted",
    "status": "added"
}

textSegment: The text segment of the request.
textEmbedding: The vectorised embeddings for the text segment.
storeName: The name of the vector store.
status: The status of the operation.

Embedding | Generate Embeddings from Text

The Generate Embeddings from Text operation create numeric vectors from a text (without ingestion into the vector database).

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Text: The text to generate embeddings for.

Additional Properties

Embedding Model Name: Indicates the embedding model to be used (default is text-embedding-ada-002).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-generate-from-text
  doc:name="Embedding generate from text"
  doc:id="eacf5862-2b9a-4602-b7d7-0358885986c3"
  config-ref="<YOUR_CONFIG>"
  textToAdd="#[payload.text]"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "Embedding": "Embedding { vector = [0.0064053377, ..., -0.01793744] }",
    "Dimension": 1536,
    "Segment": "TextSegment { text = \"The capital of Switzerland is Bern\" metadata = {} }"
}

Embedding: The vector embeddings of the input Text.
Dimension: The dimension of the selected embedding model.
Segment: The text segment of the input Text.

Embedding | Query from Store

The Query from store operation retrieve information based on a plain text prompt from an embedding store.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the vector collection in the Vector database.
Question: The prompt to be sent to the LLM along with the embedding store to respond to.
Max results: The maximum number of results to query back. default (3).
Min Score: The min score for the similarity search (0 - 1), default (0.8).

Additional Properties

Embedding Model Name: Indicates the model to be used (default is gpt-3.5-turbo).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-query-from-store
  doc:name="Embedding query from store"
  doc:id="f245c837-cb2c-4807-b0a3-4ca4ad0d522b"
  config-ref="MAC_Vectors_Config"
  storeName="mulechaindemo"
  question="#[payload.question]"
  maxResults="5"
  minScore="0.5"
  embeddingModelName="text-embedding-3-small"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "question": "Tell me more about Cloudhub High Availability Feature",
    "sources": [
        {
            "embeddingId": "",
            "text": "= CloudHub High Availability Features\nifndef::env-site,env-github[]\ninclude::_attributes.adoc[]\nendif::[]\n:page-aliases: runtime-manager::cloudhub-fabric.adoc,\....\n\n== Worker Scale-out",
            "score": 0.9282029356714594,
            "metadata": {
                "source_Id": "",
                "index": "0",
                "absolute_directory_path": "/Users/<user>/Documents/Downloads/patch 8",
                "file_name": "docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
                "full_path": "/Users/<user>/Documents/Downloads/patch 8docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
                "file_type": "any",
                "ingestion_datetime": ""
            }
        },
        {
          ...
        },
        {
          ...
        }
    ],
    "response": "= CloudHub High Availability Features\.. (...) \..distributes HTTP requests among your assigned workers.\n. Persistent message queues (see below)",
    "maxResults": 3,
    "storeName": "gettingstarted",
    "minimumScore": 0.7
}

question: The question of the request.
sources: The sources identified by the similarity search.
- embeddingId: The embedding UUID.
- text: The relevant text segment.
- score: The score of the similarity search based on the question.
- metadata: The metadata key-value pairs.
  - source_id: The source UUID.
  - index: The segment/chunk number for the uploaded data source.
  - absolute_directory_path: The full path to the file which contains relevant text segment.
  - file_name: The name of the file, where the text segment was found.
  - full_path: The full path to the file.
  - file_Type: The file/source type.
  - ingestion_datetime:
response: The collected response of all relevant text segment. This is the response will is sent to the LLM.
maxResults: The maximum number of text segments considered.
storeName: The name of the vector store.
minimumScore: The minimum score for the result.

Example Use Cases

This operation can be particularly useful in scenarios such as:

Knowledge Management Systems: Adding new documents to an organizational knowledge base.
Customer Support: Storing customer interaction documents for quick retrieval and analysis.
Content Management: Ingesting various types of documents (text, PDF, URL) into a centralized repository for easy access and searchability.

Embedding | Query from Store with Filter

The Query from Store with Filter operation retrieve information embedding store based on filter and plain text prompt.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the vector collection in the Vector database.
Question: The prompt to be sent to the LLM along with the embedding store to respond to.
Max results: The maximum number of results to query back. default (3).
Min Score: The min score for the similarity search (0 - 1), default (0.8).

Filter

Metadata key: The metadata key used for filtering results.
Filter method: The conditional operator to use for filtering.
Metadata value: The metadata value to evaluate.

Additional Properties

Embedding Model Name: Indicates the model to be used (default is gpt-3.5-turbo).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-query-from-store-with-filter
  doc:name="Embedding query from store with filter"
  doc:id="f245c837-cb2c-4807-b0a3-4ca4ad0d522b"
  config-ref="MAC_Vectors_Config"
  storeName="mulechaindemo"
  question="#[payload.question]"
  maxResults="5"
  minScore="0.5"
  embeddingModelName="text-embedding-3-small"
  metadataKey='filename'
  filterMethod='isEqualTo'
  metadataValue='sample.pdf' />

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "question": "Tell me more about Cloudhub High Availability Feature",
    "sources": [
        {
            "embeddingId": "",
            "text": "= CloudHub High Availability Features\nifndef::env-site,env-github[]\ninclude::_attributes.adoc[]\nendif::[]\n:page-aliases: runtime-manager::cloudhub-fabric.adoc,\....\n\n== Worker Scale-out",
            "score": 0.9282029356714594,
            "metadata": {
                "source_Id": "",
                "index": "0"
                "absolute_directory_path": "/Users/<user>/Documents/Downloads/patch 8",
                "file_name": "docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
                "full_path": "/Users/<user>/Documents/Downloads/patch 8docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
                "file_type": "any",
                "ingestion_datetime": ""
            }
        },
        {
          ...
        },
        {
          ...
        }
    ],
    "filter": {
      "metadataKey": "file_name",
      "filterMethod": "isEqualTo",
      "metadataValue": "docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc"
    }
    "response": "= CloudHub High Availability Features\.. (...) \..distributes HTTP requests among your assigned workers.\n. Persistent message queues (see below)",
    "maxResults": 3,
    "storeName": "gettingstarted",
    "minimumScore": 0.7
}

question: The question of the request.
sources: The sources identified by the similarity search.
- embeddingId: The embedding UUID.
- text: The relevant text segment.
- score: The score of the similarity search based on the question.
- metadata: The metadata key-value pairs.
  - source_id: The UUID for the uploaded data source.
  - index: The segment/chunk number for the uploaded data source.
  - absolute_directory_path: The full path to the file which contains relevant text segment.
  - file_name: The name of the file, where the text segment was found.
  - full_path: The full path to the file.
  - file_Type: The file type
  - ingestion_datetime: The source ingestion datetime.
filter:
- metadataKey: The metadata key used for filtering results.
- filterMethod: The conditional operator to use for filtering.
- metadataValue: The metadata value to evaluate.
response: The collected response of all relevant text segment. This is the response will is sent to the LLM.
maxResults: The maximum number of text segments considered.
storeName: The name of the vector store.
minimumScore: The minimum score for the result.

Example Use Cases

This operation can be particularly useful in scenarios such as:

Knowledge Management Systems: Adding new documents to an organizational knowledge base.
Customer Support: Storing customer interaction documents for quick retrieval and analysis.
Content Management: Ingesting various types of documents (text, PDF, URL) into a centralized repository for easy access and searchability.

Embedding | List Sources

The List Sources operation list all source into embedding store.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the vector collection in the Vector database.

Querying Strategy

Embedding Page Size: Page size to use when querying the store.

Additional Properties

Embedding Model Name: Indicates the model to be used (default is gpt-3.5-turbo).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-list-sources
  doc:name="Embedding list sources"
  doc:id="dcd57b22-914d-44a8-96f3-2c916e996393"
  storeName="mulechaindemo"
  embeddingModelName="text-embedding-3-small"
  embeddingPageSize="5000"
  config-ref="MAC_Vectors_Config"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "sourceCount": 3,
    "sources": [
        {
            "absolute_directory_path": "/Users/tbolis/Downloads/RFP Docs/batch 1",
            "file_name": "docs-accelerators__financial-services_1.11_modules_ROOT_pages_prerequisites.adoc",
            "source_id": "d6d2e426-8da6-4454-a723-202e1bfb1114",
            "full_path": "/Users/tbolis/Downloads/RFP Docs/batch 1/docs-accelerators__financial-services_1.11_modules_ROOT_pages_prerequisites.adoc",
            "segmentCount": 1,
            "ingestion_datetime": "2024-11-12T14:28:04.189Z"
        },
        {
            "absolute_directory_path": "/Users/tbolis/Downloads/RFP Docs/batch 1",
            "file_name": "docs-accelerators__healthcare_2.20_modules_ROOT_pages_fhir-r4-us-core-profiles.adoc",
            "source_id": "37789839-7685-46b5-bc39-6f47db3e2921",
            "full_path": "/Users/tbolis/Downloads/RFP Docs/batch 1/docs-accelerators__healthcare_2.20_modules_ROOT_pages_fhir-r4-us-core-profiles.adoc",
            "segmentCount": 3,
            "ingestion_datetime": "2024-11-12T14:28:17.274Z"
        },
        {
          ...
        },
        {
          ...
        }
    ],
    "storeName": "gettingstarted"
}

sourceCount: The number of sources within the embedding store.
sources: The list of sources within the embedding store.
- absolute_directory_path: The full path to the file which contains relevant text segment.
- file_name: The name of the file, where the text segment was found.
- source_id: The source UUID.
- full_path: The full path to the file.
- segmentCount: The number of segment/chunk the source is splitted into.
- ingestion_datetime: The source ingestion datetime.
storeName: The name of the vector store.

Embedding | Remove from Store by Filter

The Remove from Store by Filter operation remove all embeddings from store matching filter.

Input Fields

Module Configuration

This refers to the MAC Vectors Configuration set up in the Getting Started section.

General

Store Name: The name of the vector collection in the Vector database.

Filter

Metadata key: The metadata key used for filtering results.
Filter method: The conditional operator to use for filtering
Metadata value: The metadata value to evaluate.

Additional Properties

Embedding Model Name: Indicates the model to be used (default is gpt-3.5-turbo).

XML Configuration

Below is the XML configuration for this operation:

<vectors:embedding-remove-from-store-by-filter
  doc:name="Embedding remove documents by filter"
  doc:id="c6b9ec97-1224-445e-ab02-f598d6fff7d7"
  config-ref="MAC_Vectors_Config"
  storeName="mulechaindemo"
  metadataKey="file_name"
  filterMethod="isEqualTo"
  metadataValue="docs-accelerators__accelerators-cim_1.3_modules_ROOT_pages_cim-setup.adoc"
  embeddingModelName="text-embedding-3-small"/>

Output Field

This operation responds with a json payload.

Example Output

This output has been converted to JSON.

{
    "filter": {
        "filterMethod": "isEqualTo",
        "metadataKey": "file_name",
        "metadataValue": "docs-accelerators__accelerators-cim_1.3_modules_ROOT_pages_cim-setup.adoc"
    },
    "storeName": "gettingstarted",
    "status": "deleted"
}

filter: The filter used to identify the embeddings to delete
- metadataKey: The metadata key used for filtering results.
- filterMethod: The conditional operator to use for filtering.
- metadataValue: The metadata value to evaluate.
storeName: The name of the vector store.
status: The operation status.

Documents Connector Overview