Embedding Operations
Embedding | Generate Embeddings from Text
The Generate Embeddings from Text
operation create numeric vectors from a text (without ingestion into the vector database).
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Text: The text to generate embeddings for.
Segmentation Fields
- Max Segment Size (Characters): The segment size of the document to be split in.
- Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:embedding-generate-from-text
doc:name="Embedding generate from text"
doc:id="f59dafce-4af1-4ca1-b0e2-38f2945290fd"
config-ref="<YOUR_CONFIG>"
embeddingModelName="text-embedding-ada-002"
text="#[payload.text]"
maxSegmentSizeInChar="3000"
maxOverlapSizeInChars="300"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"segments": [
{
"index": 0,
"text": "In the modern world, technological advancements have become .",
"embedding": "[-0.00683132, -0.0033572172, 0.02698761, -0.01291587, ...]"
},
{
"index": 1,
"text": "E-commerce giants like Amazon and Alibaba have redefined ..",
"embedding": "[-0.0047172513, -0.03481483, 0.02046227, -0.037395656, ..."
}
],
"dimensions": 1536
}
- Segments: The list of segments.
- index: The index of the split portion
- text: The text segment
- embedding: The embedding generated from the text-segment
- Dimensions: The dimension of the selected embedding model.
Attributes
Example
Embedding | Add Text to Store
The Embedding add text to store
operation adds a text into an embedding store.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Text: Contains text to be vectorised.
- Store Name: The name of the collection in the external Vector Database.
Segmentation Fields
- Max Segment Size (Characters): The segment size of the document to be split in.
- Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:embedding-add-text-to-store
doc:name="Embedding add text to store"
doc:id="26aa6893-5ce8-4a69-8a04-131cb3890fd2"
config-ref="<YOUR_CONFIG>"
storeName="msvectorscollection"
text="#[payload.text]"
embeddingModelName="text-embedding-ada-002"
maxSegmentSizeInChar="3000"
maxOverlapSizeInChars="300"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"storeName": "gettingstarted",
"status": "updated"
}
- textSegment: The text segment of the request.
- textEmbedding: The vectorised embeddings for the text segment.
- storeName: The name of the vector store.
- status: The status of the operation.
Attributes
Example
Embedding | Add Document to Store
The Add Document to Store
operation adds a document into an embedding store.
The document is ingested into external vector databases.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Store Name: The name of the collection in the external Vector Database.
- Storage (Override Module Configuration): Based on the selected storage option you will be presented with the related required parameters
- None: Selected when there is no need to define or to override the storage configuration at operation level.
Note. When no storage configuration is defined at module and operation level, then the connector will
behave as per
Local
configuration. - Expression or Bean reference: Allows to define the storage using a dataweave expression. This can be particularly helpful when there is the need of dynamically define the storage. More details on how to do it are available here
- AWS S3: Allows to load data from AWS S3 Buckets
- Azure Blob: Allows to load data from Azure Blob Storage
- Local: Allows to load data from application local storage
- None: Selected when there is no need to define or to override the storage configuration at operation level.
Note. When no storage configuration is defined at module and operation level, then the connector will
behave as per
Document Fields
-
File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
-
Context Path: Behaviour changes based on storage type.
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
mule.home ++ "/apps/" ++ app.name ++ "/"
. - AZURE_BLOB: Contains container name and blob item name in the following format
<container-name>/<blob-item-name>
(eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...) - S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format
s3://<s3-bucket>/<s3-object-key>
(eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
Segmentation Fields
- Max Segment Size (Characters): The segment size of the document to be split in.
- Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:embedding-add-document-to-store
doc:name="Embedding add document to store"
doc:id="bad88ec3-4239-4f9d-9e99-7eba0456b799"
config-ref="<YOUR_CONFIG>"
storeName="#[payload.storeName]"
fileType="#[payload.fileType]"
contextPath="#[payload.contextPath]"
maxSegmentSizeInChar="3000"
maxOverlapSizeInChars="300"
embeddingModelName="text-embedding-3-small"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"storeName": "gettingstarted",
"status": "updated"
}
- storeName: The name of the vector store.
- status: The status of the operation.
Attributes
Example
Embedding | Add Folder to Store
The Add Folder to Store
operation adds a complete folder into an embedding store.
The documents are ingested into external vector databases.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Store Name: The name of the collection in the external Vector Database.
- Storage (Override Module Configuration): Based on the selected storage option you will be presented with the related required parameters
- None: Selected when there is no need to define or to override the storage configuration at operation level.
Note. When no storage configuration is defined at module and operation level, then the connector will
behave as per
Local
configuration. - Expression or Bean reference: Allows to define the storage using a dataweave expression. This can be particularly helpful when there is the need of dynamically define the storage. More details on how to do it are available here
- AWS S3: Allows to load data from AWS S3 Buckets
- Azure Blob: Allows to load data from Azure Blob Storage
- Local: Allows to load data from application local storage
- None: Selected when there is no need to define or to override the storage configuration at operation level.
Note. When no storage configuration is defined at module and operation level, then the connector will
behave as per
Document Fields
-
File Type: Contains the type of the document to be ingested into the embedding store. Currently, three file types are supported:
- any: Any type except txt, url or crawl
- text: Any type of text files (json, xml, txt, csv, etc.)
- url: Only a single URL supported.
- crawl: The file type created by the webcrawler connector.
-
Context Path: Behaviour changes based on storage type.
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
mule.home ++ "/apps/" ++ app.name ++ "/"
. - AZURE_BLOB: Contains container name and blob item name in the following format
<container-name>/<blob-item-name>
(eg. ms-vectors-container/invoicesample.pdf, ms-vectors-container/folder/invoicesample.pdf, ...) - S3: Contains AWS S3 Bucket and AWS S3 Object Key in the following format
s3://<s3-bucket>/<s3-object-key>
(eg. s3://ms-vectors-bucket/setup.adoc, s3://ms-vectors-bucket/folder/setup.adoc,...)
- Local: Contains the path for the documents to be ingested into the embedding store. Ensure the file path is accessible. You can also use a DataWeave expression for this field, e.g.,
Segmentation Fields
- Max Segment Size (Characters): The segment size of the document to be split in.
- Max Overlap Size (Characters): The overlap size of the segments to fine tune the similarity search.
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:embedding-add-folder-to-store
doc:name="Embedding add folder to store"
doc:id="9b60357c-e8e5-42ee-a75e-054cf97bb674"
config-ref="<YOUR_CONFIG>"
storeName="#[payload.storeName]"
fileType="#[payload.fileType]"
maxSegmentSizeInChars="3000"
maxOverlapSizeInChars="300"
embeddingModelName="text-embedding-3-small">
<ms-vectors:storage>
<ms-vectors:aws-s3
awsRegion="${s3.awsDefaultRegion}"
awsAccessKeyId="${s3.awsAccessKeyId}"
awsSecretAccessKey="${s3.awsSecretAccessKey}" />
</ms-vectors:storage>
</ms-vectors:embedding-add-folder-to-store>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"storeName": "gettingstarted",
"status": "updated"
}
- folderPath: The folder path to the files to be ingested.
- storeName: The name of the vector store.
- filesCount: The number of files ingested.
- status: The status of the operation.
Attributes
Example
Embedding | Query from Store
The Query from store
operation retrieve information based on a plain text prompt from an embedding store.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Store Name: The name of the vector collection in the Vector database.
- Question: The prompt to be sent to the LLM along with the embedding store to respond to.
- Max results: The maximum number of results to query back. default (3).
- Min Score: The min score for the similarity search (0 - 1), default (0.8).
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:embedding-query-from-store
doc:name="Embedding query from store"
doc:id="f245c837-cb2c-4807-b0a3-4ca4ad0d522b"
config-ref="<YOUR_CONFIG>"
storeName="mulechaindemo"
question="#[payload.question]"
maxResults="5"
minScore="0.5"
embeddingModelName="text-embedding-3-small"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"question": "Tell me more about Cloudhub High Availability Feature",
"sources": [
{
"embeddingId": "",
"text": "= CloudHub High Availability Features\nifndef::env-site,env-github[]\ninclude::_attributes.adoc[]\nendif::[]\n:page-aliases: runtime-manager::cloudhub-fabric.adoc,\....\n\n== Worker Scale-out",
"score": 0.9282029356714594,
"metadata": {
"source_id": "c426a871-1a6e-4a47-a8ab-027eec9303e1",
"index": "0",
"absolute_directory_path": "/Users/<user>/Documents/Downloads/patch 8",
"file_name": "docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
"full_path": "/Users/<user>/Documents/Downloads/patch 8docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
"file_type": "any",
"ingestion_datetime": "2024-11-20T20:34:41.691Z",
"ingestion_timestamp": "1732134881691"
}
},
{
...
},
{
...
}
],
"response": "= CloudHub High Availability Features\.. (...) \..distributes HTTP requests among your assigned workers.\n. Persistent message queues (see below)",
"maxResults": 3,
"storeName": "gettingstarted",
"minScore": 0.7
}
- question: The question of the request.
- sources: The sources identified by the similarity search.
- embeddingId: The embedding UUID.
- text: The relevant text segment.
- score: The score of the similarity search based on the question.
- metadata: The metadata key-value pairs.
- source_id: The source UUID.
- index: The segment/chunk number for the uploaded data source.
- absolute_directory_path: The full path to the file which contains relevant text segment.
- file_name: The name of the file, where the text segment was found.
- full_path: The full path to the file.
- file_Type: The file/source type.
- ingestion_datetime: The ingestion date and time in ISO 8601 format (UTC)
- ingestion_timestamp: The ingestion time in milliseconds
- response: The collected response of all relevant text segment. This is the response will is sent to the LLM.
- maxResults: The maximum number of text segments considered.
- storeName: The name of the vector store.
- minScore: The minimum score for the result.
Attributes
Example
Example Use Cases
This operation can be particularly useful in scenarios such as:
- Knowledge Management Systems: Adding new documents to an organizational knowledge base.
- Customer Support: Storing customer interaction documents for quick retrieval and analysis.
- Content Management: Ingesting various types of documents (text, PDF, URL) into a centralized repository for easy access and searchability.
Embedding | Query from Store with Filter
The Query from Store with Filter
operation retrieve information embedding store based on filter and plain text prompt.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Store Name: The name of the vector collection in the Vector database.
- Question: The prompt to be sent to the LLM along with the embedding store to respond to.
- Max results: The maximum number of results to query back. default (3).
- Min Score: The min score for the similarity search (0 - 1), default (0.8).
Filter
- Metadata key: The metadata key used for filtering results.
- Filter method: The conditional operator to use for filtering.
- Metadata value: The metadata value to evaluate.
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<ms-vectors:embedding-query-from-store-with-filter
doc:name="Embedding query from store with filter"
doc:id="f245c837-cb2c-4807-b0a3-4ca4ad0d522b"
config-ref="<YOUR_CONFIG>"
storeName="mulechaindemo"
question="#[payload.question]"
maxResults="5"
minScore="0.5"
metadataKey="filename"
filterMethod="isEqualTo"
metadataValue="sample.pdf"
embeddingModelName="text-embedding-3-small" />
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"question": "Tell me more about Cloudhub High Availability Feature",
"sources": [
{
"embeddingId": "",
"text": "= CloudHub High Availability Features\nifndef::env-site,env-github[]\ninclude::_attributes.adoc[]\nendif::[]\n:page-aliases: runtime-manager::cloudhub-fabric.adoc,\....\n\n== Worker Scale-out",
"score": 0.9282029356714594,
"metadata": {
"source_Id": "c426a871-1a6e-4a47-a8ab-027eec9303e1",
"index": "0"
"absolute_directory_path": "/Users/<user>/Documents/Downloads/patch 8",
"file_name": "docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
"full_path": "/Users/<user>/Documents/Downloads/patch 8docs-runtime-manager__cloudhub_modules_ROOT_pages_cloudhub-fabric.adoc",
"file_type": "any",
"ingestion_datetime": "2024-11-20T20:34:41.691Z",
"ingestion_timestamp": "1732134881691"
}
},
{
...
},
{
...
}
]
"response": "= CloudHub High Availability Features\.. (...) \..distributes HTTP requests among your assigned workers.\n. Persistent message queues (see below)",
"maxResults": 3,
"storeName": "gettingstarted",
"minimumScore": 0.7
}
- question: The question of the request.
- sources: The sources identified by the similarity search.
- embeddingId: The embedding UUID.
- text: The relevant text segment.
- score: The score of the similarity search based on the question.
- metadata: The metadata key-value pairs.
- source_id: The UUID for the uploaded data source.
- index: The segment/chunk number for the uploaded data source.
- absolute_directory_path: The full path to the file which contains relevant text segment.
- file_name: The name of the file, where the text segment was found.
- full_path: The full path to the file.
- file_Type: The file type
- ingestion_datetime: The ingestion date and time in ISO 8601 format (UTC)
- ingestion_timestamp: The ingestion time in milliseconds
- response: The collected response of all relevant text segment. This is the response will is sent to the LLM.
- maxResults: The maximum number of text segments considered.
- storeName: The name of the vector store.
- minimumScore: The minimum score for the result.
Attributes
Example Output Attributes
Example Use Cases
This operation can be particularly useful in scenarios such as:
- Knowledge Management Systems: Adding new documents to an organizational knowledge base.
- Customer Support: Storing customer interaction documents for quick retrieval and analysis.
- Content Management: Ingesting various types of documents (text, PDF, URL) into a centralized repository for easy access and searchability.
Embedding | List Sources
The List Sources
operation list all source into embedding store.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Store Name: The name of the vector collection in the Vector database.
Querying Strategy
- Embedding Page Size: Page size to use when querying the store.
XML Configuration
Below is the XML configuration for this operation:
<vectors:embedding-list-sources
doc:name="Embedding list sources"
doc:id="dcd57b22-914d-44a8-96f3-2c916e996393"
config-ref="<YOUR_CONFIG>"
storeName="mulechaindemo"
embeddingPageSize="5000"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"sourceCount": 3,
"sources": [
{
"absolute_directory_path": "/Users/tbolis/Downloads/RFP Docs/batch 1",
"file_name": "docs-accelerators__financial-services_1.11_modules_ROOT_pages_prerequisites.adoc",
"source_id": "d6d2e426-8da6-4454-a723-202e1bfb1114",
"full_path": "/Users/tbolis/Downloads/RFP Docs/batch 1/docs-accelerators__financial-services_1.11_modules_ROOT_pages_prerequisites.adoc",
"segmentCount": 1,
"ingestion_datetime": "2024-11-20T20:34:41.691Z",
"ingestion_timestamp": "1732134881691"
},
{
"absolute_directory_path": "/Users/tbolis/Downloads/RFP Docs/batch 1",
"file_name": "docs-accelerators__healthcare_2.20_modules_ROOT_pages_fhir-r4-us-core-profiles.adoc",
"source_id": "37789839-7685-46b5-bc39-6f47db3e2921",
"full_path": "/Users/tbolis/Downloads/RFP Docs/batch 1/docs-accelerators__healthcare_2.20_modules_ROOT_pages_fhir-r4-us-core-profiles.adoc",
"segmentCount": 3,
"ingestion_datetime": "2024-11-12T14:28:17.274Z"
"ingestion_timestamp": "1732134881691"
},
{
...
},
{
...
}
]
}
- sourceCount: The number of sources within the embedding store.
- sources: The list of sources within the embedding store.
- absolute_directory_path: The full path to the file which contains relevant text segment.
- file_name: The name of the file, where the text segment was found.
- source_id: The source UUID.
- full_path: The full path to the file.
- segmentCount: The number of segment/chunk the source is splitted into.
- ingestion_datetime: The ingestion date and time in ISO 8601 format (UTC)
- ingestion_timestamp: The ingestion time in milliseconds
Attributes
Example
Embedding | Remove from Store by Filter
The Remove from Store by Filter
operation remove all embeddings from store matching filter.
Input Fields
Module Configuration
This refers to the MAC Vectors Configuration set up in the Getting Started section.
General
- Store Name: The name of the collection in the Vector database.
Filter
- Metadata key: The metadata key used for filtering results.
- Filter method: The conditional operator to use for filtering.
- Metadata value: The metadata value to evaluate.
Embedding Model
- Embedding Model Name: Indicates the embedding model to be used (default is
text-embedding-ada-002
).
XML Configuration
Below is the XML configuration for this operation:
<vectors:embedding-remove-from-store-by-filter
doc:name="Embedding remove documents by filter"
doc:id="c6b9ec97-1224-445e-ab02-f598d6fff7d7"
config-ref="MAC_Vectors_Config"
storeName="mulechaindemo"
metadataKey="file_name"
filterMethod="isEqualTo"
metadataValue="docs-accelerators__accelerators-cim_1.3_modules_ROOT_pages_cim-setup.adoc"
embeddingModelName="text-embedding-3-small"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
This output has been converted to JSON.
{
"status": "deleted"
}
- status: The operation status.