Crawl Operations
Crawl | Website
The Crawl website
operation allows you to easily crawl for website content, at a specified depth. This operation allows you to additionally:
- set a crawl delay so that you are not overloading the webserver with requests
- download images from the crawled web pages during the crawl
Input Fields
Module Configuration
This refers to the MAC Web Crawler Configuration set up in the Getting Started section.
General
- Website URL: The website to be crawled. Crawl will start from this URL, and by default, based on the specified Maximum Depth, any links found in pages that match the base-url will also be crawled.
- Maximum Depth : Crawl will be limited to the specified maximum depth.
- Delay (millisecs) : To prevent websites from being overloaded, you can add a delay to your crawl. This delay is the time delay between crawling pages on a website. Specify 0 for no delay.
- Restrict Crawl under URL : If set to True, then the crawler will only crawl and fetch contents from those pages that match the specified Website URL
- Retrieve Meta Tags : If set to True, then the crawler will also retrieve metadata from each crawled page, including title, description, keywords, and other SEO-related information that the page contains.
- Download Documents : If set to True, then the crawler will also download documents found on each crawled page.
- Download Images : If set to True, then the crawler will also download images found on each crawled page.
- Download Location : The path where the crawler will download retrieved webpage content, including any images.
XML Configuration
Below is the XML configuration for this operation:
<ms-webcrawler:crawl-website
config-ref="MuleSoft_WebCrawler_Config"
doc:id="2b0e71c5-123e-4c01-8e9f-3d5a128bd86d"
doc:name="[Crawl] Website"
url="#[payload.url]"
maxDepth="#[payload.depth]"
delayMillis="#[payload.delay]"
restrictToPath="true"
getMetaTags="true"
downloadDocuments="true"
downloadImages="true"
downloadPath="#[payload.downloadLocation]"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
{
"url": "https://mac-project.ai/docs",
"children": [
{
"url": "https://mac-project.ai/docs/mulechain-ai/showcase",
"children": [],
"fileName": "ExampleShowcases_20241022152911020.json"
},
{
"url": "https://mac-project.ai/docs/ms-webcrawler/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152911045.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/agent",
"children": [],
"fileName": "[Agent]DefinePromptTemplate_20241022152911072.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/sentiment-analysis",
"children": [],
"fileName": "[Sentiment]Analyzer_20241022152911098.json"
},
{
"url": "https://mac-project.ai/docs/contribute",
"children": [],
"fileName": "Contribute_20241022152911127.json"
},
{
"url": "https://mac-project.ai/docs/ms-webcrawler/supported-operations",
"children": [],
"fileName": "MACWebCrawlerConnectorOperations_20241022152911152.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/embedding",
"children": [],
"fileName": "[Embedding]Generatefromtext_20241022152911189.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/sentiment-analysis",
"children": [],
"fileName": "SentimentOperations_20241022152911218.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/embeddings",
"children": [],
"fileName": "[Embedding]Operations_20241022152911246.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai",
"children": [],
"fileName": "MACEinsteinAIConnector_20241022152911271.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/supported-operations/speech",
"children": [],
"fileName": "[Speech]toText_20241022152911515.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/image-generation",
"children": [],
"fileName": "[Image]Generate_20241022152911765.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/connector-overview",
"children": [],
"fileName": "MACWhispererConnectorOverview_20241022152911791.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations",
"children": [],
"fileName": "MACEinsteinAIConnectorOperations_20241022152912422.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/chat",
"children": [],
"fileName": "[Chat]Operations_20241022152912669.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/connector-overview",
"children": [],
"fileName": "MuleSoftAIChain(MAC)ConnectorOverview_20241022152912954.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/tools",
"children": [],
"fileName": "ToolsOperations_20241022152913214.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/chat",
"children": [],
"fileName": "ChatOperations_20241022152913456.json"
},
{
"url": "https://mac-project.ai/docs/ms-webcrawler/connector-overview",
"children": [],
"fileName": "MACWebCrawlerConnectorOverview_20241022152913493.json"
},
{
"url": "https://mac-project.ai/docs",
"children": [],
"fileName": "Duplicate."
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/rag",
"children": [],
"fileName": "[RAG]Operations_20241022152913543.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/",
"children": [],
"fileName": "MuleSoftAIChainConnector_20241022152913902.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/supported-operations",
"children": [],
"fileName": "MACVectorsConnectorOperations_20241022152913929.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/agent",
"children": [],
"fileName": "AgentOperations_20241022152914173.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152914422.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152914668.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/supported-operations",
"children": [],
"fileName": "MACWhispererConnectorOperations_20241022152914938.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations",
"children": [],
"fileName": "MuleSoftAIChainConnectorOperations_20241022152915197.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/connector-overview",
"children": [],
"fileName": "MACVectorsConnectorOverview_20241022152915441.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/agent",
"children": [],
"fileName": "[Agent]DefinePromptTemplate_20241022152915682.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai",
"children": [],
"fileName": "MuleSoftAIChainConnector_20241022152915707.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/platform",
"children": [],
"fileName": "[Agent]List_20241022152915977.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152916005.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/chat",
"children": [],
"fileName": "[Chat]Answerprompt_20241022152916058.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/connector-overview",
"children": [],
"fileName": "AWSBedrockOverview_20241022152916302.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152916552.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/supported-operations/embeddings",
"children": [],
"fileName": "[Embedding]generatefromtext_20241022152916802.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/rag",
"children": [],
"fileName": "RAGOperations_20241022152917050.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/supported-operations/documents",
"children": [],
"fileName": "[Document]parser_20241022152917082.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152917110.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-vectors",
"children": [],
"fileName": "MACVectorsConnector_20241022152917138.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/image-generation",
"children": [],
"fileName": "ImageOperations_20241022152917392.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/",
"children": [],
"fileName": "MACEinsteinAIConnector_20241022152917696.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/connector-overview",
"children": [],
"fileName": "MACEinsteinAIConnectorOverview_20241022152917938.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/embeddings",
"children": [],
"fileName": "EmbeddingOperations_20241022152918185.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/supported-operations/text",
"children": [],
"fileName": "[Text]tospeech_20241022152918213.json"
}
],
"fileName": "Introduction_20241022152910742.json"
}
Crawl | Get Links as Sitemap
This operation allows you to create a sitemap of a website to aid in SEO and site structure analysis. The results of this operation can be used to customize the way you want to crawl the website (eg specific pages)
Input Fields
General
- Website URL: The website to generate a sitemap for.
- Maximum Depth : The generate sitemap's depth will be limited to this specified depth.
- Delay (millisecs) : To prevent websites from being overloaded, you can add a delay to your crawl. This delay is the time delay between crawling pages on a website. Specify 0 for no delay.
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:crawl-links-as-sitemap doc:name="[Crawl] Get links as sitemap"
doc:id="6410aae9-21d4-4005-8d86-ac9a136657b4"
url="#[payload.url]"
maxDepth="#[payload.maxDepth]"
delayMillis="#[payload.delay]"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
{
"url": "https://mac-project.ai/",
"children": [
{
"url": "https://www.mac-project.ai/",
"children": []
},
{
"url": "https://mac-project.ai/docs",
"children": []
},
{
"url": "https://mac-project.ai/",
"children": []
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/getting-started",
"children": []
},
{
"url": "https://mac-project.ai/docs/contribute",
"children": []
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations",
"children": []
},
{
"url": "https://mac-project.ai/about",
"children": []
}
]
}