Crawl Operations
Crawl | Website
The Crawl website
operation allows you to easily crawl for website content, at a specified depth. This operation allows you to additionally:
- set a crawl delay so that you are not overloading the webserver with requests
- download images from the crawled web pages during the crawl
data:image/s3,"s3://crabby-images/073ff/073fff2d14eb205a4fb2463336583fbb91037343" alt="Crawl a website for content"
Input Fields
Module Configuration
This refers to the MAC Web Crawler Configuration set up in the Getting Started section.
General
- Website URL: The website to be crawled. Crawl will start from this URL, and by default, based on the specified Maximum Depth, any links found in pages that match the base-url will also be crawled.
- Download Location : The path where the crawler will download retrieved webpage content, including any images.
Target Pages
- Restrict Crawl under URL : If set to True, then the crawler will only crawl and fetch contents from those pages that match the specified Website URL
- Maximum Depth : Crawl will be limited to the specified maximum depth.
Target Content
- Tag list: The list of tags to extract content from.
- Retrieve Meta Tags : If set to True, then the crawler will also retrieve metadata from each crawled page, including title, description, keywords, and other SEO-related information that the page contains.
- Download Documents : If set to True, then the crawler will also download documents found on each crawled page.
- Max document number : The maximum number of document to be downloaded.
- Download Images : If set to True, then the crawler will also download images found on each crawled page.
- Max image number : The maximum number of image to be downloaded.
XML Configuration
Below is the XML configuration for this operation:
<ms-webcrawler:crawl-website
config-ref="MuleSoft_WebCrawler_Config"
doc:id="2b0e71c5-123e-4c01-8e9f-3d5a128bd86d"
doc:name="[Crawl] Website"
url="#[payload.url]"
maxDepth="#[payload.depth]"
delayMillis="#[payload.delay]"
restrictToPath="true"
getMetaTags="true"
downloadDocuments="true"
downloadImages="true"
downloadPath="#[payload.downloadLocation]"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
{
"url": "https://mac-project.ai/docs",
"children": [
{
"url": "https://mac-project.ai/docs/mulechain-ai/showcase",
"children": [],
"fileName": "ExampleShowcases_20241022152911020.json"
},
{
"url": "https://mac-project.ai/docs/ms-webcrawler/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152911045.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/agent",
"children": [],
"fileName": "[Agent]DefinePromptTemplate_20241022152911072.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/sentiment-analysis",
"children": [],
"fileName": "[Sentiment]Analyzer_20241022152911098.json"
},
{
"url": "https://mac-project.ai/docs/contribute",
"children": [],
"fileName": "Contribute_20241022152911127.json"
},
{
"url": "https://mac-project.ai/docs/ms-webcrawler/supported-operations",
"children": [],
"fileName": "MACWebCrawlerConnectorOperations_20241022152911152.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/embedding",
"children": [],
"fileName": "[Embedding]Generatefromtext_20241022152911189.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/sentiment-analysis",
"children": [],
"fileName": "SentimentOperations_20241022152911218.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/embeddings",
"children": [],
"fileName": "[Embedding]Operations_20241022152911246.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai",
"children": [],
"fileName": "MACEinsteinAIConnector_20241022152911271.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/supported-operations/speech",
"children": [],
"fileName": "[Speech]toText_20241022152911515.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/image-generation",
"children": [],
"fileName": "[Image]Generate_20241022152911765.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/connector-overview",
"children": [],
"fileName": "MACWhispererConnectorOverview_20241022152911791.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations",
"children": [],
"fileName": "MACEinsteinAIConnectorOperations_20241022152912422.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/chat",
"children": [],
"fileName": "[Chat]Operations_20241022152912669.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/connector-overview",
"children": [],
"fileName": "MuleSoftAIChain(MAC)ConnectorOverview_20241022152912954.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/tools",
"children": [],
"fileName": "ToolsOperations_20241022152913214.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/chat",
"children": [],
"fileName": "ChatOperations_20241022152913456.json"
},
{
"url": "https://mac-project.ai/docs/ms-webcrawler/connector-overview",
"children": [],
"fileName": "MACWebCrawlerConnectorOverview_20241022152913493.json"
},
{
"url": "https://mac-project.ai/docs",
"children": [],
"fileName": "Duplicate."
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/rag",
"children": [],
"fileName": "[RAG]Operations_20241022152913543.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/",
"children": [],
"fileName": "MuleSoftAIChainConnector_20241022152913902.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/supported-operations",
"children": [],
"fileName": "MACVectorsConnectorOperations_20241022152913929.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/agent",
"children": [],
"fileName": "AgentOperations_20241022152914173.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152914422.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152914668.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/supported-operations",
"children": [],
"fileName": "MACWhispererConnectorOperations_20241022152914938.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations",
"children": [],
"fileName": "MuleSoftAIChainConnectorOperations_20241022152915197.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/connector-overview",
"children": [],
"fileName": "MACVectorsConnectorOverview_20241022152915441.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/supported-operations/agent",
"children": [],
"fileName": "[Agent]DefinePromptTemplate_20241022152915682.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai",
"children": [],
"fileName": "MuleSoftAIChainConnector_20241022152915707.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/platform",
"children": [],
"fileName": "[Agent]List_20241022152915977.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152916005.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/chat",
"children": [],
"fileName": "[Chat]Answerprompt_20241022152916058.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/connector-overview",
"children": [],
"fileName": "AWSBedrockOverview_20241022152916302.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152916552.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/supported-operations/embeddings",
"children": [],
"fileName": "[Embedding]generatefromtext_20241022152916802.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/rag",
"children": [],
"fileName": "RAGOperations_20241022152917050.json"
},
{
"url": "https://mac-project.ai/docs/ms-vectors/supported-operations/documents",
"children": [],
"fileName": "[Document]parser_20241022152917082.json"
},
{
"url": "https://mac-project.ai/docs/amazon-bedrock/getting-started",
"children": [],
"fileName": "GettingStarted_20241022152917110.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-vectors",
"children": [],
"fileName": "MACVectorsConnector_20241022152917138.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/image-generation",
"children": [],
"fileName": "ImageOperations_20241022152917392.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/",
"children": [],
"fileName": "MACEinsteinAIConnector_20241022152917696.json"
},
{
"url": "https://mac-project.ai/docs/einstein-ai/connector-overview",
"children": [],
"fileName": "MACEinsteinAIConnectorOverview_20241022152917938.json"
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations/embeddings",
"children": [],
"fileName": "EmbeddingOperations_20241022152918185.json"
},
{
"url": "https://mac-project.ai/docs/mac-whisperer/supported-operations/text",
"children": [],
"fileName": "[Text]tospeech_20241022152918213.json"
}
],
"fileName": "Introduction_20241022152910742.json"
}
Crawl | Get Links as Sitemap
This operation allows you to create a sitemap of a website to aid in SEO and site structure analysis. The results of this operation can be used to customize the way you want to crawl the website (eg specific pages)
data:image/s3,"s3://crabby-images/b48e8/b48e848c7a646e885e43f4aa12ac411b7e073eb4" alt="Get links as sitemap"
Input Fields
General
- Website URL: The website to generate a sitemap for.
Target Pages
- Restrict Crawl under URL : If set to True, then the crawler will only crawl and fetch contents from those pages that match the specified Website URL
- Maximum Depth : Crawl will be limited to the specified maximum depth.
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:crawl-links-as-sitemap doc:name="[Crawl] Get links as sitemap"
doc:id="6410aae9-21d4-4005-8d86-ac9a136657b4"
url="#[payload.url]"
maxDepth="#[payload.maxDepth]"
delayMillis="#[payload.delay]"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
{
"url": "https://mac-project.ai/",
"children": [
{
"url": "https://www.mac-project.ai/",
"children": []
},
{
"url": "https://mac-project.ai/docs",
"children": []
},
{
"url": "https://mac-project.ai/",
"children": []
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/getting-started",
"children": []
},
{
"url": "https://mac-project.ai/docs/contribute",
"children": []
},
{
"url": "https://mac-project.ai/docs/mulechain-ai/supported-operations",
"children": []
},
{
"url": "https://mac-project.ai/about",
"children": []
}
]
}