Crawl Operations

Crawl | Website (Full Scan)

The [Crawl] Website (Full Scan) operation allows you to easily crawl for website content, at a specified depth. This operation allows you to additionally:

set a crawl delay so that you are not overloading the webserver with requests
download images from the crawled web pages during the crawl

Input Fields

Module Configuration

This refers to the MAC Web Crawler Configuration set up in the Getting Started section.

General

Website URL: The website to be crawled. Crawl will start from this URL, and by default, based on the specified Maximum Depth, any links found in pages that match the base-url will also be crawled.
Output format: The format of the output.
- HTML: The output will be in HTML format.
- TEXT: The output will be in TEXT format.
- MARKDOWN: The output will be in MARKDOWN format.
Download Location : The path where the crawler will download retrieved webpage content, including any images.

Target Pages

Maximum Depth : Crawl will be limited to the specified maximum depth.
Restrict Crawl under URL : If set to True, then the crawler will only crawl and fetch contents from those pages that match the specified Website URL
Regex URLs filter logic :
- INCLUDE : The crawler will only crawl and fetch contents from those pages that match the specified Regular Expression.
- EXCLUDE : The crawler will not crawl and fetch contents from those pages that match the specified Regular Expression.
Regex URLs List : The list of Regular Expressions to filter URLs.

Target Content

Tag list: An ordered list of CSS selectors (tags) for content extraction. The crawler will use the HTML content from the first selector that matches an element on the page. If there are no matches, the full HTML of the page will be extracted.
Retrieve Meta Tags : If set to True, then the crawler will also retrieve metadata from each crawled page, including title, description, keywords, and other SEO-related information that the page contains.
Download Documents : If set to True, then the crawler will also download documents found on each crawled page.
Max document number : The maximum number of document to be downloaded.
Download Images : If set to True, then the crawler will also download images found on each crawled page.
Max image number : The maximum number of image to be downloaded.

Page Load Options (WebDriver)

ℹ️

Page load options only applies to the WebDriver connection. They make sense only in case of dynamic content retrieval.

Wait on page load (millisec): The time to wait for the page to load in milliseconds.
Wait for XPath: The XPath to wait for before continuing. The process continue in any case once the Wait on page load (millisec) is reached.
Extract Shadow DOM: Extract the shadow DOM content.
Shadow Host(s) XPath: The XPath to the shadow host(s) to extract. If not set, the whole shadow DOM is extracted. If set all the shadow DOM nested inside the shadow host(s) is extracted and the content is merged in the order they appear in the page.

XML Configuration

Below is the XML configuration for this operation:

<ms-webcrawler:crawl-website
config-ref="MuleSoft_WebCrawler_Config"
doc:id="2b0e71c5-123e-4c01-8e9f-3d5a128bd86d"
doc:name="[Crawl] Website"
url="#[payload.url]"
maxDepth="#[payload.depth]"
delayMillis="#[payload.delay]"
restrictToPath="true"
getMetaTags="true"
downloadDocuments="true"
downloadImages="true"
downloadPath="#[payload.downloadLocation]"/>

Output Fields

Payload

This operation responds with a json payload.

Example

{
    "url": "https://mac-project.ai/docs",
    "children": [
        {
            "url": "https://mac-project.ai/docs/mulechain-ai/showcase",
            "fileName": "ExampleShowcases_20241022152911020.json"
        },
        {
            "url": "https://mac-project.ai/docs/ms-webcrawler/getting-started",
            "fileName": "GettingStarted_20241022152911045.json"
        },
        {
            "url": "https://mac-project.ai/docs/amazon-bedrock/supported-operations/agent",
            "fileName": "[Agent]DefinePromptTemplate_20241022152911072.json"
        },
        ...
    ],
    "fileName": "Introduction_20241022152910742.json"
}

Crawl | Website (Streaming)

The Crawl website operation allows you to easily crawl for website content, at a specified depth. This operation allows you to additionally:

set a crawl delay so that you are not overloading the webserver with requests
download images from the crawled web pages during the crawl

Input Fields

Module Configuration

This refers to the MAC Web Crawler Configuration set up in the Getting Started section.

General

Website URL: The website to be crawled. Crawl will start from this URL, and by default, based on the specified Maximum Depth, any links found in pages that match the base-url will also be crawled.
Output format: The format of the output.
- HTML: The output will be in HTML format.
- TEXT: The output will be in TEXT format.
- MARKDOWN: The output will be in MARKDOWN format.
Download Location : The path where the crawler will download retrieved webpage content, including any images.

Target Pages

Maximum Depth : Crawl will be limited to the specified maximum depth.
Restrict Crawl under URL : If set to True, then the crawler will only crawl and fetch contents from those pages that match the specified Website URL
Regex URLs filter logic :
- INCLUDE : The crawler will only crawl and fetch contents from those pages that match the specified Regular Expression.
- EXCLUDE : The crawler will not crawl and fetch contents from those pages that match the specified Regular Expression.
Regex URLs List : The list of Regular Expressions to filter URLs.

Page Load Options (WebDriver)

ℹ️

Page load options only applies to the WebDriver connection. They make sense only in case of dynamic content retrieval.

Wait on page load (millisec): The time to wait for the page to load in milliseconds.
Wait for XPath: The XPath to wait for before continuing. The process continue in any case once the Wait on page load (millisec) is reached.
Extract Shadow DOM: Extract the shadow DOM content.
Shadow Host(s) XPath: The XPath to the shadow host(s) to extract. If not set, the whole shadow DOM is extracted. If set all the shadow DOM nested inside the shadow host(s) is extracted and the content is merged in the order they appear in the page.

XML Configuration

Below is the XML configuration for this operation:

<ms-webcrawler:crawl-website
config-ref="MuleSoft_WebCrawler_Config"
doc:id="2b0e71c5-123e-4c01-8e9f-3d5a128bd86d"
doc:name="[Crawl] Website"
url="#[payload.url]"
maxDepth="#[payload.depth]"
delayMillis="#[payload.delay]"
restrictToPath="true"
getMetaTags="true"
downloadDocuments="true"
downloadImages="true"
downloadPath="#[payload.downloadLocation]"/>

Output Fields

Payload

This operation responds with a json payload.

Example

[
    {
        "url": "https://mac-project.ai/docs",
        "title": "",
        "content": ""
    },
    {
        "url": "https://mac-project.ai/docs/mulechain-ai/showcase",
        "title": "",
        "content": ""
    },
    {
        "url": "https://mac-project.ai/docs/ms-webcrawler/getting-started",
        "title": "",
        "content": ""
    }
]

Crawl | Get Sitemap

This operation allows you to create a sitemap of a website to aid in SEO and site structure analysis. The results of this operation can be used to customize the way you want to crawl the website (eg specific pages)

Input Fields

General

Website URL: The website to generate a sitemap for.

Target Pages

Maximum Depth : Crawl will be limited to the specified maximum depth.
Restrict Crawl under URL : If set to True, then the crawler will only crawl and fetch contents from those pages that match the specified Website URL
Regex URLs filter logic :
- INCLUDE : The crawler will only crawl and fetch contents from those pages that match the specified Regular Expression.
- EXCLUDE : The crawler will not crawl and fetch contents from those pages that match the specified Regular Expression.
Regex URLs List : The list of Regular Expressions to filter URLs.

Page Load Options (WebDriver)

ℹ️

Page load options only applies to the WebDriver connection. They make sense only in case of dynamic content retrieval.

Wait on page load (millisec): The time to wait for the page to load in milliseconds.
Wait for XPath: The XPath to wait for before continuing. The process continue in any case once the Wait on page load (millisec) is reached.
Extract Shadow DOM: Extract the shadow DOM content.
Shadow Host(s) XPath: The XPath to the shadow host(s) to extract. If not set, the whole shadow DOM is extracted. If set all the shadow DOM nested inside the shadow host(s) is extracted and the content is merged in the order they appear in the page.

XML Configuration

Below is the XML configuration for this operation:

<mac-web-crawler:crawl-links-as-sitemap doc:name="[Crawl] Get links as sitemap"
doc:id="6410aae9-21d4-4005-8d86-ac9a136657b4" 
url="#[payload.url]" 
maxDepth="#[payload.maxDepth]" 
delayMillis="#[payload.delay]"/>

Output Fields

Payload

This operation responds with a xml payload based on the sitemap protocol (opens in a new tab) specification.

ℹ️

The priority reflects the depth of the page in the website structure, depth 0 corresponds to priority 1.0, depth 1 corresponds to priority 0.9, and so on (depth greater than 10 always corresponds to priority 0.0).

Example

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://mac-project.ai/</loc>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://mac-project.ai/docs</loc>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>https://mac-project.ai/docs/ms-vectors/supported-operations/document</loc>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://mac-project.ai/docs/einstein-ai/supported-operations/tools</loc>
    <priority>0.8</priority>
  </url>
</urlset>

Supported Operations Page