Docs
MuleSoft WebCrawler
Page

Page Operations

Page | Download Document

The Download document operation allows you download documents from a specified webpage. You can also use it to download a single document by providing it the direct url of the document.

Download documents from a webpage

Input Fields

General

  • Page or Image URL: Specify the webpage url that contains the images you wish to download. This will download all images found on the webpage. Alternatively, you can provided a direct link to the image if you want to download a single image only.
  • Download Location : The path where the crawler will download images to.

XML Configuration

Below is the XML configuration for this operation:

<mac-web-crawler:page-download-document
doc:name="[Page] Download document"
doc:id="426d217d-74c5-4c9d-9bdb-74f075fc1f26" 
url="#[payload.url]" 
downloadPath="#[payload.downloadLocation]"/>
/>

Output Fields

Payload

This operation responds with a json payload.

Example

[
    {
        "fileName": "versioning-back-support-policy.pdf",
        "mimeType": "application/pdf",
        "url": "https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/versioning-back-support-policy.pdf"
    }
]

Page | Download Image

The Download image operation allows you download images from a specified webpage. You can also use it to download a single image by providing it the direct url of the image.

Download images from a webpage

Input Fields

General

  • Page or Image URL: Specify the webpage url that contains the images you wish to download. This will download all images found on the webpage. Alternatively, you can provided a direct link to the image if you want to download a single image only.
  • Download Location : The path where the crawler will download images to.

XML Configuration

Below is the XML configuration for this operation:

<mac-web-crawler:page-download-image
doc:name="[Page] Download image"
doc:id="426d217d-74c5-4c9d-9bdb-74f075fc1f26" 
url="#[payload.url]" 
downloadPath="#[payload.downloadLocation]"/>
/>

Output Fields

Payload

This operation responds with a json payload.

Example

[
    {
        "fileName": "mulechain-project-desc.97862945.png",
        "mimeType": "image/png",
        "url": "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fmulechain-project-desc.97862945.png&w=3840&q=75"
    },
    {
        "fileName": "mulechain-project-model-support.1411a98a.png",
        "mimeType": "image/png",
        "url": "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fmulechain-project-model-support.1411a98a.png&w=3840&q=75"
    },
    {
        "fileName": "mulechain-crawler-logo.png",
        "mimeType": "image/png",
        "url": "https://mac-project.ai/_next/image?url=%2Flogos%2Fmulechain-crawler-logo.png&w=128&q=75"
    },
    ...
]

Page | Get Content

Allows you to retrieve the contents of a specified webpage.

Get contents of a webpage

Input Fields

Module Configuration

This refers to the MuleSoft Web Crawler Configuration set up in the Getting Started section.

General

  • Page URL: The webpage to fetch the contents for.

XML Configuration

Below is the XML configuration for this operation:

<mac-web-crawler:get-page-content doc:name="[Page] Get content"
doc:id="8f436a64-1795-42a0-9fe7-69be66db235b" 
url="#[payload.url]" 
config-ref="MuleSoft_WebCrawler_Config"/>

Output Fields

Payload

This operation responds with a json payload.

Example

{
    "title": "The MuleSoft AI Chain (MAC) Project",
    "url": "https://mac-project.ai/",
    "content": "The MuleSoft AI Chain (MAC) Project DocsDocsAboutAbout GitHub GitHub (opens in a new tab) Light Build powerful AI Agents with MuleSoft AI Chain Seamlessly integrate cutting-edge AI capabilities into your MuleSoft ecosystem, enabling smarter automation and enhanced decision-making. Go to Docs→ Full-powered AI Agents with MuleSoft No Code / Low Code Development MS AI Chain enables MuleSoft developers to leverage powerful AI capabilities with minimal coding. Easily configure and manage AI agents directly within Anypoint Studio, streamlining the development process. MuleSoft AI Chain Project (MAC) is OpenSource MAC Project is an open-source project empowering developers to integrate advanced AI capabilities into the MuleSoft ecosystem. Join our community to collaborate, innovate, and drive the future of technology. Leverage AI Capabilities, Seamlessly Integrated with MuleSoft Enhance your MuleSoft ecosystem with advanced AI functionalities from multiple Large Language Models. Light We Are OpenSource © 2024 The MuleSoft AI Chain Project."
}

Page | Get Insights

Allows you to fetch insights from a webpage. This allows retrieve things like:

  • word count on the page
  • count on elements or tags such as H1, H2, DIV, P etc. You can also specify your own tags to retrieve insights by specifying these tags in the configuration of the operation.
  • link structures (broken down into internal, external and reference)
  • image links
These insights can be used to build your own custom crawl by combining it the other operations provided by this connector!
Get insights of a webpage

Input Fields

Module Configuration

This refers to the MAC Web Crawler Configuration set up in the Getting Started section.

General

  • Page URL: The webpage to fetch the insights for.

XML Configuration

Below is the XML configuration for this operation:

<mac-web-crawler:page-get-insights
doc:name="Get page insights"
doc:id="08dd9627-5220-41df-a7cf-869bff4eee91"
config-ref="MuleSoft_WebCrawler_Config"
url="#[payload.url]"/>

Output Fields

Payload

This operation responds with a json payload.

Example

{
    "pageStats": {
        "div": 36,
        "p": 6,
        "reference": 0,
        "internal": 7,
        "external": 2,
        "images": 4,
        "wordCount": 147,
        "h1": 1,
        "h2": 0,
        "h3": 4,
        "h4": 0,
        "h5": 0
    },
    "links": {
        "reference": [],
        "internal": [
            "https://www.mac-project.ai/",
            "https://mac-project.ai/docs",
            "https://mac-project.ai/",
            "https://mac-project.ai/docs/mulechain-ai/getting-started",
            "https://mac-project.ai/docs/contribute",
            "https://mac-project.ai/docs/mulechain-ai/supported-operations",
            "https://mac-project.ai/about"
        ],
        "external": [
            "https://www.linkedin.com/groups/13047000/",
            "https://github.com/MuleSoft-AI-Chain-Project"
        ],
        "images": [
            "https://mac-project.ai/_next/image?url=%2Flogos%2Fmulechain-project-logo.png&w=96&q=75",
            "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcard-1.b6224663.png&w=3840&q=75",
            "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcard-operations.d3098f38.png&w=1920&q=75",
            "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcard-1.dark.fd8b5613.png&w=3840&q=75"
        ],
        "documents": [
        ]
    },
    "title": "The MuleSoft AI Chain (MAC) Project",
    "url": "https://mac-project.ai/"
}

Page | Get Meta Tags

Allows you to retrieve metadata from a webpage, including title, description, keywords, and other SEO-related information.

Get metags of a webpage

Input Fields

General

  • Page URL: The webpage to fetch the meta tags for.

XML Configuration

Below is the XML configuration for this operation:

<mac-web-crawler:page-get-meta-tags
 doc:name="[Page ]Get meta tags"
 doc:id="2cd02e46-5607-43f1-9e2b-3ce624a6a68a"
 url="#[payload.url]"/>

Output Fields

Payload

This operation responds with a json payload.

Example

[
    {
        "name": "robots",
        "content": "index,follow"
    },
    {
        "property": "og:title",
        "content": "Introduction"
    },
    {
        "name": "msapplication-TileColor",
        "content": "#fff"
    },
    {
        "name": "theme-color",
        "content": "#fff"
    },
    {
        "name": "viewport",
        "content": "width=device-width, initial-scale=1.0"
    },
    ...
]