Page Operations
Page | Download Document
The Download document
operation allows you download documents from a specified webpage. You can also use it to
download a single document by providing it the direct url of the document.
data:image/s3,"s3://crabby-images/2bedf/2bedf2e3d0569d16011ee54e33be311a84763cc1" alt="Download documents from a webpage"
Input Fields
General
- Page or image URL: Specify the webpage url that contains the images you wish to download. This will download all images found on the webpage. Alternatively, you can provided a direct link to the image if you want to download a single image only.
- Max document number : The maximum number of document to be downloaded.
- Download location : The path where the crawler will download images to.
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:page-download-document
doc:name="[Page] Download document"
doc:id="426d217d-74c5-4c9d-9bdb-74f075fc1f26"
url="#[payload.url]"
downloadPath="#[payload.downloadLocation]"
maxDocumentNumber="50"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
[
{
"fileName": "versioning-back-support-policy.pdf",
"mimeType": "application/pdf",
"url": "https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/versioning-back-support-policy.pdf"
}
]
Page | Download Image
The Download image
operation allows you download images from a specified webpage. You can also use it to download a single image by providing it the direct url of the image.
data:image/s3,"s3://crabby-images/8e2a0/8e2a031891c283b6099e287e101947ff2bbfaea8" alt="Download images from a webpage"
Input Fields
General
- Page or image URL: Specify the webpage url that contains the images you wish to download. This will download all images found on the webpage. Alternatively, you can provided a direct link to the image if you want to download a single image only.
- Max image number : The maximum number of image to be downloaded.
- Download location : The path where the crawler will download images to.
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:page-download-image
doc:name="[Page] Download image"
doc:id="426d217d-74c5-4c9d-9bdb-74f075fc1f26"
url="#[payload.url]"
downloadPath="#[payload.downloadLocation]"
maxImageNumber="50"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
[
{
"fileName": "mulechain-project-desc.97862945.png",
"mimeType": "image/png",
"url": "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fmulechain-project-desc.97862945.png&w=3840&q=75"
},
{
"fileName": "mulechain-project-model-support.1411a98a.png",
"mimeType": "image/png",
"url": "https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fmulechain-project-model-support.1411a98a.png&w=3840&q=75"
},
{
"fileName": "mulechain-crawler-logo.png",
"mimeType": "image/png",
"url": "https://mac-project.ai/_next/image?url=%2Flogos%2Fmulechain-crawler-logo.png&w=128&q=75"
},
...
]
Page | Get Content
Allows you to retrieve the contents of a specified webpage.
data:image/s3,"s3://crabby-images/b5e19/b5e194915eda5c7913e36ed90431588539aab495" alt="Get contents of a webpage"
Input Fields
Module Configuration
This refers to the MuleSoft Web Crawler Configuration set up in the Getting Started section.
General
- Page URL: The webpage to fetch the contents for.
Target Content
- Tag list: The list of tags to extract content from.
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:get-page-content doc:name="[Page] Get content"
doc:id="8f436a64-1795-42a0-9fe7-69be66db235b"
url="#[payload.url]"
config-ref="MuleSoft_WebCrawler_Config"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
{
"title": "The MuleSoft AI Chain (MAC) Project",
"url": "https://mac-project.ai/",
"content": "The MuleSoft AI Chain (MAC) Project DocsDocsAboutAbout GitHub GitHub (opens in a new tab) Light Build powerful AI Agents with MuleSoft AI Chain Seamlessly integrate cutting-edge AI capabilities into your MuleSoft ecosystem, enabling smarter automation and enhanced decision-making. Go to Docs→ Full-powered AI Agents with MuleSoft No Code / Low Code Development MS AI Chain enables MuleSoft developers to leverage powerful AI capabilities with minimal coding. Easily configure and manage AI agents directly within Anypoint Studio, streamlining the development process. MuleSoft AI Chain Project (MAC) is OpenSource MAC Project is an open-source project empowering developers to integrate advanced AI capabilities into the MuleSoft ecosystem. Join our community to collaborate, innovate, and drive the future of technology. Leverage AI Capabilities, Seamlessly Integrated with MuleSoft Enhance your MuleSoft ecosystem with advanced AI functionalities from multiple Large Language Models. Light We Are OpenSource © 2024 The MuleSoft AI Chain Project."
}
Page | Get Insights
Allows you to fetch insights from a webpage. This allows retrieve things like:
- word count on the page
- count on elements or tags such as H1, H2, DIV, P etc. You can also specify your own tags to retrieve insights by specifying these tags in the configuration of the operation.
- link structures (broken down into internal, external and reference)
- image links
data:image/s3,"s3://crabby-images/f1a84/f1a847b46e8677d4827139ac4c4506dd0be46a15" alt="Get insights of a webpage"
Input Fields
Module Configuration
This refers to the MAC Web Crawler Configuration set up in the Getting Started section.
General
- Page URL: The webpage to fetch the insights for.
Target Content
- Tag list: The list of tags to extract content from.
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:page-get-insights
doc:name="Get page insights"
doc:id="08dd9627-5220-41df-a7cf-869bff4eee91"
config-ref="MuleSoft_WebCrawler_Config"
url="#[payload.url]"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
{
"pageStats": {
"div": 36,
"p": 6,
"reference": 0,
"internal": 7,
"external": 2,
"images": 4,
"wordCount": 147,
"h1": 1,
"h2": 0,
"h3": 4,
"h4": 0,
"h5": 0
},
"links": {
"reference": [],
"internal": [
"https://www.mac-project.ai/",
"https://mac-project.ai/docs",
"https://mac-project.ai/",
"https://mac-project.ai/docs/mulechain-ai/getting-started",
"https://mac-project.ai/docs/contribute",
"https://mac-project.ai/docs/mulechain-ai/supported-operations",
"https://mac-project.ai/about"
],
"external": [
"https://www.linkedin.com/groups/13047000/",
"https://github.com/MuleSoft-AI-Chain-Project"
],
"images": [
"https://mac-project.ai/_next/image?url=%2Flogos%2Fmulechain-project-logo.png&w=96&q=75",
"https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcard-1.b6224663.png&w=3840&q=75",
"https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcard-operations.d3098f38.png&w=1920&q=75",
"https://mac-project.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcard-1.dark.fd8b5613.png&w=3840&q=75"
],
"documents": [
]
},
"title": "The MuleSoft AI Chain (MAC) Project",
"url": "https://mac-project.ai/"
}
Page | Get Meta Tags
Allows you to retrieve metadata from a webpage, including title, description, keywords, and other SEO-related information.
data:image/s3,"s3://crabby-images/d4f59/d4f59b5885e34b1102ade0e02707d4e2f4797e8c" alt="Get metags of a webpage"
Input Fields
XML Configuration
Below is the XML configuration for this operation:
<mac-web-crawler:page-get-meta-tags
doc:name="[Page ]Get meta tags"
doc:id="2cd02e46-5607-43f1-9e2b-3ce624a6a68a"
url="#[payload.url]"/>
Output Fields
Payload
This operation responds with a json
payload.
Example
[
{
"name": "robots",
"content": "index,follow"
},
{
"property": "og:title",
"content": "Introduction"
},
{
"name": "msapplication-TileColor",
"content": "#fff"
},
{
"name": "theme-color",
"content": "#fff"
},
{
"name": "viewport",
"content": "width=device-width, initial-scale=1.0"
},
...
]