Getting Started
System Requirements
Before you start, ensure you have the following prerequisites:
- Java Development Kit (JDK) 11, and 17
- Apache Maven
- MuleSoft Anypoint Studio
Download the MAC WebCrawler Connector
Clone the MAC WebCrawler Connector repository from GitHub:
git clone https://github.com/MuleSoft-AI-Chain-Project/mac-web-crawler.git
cd mac-web-crawler
Install the Connector with Java 8
mvn clean install -Dmaven.test.skip=true -DskipTests
Installing with Java 11, 17, 21, 22, etc.
Step 1
export MAVEN_OPTS="--add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.util.regex=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.xml/javax.xml.namespace=ALL-UNNAMED"
Step 2
For Java 17
mvn clean install -Dmaven.test.skip=true -DskipTests -Djdeps.multiRelease=17
For Java 21
mvn clean install -Dmaven.test.skip=true -DskipTests -Djdeps.multiRelease=21
For Java 22
mvn clean install -Dmaven.test.skip=true -DskipTests -Djdeps.multiRelease=22
Add the Connector to Your Project
Add the following dependency to your pom.xml
file:
<dependency>
<groupId>com.mule.mulechain</groupId>
<artifactId>mac-web-crawler</artifactId>
<version>0.1.0</version>
<classifier>mule-plugin</classifier>
</dependency>
The MAC Project connectors are constantly updated, and the version is regularly changed.
Make sure to replace {version}
with the latest release from our GitHub repository (opens in a new tab).
Configuration
The configuration is applicable to the Crawl website and Get page insights operations.
The configuration for the MAC WebCrawler connector is simple to create.
Go to the Global Elements
in your MuleSoft project, and create a new configuration. In the Connector Configuration
, you will find the MAC WebCrawler Configuration. Select it and press OK.
If you wish to restrict content retrieval from specific elements or tags, then enter these in the Tag List as in the example below.
In the example above, only text from HTML elements p.text-primary
and h1.heading-primary
will be retrieved (or analysed if using the Get page insights
operation).