Nutch and Elastic Search for crawling the websites:

11 min readJan 17, 2021

Before we start with above process make sure you have Java and apache ant installed on your device. You can check by running ant -version

If it is not already installed, then your best bet is to install Homebrew (brew install ant) or MacPorts (sudo port install apache-ant), and use those tools to install Apache Ant.

Alternatively, though I would highly advise using Homebrew or MacPorts instead, you can install Apache Ant manually. To do so, you would need to:

Decompress the .tar.gz file.
Optionally put it somewhere.
Put the “bin” subdirectory in your path.

The commands that you would need, assuming apache-ant-1.8.1-bin.tar.gz (replace 1.8.1 with the actual version) were still in your Downloads directory, would be the following (explanatory comments included):

cd ~/Downloads # Let's get into your downloads folder.
tar -xvzf apache-ant-1.8.1-bin.tar.gz # Extract the folder
sudo mkdir -p /usr/local # Ensure that /usr/local exists
sudo cp -rf apache-ant-1.8.1-bin /usr/local/apache-ant # Copy it into /usr/local
# Add the new version of Ant to current terminal session
export PATH=/usr/local/apache-ant/bin:"$PATH"
# Add the new version of Ant to future terminal sessions
echo 'export PATH=/usr/local/apache-ant/bin:"$PATH"' >> ~/.profile
# Verify new version of ant
ant -version

Create a Folder on your Desktop , I named my folder crawler, next go into the crawler directory and retrieve the nutch repository

cd crawler
git clone https://github.com/apache/nutch.git

Use ant command to compile the nutch

ant clean runtime

Check if installation is okay by running following command and it will give output similar :

runtime/local/bin/nutch

Output will look like this:

nutch 1.17-SNAPSHOT
Usage: nutch COMMAND
where COMMAND is one of:
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment’s pages
parse parse a segment’s pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the plugin-based indexer on parsed segments and link

Next we need to config the nutch parameters in /crawler/runtime/local/conf/nutch-site.xml before starting the crawler. I only added certain parameters to crawl.

(Here is a list of of full list of parameter/properties that can be added https://github.com/ravishersingh2/CU-Denver-Search-Engine-TGIHR-/blob/main/nutch-site1.xml)

<property>
     <name>http.agent.name</name>
     <value>SICrawler</value>
     <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

    </description>
  </property>
  <property>
     <name>plugin.includes</name>
     <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elastic</value>
  </property>
  <property>
      <name>db.ignore.external.links</name>
      <value>false</value>
      <description>If true, outlinks leading from a page to external hosts or domain
      will be ignored. This is an effective way to limit the crawl to include
      only initially injected hosts or domains, without creating complex URLFilters.
      See 'db.ignore.external.links.mode'.
      </description>
  </property>
  <property>
      <name>elastic.host</name>
      <value>localhost</value>
      <description>The hostname to send documents to using TransportClient.
      Either host and port must be defined or cluster.
      </description>
  </property>
  <property>
      <name>elastic.port</name>
      <value>9300</value>
      <description>
      The port to connect to using TransportClient.
      </description>
  </property>
  <property>
      <name>elastic.cluster</name>
      <value>elasticsearch</value>
      <description>The cluster name to discover. Either host and port must
      be defined.
      </description>
  </property>
  <property>
      <name>elastic.index</name>
      <value>nutch</value>
      <description>
      The name of the elasticsearch index. Will normally be autocreated if it
      doesn't exist.
      </description>
  </property>

Next create a urls folder where we going to put all of our websites to crawl

mkdir runtime/local/urls

Next we need to install ElasticSearch and Kibana :

You can go to ElasticSearch website and download the zip file and extract the files in Crawler folder (Same for Kibana). Once installed check if elasticSearch is running:

bin/elasticsearch

We can test on http:localhost:9200 if its running:

curl http://localhost:9200

If you get following output that means ElasticSearch is running

{
“name” : “Machine-Name”,
“cluster_name” : “elasticsearch”,
“cluster_uuid” : “tnuE5mQTS_2e725s3epVdg”,
“version” : {
“number” : “7.4.2”,
“build_flavor” : “default”,
“build_type” : “tar”,
“build_hash” : “22e1767283e61a198cb4db791ea66e3f11ab9910”,
“build_date” : “2019-09-27T08:36:48.569419Z”,
“build_snapshot” : false,
“lucene_version” : “8.2.0”,
“minimum_wire_compatibility_version” : “6.8.0”,
“minimum_index_compatibility_version” : “6.0.0-beta1”
},
“tagline” : “You Know, for Search”
}

Next initialize Kibana

bin/kibana

and go to browser and go to http://localhost:5601

Now we are ready to crawl the websites:

we go to runtime/local/urls. and create a new file in the folder named seed.txt. This is where you will put all your resources/websites/pdf that you need to crawl. And second, editing the regex-urlfilter.txt in runtime/local/conf to specify what nutch should include or skip in the crawling process.:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
#-^(?:file|ftp|mailto):

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-(?I)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

Crawling the website

The first step of the crawler is to read the seed file and create or update the crawldb directory, called the injection process.

cd runtime/localbin/nutch inject crawl/crawldb urls

“The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.”- cwiki.apache.org

Injector: starting at 2019–11–25 13:29:49
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/crawler/nutch/runtime/local/urls/seed.txt
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2019–11–25 13:29:51, elapsed: 00:00:02

The second step is to take the URLs from the crawldb and create a new segment ready for fetching, this is called the generation process.

bin/nutch generate crawl/crawldb crawl/segments

“A set of segments. Each segment is a set of URLs that are fetched as a unit.”- cwiki.apache.org

Generator: starting at 2019–11–25 13:30:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: number of items rejected during selection:
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20191125133101
Generator: finished at 2019–11–25 13:31:02, elapsed: 00:00:03

For easy access, we stored the name of the segment in variables.

s1=`ls -d crawl/segments/2* | tail -1`echo $s1

crawl/segments/20191125133101

The third step is to actually retrieve the content of each of the URLs and stored them in their respective segment folder.

bin/nutch fetch $s1Fetcher: starting at 2019–11–25 13:32:25
Fetcher: segment: crawl/segments/20191125133101
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records hit by time limit : 0
FetcherThread 36 Using queue mode : byHost
FetcherThread 40 fetching https://www.ejemplo.com/ (queue crawl delay=5000ms)
FetcherThread 36 Using queue mode : byHost
FetcherThread 41 has no more work available
FetcherThread 41 -finishing thread FetcherThread, activeThreads=1
robots.txt whitelist not configured.
FetcherThread 36 Using queue mode : byHost
FetcherThread 42 has no more work available
FetcherThread 42 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
FetcherThread 43 has no more work available
FetcherThread 43 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
FetcherThread 44 has no more work available
FetcherThread 44 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
FetcherThread 45 has no more work available
FetcherThread 45 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
FetcherThread 46 has no more work available
FetcherThread 46 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
FetcherThread 47 has no more work available
FetcherThread 47 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
FetcherThread 48 has no more work available
FetcherThread 48 -finishing thread FetcherThread, activeThreads=1
FetcherThread 36 Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
FetcherThread 49 has no more work available
FetcherThread 49 -finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
FetcherThread 40 has no more work available
FetcherThread 40 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2019–11–25 13:32:29, elapsed: 00:00:03

The fourth step is to parse the content we just retrieve, generating more useful information, and giving us more URLs to fetch.

bin/nutch parse $s1ParseSegment: starting at 2019–11–25 13:33:19
ParseSegment: segment: crawl/segments/20191125133101
Parsed (189ms):https://www.example.com/
ParseSegment: finished at 2019–11–25 13:33:21, elapsed: 00:00:01

Finally, we update crawldb with all the above information.

bin/nutch updatedb crawl/crawldb $s1

CrawlDb update: starting at 2019–11–25 13:33:56
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20191125133101]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2019–11–25 13:33:57, elapsed: 00:00:01

This step is only for when you decide to add more websites to crawl later on:

we can repeat the whole process by taking into account the new URLs and creating a new segment with the top 1000, we can keep doing this further and further, for this case we just did 3 runs.

Here limit is 1000 websites this can be increased as per requirement.

bin/nutch generate crawl/crawldb crawl/segments -topN 1000Generator: starting at 2019–11–25 13:35:15
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: running in local mode, generating exactly one partition.
Generator: number of items rejected during selection:
Generator: 1 SCHEDULE_REJECTED
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20191125133518
Generator: finished at 2019–11–25 13:35:19, elapsed: 00:00:03

next

s2=`ls -d crawl/segments/2* | tail -1`echo $s2

crawl/segments/20191125133518

bin/nutch fetch $s2

bin/nutch parse $s2

bin/nutch updatedb crawl/crawldb $s2

bin/nutch generate crawl/crawldb crawl/segments -topN 1000

s3=`ls -d crawl/segments/2* | tail -1`echo $s3

bin/nutch fetch $s3

bin/nutch parse $s3

bin/nutch updatedb crawl/crawldb $s3

Now we update the linkdb

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

“The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.”- cwiki.apache.org

LinkDb: starting at 2019–11–25 16:40:30
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/crawler/nutch/runtime/local/crawl/segments/20191125133518
LinkDb: adding segment: file:/crawler/nutch/runtime/local/crawl/segments/20191125133101
LinkDb: adding segment: file:/crawler/nutch/runtime/local/crawl/segments/20191125134502
LinkDb: finished at 2019–11–25 16:40:33, elapsed: 00:00:02Finally, we are ready to index all the content we just crawled with Nutch into Elasticsearch with the following command (make sure you have Elasticsearch service running)bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize -deleteGonebin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s2 -filter -normalize -deleteGonebin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize -deleteGone

“The index command takes the content from one or multiple segments and passes it to all enabled IndexWriter plugins which send the documents to Solr, Elasticsearch, and various other index back-ends.” — cwiki.apache.org

Next couple of things to remember.

When you are parsing the links some of these websites might have bigger size that default size.You will get an error like “Content of X was truncated to 65536”.

To fix the above error :

In nutch , go to nutch/conf/nutch-default-xml, find the property with name file.content.limit and change the value to whatever file size you like:

<property> <name>file.content.limit</name> <value>1048576</value> <description>The length limit for downloaded content using the file:// protocol, in bytes. If this value is non-negative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description></property>

Secondly, if you would limit the number of results when making a request to server. This is what I have found but I did not try :

The following expert settings can be set to manage global search and aggregation limits.

indices.query.bool.max_clause_count

(Static, integer) Maximum number of clauses a Lucene BooleanQuery can contain. Defaults to 1024.

This setting limits the number of clauses a Lucene BooleanQuery can have. The default of 1024 is quite high and should normally be sufficient. This limit does not only affect Elasticsearchs bool query, but many other queries are rewritten to Lucene’s BooleanQuery internally. The limit is in place to prevent searches from becoming too large and taking up too much CPU and memory. In case you’re considering increasing this setting, make sure you’ve exhausted all other options to avoid having to do this. Higher values can lead to performance degradations and memory issues, especially in clusters with a high load or few resources.

search.max_buckets

(Dynamic, integer) Maximum number of aggregation buckets allowed in a single response. Defaults to 65,535.

Requests that attempt to return more than this limit will return an error.

indices.query.bool.max_nested_depth

(Static, integer) Maximum nested depth of bool queries. Defaults to 20.

This setting limits the nesting depth of bool queries. Deep nesting of boolean queries may lead to stack overflow.

Step 3: Indexing with Elasticsearch

At this point, we can check out Elasticsearch for our newly created nutch index just by going to the management gear icon in the bottom left corner of kibana http://localhost:5601 and then selecting Index Management on Elasticsearch.