Question 1

How does a crawler work?

Accepted Answer

A web crawler (or "web spider") is a bot (software program) that gathers and indexes web data (also known as web scraping) so it can be made available to people using a search engine to find information.

A website crawler achieves this by visiting a website (or multiple sites), downloading web pages, and diligently following links on sites to discover newly created content. The site crawler tool catalogs the information it discovers in a searchable index.

There are several types of website crawler. Some crawlers find and index data across the entire Internet (the global information system of website information is known as the World Wide Web). Large-scale and well-known web crawlers include Googlebot, Bingbot (for Microsoft Bing's search engine); Baidu Spider (China), and Yandex (Russia). In addition, many smaller and lesser-known web crawlers focus their crawling processes on exploring certain types of web data, such as images, videos, or email.

Question 2

What is a database crawler?

Accepted Answer

A database crawler is a specific type of web crawler that parses and catalogs information stored in tables in a database. When this information is catalogued, people can then find it by using search engines.

Different types of databases require different configuration in order for the crawler to extract their information in an intelligent way. You specify the type of data and fields you want crawled and determine a crawling schedule.

A database crawler treats each row in a table as a separate document, parsing and indexing column values as searchable fields.

A database crawler can also be set up to crawl various tables by using a plug-in. In a relational database, this allows the joining of rows from multiple tables that have the same key fields and treating them as one document. Then, when the document is displayed in search results, the data from the joined tables appears as additional fields.

Question 3

How do I crawl a sitemap?

Accepted Answer

Like other web content, a website's XML sitemap can be crawled by a web crawler. If a website has sitemap URL in its robots.txt, the sitemap will be automatically crawled. However, you can also separately download and crawl the XML sitemap URLs with a tool such as Screaming Frog.

To convert a sitemap file into a format that a program like Screaming Frog can crawl, you import the file into Microsoft Excel and copy the URLs to a text file.

If a sitemap has any "dirt" in it, that is, it references outdated pages that lead to header response code indicating errors (such as 404), redirects, or application errors, the data turned up and indexed by a crawler and made available to search engines can be error prone. This is why it makes sense to spend the effort needed to crawl a sitemap and then correct any issues.

How do you know if your sitemap is dirty? In Google Webmaster Tools, the "Sitemaps" section shows you both the number of pages submitted in the sitemap and the number of pages indexed. This should be a ratio of roughly 1 to 1. If it's a low ratio of indexed material to a high number of submitted pages, there could be errors with the URLs in the sitemap.

Question 4

What is the main purpose of a web crawler program?

Accepted Answer

The goal of a web crawler software program (a.k.a. "web spider") is to explore web pages, discover and fetch data, and index it so that it can be accessed by people using a search engine. A website crawler completes this mission by systematically examining a website (or multiple sites), downloading its web pages, and following its links to identify new content. The site crawler tool then catalogs the information it uncovers in a searchable index for quick retrieval.

Question 5

What is web crawling in ecommerce?

Accepted Answer

Web crawling is having a software program (a "bot") systematically explore websites and index the data it finds, making it easy for people to locate by using a search engine.

Web scraping, a slightly different form of gathering web data, involves collecting (downloading) specific types of information, for instance, about pricing.

In ecommerce, both of these types of data gathering are especially valuable because the data collected and analyzed can lead to marketers data-based decisions that can boost sales.

Marketers can compare data about products being sold on other sites with the same ones they're selling, for instance.

If they find out that shoppers are routinely entering certain keywords in a search engine to locate a given product, they might decide to add those words to the product description to attract potential buyers to the product listing.

Consumers typically want the best deals, and they can easily search for the lowest prices on the web. If a company sees that a competitor has a lower price on a product they offer, they can lower their own price to ensure that prospective customers won't choose the competitor's due solely to a lower cost.

By gathering product review and ranking data, marketers and businesspeople can uncover information about flaws in their own and competitors' products.

They can also use crawler technology to monitor product reviews and rankings so that they can swiftly respond when people post negative comments, thereby improving their customer service.

They can find out which products are bestsellers and potentially identify hot new markets.

All of this revenue-impacting activity makes ecommerce web crawling and web scraping an important and lucrative subdomain of these activities as a whole.

Liberate your web content with our crawler

How our website crawler works

A site crawler tool that uncovers all your content, no matter what it’s stored

Provide your users with great site search

Turn your site into structured content

You don't need to add meta tags

Enrich your content to make it more relevant

Configure your crawling as needed

Schedule automatic crawling sessions

Manually set up a crawl

Tell it where to go

Give permission

Keep your searchable content up to date

URL Inspector

Monitoring

Data Analysis

Path Explorer

The most advanced companies experiment everyday with the crawler

Website Crawler FAQ