Crawler

Liberate your web content with our crawler

Help users find your content easily by using a customizable and hosted web crawler to catalog and store your site's web pages.

Start building for free
Get a demo
Algolia website crawler

How our website crawler works

A site crawler tool that uncovers all your content, no matter what it’s stored

Unify your site search experience, and eliminate data silos, with a crawler that can collect data no matter where it's stored.

Provide your users with great site search

Is your website content siloed in separate systems and managed by different teams? The first step in providing a high-quality site search experience is implementing a first-rate crawling process.

Our web spider can save your company time and lower your expenses by eliminating the need for building data pipelines between each of your content repositories and your site search software, as well as the project management that entails.

Site search crawler you can program to accurately understand website content, and ensure it's all available to searchers.

Turn your site into structured content

You can tell our website crawler exactly how to operate so that it accurately interprets your content. For example, in addition to standard web pages, you can ensure that it lets users search for and navigate news articles, job postings, and financial reports, including information that's in documents, PDFs, HTML, and JavaScript.

Unlock the efficiency of content crawling and extraction without the need for meta tags.

You don't need to add meta tags

You can have your content extracted without first adding meta tags to your site. Our web crawler doesn't rely on custom metadata. Instead, it provides your technical team with an easy-to-use editor for defining which content you want to extract and how to structure it.

A site crawler that incorporates business analytics data into its extraction process.

Enrich your content to make it more relevant

To enhance search-result relevance for your users, you can enrich your extracted content with business web data, including from Google Analytics and Adobe Analytics. With Algolia Crawler, you can use data about visitor behavior and page performance to adjust your search engine rankings, attach categories to your content to power advanced navigation, and more.

Configure your crawling as needed

Customize and automate crawling sessions.

Schedule automatic crawling sessions

You can configure our site crawler tool to look at your web data on a set real-time schedule, such as every night at 9 p.m., with a recrawl at noon the next day.

Manually trigger a website crawl.

Manually set up a crawl

If necessary, you can manually trigger crawling of a particular section of your website, or even the whole thing.

Tell your web crawler what parts of your website need to be crawled.

Tell it where to go

You can define which parts of your site, or which web pages, you want crawled (or avoided) by our web spider, or you can let it automatically crawl everywhere.

Enable your web crawler to index login protected URLs.

Give permission

Configure our crawler to explore and index login protected pages.

Keep your searchable content up to date

Real-time website crawl data at your fingertips.

URL Inspector

On the Inspector tab, you can see and inspect all your crawled URLs, noting whether each crawl succeeded, when it was completed, and the records that were generated.

Monitor the successes or failures of your latest site crawls.

Monitoring

On the Monitoring tab, you can view the details on the latest crawl, plus sort your crawled URLs by status (success, ignored, failed).

Analyze crawl data to determine the accuracy and quality of crawl performance.

Data Analysis

On the Data Analysis tab, you can assess the quality of your web-crawler-generated index and see whether any records are missing attributes.

Get detailed reports on crawl paths, extracted records, and errors encountered.

Path Explorer

On the Path Explorer tab, you can see which paths the crawler has explored; for each, how many URLs were crawled, how many records were extracted, and how many errors were received during the crawling process.

The most advanced companies experiment everyday with the crawler

Legalzoom
We realized that search should be a core competence of the LegalZoom enterprise, and we see Algolia as a revenue generating product.

Mrinal Murari

Tools team lead & senior software engineer @ LegalZoom
Read their story

Recommended content

What is a web crawler?

What is a web crawler?

A web crawler is a bot—a software program—that systematically visits a website, or sites, and catalogs the data it finds.

30 days to improve our Crawler performance by 50%

30 days to improve our Crawler performance by 50%

This article is about how we reworked the internals of our app crawler, looked for bottlenecks, and streamlined tasks to optimize the processing of this complex parallel & distributed software.

Algolia Crawler

Algolia Crawler

An overview of what the Algolia Crawler can do for your website.

See more

Website Crawler FAQ

  • A web crawler (or "web spider") is a bot (software program) that gathers and indexes web data (also known as web scraping) so it can be made available to people using a search engine to find information.

    A website crawler achieves this by visiting a website (or multiple sites), downloading web pages, and diligently following links on sites to discover newly created content. The site crawler tool catalogs the information it discovers in a searchable index.

    There are several types of website crawler. Some crawlers find and index data across the entire Internet (the global information system of website information is known as the World Wide Web). Large-scale and well-known web crawlers include Googlebot, Bingbot (for  Microsoft Bing's search engine); Baidu Spider (China), and Yandex (Russia). In addition, many smaller and lesser-known web crawlers focus their crawling processes on exploring certain types of web data, such as images, videos, or email.

  • A database crawler is a specific type of web crawler that parses and catalogs information stored in tables in a database. When this information is catalogued, people can then find it by using search engines. 

    Different types of databases require different configuration in order for the crawler to extract their information in an intelligent way. You specify the type of data and fields you want crawled and determine a crawling schedule.

    A database crawler treats each row in a table as a separate document, parsing and indexing column values as searchable fields. 

    A database crawler can also be set up to crawl various tables by using a plug-in. In a relational database, this allows the joining of rows from multiple tables that have the same key fields and treating them as one document. Then, when the document is displayed in search results, the data from the joined tables appears as additional fields.

  • Like other web content, a website's XML sitemap can be crawled by a web crawler. If a website has sitemap URL in its robots.txt, the sitemap will be automatically crawled. However, you can also separately download and crawl the XML sitemap URLs with a tool such as Screaming Frog. 

    To convert a sitemap file into a format that a program like Screaming Frog can crawl, you  import the file into Microsoft Excel and copy the URLs to a text file.

    If a sitemap has any "dirt" in it, that is, it references outdated pages that lead to header response code indicating errors (such as 404), redirects, or application errors, the data turned up and indexed by a crawler and made available to search engines can be error prone. This is why it makes sense to spend the effort needed to crawl a sitemap and then correct any issues.

    How do you know if your sitemap is dirty? In Google Webmaster Tools, the "Sitemaps" section shows you both the number of pages submitted in the sitemap and the number of pages indexed. This should be a ratio of roughly 1 to 1. If it's a low ratio of indexed material to a high number of submitted pages, there could be errors with the URLs in the sitemap.

  • The goal of a web crawler software program (a.k.a. "web spider") is to explore web pages, discover and fetch data, and index it so that it can be accessed by people using a search engine. A website crawler completes this mission by systematically examining a website (or multiple sites), downloading its web pages, and following its links to identify new content. The site crawler tool then catalogs the information it uncovers in a searchable index for quick retrieval.

  • Web crawling is having a software program (a "bot") systematically explore websites and index the data it finds, making it easy for people to locate by using a search engine.

    Web scraping, a slightly different form of gathering web data, involves collecting (downloading) specific types of information, for instance, about pricing. 

    In ecommerce, both of these types of data gathering are especially valuable because the data collected and analyzed can lead to marketers data-based decisions that can boost sales. 

    Marketers can compare data about products being sold on other sites with the same ones they're selling, for instance.

    If they find out that shoppers are routinely entering certain keywords in a search engine to locate a given product, they might decide to add those words to the product description to attract potential buyers to the product listing.

    Consumers typically want the best deals, and they can easily search for the lowest prices on the web. If a company sees that a competitor has a lower price on a product they offer, they can lower their own price to ensure that prospective customers won't choose the competitor's due solely to a lower cost. 

    By gathering product review and ranking data, marketers and businesspeople can uncover information about flaws in their own and competitors' products.

    They can also use crawler technology to monitor product reviews and rankings so that they can swiftly respond when people post negative comments, thereby improving their customer service.

    They can find out which products are bestsellers and potentially identify hot new markets.

    All of this revenue-impacting activity makes ecommerce web crawling and web scraping an important and lucrative subdomain of these activities as a whole.