A search engine is a web-based tool that enables users to locate information on the World Wide Web. Popular examples of search engines are Google, Yahoo, and MSN Search. The information gathered by the spiders is used to create a searchable index of the Web.
How Search Engines Work Step by Step
A web crawler, or spider, is a type of bot that’s typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results. This is the means by which search engines can find out what is published out on the World Wide Web.
How does Web crawler work –
A web crawler copies webpages so that they can be processed later by the search engine, which indexes the downloaded pages. This allows users of the search engine to find webpages quickly. The web crawler also validates links and HTML code, and sometimes it extracts other information from the website.
Web crawler examples –
Focused crawlers, for example, focus on current, content-relevant websites when indexing. Web analysis tools use crawlers or spiders to collect data for page views, or incoming or outbound links. Crawlers serve to provide information hubs with data, for example, news sites, etc
Once a spider has crawled a web page, the copy that is made is returned to the search engine and stored in a data center.
Types of indexing
Primary Index is an ordered file which is fixed length size with two fields. The primary Indexing is also further divided into two types –
a – Dense Index
In dense index, there is an index record for every search key value in the database. This makes searching faster but requires more space to store index records itself. Index records contain search key value and a pointer to the actual record on the disk.
b – Sparse Index
Sparse indexes only contain entries for documents that have the indexed field, even if the index field contains a null value. The index skips over any document that is missing the indexed field. The index is sparse because it does not include all documents of a collection.
A secondary index, put simply, is a way to efficiently access records in a database by means of some piece of information other than the usual key. Secondary indexes can be created manually by the application; there is no disadvantage, other than complexity, to doing so.
Google’s algorithm does the work for you by searching out Web pages that contain the keywords you used to search, then assigning a rank to each page based several factors, including how many times the keywords appear on the page.
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. PageRank was named after Larry Page, one of the founders of Google.
The fastest searching algorithm
Binary search is faster than linear search except for small arrays. However, the array must be sorted first to be able to apply binary search. There are specialized data structures designed for fast searching, such as hash tables, that can be searched more efficiently than binary search.
Most common search algorithms
- Linear Search.
- Binary Search.
- Jump Search.
- Interpolation Search.
- Exponential Search.
- Sublist Search.
- Fibonacci Search.