Smart Computing-Editorial

Reference Series


How Computers Work, Part I
August 2001• Vol.5 Issue 3 Page(s) 194-199 in print issue
Searching The Internet Explore How Search Engines & Web Directories Work

Millions of Web pages are out there, many of which contain the information you’re looking for. But finding exactly what you want isn’t always easy. Fortunately, you can rely on search engines to help you in your quest for information.

Search engines and Web directories are comparable to digital card catalogs, but the organization and presentation of information isn’t a precise system; there are no librarians inserting new sites in alphabetical order. Instead, the process involves digital robots or even humans scouring millions of Web pages and attempting to categorize the information on the page by topic, coupled with software that uses a complex set of criteria and algorithms to try to determine the most relevant sites based on keywords input by the user.

Steve Lawrence, a research scientist at the NEC Research Institute and co-author of “Accessibility of Information on the Web,” a recent study on search engines, offers this view: “The current state of search engines can be compared to a phone book that is updated irregularly, is biased toward listing more popular information, and has most of the pages ripped out.”

But despite their shortcomings, search engines do serve a vital role on the Internet; they’re often the first tools users use to locate specific information. We’ll explore how search engines and Web directories gather and store information and show you the process of how search engines arrive at their results.

Under The Hood. Search engines basically consist of three elements: an automatic site searcher (also called a robot, bot, spider, or crawler), the index, and the software that breaks it all down and presents the search results to the user.

It’s important to note, however, that although Web directories are often referred to as search engines, Web directories do not feature automatic data gathering. Instead, human beings compile lists of URLs (uniform resource locators; Web addresses) and page data based on sites that have been formally submitted for review. The information in Web directories, such as Yahoo! (http://www.yahoo.com), Looksmart (http://www.Looksmart.com), Excite’s Web Site Guide (http://www.excite.com/guide), and dmoz (Open Directory Project; http://www.dmoz.com), is organized in hierarchical tiers to help guide users to the desired information. Like search engines, many Web directories also include the ability to perform keyword searches of the contents.

Because of the popularity of Yahoo! and other Web directories, many search engines, such as Lycos (http://www.lycos.com) and GO Network (http://www.go.com), also offer information in tiers. Unlike pure Web directories, however, these hybrid directories can offer even more extensive Web coverage through the use of data collecting bots.

Information Gathering. Robots, or bots, as they’re often called, do the actual dirty work of sifting through millions of Web pages to collect data. These autonomous programs perform their collection duties automatically, “crawling” from Web server to Web server and the sites contained therein, gathering URLs and other information for the search engine to use in keyword searches.

Bots typically begin their gathering process with a predetermined list of URLs. These are usually URLs that contain several links, such as those found on servers or collections of popular or best sites. The bots follow the links to these sites, adding more URLs to the list as it goes.

Before some bots can visit a page, though, the site’s creator must submit the site and its linked pages to the search engine. Most engine and Web directories include a link for URL submission on their home page. However, the time lapse between the site submission and when a bot actually visits and collects the information can be weeks or even months.

Each search engine’s bot uses a different criteria and process for gathering information. While some bots gather only address links, other bots will also collect page titles and some even collect the entire text on the pages. The amount of information collected is determined by the type of crawl the bot performs.

Bot crawling. Individual robots crawl the Internet for information at different levels. “Deep crawling” is performed by bots that collect information from Web sites that are linked from submitted URLs, whether the linked sites were submitted or not.

Some bots, such as those used by AltaVista (http://www.altavista.com), HotBot (http://hotbot.lycos.com), and Northern Light (http://www.northernlight.com), will perform extremely thorough crawls through linked page paths. Others, such as those used by Google (http://www.google.com) and GO Network, don’t go as deep through link paths when collecting data, returning sooner to a submitted URL to begin another crawl.

Link popularity. Crawls also can be influenced by link popularity, or the number of Web sites that link to a given page. The more popular a Web page is, the more likely it is to be indexed by the bot. Bots used by Excite (http://www.excite.com), HotBot, and Lycos have used link popularity to determine the route of bots. Some criticize this method, however, saying that pages that contain valuable information are often missed and not included in the search engines index.

Index The Data. Once the data has been collected, either by bots or by humans, all the information is placed in the index (or catalog). While humans update directories manually, search engines rely on bot software to record new information and refresh the old data. Once the bot returns home from its Web excursion, it downloads into the engine’s database the titles, URLs, text, and other information it was programmed to collect. The updated information replaces the old data, and new site keywords are placed within the index to be called up in future search results.

Some sites, such as AltaVista, Infoseek, and MSN Search (http://search.msn.com), offer instant indexing services, enabling new or updated pages to be speedily indexed within a couple of days, while other search engines may take weeks or months before updating the search result to display the latest information online.

Most of the major search engines claim they index all of a page’s visible text, but not all store every single word. To prevent unfair techniques used by some Webmasters and to preserve server space, some engines will exclude text that attempts to manipulate ranking results (see the Web Games section for details), while others will leave out stop words, which are words that appear so frequently that to store them would just take up unnecessary server space and slow down the search process. Stop words are rarely used as keywords in searches and include common articles such as a, an, and the.

After the information is indexed, the sites will appear in search results. However, without a way to access, prioritize, and present ranked findings to the user, the search results would be nothing more than a random listing of millions of URLs. That’s where the search software comes into play.

Site Sorting Software. Search software uses algorithmic functions and criteria to find keyword matches among the massive compilation of data indexed in search engine databases and Web directories and presents the results in some semblance of relevant order.

Both HotBot and Northern Lights have bots that collect information by deep crawling through the Web.

The search software runs in the background. When you head to a search engine’s or Web directory’s site, you are presented with a keyword field in which you can enter keywords for the search you wish to perform. After you enter your keywords and press the Search button, the software combs through the index for occurrences of the keywords. Each engine and directory handles its data differently. Some programs look through all the words on each page, while others only search through URLs or titles. Each handles plural words, misspellings, and truncation in a different way.

Relevant determination. What’s important to users, though, is how the information is sorted to produce relevant search results.

“People don’t say, ‘I can’t find anything,’ because they had no listings,” says Danny Sullivan, editor of Search Engine Watch. “Instead they say, ‘I found way too much; help me find what is best.’ ”

To organize results, the earliest search software relied exclusively on default algorithms that broke down and presented the data to the user alphabetically. Without some kind of method for ranking the relevancy of each item in a result list, users would have to look through pages of alphabetical listings and determine for themselves which sites best match the queries.

Software has since been developed to get around this default. Each engine’s and directory’s software is different, and its method of ranking relevancy varies. The software for search engines, such as AltaVista, GO Network, and Excite, for example, looks through every single word on every single page during searches, counting the number of keyword occurrences. The HotBot software gives priority to sites that were accessed frequently in past results, and GO Network and other Web directories give consideration to sites reviewed by employees.

Meta tags also are used in determining relevancy and ranking. Keywords placed inside meta tags are not visible to the Web site viewer but contain words relating to the page’s content that may help the URL rise through the ranks of search results. For example, a Web page for older cars may have words such as “classic,” “car,” “1950,” and “Chevrolet” listed in a meta tag. When a search is conducted for “1957 Chevrolet,” the engine’s software will detect the keyword(s) contained within the keyword meta tag and present that page higher in the list of results.

Description meta tags are used by Webmasters to dictate what a Web page’s description will say in the search result list. If a description meta tag is not included in a Web page, the search results will usually display the first hundred or so words as the site’s description in the search results. Some search software also factors description meta tags into relevancy ranking.

While search engines and World Wide Web directories count keywords in their own way to determine relevancy, there are techniques users can employ to help engines narrow down search results and provide the most relevant matches.

The Boolean way. To refine search results, many search engines let users include Boolean operators (words such as AND, OR, and NOT) between keywords. The operators let you quickly reduce the number of results an engine will return by including or excluding particular words from search results.

For example, putting AND between two keywords in a query tells the engine to look for documents that contain both keywords. The more frequently those words are repeated, the higher the page will appear in the result list. The OR operator works in a similar manner; it tells the engine to look for documents containing one word or the other. The NOT operator instructs the engine to return documents that contain the first keyword but not the second.

Beyond the use of Boolean operators, many engines and directories provide advanced search forms for refined searches. Look for links such as Advanced Search, Search Options, or More Search on engine’s and directory’s home pages.

Web Games. The goal of any Webmaster is to get his or her site to rank at the top of search results. And, of course, some search engines and directories have seen they can put a price on ranking. The result is that Webmasters resort to trickery to improve rankings, and engines and directories sell the top spots in search results.

Yahoo! was one of the first sites to use people to review Web pages prior to categorizing and activating them.

Some Webmasters have been known to stoop to low levels to enhance their ranking in search results. Some of the tactics include peppering a Web page with keywords disguised in the same color as the background, rendering the keywords invisible to the viewer while remaining detectable to the search engine. Others sprinkle small keyword text at the bottom of each page. To curtail these schemes, many search engines now limit the number of keywords allowed on a page, or program their software to simply shove offending sites to the bottom of the search results, condemning them to virtual oblivion.

However, recognizing the power of the all-mighty dollar and the importance of top rankings, many engines and directories, such as GoTo, AltaVista, and Yahoo!, charge for better placement in results. Most of the engines and directories that participate in the practice are usually open about it, indicating the sites that have paid for higher rankings or putting the fee-based rankings into a separate results list. While Webmasters and search engines vie to one-up each other, search engine technology continues to advance.

In The Beginning. Since the Internet’s burgeoning years, search engines have helped users to locate data. The need for such a tool was apparent even in the early stages of development.

Early methods of sharing information involved using the FTP (File Transfer Protocol), a means of exchanging files between computers. Machines had to be FTP-compliant in order for the data sharing to work. Eventually, servers were set up with the sole purpose of storing these FTP files, but information was slow to spread among users, who notified each other of files on various servers through message boards and word of mouth.

All this changed in 1990 with the birth of the Archie. Archie was an engine programmed to scour through FTP servers and download its files to an index. Users could then search Archie instead of each FTP server for the desired file.

The Veronica engine appeared in 1992 and was designed to enable searches for documents on Gopher servers. Gopher servers organize information into menus; these menus are then itemized according to collections of information and stored databases. Gopher was developed at the University of Minnesota in 1991 and named after the school’s mascot, the Golden Gopher.

These first engines caught on quickly with users, and technological innovations improving upon the basic search engine concept soon began to appear. Soon, the data-collecting bots used by today’s search engines debuted. Other engines taking advantage of this technology soon followed suit, introducing simple, user-friendly interfaces, search result rankings, and regular-expression keyword searches.

Web directories, such as Galaxy (http://www.galaxy.com) and Yahoo!, both appeared in early 1994 and took Internet searching in a different direction, using humans instead of bots to compile data. Sites were organized hierarchically into categories that users could browse through to find what they were looking for.

Meanwhile, the Webcrawler (http://www.webcrawler.com) search engine began making waves in April 1994. Webcrawler offered keyword matching of entire Web pages, not just URLs, descriptions, or titles, and provided information from more than 6,000 World Wide Web servers.

Hoping to draw upon the success of WebCrawler, several new Web search engines were launched in the following months, each trying to capitalize on the engine’s popularity by offering twists on its basic search technology.

Lycos relied on its huge database to attract searchers, while Excite featured software that combined basic search features with automatic hypertext linking and subject grouping. This was done in an effort to increase the efficiency of the resulting list.

AltaVista soon stormed into the picture, offering unprecedented search speed and index size. This engine also offered other extras, such as natural language queries (question queries), and Boolean operator search capabilities.

HotBot upped the ante in the quest for the largest database when it appeared in 1996. Its owner, Inktomi, claims HotBot can index 110 million Web documents every three or four days.

But with all those indexed pages floating around among various search engines, a solution was soon needed that could compress the different results into one, easy-to-navigate interface. So, metasearch engines soon began to dot the Information Highway landscape.

In 1995, the first metasearch engine appeared. It was called MetaCrawler (http://www.metacrawler.com). Metasearch engines scour several individual search engines at once and present the results on a single page.

Many of today’s search tools are new and improved versions of the first search engines, which have evolved over the years to mirror the complexity of the Web. In addition, private engines for company databases, personal engines for private Web pages, and specialized searchers for narrowly defined topics have also surfaced.

What’s In Store. Lawrence’s study, “Accessibility of Information on the Web,” found that Northern Light can catalog only about 16% of the Web. And although the largest databases offer millions of indexed pages, they also tend to contain the most dead links (typically older links to Web sites that are inactive or no longer exist) in search results.

Although freshness of engine data could remain an issue for some time, the shortage of Web coverage could soon be a thing of the past.

Two search engines, Excite@Home and Alltheweb, unveiled technology that enables them to catalog most or all Web sites. Excite@Home officials say their “scalable” search engine architecture designed specifically to keep pace with the growth of the Internet will soon have most of the Web catalogued. Officials from Alltheweb expect to catalog the entire Web using improved search algorithms and architectures.

But more search results do not necessarily translate into comprehensive, fresh Web coverage. Lawrence says that what will benefit users most is the opportunity for search engines to pull sources from a larger pool of information and provide better relevancy in search results.

Forecasters say the future will most likely include Internet search engines that specialize in specific areas, such as science, health, and government, and provide more comprehensive, up-to-date coverage than today’s search engines.That objective is closer every day.

Handy Tools. While search engines may not always be the most up-to-date resources, they are still the best tools for finding information on the Internet. Whether you’re searching through information in a library or online, finding exactly what you’re looking for is going to take some work; you won’t walk into any library to find the best books open to the pages that contain the best information.

As with most technologies, search engines will continue to evolve and improve over time, and they most certainly will continue to play a pivotal role on the Information Superhighway.

by Lori Robison

View the graphics that accompany this article.
(NOTE: These pages are PDF (Portable Document Format) files. You will need Adobe Acrobat Reader to view these pages. Download Adobe Acrobat Reader)

Terms To Know

Boolean —(Pronounced BOO-lee-un.) An adjective describing an expression that results in a value of either TRUE or FALSE. Named for mathematician George Boole, the word describes a common system of logic using mathematical expressions. Boolean expressions are used extensively in search engines on the World Wide Web. For example, if users are searching the Web for information on singer Tina Turner, they might type "Tina AND Turner" into the search box. This is a Boolean expression that will retrieve only documents containing both the words Tina and Turner. If the user does not want to read about Turner's Australian tour, the Boolean expression to be entered might be "Tina AND Turner NOT Australia." Documents that meet these criteria would be "true" and all others would be "false."

bot—Abbreviation for robot. Bot usually refers to software that executes some function automatically. Search engines typically use bots to seek out Web sites and record information about the sites for future search purposes.

FTP (File Transfer Protocol)—Standard way to transfer files between computers. The method has built-in error checking. It is frequently used as a way of transferring many types of files over the Internet.

keyword-When using a search function, a keyword is the word the user wants to find in a document or documents. For example, to find all documents about dogs in a folder, a good keyword might be "dog." Some word processing and database programs let the user attach certain keywords to specific documents to make searching faster; rather than searching the entire file, the search program might only look at lists of user-defined keywords for each file.

meta tag-HTML (Hypertext Markup Language) code used to index pages. The tag includes such things as keywords and page descriptions for a Web site.

search engine-A program that lets users locate specified information from a database or mass of data. Search engine sites are extremely popular on the World Wide Web because they let users quickly sift through millions of documents on the Internet. AltaVista (http://www.altavista.com) is one example.

spider-A program that "crawls" across the World Wide Web, automatically collecting Web pages. A spider follows every link on a page and catalogs each page until it comes to a dead end. Then it will start over on a new page. Spiders are used primarily by Web search engines to gather data for the search engine's database. Search engines don't actually search the entire Internet when a user enters a search term. Instead, they look at the database of Web pages collected by their spider. Spiders are also known as crawlers and bots.

URL (uniform resource locator)-Previously known as "universal resource locator." A standardized naming, or "addressing," system for documents and media accessible over the Internet. The URL http://www.smartcomputing.com, for example, includes the type of document (http, Hypertext Transfer Protocol), and the address of the computer on which it can be found (www.smartcomputing.com). FTP (File Transfer Protocol) sites, newsgroups, Gopher pages, and other sites all can be named with URLs.x

Want more information about a topic you found of interest while reading this article? Type a word or phrase that identifies the topic and click "Search" to find relevant articles from within our editorial database.