Evaluating internet sources

Evaluating internet sources requires a healthy dose of common sense and skepticism about the validity of their findings. From web directories to search engines to metasearch engines, searching on the internet can be an overwhelming experience. In some cases, such as the invisible web, and sites that prohibit spiders from searching them, the internet is impenetrable. While search engines continue to develop, becoming more effective and smarter, they are becoming increasingly plagued with more advertisements, and with companies paying for the number one spot in a search. Internet users must therefore learn to search effectively. The balance between searching effectively and maintaining a reasonable level of skepticism is the user's best tool in searching for information on today's internet.

History of the Internet
In the 19th century, the world was brought together by the telegraph, which allowed for the quick dissemination of information, changing the way people thought, and the way business was conducted. It was referred to as the "highway of thought" (Standage VIII). Before this, as Tom Standage points out in his book, "The Victorian Internet", the fastest way to send information was the same as it had been since the Mongols first tamed the horse. The telegraph was eventually replaced by the telephone, but it opened people's minds to a new, faster-paced way of life.

The Internet is a set of networking protocols that allows one computer to connect to and communicate with other computers (Sherman and Price 1). It was developed in the 1960s by the Department of Defense (DOD) to connect universities and laboratories, so they could more easily share research and data, thereby increasing productivity, and decreasing unnecessary duplication. The result, first successful in 1969, was ARPANet, which evolved into the Internet.

There was no easy way to search for files on the early Internet. Searching was really sending e-mails to people with access to other computers and their files, and waiting for a reply. Gopher changed all these by creating the first menu-type index of files which could be shared on the Internet. It was a closed system, available only at the University of Minnesota. One could not read the files on the internet, but had to download them into their home computers. Soon, Gophers appeared in other places, and subsequently they were joined together. Archie, and then Veronica were software programs that were created as a means to search Gopher menus. With Archie (a play on the word, "archives"), one could search for anonymous FTP files, while Veronica (yes - Archie's girlfriend) was capable of a keyword search of Gopher menus. Veronica initiated the practice of Boolean searches. She was the first to allow one to limit his search by using "AND", "NOT", "OR", "EITHER", or parenthesis (Burke 66), as well as the first to allow wild card, or word truncation (Burke 67). The World Wide Web is a set of software protocols that runs on the Internet, was conceived for the same reasons, allowing users to easily access files (Sherman and Price 1).

In 1989, physicists at CERN in Switzerland wanted a way to share information and data in a more complete way than before. Tim Berners-Lee perfected the Hypertext Markup language or HTML (the language most websites are written in) and along with HTTP, they were able to share not only indexes off the internet, but were able to share entire documents which could be linked to each other by hypertext.

Subject Directories
Subject directories (or web directories) list the name and address of the web pages, and work like a telephone book. The earlier ones relied on web page authors to submit their websites to the directory (Sherman and Price 13). Each site was handpicked, usually annotated, and classified by subject. Many people at these directories reviewed the sites listed in their directory, and selected them for relevance and quality. The first web directory was "The Project" (Sherman and Price 12).

is an example of a subject directory. In 1994, Jerry Yang and David Filo created "Jerry's Guide to the Internet" which used spiders to search the web for sites, grouping them manually into hierarchical lists, "Jerry's guide" became Yahoo!, which is an acronym for "Yet Another Hierarchical Officious Oracle" (Sherman and Price, pg. 15); the authors did not say which came first, the name or the acronym. Another more recent example of a web directory, Beaucoup.com, categorizes the sites found on it, as well as reviews some new sites, and is useful if the web user has an idea of what kind of sites he is searching for. When the user wants to search for a word or topic, he would use a search engine.

Search Engines
In 1994, Brian Pinkerton, a graduate student at the University of Washington, created a web crawler to search for cool and unusual pages. He posted it to the Internet with an interactive interface, and created the first search engine (Sherman and Price 14). A search engine is a site that "allow(s) you to find specific documents through keyword and menu choices" (Sawyer). Search engines allow you to search a greater number of sites than do directories. Spiders or web crawlers go from link to link, and index key words from the header of the page, as well as words from the text of the page. Some spiders, like at AltaVista, indexes every word on a site. Others index the 100 most frequently used words.(Curt Franklin) When one types in a request, it matches the keywords to the meta tags in the header, and to the words indexed from the document. It does not evaluate the sites for content or relevance.

GoTo.com (now "Overture") has an unusual way of ranking sites, ranking them not by relevance, or popularity, but rather by payment. It opened in 1998 with the concept that who ever paid the most would be the first company listed in search returns (Arthur Weiss). Alta Vista also gives top ranking to paying sites, but notes which sites have paid for their rank (Arthur Weiss). More recently, search engines have been striving to develop more user-friendly results. At the "12th International World Wide Web Conference in Budapest, breakthroughs in searching technology were showcased. Among these innovations were improved techniques in specifying the type of data one needs. TimeSearcher, a new search engine, will allow results to "be confined to data created or changed on specific dates" (Delio). According to Wired News's Michelle Delio, we can also "(e)xpect, for example, to be able to sift through search results geographically, or to personalize Google results" (Delio). Geographically sorting data will be a fantastic advantage in web searching in the future, especially if one is searching for a florist online. Search engines are always changing. They add new pages even as they go back to update the sites already indexed. Improvements are made to the way spiders search the web, and they can recognize a greater variety of pages. AlltheWeb.com was the first search engine to index Macromedia Flash Files. It now also indexes Word files and PDF files(Robert Lackie).

Search engines and directories continue to improve and expand, and can now access more documents than ever. With these improvements, there are fewer and fewer differences between search engines and directories. Many search engines have even added directories to their sites. Despite the improvements in search techniques, search engines cannot keep pace with the growing number of sites on the Internet.

Google vs Meta Search Engines
Google is the search engine that has the largest number of indexed sites; around 3 billion different pages (Quentin Hardy 100). It has become the standard against which other search engines are measured. It has over 10,000 networked computers and can handle seven million queries an hour (Quentin Hardy 100). Google has the following capabilities:
* It rates the page by how popular it is, and tells the user how many hits, his request had. The more visitors a page has, the higher it is in its "returned hits" list.
* It tells the user long it will take to compile the list.
* It corrects spelling errors.
* It can translate to almost every language (including Klingon).
* It is a reverse directory, a street map, and can search just for images.
* It offers a directory of online catalogs.
* It indexes every word on a page, except for the articles ("a", "an" and "the") (Curt Franklin).
* It can recognize and index pages that are not HTML. Google can recognize PDFs, PostScript, Excel, Power Point and Rich Text documents, so it can offer more returns on request. Not only does Google recognize PDF documents, it translates them into HTML (Robert Lackie).

Google is renowned for its relevancy and its simplicity. However, as some of Google's competitors approach the same level of relevancy, it will have to continue to make itself "smarter", and learn what the user tends to want, providing similar responses each time. As Arnaud Fischer states, "Why wouldn't the most relevant results from several of the best engines not be more relevant than the results of a single-, even the best-, crawler-based engine?" (Fischer). However, these metasearch engines do not store or index pages. Metasearch engines use the resources gathered and indexed by search engines. Many metasearch engines, like Dogpile.com and Info.com, have reputations for being more "advertisement"-based (Fischer), rather than a regular search engine, which is often NOT what the user wants. Arthur Weiss says of meta search engines, "Such tools are parasitic, in that they share none of the database and indexing development overhead, and instead take away advertising revenues from the search tools they use" (Arthur Weiss). For example, when "Portland, Oregon" was plugged into info.com, the first 15 sites that came up were hotels, airlines or apartments, followed only then by the city's homepage. On Google, though, the city's homepage was first on the results list, next to Yahoo!'s city map.

The Invisible Web
The "Invisible Web" (or "Hidden Web", or "Deep Web", or Dark Matter) and "How to Search It") contains information that spiders or web crawlers cannot or will not access, so are usually excluded from general purpose search engines and directories. The sites are not accessible to the web crawlers for a variety of reasons. Search engine technology, while improving, is limited. Web crawlers work by traveling from one hypertext link to another. If there are no links to that page, or if the links are broken, the web crawlers cannot find them. However, the web designers can still submit their URL to individual directories, or search engines.

Another reason web crawlers do not include certain sites is the formatting on the pages. Search engines can index text documents, most images, and audio and video files, but they cannot read PDF, Flash, Shockwave, .EXE, PostScript or .ZIP files. They do not contain HTML text.

Web crawlers are not good with databases either, as databases are actually incomprehensible to web crawlers (Sherman and Price 59). Databases are commonly found on library, business, university and business association web sites. The Educator's Reference Desk is the world's largest educational database (Sherman and Price 83). It is filled with archived journals, citations and archives from education and library electronic mailing lists. However, because of its format, it is invisible to web crawlers, and is usually missing from standard search engine results.

Websites can deliberately block spiders or web crawlers from accessing them. The designers can do this by placing certain blocks in the meta tags in the head of the page, or by requiring a password before accessing the site.

Better Searching
When a user types "movies" into Google, a googol of different answers will come up. There are ways of narrowing the search so only the relevant sites are selected. The first step is to evaluate what the user really wants, or, as Mary Ellen Bates puts it, "Who cares?" (Bates). The user should first try to state his request in a clear question, and select the keywords in the question as the keywords to his entry. Words such as "the, and, I and to", should be omitted, and "movies, French, Foreign" entered. The sites that come up will be much more relevant. If the user enters the synonyms of his keywords, it will add more sites to his search, eg. by choosing "movies", he could add "films", or "motion pictures", to his entry. The search engine would bring up yet another different list. Ways have been developed to improve the search results by the intervention of human search assistants (human search engines) who can understand synonyms and abstract expressions.

Boolean search words can be used to further refine the search. The user can enter "+", "-", "AND", "NOT" and "OR" to limit the number of hits. If he use "AND" or "+", the search engine will bring up only those sites which contain those words. Some search engines (Google, for example) uses the Boolean "AND" by default. Thus, if you enter "Movies Foreign French", the search engine will bring up only those pages that contain all three of these words. If he wants "French movies" that do not feature "Gerard Depardieu", then he would have to limit his search with a Boolean "NOT"; he would enter "French AND foreign AND movies NOT Depardieu", and the search engine will weed out all movies with "Gerard Depardieu". If supposing he wants films in French, but does not care whether they were made in Canada or France, then he would use the Boolean "OR"; he would enter "French foreign films OR France OR Canada." Most search engines have and "Advanced Search" feature which will help the user refines his search even further; if something specific like a review or history is needed, this feature would be the most appropriate one to use.

When a user find a site that he thinks he might use, he should bookmark it under "Favorites" by creating a file for his search. Just because the user has found the site once, does not mean he will find it again. Bookmarking is therefore a great way to save time in finding a site that a user wants. Another option is to save the entire page to his hard drive, and then delete it when he is finished with it. The only advantage to this is that he does not have to be online to read the document, and he does not tie up his phone line (for those who still use dial-up connections, not broadband).

There are sites specifically set up to help people choose a search engine. Debbie Abilock's is a well-known favorite (Sherman). Even typing "choose search engine" into Google will bring up many sites that guide users through their searches. These engines will help identify what kind of information the user needs, and suggests search engines that will provide it.

Evaluating a web site
Here are some things for the user to consider when evaluating the materials in a site:
* How did you find the site? Was it the 100th site posted by the search engine? Was it recommend by a colleague, a friend, or journal you are familiar with? Was it a link from another page? Was it included in the bibliography of another web site?
* Is it related to your topic?
* Is the information accurate? One quick way to check for accuracy is to compare the information on this site with that of other sites on the same subject. A better way is to compare it with information you have found in other sources. Inaccurate information, however, will sometimes appear on multiple websites. If information presented on two different websites has identical wording, the repetition of the information should not be taken as a confirmation of factuality. It is likely that both webpages simply copied the text from the same source. Thus, if the original source provided incorrect information, then all websites that copy that information are just as incorrect.
* Is the site, someone's home page or pet project? Or is it affiliated with an organization like a university, or a business, or a government agency? Is there a way to contact the author, or the organization, by email or even snail mail? Is there a link to the organization's home page so that you can find out more about them?
* Does the organization have an agenda or a bias? Does it offer a variety of viewpoints?
* What is the purpose of the site? Is it for entertainment, or news, or editorial? Is it an advertisement?
* Is there detailed information on the page? Is it an in-depth discussion on the topic, or is it a "My dog, Fido" project? Or is it primarily a list of links to other sites?
* Is the information easy to read? Are the grammar and spelling correct? Are there maps, or graphs, or charts?
* How old is the page? Is there a date at the bottom of the page? Some sites will have the date, showing when the page was posted, and the date of the most recent update.

Main Menu

See Also

Main Menu