Web integrity tool

Igor maslov
Columbia University
New York, NY 10027
USA
ivm3@columbia.edu

Abstract

Toolset is implemented as Java application. Program opens Web page and downloads its content. If page has references to other Web pages, program parses HTML text and creates a list of all links contained in the page. After the list has been created, program goes through all links trying to establish connections with corresponding servers. If connection with a server can not be established, program displays the list of invalid links. User can send e-mail to the author or webmaster and notify them about invalid links.

Introduction

Program WebLinkChecker (WLC) is designed to provide an easy and effective tool for checking the integrity of Web pages (HTML documents) that contain references (links) to other Web pages. WLC establishes HTTP connection with the server that has required HTML document, downloads the content of the document and displays it on the screen. User can request to show all links that document contains. Program parses the document and extracts all references that are embraced with the following tags:

    <A HREF>   <a href>   <IMG SRC>   <img src>   <ACTION>    <action>

During parsing, program analyzes the link and, if it is local, turns it into global (starting with the current document's URL as a prefix). As one of the types of a link, program extracts e-mail address (or addresses) of the author of the document and/or the webmaster. All links are displayed on the screen as a tree with current document as the root, and links as leaves.

After all links have been displayed, user can select any link with HTTP and request to check its validity. WLC attempts to establish connection with the server that has the link. Establishing connection goes through the following steps:
-- program checks the validity of the specified URL, -- opens connection to the URL, -- opens input stream from the URL, -- reads the content (MIME) type of the document, -- if it is HTML document, program downloads the content of the document.

If on any of these stages program fails, it displays corresponding error message and marks the link as invalid. If link is not HTML document, program considers it to be valid if MIME type has been successfully read. For HTML document, however, program has to check the content of the returned document. If it contains the following patterns: -- "file not found", -- "access forbidden", -- "access denied", program marks the link as invalid. User can request to check all links at one pass. In that case, program displays the list of all invalid links.

If invalid links have been found, WLC suggests user to send e-mail to the server. User can choose from the list of e-mail addresses, to whom she wants to send message. In the message, the list of invalid links is printed.

Architecture

Program is implemented as Java application with JDK1.1.5 and Swing 1.0.1. The entire program is contained in public class LinkChecker extending GUI class JFrame. Links are implemented as class LinkObj having the following structure:
String linkAddr -- URL of the link, boolean isValid -- "true" if link is valid, otherwise "false", int servHits -- equal 0 if e-mail has not been sent to the server yet, to keep track of servers that were notified about invalid links.

Program is menu-driven and uses event mechanism of Action Listeners. There are six commands: Open Page, Exit, Show All, Check All, Check Link, and Send Mail, which are grouped into File, Link, and Mail. With each command, corresponding Action Listener is associated.

Command Open Page (menu File) has class OpenL as an Action Listener. When user click on this command, OpenL asks for the URL of the document and spawns a thread OpenThread. This thread does the followin: -- checks the validity of the URL, -- opens connection to the URL, -- opens input stream reom the URL, -- reads MIME type of the document from the server, -- reads and displays the content of the HTML document. If on any of these stages OpenThread fails, it displays corresponding error message, and user can repeat the command with corrected input.

Command Exit (menu File) has class ExitL as an Action Listener. ExitL simply closes application' window and shuts down the program.

Command Show All (menu Link) has class ShowL as an Action Listener. This class parses HTML document and extracts all links that are embraced with the tags:

    <A HREF>   <a href>   <IMG SRC>   <img src>   <ACTION>    <action>
During parsing, ShowL converts all local links into corresponding global, adding the URL of current HTML document as a prefix. It also converts DOS-type path(using \) into HTTP-type path (using /). As one of the types of a link, e-mail address (or addresses) is extracted from the document, in case user wants to notify the owner of the Web site about invalid links. ShowL creates objects of type LinkObj and builds a tree with current document as a root and links as its leaves. The expanded tree is displayed on the screen.

Command Check Link (menu Link) has class CheckL as an Action Listener. User selects the link she wants to check for validity and click on the command. CheckL spawns a thread CheckLink which does the following: -- checks if link is HTTP-link, -- checks the validity of the URL, -- opens connection to the URL, -- opens input stream from the URL, -- reads the content type from the URL, -- if it is HTML document, reads the content, -- parses document and checks that it does not contain patterns "file not found", "access forbidden", or "access denied". If on any of these steps CheckL fails, it displays corresponding error message and marks the link as invalid.

Command Check All (menu Link) has class AllL as an Action Listener. This class spawns a thread CheckLink for each link in the document. All threads run concurrently and produce the list of invalid links which is displayed on the screen. User is then asked if she wants to send e-mail to the Web site notifying the owner about invalid links. User can send e-mail instantly or postpone it and send it later with the command Send Mail. Invalid link is included in message only if it was not send before, that is if its field servHits is equal 0. If link was already sent to the Web site, it will not be included in the list. This is done in order to eliminate duplicate messages to the Web site.

Command Send Mail (menu Mail) has class SendL as an Action Listener. SendL compiles the list of invalid links that have not been sent to the Web site and send e-mail.

Function SendMail prompts user for her e-mail address and e-mail address on the Web site where she wants to send message about invalid links. There might be several e-mail addresses in the document, including the author and webmaster. SendMail then uses "mailto" protocol to send message. If it succeeds, it sets fields servHits of links to 1, to eliminate further attempts to duplicate notification.

Installation instructions

Class LinkChecker has to be installed on system that has JDK1.1.5 and Swing1.0.1 libraries.
To start the program, user has to type in command:
java LinkChecker

References

David Flanagan, Java Examples in a Nutshell, O'Reilly, 1997.
David Flanagan, Java in a Nutshell, O'Reilly, 1997.
Elliotte Rusty Harold, Java Network Programming, O'Reilly, 1997.