Abstract:
This paper describes a formalized
procedure for exploring a site using webometrics methods. The
procedure involves gathering details on a site's structure,
constructing and exploring the resulting webgraph, defining the
correctness criterion, identifying control actions that would
improve the structure under the given criterion, testing the
correctness criterion on real-world examples and developing
recommendations on improving the structure. PageRank is used as a
criterion to evaluate the value of web pages. The value is
determined by the presence/absence of a link pointing to that page
from the homepage of the site. Going by the correctness criterion,
valuable pages of a site should have the highest PageRank among
all other pages of that site. Control action consists of removing
non-valuable directories (and transforming them into independent
sites), whose root page has a high PageRank. Experiments are
conducted on three faculty sites of major universities in USA,
Russia and Nigeria. The approach is shown to be applicable and
reasonable in all cases.
Keywords:website, graph, PageRank, universities, data mining, website structure, web harvesting, web mining, URL.