The large variety of information available on the web are highly dynamic.
Web pages are created, modified and eventually deleted by their owners.
A major challenge is then the ability to locate the most relevant and
up-to-date information in a highly evolving Web. Search engines and news
alerting/publishing services have to cope with this challenge.
The goal of this project is to analyze the dynamic behavior of the Web
contents to assess how often and to what extent the contents of a site
change. We monitored the MSNBC news web site for
19 weeks, starting mid November 2004. We collected snapshots of the site
every 15 minutes by downloading the pages belonging to five major
news categories, that is, business, entertainment, health, news and
sports. To model the evolution of the site and in particular its
growth, we applied numerical fitting techniques to the daily changes of
the site, that is, creations of new Web pages and the updates of existing
pages. We also investigated the evolution of the site in terms of
content changes. In this framework we applied techniques typical of
the Information Retrieval domain, such as the cosine distance.
|