|[Original Greek Site] [North European Mirror]|
Computer Architecture and VLSI Systems Division Operating Systems and HPCN
Resource Discovery on the WWW
Information retrieval on the Web is like searching for a needle in a haystack: one needs the right tools to separate the needle from the hay.In this project we develop tools that help users identify and track the information they are interested in.
When a user wants to find information about a specific topic he/she sends a query to a a search engine (e.g. Alta-Vista), which replies with several URLs. Every time the user wants to find new information about the same topic, Alta-Vista returns the same URLs, flooding the user with unecessary information. USEwebNET is designed to relieve users form the long waits and the information flood associated with the traditional search model. Specifically, USEwebNET is a network tool with a user-friendly interface designed to retrieve documents about selected subjects(or updated versions of selected documents) from the web and present them to the user with various information about them, according to the user's preferences.
In a daily basis, USEwebNET contacts several search engines selected in the user's preferences (currently Yahoo, Alta-Vista and Hot-Bot are supported) and downloads all documents that match the specified keywords and have not been downloaded during the previous days.
For each user, USEwebNET keeps a database with his preferences. These include the search engines, from which documents are going to be retrieved, the keywords that are going to be used for a search and the time period, after which documents that have not been read by the user are considered NOT valid and are deleted. USEwebNET keeps track of the documents that have been read by every user. Thus, each user is provided only with new documents every time he accesses USEwebNET and so he focuses only on new or updated pages.
In the first mode, interested users supply PaperFinder with a few keywords that describe their field of interest, like ``digital libraries'', or ``process scheduling''. Along with these keywords users specify a number of on-line digital libraries that PaperFinder should search for papers. Then, PaperFinder inquiries each digital library for papers matching the above keywords. All replies are merged and presented to the user via a USENET-based interface. Once the user views some papers, PaperFinder marks them as ``read'' and does not present them to the user the next time. Thus, users can focus on ``new'' papers that they have not previously seen. Users may also select to ``save'', or ``delete'' a paper. Thus, users are always presented with new papers, that they have not processed before.
In the resource-discovery mode, PaperFinder sets out to discover papers that may match a user's interest, but which do not necessarily match some predefined keywords. In this mode, users specify some ``seed papers'' (or ``seed authors''), and PaperFinder searches the digital libraries to find similar papers to these ones. Defining the best similarity metrics is an open and interesting issue. We favor the use of simple metrics that can be easily calculated. For example, papers that have a similar set of references may be close to each other. As another example, papers that have an overlapping set of co-authors, or several common keywords in their title/abstract may also be similar.
|Sponsors - Affiliations|
|GSRT||The USENIX Association||University of Crete|