MSCI 720

2024年2月17日 92次阅读来源: tony

MSCI 720 Monday Jan 9 Lec 3/36

Syllabus
TREC Agreement
Summary-Review
Architecture
TREC text format
Tokenization

Tutorials will be used to tutor and help students to complete the assignments

Outcome

Identify, explain, and implement the key components of a search engine
Explain the advantages and disadvantages of in-situ, online and offline evaluation methods
Implement and compute offline effectiveness measures using a custom or existing test collection
Make and justify decisions based on the outcome of experiments
Diagnose search quality problems and suggest areas of engine improvement for future

Information Representation

Information retrieval (IR) system purpose: To help people satisfy their information needs.
The way we represent documents is the first step towards obtaining a high quality retrieval system. We begin by discussing issues in representing text items that pertain to both manual and automatic representation techniques.

Text Representation

The items in our collection vary in length and structure and may contain non-text items such as images. In all cases, we will refer to the text items in a collection as documents or items, but it is important to remember that there are a large variety of possible text items.

Our focus will be on representations that utilize words or tokens derived from words. The process of deciding which words to use to describe a document is called indexing and the chosen words are called index terms. Sometimes we want to represent documents with more than words and then it makes sense to talk about the use of features, which are more generic than index terms. An example of a non-word feature could be the number of words in a document.

When we automatically index, we write computer algorithms to process digital forms of the documents and make the decisions about index terms.

We want to index in a manner that will help provide the best interactive retrieval experience for the user. Defining what makes one retrieval experience better than another is complex and is addressed.

We simplify our notion of evaluation in this chapter to act of a single retrieval for a single user query.

Our goal in indexing is two-fold: first, to assign features to a document that make the document easy to find given some similarity measure between the user’s query and the document. Second, at the same time we want the features to have enough discriminatory power so as to not make all documents look similar to the query. A user’s query is not restricted to the keyword queries used with web search engines. A query can be anything that the user formulates given a retrieval technology. However, at some level all queries need to be converted into a form that allows similarity to be measured between the query and the documents in the collection. Such similarity measures are discussed in Chapter 10.

Let’s assume the user’s query is a single index term and that the similarity measure is simple word matching. Given a single index term, simple matching will retrieve all documents that match the index term. for the set of documents retrieved, the user would like them to all be relevant to the user’s information need.

Precision and Recall

Ideally, the user would want perfect precision and recall.
Precision is the fraction of items found by the user that are relevant.
Recall is the fraction of relevant items that the user is able to find. It is well established that precision and recall are inversely related.

Specificity

The degree to which a term is broad or narrow is called its specificity

Exhaustivity

In addition to specificity, there is the exhaustively of indexing. The more exhaustive the indexing, the more index terms that are used for each document.

Manual Indexing

    原文作者：tony
    原文地址: https://segmentfault.com/a/1190000008064734
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。