Document delineation(文档切分) and character sequence decoding
Determine the vocabulary of terms
Tokenization
Dropping common terms:stop words
Normalization
Faster postings list intersection via skip pointers
Positional postings and phrase queries.
Biword indexes
可以解决大多数的phrase queries请求可以得到很好的效果。
依旧不能解决长phrase queirs的问题,例如超过三个query。
Positional indexes
Combination schemes
Reference and futher reading
Dictionaries and tolerate retrieval
index construction
Hardware basics
Blocked sort-based indexing
Single-pass in-momory indexing
Dynamic indexing
A lange main index and small auxility index(in memory)
Invalidation bit vector.
Delete just set the invalidation bit.
Documents are updated by deleting and reinserting them.
Merge auxility index into main index.
Other types of indexes
References and further reading
index compression
Statistical properties of terms in information retrieval
Dictionary compression
Postings file compression
References and further reading
scoring,term weighting and the vector space model
(assigning a score to a (query,document) pair)
Parametric and zone indexes
(可以理解为属性与标签
zone是arbitrary free text, unbounded,例如标题,正文等等)
Index and retrieve documents by metadata such as the language in which a document is written.
Weighted zone scoring
(ranked boolean retrieval)
Give a simple means for scoring documents in response to a query.
Term frequency and weighting
Collection frequency
Document frequency
Inverse document frequency
idf = log(N / tf)
Tf / idf weighting ( tf 是文档的tf )
Topic
The vector space model for scoring
Variant tf-idf functions
References and further reading
computing scores in a complete search system
Efficient scoring and ranking
Components of an information retrieval system
Vector space scoring and query operator interaction
References and further reading
evaluation in information retrieval
Information retrieval system evaluation
Standard test collections
Evaluation of unranked retrieval sets
Evaluation of ranked retrieval results
Accessing relevance
A broader perspective:System quality and user utility
Results snippets
References and further reading
relevance feedback and query expansion
Relevance feedback and pseudo relevance feedback
Global methods for query reformulation
References and further reading
XML retrieval
Basic XML concepts
Challenges in XML retrieval
A vector space model for XML retrieval
Evaluation of XML retrieval
Text-centric vs. data-centric XML retrieval
References and further reading
Probalistic information retrieval
Review of basic probability theory
The Probalility Ranking principle
The Binary Independence Model
An appraisal and some extensions
References and further reading
Language models for information retrieval
Language models
The query likelihood model
Language modeing versus other approaches in IR
Extended language modeling approaches
References and further reading
Text classification and Naive Bayes
Vector space classification
Support vector machines and machine learning on documents
Flat clustering
Clustering in information retrieval
Problem statement
Evaluation of clustering
K-means
Model-based clustering
References and further reading
Exercises
Hierarchical clustering
Hierarchical agglomerative clustering
Single-link and complete-link clustering
Group-average agglomerative clustering
Centroid clustering
Optimality of HAC
Division clustering
Cluster labeling
Implementation notes
References and further reading
Exercises
Matrix decompositions and latent semantic indexing
Linear algebra review
Term-document matrices and singular value docomposistions
Low-rank approximations
Latent semantic indexing
References and further reading
Web search basics
Background and history
Web characteritics
Advertising as the econonic model
The search user experience
Index size and estimation
Near duplicates and shingling
Search engine try to avoid indexing multiple copies of the same content
to keep down storage and processing overheads.
fingerprint
articles
Near duplication detection
信息指纹及其应用
Web crawling and indexes
Overview
Crawling
Crawler architecture
DNS resolution
DNS resolution is the well known bottleneck in web crawling
The lookup implementions in stand libraries are generally synchronous.
this means that once a request is made to the Domain Name Service, other crawler threads at that nodes are blocked until the firse request is completed.
Url frontier
实质就是traversing the web graph.
basic rules
High quality pages that change frequently should be prioritized for crawling.
Politeness, must avoid repeated fetch requests to a host within a short time.
urldb,用于保存url相关信息.
Distributing index
Connecivity services
Reference and further reading
Link analysis
The web as a graph
link
anchor text
The web is full of instance where the page does not provide an accurete description of itself.
The window of text surrounding anchor text is often usable usaful in the same manual as the anchor text