Information retrieval

boolean retrieval
the term vocabulary and posting lists
1. Document delineation(文档切分) and character sequence decoding
2. Determine the vocabulary of terms
  1. Tokenization
  2. Dropping common terms:stop words
  3. Normalization
3. Faster postings list intersection via skip pointers
4. Positional postings and phrase queries.
  1. Biword indexes
    1. 可以解决大多数的phrase queries请求可以得到很好的效果。
    2. 依旧不能解决长phrase queirs的问题，例如超过三个query。
  2. Positional indexes
  3. Combination schemes
5. Reference and futher reading
Dictionaries and tolerate retrieval
index construction
1. Hardware basics
2. Blocked sort-based indexing
3. Single-pass in-momory indexing
4. Dynamic indexing
  1. A lange main index and small auxility index(in memory)
  2. Invalidation bit vector. Delete just set the invalidation bit. Documents are updated by deleting and reinserting them.
  3. Merge auxility index into main index.
5. Other types of indexes
6. References and further reading
index compression
1. Statistical properties of terms in information retrieval
2. Dictionary compression
3. Postings file compression
4. References and further reading
scoring,term weighting and the vector space model (assigning a score to a (query,document) pair)
1. Parametric and zone indexes (可以理解为属性与标签 zone是arbitrary free text, unbounded，例如标题，正文等等)
  1. Index and retrieve documents by metadata such as the language in which a document is written.
  2. Weighted zone scoring (ranked boolean retrieval)
  3. Give a simple means for scoring documents in response to a query.
2. Term frequency and weighting
  1. Collection frequency
  2. Document frequency
  3. Inverse document frequency idf = log(N / tf)
  4. Tf / idf weighting ( tf 是文档的tf )
    1. Topic
3. The vector space model for scoring
4. Variant tf-idf functions
5. References and further reading
computing scores in a complete search system
1. Efficient scoring and ranking
2. Components of an information retrieval system
3. Vector space scoring and query operator interaction
4. References and further reading
evaluation in information retrieval
1. Information retrieval system evaluation
2. Standard test collections
3. Evaluation of unranked retrieval sets
4. Evaluation of ranked retrieval results
5. Accessing relevance
6. A broader perspective:System quality and user utility
7. Results snippets
8. References and further reading
relevance feedback and query expansion
1. Relevance feedback and pseudo relevance feedback
2. Global methods for query reformulation
3. References and further reading
XML retrieval
1. Basic XML concepts
2. Challenges in XML retrieval
3. A vector space model for XML retrieval
4. Evaluation of XML retrieval
5. Text-centric vs. data-centric XML retrieval
6. References and further reading
Probalistic information retrieval
1. Review of basic probability theory
2. The Probalility Ranking principle
3. The Binary Independence Model
4. An appraisal and some extensions
5. References and further reading
Language models for information retrieval
1. Language models
2. The query likelihood model
3. Language modeing versus other approaches in IR
4. Extended language modeling approaches
5. References and further reading
Text classification and Naive Bayes
Vector space classification
Support vector machines and machine learning on documents
Flat clustering
1. Clustering in information retrieval
2. Problem statement
3. Evaluation of clustering
4. K-means
5. Model-based clustering
6. References and further reading
7. Exercises
Hierarchical clustering
1. Hierarchical agglomerative clustering
2. Single-link and complete-link clustering
3. Group-average agglomerative clustering
4. Centroid clustering
5. Optimality of HAC
6. Division clustering
7. Cluster labeling
8. Implementation notes
9. References and further reading
10. Exercises
Matrix decompositions and latent semantic indexing
1. Linear algebra review
2. Term-document matrices and singular value docomposistions
3. Low-rank approximations
4. Latent semantic indexing
5. References and further reading
Web search basics
1. Background and history
2. Web characteritics
3. Advertising as the econonic model
4. The search user experience
5. Index size and estimation
6. Near duplicates and shingling
  1. Search engine try to avoid indexing multiple copies of the same content to keep down storage and processing overheads.
  2. fingerprint
  3. articles
    1. Near duplication detection
    2. 信息指纹及其应用
Web crawling and indexes
1. Overview
2. Crawling
  1. Crawler architecture
  2. DNS resolution
    1. DNS resolution is the well known bottleneck in web crawling
    2. The lookup implementions in stand libraries are generally synchronous. this means that once a request is made to the Domain Name Service, other crawler threads at that nodes are blocked until the firse request is completed.
  3. Url frontier 实质就是traversing the web graph.
    1. basic rules
      1. High quality pages that change frequently should be prioritized for crawling.
      2. Politeness, must avoid repeated fetch requests to a host within a short time.
    2. urldb，用于保存url相关信息.
3. Distributing index
4. Connecivity services
5. Reference and further reading
Link analysis
1. The web as a graph
  1. link
  2. anchor text
    1. The web is full of instance where the page does not provide an accurete description of itself.
    2. The window of text surrounding anchor text is often usable usaful in the same manual as the anchor text
2. PageRank
3. Hubs and authorities
4. References and further reading
Web page analysis
1. 内容页索引页识别
2. 标题内容提取