-
introduction
-
Modern Human Knowledge Base
- Library? no
- The Internet
- Especially Wikipedia
-
How do we get cross site informations?
-
search
- visit multiple webpages and extract/compare by hand
-
Current cross page info presentation (hand made)
-
Show wikipedia list of XXX
- man made informations from across many entrys
-
slow
- needs constant updated on new/modify entrys
-
Show product comparison page
- product property on indivisual pages
- opinons on seperated set of pages
- new products come out very often
- Too weak, and only a few list pages availiable
-
Corpus will be wikipedia instead the whole Net
-
Pros:
- very high quality
- good grammer and volcabulary
- easy to get data set
- small data set runs faster
-
Cons:
-
Conventional facts are not included
- fun!
- lyrics
- conspiracy
- jokes
- idioms
- quriosity kills cats
- may contain many false informations (rumors)
- not much redundent data for verification
-
development framework
-
MapReduce (by Hadoop)
- used by Y!, free implementation of G's back bone
-
wikipedia is still big
- 2,869,045 articles
- 2.1G just the abtractions
- 19G on full text (xml format)
-
Programming language
- (mainly) Python
- data stream modulized, language mixing is easy
-
method
-
how dose facts look like in NL?
-
a method: Howto ...
- step(Topic, (s1,s2,s3,s4....))
-
a description: A is B's friend
- (A) -> (friend of B)
-
a relation: cats eat fish
- (cat) --(eat)--> (fish)
-
main method
-
Extraction
-
POS Tagging
- mark noun phrases and verb phrases
- Tools
- YamCha
-
Facts extraction
- NVN
- solid, but too narrow
- NPVPNP
- better
- NP*VP*NP
- limit length
- NP+beV+VBN+介係詞
- p.p. opposite relation
-
Normalization
- form A --B--> C
- 詞性
- passive
- phrases
- NER Tagging?
-
Presenting
-
Thesaurus on N
- for query expansion
- Tools:
- WordNet
-
Thesaurus on V
- for query expansion
- Tools:
- WordNet
- VerbOcene
- AutoGen from N?
-
Thesaurus on VP?
- AutoGen from N/NP?
-
presentation
-
A limited search form
- 5W + Verb + NP
- NP + Verb + 5W
-
Problems:
- 5W need NER tagging
-
Query Expansion
- N expansion using thesarus
- VP expansion ??
-
Related works & topic
-
related works
-
TextRunner
- Comparison with the Net Corpus
-
Comparison with other QA Systems
-
traditional "Q&A" list
- man made
- static
- works good small scopes
- usually indexed, unsearchable
-
forums
-
not only contains QAs
- but a lot of them are
-
social "Q&A" (Yahoo! Answers)
- an alternate form of forums
- man made (User contribute)
- potencially bad quality
-
smarter ones
-
Answers dot com
- not smart enough