Created: 2025年9月21日 11:29 Class: TTDS Content: Chapter2 Learning time: 2.5
An example of an architecture used to provide a standardfor integrating search and related language technology components is UIMA (Unstructured Information Management Architecture).
- Effectiveness/quality
- Efficiency/speed
Two major functions:
- Text Acquisition: identify the documents will be searched(source of info: crawling/scanning the Web/corporate intranet/desktop)
- Document Data Store: contains the text and metadata(type/structure/doc length)
- Text Transformation component transforms documents into index terms or features. Output: index vocabulary
- Index Creation: needs to be efficiently updated. Most common method is Inverted indexes/incerted files
- user interaction
- accepting the user’s query and transforming it into index terms
- take the ranked list of documents from the search engine and organize it into the results(snippets used to summarize doc).
- provide techniques for refining the query
- ranking(core)
- transformed query from the user interaction component and generates a ranked list of documents using scores based on a retrieval model.
- The efficiency of ranking depends on the indexes, and the effectiveness depends on the retrieval model.
- evaluation
- record and analyze user behavior using log data
- tune and improve the ranking component
-
Crawler/general web crawler
there are significant challenges in designing a web crawler that can efficiently handle the huge volume of new pages on the Web, while at the same time ensuring that pages that may have changed
-
Feeds
- real-time stream of documents. eg.news stories and updates
- RSS is a common standard and actually refers to a family of standards with similar names (and the same initials), such as Really Simple Syndication or Rich Site Summary
-
Conversion
- utilities are available to convert various formats into text
- converted into a consistent encoding scheme: ASCII 7 or 8 bits; Unicode 16bits
-
Document data store
- stored in compressed form for efficiency, for fast retrieved documents
- structured data consists of metadata and other info extract from the documents, such as from link and anchor text
- Parser
- recognize structural elements(title, figures, links) from the sequence of text tokens in the document
- Tokenizing is the first step, potentially affecting retrieval
- The document parser uses knowledge of the syntax **of the markup language to identify the structure. eg. XML, HTML #tags
- Stopping
- removing common words from the stream of tokens that become index terms. eg.function words
- difficult to decide how many words to include on the stopword list
⚠️ to be or not to be—> all stop word list, longer lists for default processing of query text
- Stemming
- group words that are derived from a common stem. fish, fishes, fishing
- increase the likelihood that words used in queries and documents will match.
- small improvements for languages with little word variation such as Chinese
- link extraction and analysis
- link analysis algorithms such as PageRank provide the authority of a page
- Anchor text
- Information extraction
- identify index terms
- technique:name entity recognizers
- Classifier
- assign predefined label to documents
- clustering without predefined catagories
- Document statistics
- counts of index term occurrences, positions of index terms, length of documents
- stored in lookup tables
- Weighting
- reflect the importance of words in documents
- tf.idf: based on a combination of the frequency or count of index term occurrences in a document (the term frequency, or tf ) and the frequency of index term occurrence over the entire collection of documents (inverse document frequency, or idf ).
- A typical formula for idf is log N/n, where N is the total number of documents indexed by the search engine and n is the number of documents that contain a particular term.
- Inversion
- document-term information —> term-document information, for the creation of inverted indexes
- challenge: efficiency(large numbers of documents+updated new documents)
- Index distribution
- parallel: document distribution and term distribution
- replication
- peer-to-peer search
- Query input
- operator: clarify meaning. eg.quotes
- keywords: “search engines”may produce a better result with a web search engine than the query “what are typical implementation techniques and data structures used in search engines”.
- Boolean query langauge
- Query transformation
- improve initial query
- Tokenizing, stopping, and stemming must be done on the query text to produce index terms that are comparable to the document terms.
- Spell checking and query suggestion, which often leverage query logs.
- query expansion
- relevance feedback
- Results output
- generating snippets, highlighting important words and passages…
-
Scoring/query processing
- related to topic and user relevance
- basic form of the document score:
$$ \sum_i q_i \cdot d_i $$
$q_i$ is the query term weight of the $i$th term, and $ d_i$ is the document term weight(generally similar to tf.idf weights).- BM25 & query likelihood
-
Optimization: throughput
- term-at-a-time scoring
- document-at-a-time scoring
-
Distribution
- query broker
- caching
- Logging
- spell checking, query suggestion, ad.
- dwell time
- Ranking analysis
- emphasize the quality of the top-ranked documents
- performance analysis
- response time, throughput
- distribution: network usage
- mathematical simulations(==test collections in ranking analysis)


