开源软件网
当前位置:文章首页 >> 开源搜索引擎 >> Lucene lecture at Pisa

Lucene lecture at Pisa
2008-04-23 22:24:06  作者:hauver  来源:互连网  浏览次数:0  网友评论0  文字大小:【】【】【】 评分等级:5

Prelude

  • my background..
  • please interrupt with questions
  • blog this talk now so that we can search for it later
    (using a Lucene-based blog search engine, of course)
  • In this course, Paolo and Antonio have presented many techniques.
  • I present real software that uses many of these techniques.

Lucene is


Lucene Architecture


[draw on whiteboard for reference throughout talk]

Lucene API

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search


Package: org.apache.lucene.document

  • A Document is a sequence of Fields.
  • A Field is a <name, value> pair.
    • name is the name of the field, e.g., title, body, subject, date, etc.
    • value is text.
  • Field values may be stored, indexed or analyzed (and, now, vectored).

Example

 public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));

return doc;
}

 

Example (continued)

field
stored
indexed
analyzed
path
yes
yes
yes
modified
yes
yes
no
content
no
yes
yes


Package: org.apache.lucene.analysis

  • An Analyzer is a TokenStream factory.
  • A TokenStream is an iterator over Tokens.
    • input is a character iterator (Reader)
  • A Token is tuple <text, type, start, length, positionIncrement>
    • text (e.g., “pisa”).
    • type (e.g., “word”, “sent”, “para”).
    • start & length offsets, in characters (e.g, <5,4>)
    • positionIncrement (normally 1)
  • standard TokenStream implementations are
    • Tokenizers, which divide characters into tokens and
    • TokenFilters, e.g., stop lists, stemmers, etc.

Example

public class ItalianAnalyzer extends Analyzer {
  private Set stopWords = 
StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
 result = new LowerCaseFilter(result);
  result = new StopFilter(result, stopWords);
  result = new SnowballFilter(result, "Italian");
 return result;
 }
}

Package: org.apache.lucene.index

  • Term is <fieldName, text>
  • index maps Term → <df, <docNum, <position>* >*>
  • e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
  • new: term vectors!

Example

IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();

Some Inverted Index Strategies

  1. batch-based: use file-sorting algorithms (textbook)
      + fastest to build
      + fastest to search
      - slow to update
  2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
      + fast to search
      - update/build does not scale
      - complex implementation
  3. segment based: lots of small indexes (Verity)
      + fast to build
      + fast to update
      - slower to search

Lucene's Index Algorithm

  • two basic algorithms:
    1. make an index for a single document
    2. merge a set of indices
  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=∞
      for (size = 1; size < M; size *= b) {
        if (there are b indexes with size docs on top of the stack) {
         pop them off the stack;
         merge them into a single index;
         push the merged index onto the stack;
       } else {
         break;
       }
      }
  • optimization: single-doc indexes kept in RAM, saves system calls
  • notes:
    • average b*logb(N)/2 indexes
      • N=1M, b=2 gives just 20 indexes
      • fast to update and not too slow to search
    • batch indexing  w/ M=∞, merge all at end
      • equivalent to external merge sort, optimal
    • segment indexing w/ M<∞

Indexing Diagram

  • b = 3
  • 11 documents indexed
  • stack has four indexes
  • grayed indexes have been deleted
  • 5 merges have occurred


Index Compression

For keys in Term -> ... map, use technique from Paolo's slides:


For values in Term -> ... map, use technique from Paolo's slides:



VInt Encoding Example

Value

First byte

Second byte

Third byte

0

00000000



1

00000001



2

00000010



...




127

01111111



128

10000000

00000001


129

10000001

00000001


130

10000010

00000001


...




16,383

11111111

01111111


16,384

10000000

10000000

00000001

16,385

10000001

10000000

00000001

...




This provides compression while still being efficient to decode.


Package: org.apache.lucene.search

  • primitive queries:
    • TermQuery: match docs containing a Term
    • PhraseQuery: match docs w/ sequence of Terms
    • BooleanQuery: match docs matching other queries.
      e.g., +path:pisa +content:“Doug Cutting” -path:nutch
  • new: SpansQuery
  • derived queries:
    • PrefixQuery, WildcardQuery, etc.

Example

Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));

PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);

Search Algorithms

From Paolo's slides:


Lucene's Disjunctive Search Algorithm

  • described in http://lucene.sf.net/papers/riao97.ps
  • since all postings must be processed
    • goal is to minimize per-posting computation
  • merges postings through a fixed-size array of accumulator buckets
  • performs boolean logic with bit masks
  • scales well with large queries

[ draw a diagram to illustrate? ]

Lucene's Conjunctive Search Algorithm

From Paolo's slides:


Algorithm
  • use linked list of pointers to doc list
  • initially sorted by doc
  • loop
    • if all are at same doc, record hit
    • skip first to-or-past last and move to end of list

Scoring

From Paolo's slides:

Is very much like Lucene's Similarity.

Lucene's Phrase Scoring

  • approximate phrase IDF with sum of terms
  • compute actual tf of phrase
  • slop penalizes slight mismatches by edit-distance

Thanks!

And there's lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

Finally, search for this talk on Technorati.


本文引用地址:http://www.open-soft.com.cn/article/2008/0423/article_110.html

责任编辑:hauver

发表评论】 【加入收藏】 【告诉好友】 【打印本页】 【关闭窗口】 【返回顶部
相关评论 0条评论  发表/查看更多评论 
发表评论  【返回顶部】【关闭窗口】 
评分: 1 2 3 4 5

    
  • 请遵守《互联网电子公告服务管理规定》及中华人民共和国其他各项有关法律法规。
  • 严禁发表危害国家安全、损害国家利益、破坏民族团结、破坏国家宗教政策、破坏社会稳定、侮辱、诽谤、教唆、淫秽等内容的评论 。
  • 用户需对自己在使用本站服务过程中的行为承担法律责任(直接或间接导致的)。
  • 本站管理员有权保留或删除评论内容。
 
[推荐]Lucene的系统结构 [2008-04-23]
[推荐]Lucene 全文检索实践 [2008-04-23]
[推荐]Lucene.Net 系列一本... [2008-04-23]
[推荐]构建基于词典的Lu... [2008-04-23]
Lucene中文分词 使用Log4... [2008-04-23]
[注意]lucene中文分词器--... [2008-04-23]
[推荐]第四节 Lucene索引构... [2008-04-23]
[推荐]第三节 Lucene索引文... [2008-04-23]
[推荐]第三节 Lucene索引文... [2008-04-23]
[推荐]第二节 Lucene系统结... [2008-04-23]