Package org.apache.lucene.codecs.memory
Class FSTTermsWriter
- java.lang.Object
-
- org.apache.lucene.codecs.FieldsConsumer
-
- org.apache.lucene.codecs.memory.FSTTermsWriter
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public class FSTTermsWriter extends FieldsConsumer
FST-based term dict, using metadata as FST output. The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information need by postings reader.File:
- .tst: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST -->
FST<TermData>
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header -->
IndexHeader
- DirOffset -->
Uint64
- DocFreq, LongsSize, BytesSize, NumFields,
FieldNumber, DocCount -->
VInt
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta -->
VLong
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
FSTTermsWriter.FieldMetaData
(package private) class
FSTTermsWriter.TermsWriter
-
Field Summary
Fields Modifier and Type Field Description (package private) FieldInfos
fieldInfos
(package private) java.util.List<FSTTermsWriter.FieldMetaData>
fields
(package private) int
maxDoc
(package private) IndexOutput
out
(package private) PostingsWriterBase
postingsWriter
(package private) static java.lang.String
TERMS_CODEC_NAME
(package private) static java.lang.String
TERMS_EXTENSION
static int
TERMS_VERSION_CURRENT
static int
TERMS_VERSION_START
-
Constructor Summary
Constructors Constructor Description FSTTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
void
write(Fields fields, NormsProducer norms)
Write all fields, terms and postings.private void
writeTrailer(IndexOutput out, long dirStart)
-
Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge
-
-
-
-
Field Detail
-
TERMS_EXTENSION
static final java.lang.String TERMS_EXTENSION
- See Also:
- Constant Field Values
-
TERMS_CODEC_NAME
static final java.lang.String TERMS_CODEC_NAME
- See Also:
- Constant Field Values
-
TERMS_VERSION_START
public static final int TERMS_VERSION_START
- See Also:
- Constant Field Values
-
TERMS_VERSION_CURRENT
public static final int TERMS_VERSION_CURRENT
- See Also:
- Constant Field Values
-
postingsWriter
final PostingsWriterBase postingsWriter
-
fieldInfos
final FieldInfos fieldInfos
-
out
IndexOutput out
-
maxDoc
final int maxDoc
-
fields
final java.util.List<FSTTermsWriter.FieldMetaData> fields
-
-
Constructor Detail
-
FSTTermsWriter
public FSTTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter) throws java.io.IOException
- Throws:
java.io.IOException
-
-
Method Detail
-
writeTrailer
private void writeTrailer(IndexOutput out, long dirStart) throws java.io.IOException
- Throws:
java.io.IOException
-
write
public void write(Fields fields, NormsProducer norms) throws java.io.IOException
Description copied from class:FieldsConsumer
Write all fields, terms and postings. This the "pull" API, allowing you to iterate more than once over the postings, somewhat analogous to using a DOM API to traverse an XML tree.Notes:
- You must compute index statistics, including each Term's docFreq and totalTermFreq, as well as the summary sumTotalTermFreq, sumTotalDocFreq and docCount.
- You must skip terms that have no docs and fields that have no terms, even though the provided Fields API will expose them; this typically requires lazily writing the field or term until you've actually seen the first term or document.
- The provided Fields instance is limited: you cannot call any methods that return statistics/counts; you cannot pass a non-null live docs when pulling docs/positions enums.
- Specified by:
write
in classFieldsConsumer
- Throws:
java.io.IOException
-
close
public void close() throws java.io.IOException
- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
- Specified by:
close
in classFieldsConsumer
- Throws:
java.io.IOException
-
-