Class FSTCompiler<T>

java.lang.Object
org.apache.lucene.util.fst.FSTCompiler<T>

public class FSTCompiler<T> extends Object
Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

The parameterized type T is the output type. See the subclasses of Outputs.

FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

It now supports 3 different workflows:

- Build FST and use it immediately entirely in RAM and then discard it

- Build FST and use it immediately entirely in RAM and also save it to other DataOutput, and load it later and use it

- Build FST but stream it immediately to disk (except the FSTMetaData, to be saved at the end). In order to use it, you need to construct the corresponding DataInput and use the FST constructor to read it.

  • Field Details

  • Constructor Details

    • FSTCompiler

      private FSTCompiler(FST.INPUT_TYPE inputType, double suffixRAMLimitMB, Outputs<T> outputs, boolean allowFixedLengthArcs, DataOutput dataOutput, float directAddressingMaxOversizingFactor, int version)
  • Method Details

    • getOnHeapReaderWriter

      public static DataOutput getOnHeapReaderWriter(int blockBits)
      Get an on-heap DataOutput that allows the FST to be read immediately after writing, and also optionally saved to an external DataOutput.
      Parameters:
      blockBits - how many bits wide to make each block of the DataOutput
      Returns:
      the DataOutput
    • getFSTReader

      public FSTReader getFSTReader()
      Get the respective FSTReader of the DataOutput. To call this method, you need to use the default DataOutput or getOnHeapReaderWriter(int), otherwise we will throw an exception.
      Returns:
      the DataOutput as FSTReader
      Throws:
      IllegalStateException - if the DataOutput does not implement FSTReader
    • getDirectAddressingMaxOversizingFactor

      public float getDirectAddressingMaxOversizingFactor()
    • getNodeCount

      public long getNodeCount()
    • getArcCount

      public long getArcCount()
    • compileNode

      private FSTCompiler.CompiledNode compileNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws IOException
      Throws:
      IOException
    • addNode

      long addNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws IOException
      Throws:
      IOException
    • writePaddingByte

      private void writePaddingByte() throws IOException
      Write the padding byte, ensure no node gets address 0 which is reserved to mean the stop state w/ no arcs
      Throws:
      IOException
    • writeLabel

      private void writeLabel(DataOutput out, int v) throws IOException
      Throws:
      IOException
    • shouldExpandNodeWithFixedLengthArcs

      private boolean shouldExpandNodeWithFixedLengthArcs(FSTCompiler.UnCompiledNode<T> node)
      Returns whether the given node should be expanded with fixed length arcs. Nodes will be expanded depending on their depth (distance from the root node) and their number of arcs.

      Nodes with fixed length arcs use more space, because they encode all arcs with a fixed number of bytes, but they allow either binary search or direct addressing on the arcs (instead of linear scan) on lookup by arc label.

    • shouldExpandNodeWithDirectAddressing

      private boolean shouldExpandNodeWithDirectAddressing(FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange)
      Returns whether the given node should be expanded with direct addressing instead of binary search.

      Prefer direct addressing for performance if it does not oversize binary search byte size too much, so that the arcs can be directly addressed by label.

      See Also:
    • writeNodeForBinarySearch

      private void writeNodeForBinarySearch(FSTCompiler.UnCompiledNode<T> nodeIn, int maxBytesPerArc)
    • reverseScratchBytes

      private void reverseScratchBytes()
      Reverse the scratch bytes in place. This operation does not affect scratchBytes.getPosition().
    • writeScratchBytes

      private void writeScratchBytes(int destPos, byte[] bytes, int offset, int length)
      Write bytes from a source byte[] to the scratch bytes. The written bytes must fit within what was already written in the scratch bytes.

      This operation does not affect scratchBytes.getPosition().

      Parameters:
      destPos - the position in the scratch bytes
      bytes - the source byte[]
      offset - the offset inside the source byte[]
      length - the number of bytes to write
    • writeNodeForDirectAddressingOrContinuous

      private void writeNodeForDirectAddressingOrContinuous(FSTCompiler.UnCompiledNode<T> nodeIn, int maxBytesPerArcWithoutLabel, int labelRange, boolean continuous)
    • writePresenceBits

      private void writePresenceBits(FSTCompiler.UnCompiledNode<T> nodeIn)
    • freezeTail

      private void freezeTail(int prefixLenPlus1) throws IOException
      Throws:
      IOException
    • add

      public void add(IntsRef input, T output) throws IOException
      Add the next input/output pair. The provided input must be sorted after the previous one according to IntsRef.compareTo(org.apache.lucene.util.IntsRef). It's also OK to add the same input twice in a row with different outputs, as long as Outputs implements the Outputs.merge(T, T) method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (eg ByteSequenceOutputs or IntSequenceOutputs) then you cannot reuse across calls.
      Throws:
      IOException
    • setEmptyOutput

      void setEmptyOutput(T v)
    • finish

      void finish(long newStartNode)
    • validOutput

      private boolean validOutput(T output)
    • compile

      public FST.FSTMetadata<T> compile() throws IOException
      Returns the metadata of the final FST. NOTE: this will return null if nothing is accepted by the FST themselves.

      To create the FST, you need to:

      - If a FSTReader DataOutput was used, such as the one returned by getOnHeapReaderWriter(int)

           fstMetadata = fstCompiler.compile();
           fst = FST.fromFSTReader(fstMetadata, fstCompiler.getFSTReader());
       

      - If a non-FSTReader DataOutput was used, such as IndexOutput, you need to first create the corresponding DataInput, such as IndexInput then pass it to the FST construct

           fstMetadata = fstCompiler.compile();
           fst = new FST<>(fstMetadata, dataInput, new OffHeapFSTStore());
       
      Throws:
      IOException
    • fstRamBytesUsed

      public long fstRamBytesUsed()
    • fstSizeInBytes

      public long fstSizeInBytes()