public abstract class TrecDocParser
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
TrecDocParser.ParsePathType
Types of trec parse paths,
|
Modifier and Type | Field and Description |
---|---|
static TrecDocParser.ParsePathType |
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
|
Constructor and Description |
---|
TrecDocParser() |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
extract(java.lang.StringBuilder buf,
java.lang.String startTag,
java.lang.String endTag,
int maxPos,
java.lang.String[] noisePrefixes)
Extract from
buf the text of interest within specified tags |
abstract DocData |
parse(DocData docData,
java.lang.String name,
TrecContentSource trecSrc,
java.lang.StringBuilder docBuf,
TrecDocParser.ParsePathType pathType)
parse the text prepared in docBuf into a result DocData,
no synchronization is required.
|
static TrecDocParser.ParsePathType |
pathType(java.nio.file.Path f)
Compute the path type of a file by inspecting name of file and its parents
|
static java.lang.String |
stripTags(java.lang.StringBuilder buf,
int start)
strip tags from
buf : each tag is replaced by a single blank. |
static java.lang.String |
stripTags(java.lang.String buf,
int start)
strip tags from input.
|
public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
public static TrecDocParser.ParsePathType pathType(java.nio.file.Path f)
public abstract DocData parse(DocData docData, java.lang.String name, TrecContentSource trecSrc, java.lang.StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws java.io.IOException
docData
- reusable resultname
- name that should be set to the resulttrecSrc
- calling trec content sourcedocBuf
- text to parsepathType
- type of parsed file, or null if unknown - may be used by
parsers to alter their behavior according to the file path type.java.io.IOException
public static java.lang.String stripTags(java.lang.StringBuilder buf, int start)
buf
: each tag is replaced by a single blank.buf
(Input StringBuilder is unmodified).public static java.lang.String stripTags(java.lang.String buf, int start)
stripTags(StringBuilder, int)
public static java.lang.String extract(java.lang.StringBuilder buf, java.lang.String startTag, java.lang.String endTag, int maxPos, java.lang.String[] noisePrefixes)
buf
the text of interest within specified tagsbuf
- entire input textstartTag
- tag marking start of text of interestendTag
- tag marking end of text of interestmaxPos
- if ≥ 0 sets a limit on start of text of interestCopyright © 2000–2019 The Apache Software Foundation. All rights reserved.