public abstract class WarcReader extends Object implements Closeable
| Modifier and Type | Field and Description |
|---|---|
protected boolean |
bBlockDigest
Block Digest enabled/disabled.
|
protected boolean |
bIsCompliant
Compliance status for records parsed up to now.
|
protected String |
blockDigestAlgorithm
Default block digest algorithm to use if none is present in the
record.
|
protected String |
blockDigestEncoding
Default encoding scheme used to encode block digest into a string,
if none is detected from the record.
|
protected boolean |
bPayloadDigest
Payload Digest enabled/disabled.
|
protected long |
consumed
Number of bytes consumed by this reader.
|
protected WarcRecord |
currentRecord
Current WARC record object.
|
Diagnostics<Diagnosis> |
diagnostics
Reader level errors and warnings or when no record is available.
|
protected int |
errors
Aggregated number of errors encountered while parsing.
|
protected WarcFieldParsers |
fieldParsers
WARC field parser used.
|
protected HeaderLineReader |
headerLineReader
Header line reader used to read the WARC headers.
|
protected Exception |
iteratorExceptionThrown
Exception thrown while using the iterator.
|
protected HeaderLineReader |
lineReader
Line reader used to read version lines.
|
protected String |
payloadDigestAlgorithm
Default payload digest algorithm to use if none is present in the
record.
|
protected String |
payloadDigestEncoding
Default encoding scheme used to encode payload digest into a string,
if none is detected from the record.
|
protected int |
payloadHeaderMaxSize
Max size allowed for a payload header.
|
protected int |
recordHeaderMaxSize
Max size allowed for a record header.
|
protected int |
records
Records parsed.
|
protected UriProfile |
uriProfile
URI profile.
|
protected UriProfile |
warcTargetUriProfile
WARC-Target-URI profile.
|
protected int |
warnings
Aggregate number of warnings encountered while parsing.
|
| Constructor and Description |
|---|
WarcReader() |
| Modifier and Type | Method and Description |
|---|---|
abstract void |
close()
Close current record resource(s) and input stream(s).
|
String |
getBlockDigestAlgorithm()
Get the default block digest algorithm.
|
boolean |
getBlockDigestEnabled()
Get the readers block digest on/off status.
|
String |
getBlockDigestEncoding()
Get the default block digest encoding scheme.
|
abstract long |
getConsumed()
Get number of bytes consumed by this reader.
|
Exception |
getIteratorExceptionThrown()
Gets an exception thrown in the iterator if any or null.
|
abstract WarcRecord |
getNextRecord()
Parses and gets the next record.
|
abstract WarcRecord |
getNextRecordFrom(InputStream in,
long offset)
Parses and gets the next record from an
Inputstream. |
abstract WarcRecord |
getNextRecordFrom(InputStream in,
long offset,
int buffer_size)
Parses and gets the next record from an
Inputstream wrapped
by a BufferedInputStream. |
abstract long |
getOffset()
Get the current offset in the WARC
InputStream. |
String |
getPayloadDigestAlgorithm()
Get the default payload digest algorithm.
|
boolean |
getPayloadDigestEnabled()
Get the readers payload digest on/off status.
|
String |
getPayloadDigestEncoding()
Get the default payload digest encoding scheme.
|
int |
getPayloadHeaderMaxSize()
Get the max size allowed for a payload header.
|
int |
getRecordHeaderMaxSize()
Get the max size allowed for a record header.
|
abstract long |
getStartOffset()
Get the offset of the current WARC record or -1 if none have been read.
|
UriProfile |
getUriProfile()
Get the URI profile used to validate URIs.
|
UriProfile |
getWarcTargetUriProfile()
Get the URI profile used to validate WARC-Target URIs.
|
protected void |
init()
Method used to initialize a readers internal state.
|
boolean |
isCompliant()
Returns a boolean indicating if all records parsed so far are compliant.
|
abstract boolean |
isCompressed()
Is this reader assuming GZip compressed input.
|
Iterator<WarcRecord> |
iterator()
Returns an
Iterator over the records as they are being
parsed. |
protected abstract void |
recordClosed()
Callback method called when the payload has been processed.
|
void |
reset()
Reset reader for reuse.
|
boolean |
setBlockDigestAlgorithm(String digestAlgorithm)
Tries to set the default block digest algorithm and returns a boolean
indicating whether the algorithm was accepted or not.
|
void |
setBlockDigestEnabled(boolean enabled)
Set the readers block digest on/off status.
|
void |
setBlockDigestEncoding(String encodingScheme)
Set the default block digest encoding scheme.
|
boolean |
setPayloadDigestAlgorithm(String digestAlgorithm)
Tries to set the default payload digest algorithm and returns a boolean
indicating whether the algorithm was accepted or not.
|
void |
setPayloadDigestEnabled(boolean enabled)
Set the readers payload digest on/off status.
|
void |
setPayloadDigestEncoding(String encodingScheme)
Set the default payload digest encoding scheme.
|
void |
setPayloadHeaderMaxSize(int size)
Set the max size allowed for a payload header.
|
void |
setRecordHeaderMaxSize(int size)
Set the max size allowed for a record header.
|
void |
setUriProfile(UriProfile uriProfile)
Set the URI profile used to validate URIs.
|
void |
setWarcTargetUriProfile(UriProfile uriProfile)
Set the URI profile used to validate WARC-Target URIs.
|
protected UriProfile warcTargetUriProfile
protected UriProfile uriProfile
protected String blockDigestAlgorithm
protected String blockDigestEncoding
protected boolean bPayloadDigest
protected String payloadDigestAlgorithm
protected String payloadDigestEncoding
protected boolean bBlockDigest
protected int recordHeaderMaxSize
protected int payloadHeaderMaxSize
protected HeaderLineReader lineReader
protected HeaderLineReader headerLineReader
protected WarcFieldParsers fieldParsers
public final Diagnostics<Diagnosis> diagnostics
protected boolean bIsCompliant
protected long consumed
protected int records
protected int errors
protected int warnings
protected WarcRecord currentRecord
protected Exception iteratorExceptionThrown
public WarcReader()
protected void init()
public void reset()
public boolean isCompliant()
public abstract boolean isCompressed()
public void setWarcTargetUriProfile(UriProfile uriProfile)
uriProfile - URI profile to usepublic UriProfile getWarcTargetUriProfile()
public void setUriProfile(UriProfile uriProfile)
uriProfile - URI profile to usepublic UriProfile getUriProfile()
public boolean getBlockDigestEnabled()
public void setBlockDigestEnabled(boolean enabled)
enabled - boolean indicating block digest on/offpublic boolean getPayloadDigestEnabled()
public void setPayloadDigestEnabled(boolean enabled)
enabled - boolean indicating payload digest on/offpublic String getBlockDigestAlgorithm()
public boolean setBlockDigestAlgorithm(String digestAlgorithm)
digestAlgorithm - block digest algorithm
(null means no default block digest algorithm is selected)public String getPayloadDigestAlgorithm()
public boolean setPayloadDigestAlgorithm(String digestAlgorithm)
digestAlgorithm - payload digest algorithm
(null means no default payload digest algorithm is selected)public String getBlockDigestEncoding()
public void setBlockDigestEncoding(String encodingScheme)
encodingScheme - encoding scheme
(null means default block digest is not encoded)public String getPayloadDigestEncoding()
public void setPayloadDigestEncoding(String encodingScheme)
encodingScheme - encoding scheme
(null means default payload digest is not encoded)public int getRecordHeaderMaxSize()
public void setRecordHeaderMaxSize(int size)
size - max size allowedpublic int getPayloadHeaderMaxSize()
public void setPayloadHeaderMaxSize(int size)
size - max size allowedpublic abstract void close()
close in interface Closeableclose in interface AutoCloseableprotected abstract void recordClosed()
public abstract long getStartOffset()
public abstract long getOffset()
InputStream.InputStreampublic abstract long getConsumed()
public abstract WarcRecord getNextRecord() throws IOException
IOException - i/o exception in parsing processpublic abstract WarcRecord getNextRecordFrom(InputStream in, long offset) throws IOException
Inputstream.
This method is mainly for random access use since there are serious
side-effects involved in using multiple PushBackInputStream
instances.in - InputStream used to read next recordoffset - offset provided by callerIOException - i/o exception in parsing processpublic abstract WarcRecord getNextRecordFrom(InputStream in, long offset, int buffer_size) throws IOException
Inputstream wrapped
by a BufferedInputStream.
This method is mainly for random access use since there are serious
side-effects involved in using multiple PushBackInputStream
instances.in - InputStream used to read next recordoffset - offset provided by callerbuffer_size - buffer size to useIOException - i/o exception in parsing processpublic Exception getIteratorExceptionThrown()
public Iterator<WarcRecord> iterator()
Iterator over the records as they are being
parsed. Any exception thrown during parsing is accessible through the
getIteratorExceptionThrown method.Iterator over the recordsCopyright © 2011–2015. All rights reserved.