public class WarcHeader extends Object
| Modifier and Type | Field and Description |
|---|---|
boolean |
bMagicIdentified
Was "WARC/" identified while looking for the version string.
|
boolean |
bMandatoryMissing
Is the header missing one of the mandatory headers.
|
boolean |
bValidVersion
Is the version recognized.
|
boolean |
bValidVersionFormat
Is the version format valid.
|
boolean |
bVersionParsed
Did the version string include between 2 and 4 substrings delimited by ".".
|
Long |
contentLength
Content-Length converted to a
Long object, if valid. |
String |
contentLengthStr
Content-Length field string value.
|
ContentType |
contentType
Content-Type converted to a
ContentType object, if valid. |
String |
contentTypeStr
Content-Type field string value.
|
protected Diagnostics<Diagnosis> |
diagnostics
Diagnostics used to report diagnoses.
|
protected WarcFieldParsers |
fieldParsers
WARC field parser used.
|
byte[] |
headerBytes
Raw WARC header byte array.
|
protected ByteArrayOutputStream |
headerBytesOut
Raw WARC header output stream.
|
protected List<HeaderLine> |
headerList
List of parsed header fields.
|
protected Map<String,HeaderLine> |
headerMap
Map of parsed header fields.
|
int |
major
Major version number from WARC header.
|
int |
minor
Minor version number from WARC header.
|
protected WarcReader |
reader
Associated WarcReader context.
|
protected boolean[] |
seen
Array used for duplicate header detection.
|
protected long |
startOffset
WARC record starting offset relative to the source WARC file input
stream.
|
static boolean |
URI_LTGT
An URI with encapsulating <> characters.
|
static boolean |
URI_NAKED
An URI without encapsulating <> characters.
|
protected UriProfile |
uriProfile
URI profile.
|
int[] |
versionArr
Array based on the version string split by the "." delimiter and converted to integers.
|
String |
versionStr
Raw version string.
|
WarcDigest |
warcBlockDigest
WARC-Block-Digest converted to a
WarcDigest object, if valid. |
String |
warcBlockDigestStr
WARC-Block-Digest field string value.
|
List<WarcConcurrentTo> |
warcConcurrentToList
List of WARC-Concurrent-To field string values and converted
URI objects, if valid. |
Date |
warcDate
WARC-Date converted to a
Date object, if valid. |
protected DateFormat |
warcDateFormat
WARC
DateFormat as specified by the WARC ISO standard. |
String |
warcDateStr
WARC-Date field string value.
|
String |
warcFilename
WARC-Filename field string value.
|
ContentType |
warcIdentifiedPayloadType
WARC-Identified-Payload-Type converted to a
ContentType object, if valid. |
String |
warcIdentifiedPayloadTypeStr
WARC-Identified-Payload-Type field string value.
|
InetAddress |
warcInetAddress
WARC-IP-Address converted to an
InetAddress object, if valid. |
String |
warcIpAddress
WARC-IP-Address field string value.
|
WarcDigest |
warcPayloadDigest
WARC-Payload-Digest converted to a
WarcDigest object, if valid. |
String |
warcPayloadDigestStr
WARC-Payload-Digest field string value.
|
Integer |
warcProfileIdx
WARC-Profile converted to an integer id, if valid.
|
String |
warcProfileStr
WARC-Profile field string value.
|
Uri |
warcProfileUri
WARC-Profile field converted to an
Uri object, if valid. |
String |
warcRecordIdStr
WARC-Record-Id field string value.
|
Uri |
warcRecordIdUri
WARC-Record-Id converted to an
Uri object, if valid. |
Date |
warcRefersToDate
WARC-Date converted to a
Date object, if valid. |
String |
warcRefersToDateStr
WARC-Refers-To-Date
|
String |
warcRefersToStr
WARC-Refers-To field string value.
|
String |
warcRefersToTargetUriStr
WARC-Refers-To-Target-URI field string value.
|
Uri |
warcRefersToTargetUriUri
WARC-Refers-To-Target-URI converted to an
Uri object, if valid. |
Uri |
warcRefersToUri
WARC-Refers-To converted to an
Uri object, if valid. |
Integer |
warcSegmentNumber
WARC-Segment-Number converted to an
Integer object, if valid. |
String |
warcSegmentNumberStr
WARC-Segment-Number field string value.
|
String |
warcSegmentOriginIdStr
WARC-Segment-Origin-Id field string value.
|
Uri |
warcSegmentOriginIdUrl
WARC-Segment-Origin-Id converted to an
Uri object, if valid. |
Long |
warcSegmentTotalLength
WARC-Segment-Total-Length converted to a
Long object, if valid. |
String |
warcSegmentTotalLengthStr
WARC-Segment-Total-Length field string value.
|
protected UriProfile |
warcTargetUriProfile
WARC-Target-URI profile.
|
String |
warcTargetUriStr
WARC_Target-URI field string value.
|
Uri |
warcTargetUriUri
WARC-TargetURI converted to an
Uri object, if valid. |
Integer |
warcTruncatedIdx
WARC-Truncated converted to an integer id, if valid.
|
String |
warcTruncatedStr
WARC-Truncated field string value.
|
Integer |
warcTypeIdx
WARC-Type converted to an integer id, if identified.
|
String |
warcTypeStr
WARC-Type field string value.
|
String |
warcWarcinfoIdStr
WARC-Warcinfo-Id field string value.
|
Uri |
warcWarcinfoIdUri
WARC-Warcinfo-Id converted to an
Uri object, if valid. |
| Modifier | Constructor and Description |
|---|---|
protected |
WarcHeader()
Non public constructor to allow unit testing.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
addErrorDiagnosis(DiagnosisType type,
String entity,
String... information)
Add an error diagnosis of the given type on a specific entity with
optional extra information.
|
protected void |
addHeader(HeaderLine headerLine)
Identify a (WARC) header name, validate the value and set the header.
|
HeaderLine |
addHeader(String fieldName,
ContentType contentTypeFieldValue,
String fieldValueStr)
Add an Content-Type header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
HeaderLine |
addHeader(String fieldName,
Date dateFieldValue,
String fieldValueStr)
Add an Date header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
HeaderLine |
addHeader(String fieldName,
InetAddress inetAddrFieldValue,
String fieldValueStr)
Add an InetAddress header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
HeaderLine |
addHeader(String fieldName,
Integer integerFieldValue,
String fieldValueStr)
Add an Integer header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
HeaderLine |
addHeader(String fieldName,
Long longFieldValue,
String fieldValueStr)
Add a Long header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
HeaderLine |
addHeader(String fieldName,
String fieldValue)
Add a String header using the supplied string and return a
HeaderLine object corresponding to how the header would be
read. |
HeaderLine |
addHeader(String fieldName,
String fieldValueStr,
int dt,
Integer integerFieldValue,
Long longFieldValue,
WarcDigest digestFieldValue,
ContentType contentTypeFieldValue,
Date dateFieldValue,
InetAddress inetAddrFieldValue,
Uri uriFieldValue)
Add a header with the supplied field name, data type and value and
return a
HeaderLine corresponding to how the header will
be read. |
HeaderLine |
addHeader(String fieldName,
Uri uriFieldValue,
String fieldValueStr)
Add an URI header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
HeaderLine |
addHeader(String fieldName,
WarcDigest digestFieldValue,
String fieldValueStr)
Add an Digest header using the supplied string and object values and return
a
HeaderLine object corresponding to how the header would be read. |
protected void |
addWarningDiagnosis(DiagnosisType type,
String entity,
String... information)
Add a warning diagnosis of the given type on a specific entity with
optional extra information.
|
protected void |
checkFieldPolicy(int recordType,
int fieldType,
Object fieldObj,
String valueStr)
Given a WARC record type and a WARC field looks up the policy in a
matrix build from the WARC ISO standard.
|
protected void |
checkFields()
Validate the WARC header relative to the WARC-Type and according to the
WARC ISO standard.
|
HeaderLine |
getHeader(String field)
Get a header line structure or null, if no header line structure is
stored with the given header name.
|
List<HeaderLine> |
getHeaderList()
Get a
List of all the headers found during parsing. |
long |
getStartOffset()
Returns the starting offset of the record in the containing WARC.
|
static WarcHeader |
initHeader(WarcReader reader,
long startOffset,
Diagnostics<Diagnosis> diagnostics)
Create and initialize a new
WarcHeader for reading. |
static WarcHeader |
initHeader(WarcWriter writer,
Diagnostics<Diagnosis> diagnostics)
Create and initialize a new
WarcHeader for writing. |
boolean |
parseHeader(ByteCountingPushBackInputStream in)
Try to parse a WARC header and return a boolean indicating the success or
failure of this.
|
protected void |
parseHeaders(ByteCountingPushBackInputStream in)
Reads WARC header lines one line at a time until an empty line is
encountered.
|
protected boolean |
parseVersion(ByteCountingPushBackInputStream in)
Looks forward in the input stream for a valid WARC version line.
|
public static final boolean URI_LTGT
public static final boolean URI_NAKED
protected WarcReader reader
protected Diagnostics<Diagnosis> diagnostics
protected UriProfile warcTargetUriProfile
protected UriProfile uriProfile
protected WarcFieldParsers fieldParsers
protected DateFormat warcDateFormat
DateFormat as specified by the WARC ISO standard.protected long startOffset
public boolean bMagicIdentified
public boolean bVersionParsed
public boolean bValidVersionFormat
public boolean bValidVersion
public String versionStr
public int[] versionArr
public int major
public int minor
protected boolean[] seen
public boolean bMandatoryMissing
public String warcTypeStr
public Integer warcTypeIdx
public String warcFilename
public String warcRecordIdStr
public Uri warcRecordIdUri
Uri object, if valid.public String warcDateStr
public String contentLengthStr
public Long contentLength
Long object, if valid.public String contentTypeStr
public ContentType contentType
ContentType object, if valid.public String warcTruncatedStr
public Integer warcTruncatedIdx
public String warcIpAddress
public InetAddress warcInetAddress
InetAddress object, if valid.public List<WarcConcurrentTo> warcConcurrentToList
URI objects, if valid.public String warcRefersToStr
public Uri warcRefersToUri
Uri object, if valid.public String warcTargetUriStr
public Uri warcTargetUriUri
Uri object, if valid.public String warcWarcinfoIdStr
public Uri warcWarcinfoIdUri
Uri object, if valid.public String warcBlockDigestStr
public WarcDigest warcBlockDigest
WarcDigest object, if valid.public String warcPayloadDigestStr
public WarcDigest warcPayloadDigest
WarcDigest object, if valid.public String warcIdentifiedPayloadTypeStr
public ContentType warcIdentifiedPayloadType
ContentType object, if valid.public String warcProfileStr
public Uri warcProfileUri
Uri object, if valid.
(revisit record only)public Integer warcProfileIdx
public String warcSegmentNumberStr
public Integer warcSegmentNumber
Integer object, if valid.public String warcSegmentOriginIdStr
public Uri warcSegmentOriginIdUrl
Uri object, if valid.
(continuation record only)public String warcSegmentTotalLengthStr
public Long warcSegmentTotalLength
Long object, if valid.
(continuation record only)public String warcRefersToTargetUriStr
public Uri warcRefersToTargetUriUri
Uri object, if valid.public String warcRefersToDateStr
public Date warcRefersToDate
Date object, if valid.protected ByteArrayOutputStream headerBytesOut
public byte[] headerBytes
protected List<HeaderLine> headerList
protected Map<String,HeaderLine> headerMap
protected WarcHeader()
public static WarcHeader initHeader(WarcWriter writer, Diagnostics<Diagnosis> diagnostics)
WarcHeader for writing.writer - writer which shall be useddiagnostics - diagnostics object used by writerWarcHeader prepared for writingpublic static WarcHeader initHeader(WarcReader reader, long startOffset, Diagnostics<Diagnosis> diagnostics)
WarcHeader for reading.reader - reader which shall be usedstartOffset - start offset of headerdiagnostics - diagnostics object used by readerWarcHeader prepared for readingprotected void addErrorDiagnosis(DiagnosisType type, String entity, String... information)
type - diagnosis typeentity - entity examinedinformation - optional extra informationprotected void addWarningDiagnosis(DiagnosisType type, String entity, String... information)
type - diagnosis typeentity - entity examinedinformation - optional extra informationpublic long getStartOffset()
public boolean parseHeader(ByteCountingPushBackInputStream in) throws IOException
in - input stream with WARC dataIOException - if an i/o exception occurs while parsing for a headerprotected boolean parseVersion(ByteCountingPushBackInputStream in) throws IOException
in - data input streamIOException - if an error occurs while reading version dataprotected void parseHeaders(ByteCountingPushBackInputStream in) throws IOException
in - header input streamIOException - if an error occurs while reading the WARC headerprotected void addHeader(HeaderLine headerLine)
headerLine - the headerLinepublic List<HeaderLine> getHeaderList()
List of all the headers found during parsing.List of HeaderLinepublic HeaderLine getHeader(String field)
field - header nameHeaderLine structure or nullpublic HeaderLine addHeader(String fieldName, String fieldValue)
HeaderLine object corresponding to how the header would be
read.fieldName - name of field to addfieldValue - field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, Integer integerFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to addintegerFieldValue - Integer field value objectfieldValueStr - Integer field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, Long longFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to addlongFieldValue - Long field value objectfieldValueStr - Long field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, WarcDigest digestFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to adddigestFieldValue - Digest field value objectfieldValueStr - Digest field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, ContentType contentTypeFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to addcontentTypeFieldValue - ContentType field value objectfieldValueStr - Content-Type field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, Date dateFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to adddateFieldValue - Date field value objectfieldValueStr - Date field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, InetAddress inetAddrFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to addinetAddrFieldValue - InetAddress field value objectfieldValueStr - IP-Address field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, Uri uriFieldValue, String fieldValueStr)
HeaderLine object corresponding to how the header would be read.
If both string and object values are not null they are used as is.
If the string value is null and the object is not null,
the object's toString method is called.
If the object is null and the string is not null, the string is parsed
and validated resulting in an object, if valid.fieldName - name of field to adduriFieldValue - URI field value objectfieldValueStr - URI field value stringHeaderLine object corresponding to what would have been readpublic HeaderLine addHeader(String fieldName, String fieldValueStr, int dt, Integer integerFieldValue, Long longFieldValue, WarcDigest digestFieldValue, ContentType contentTypeFieldValue, Date dateFieldValue, InetAddress inetAddrFieldValue, Uri uriFieldValue)
HeaderLine corresponding to how the header will
be read. The data type is validated against the field data type.
The values used are the field value string and the parameter
corresponding to the data type.fieldName - header field namefieldValueStr - field value in string formdt - data type of the field value string when converted to an objectintegerFieldValue - Integer object field valuelongFieldValue - Long object field valuedigestFieldValue - Digest object field valuecontentTypeFieldValue - ContentType object field valuedateFieldValue - Date object field valueinetAddrFieldValue - InetAddress object field valueuriFieldValue - URI object field valueHeaderLine object corresponding to what would have been readprotected void checkFields()
protected void checkFieldPolicy(int recordType, int fieldType, Object fieldObj, String valueStr)
recordType - WARC record type idfieldType - WARC field type idfieldObj - WARC fieldvalueStr - WARC raw field valueCopyright © 2011–2015. All rights reserved.