Package Torello.HTML
Interface HTMLPage.Parser
-
- Enclosing class:
- HTMLPage
- Functional Interface:
- This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.
@FunctionalInterface public static interface HTMLPage.Parser
A function-pointer / lambda-target that (could) potentially be used to replace this library's current regular-expression based parser with something possibly faster or even more efficient.
ThisFunctional Interface
is identical toQuintFunction<A, B, C, D, E, X>
in the'Java.Additional'
package, but adds the ability to throw anIOException
. Having the ability to "swap parsers" is actually not a very important 'feature' - unless one has identified a way to optimize past the abilities of the current parser, or desires something different altogether. This 'feature' shall remain in place since there is essentially zero over-head costs incurred here. To see the actualparser
code used by this package, view the documentation forclass-HTMLPage
, and scroll to 'View Source Files'.If one desired, for instance, to ignore the debugging log-files feature, that is easily done by ignoring the three file-name parameters. However, this can easily be achieved inclass HTMLPage
by invoking one of the methods where those log file-names are passed null-value strings.- See Also:
HTMLPage.parser
Hi-Lited Source-Code:- View Here: Torello/HTML/HTMLPage.java
- Open New Browser-Tab: Torello/HTML/HTMLPage.java
File Size: 1,920 Bytes Line Count: 40 '\n' Characters Found
-
-
Method Detail
-
parse
java.util.Vector<HTMLNode> parse(java.lang.CharSequence html, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
Parse html source-text into aVector<HTMLNode>
.- Parameters:
html
- This may be any form ofjava.lang.CharSequence
, and it will be converted into aString
. This should contain HTML that needs to be parsed, and vectorized.eliminateHTMLTags
- When this parameter is TRUE, allTagNode
andCommentNode
elements are eliminated from the returned HTMLVector
. AVector
having only the page-text (as instances ofTextNode
) is returned, instead.rawHTMLFile
- If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter'rawHTMLFile'
. If this parameter is null, it will be ignored (and the raw-HTML discarded).If you have decided to implement a parser, and you wish to ingore this parameter (and don't want to output such a file) - it is (hopefully) obvious that you may skip this step!matchesFile
- If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's
. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.As above, you may skip implementing this.justTextFile
- If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTMLTagNode
orCommentNode
- will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.As above, you may skip implementing this.- Returns:
- A
Vector
ofHTMLNode's
(called 'Vectorized HTML') that represents the available parsed-content provided by the input-source. - Throws:
java.io.IOException
- This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
-
-