Package Torello.HTML

Interface HTMLPage.Parser

  • Enclosing class:
    HTMLPage
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    @FunctionalInterface
    public static interface HTMLPage.Parser
    A function-pointer / lambda-target that (could) potentially be used to replace this library's current regular-expression based parser with something possibly faster or even more efficient.

    This Functional Interface is identical to QuintFunction<A, B, C, D, E, X> in the 'Java.Additional'package, but adds the ability to throw an IOException. Having the ability to "swap parsers" is actually not a very important 'feature' - unless one has identified a way to optimize past the abilities of the current parser, or desires something different altogether. This 'feature' shall remain in place since there is essentially zero over-head costs incurred here. To see the actual parser code used by this package, view the documentation for class-HTMLPage, and scroll to 'View Source Files'.

    If one desired, for instance, to ignore the debugging log-files feature, that is easily done by ignoring the three file-name parameters. However, this can easily be achieved in class HTMLPage by invoking one of the methods where those log file-names are passed null-value strings.
    See Also:
    HTMLPage.parser


    • Method Summary

       
      @FunctionalInterface: (Lambda) Method
      Modifier and Type Method
      Vector<HTMLNode> parse​(CharSequence html, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
    • Method Detail

      • parse

          🗕  🗗  🗖
        java.util.Vector<HTMLNodeparse​(java.lang.CharSequence html,
                                         boolean eliminateHTMLTags,
                                         java.lang.String rawHTMLFile,
                                         java.lang.String matchesFile,
                                         java.lang.String justTextFile)
                                  throws java.io.IOException
        Parse html source-text into a Vector<HTMLNode>.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored (and the raw-HTML discarded).

        If you have decided to implement a parser, and you wish to ingore this parameter (and don't want to output such a file) - it is (hopefully) obvious that you may skip this step!
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.

        As above, you may skip implementing this.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.

        As above, you may skip implementing this.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).