Interface ArticleGet

  • All Superinterfaces:
    java.io.Serializable
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    @FunctionalInterface
    public interface ArticleGet
    extends java.io.Serializable
    A function-pointer / lambda target for extracting an article's content from the web-page from whence it was downloaded; including several static-builder methods for the most common means of finding the HTML-Tags that wrap artilce-HTML on news-media websites.

    The purpose of this Java Predicate Function is to "get" an article-content-body out of the complete-HTML page in which it resides. For all intents-and-purposes, this is a highly-trivial coding-problem, but different for just about every news-site on the internet.

    Generally, all that is needed to implement an instance of class ArticleGet is to provide the needed parameters to one of several factory-methods in this class. This class contains several 'static' methods named 'usual(...)' that accept typical NodeSearch-Package parameters for extracting a partial Web-Ppage out of a complete one.

    Primarily, the use of class 'ArticleGet' is such that after a list of News-Article URL's have been built from an online News-Based Web-site, those Article's are processed quickly. This is accomplished by immediately removing all extranneous HTML and concentrating on the Article and Header itself only.

    For example, on Yahoo! News, downloading any one of the myriad Yahoo! Articles, one will encounter lists upon lists of "related news", advertisements, links to other sections of the site, and even User-Comments. The Article-Body itself - usually including the Title, Author and Story-Photos - is easily retrieved by looking for the HTML Tag "<ARTICLE ...>

    To retrieve the contents of the <ARTICLE> ... </ARTICLE> construct, simply make a call to the NodeSearch-Package method TagNodeGetInclusive.first(fullPage, "article"). It will retrieve the entire Article Content in a single Line of Code!

    Below are some examples of how to build an instance of 'ArticleGet' such that it may be passed to the method ScrapeArticles.download(...) Generally, this automates the sometime laborious process of scraping an entire News Web-Site for a day's entire set of articles.

    Example:
    // This example would take a page copied from a URL on a news-site, and eliminate everything except the
    // HTMLNode's that were between the DIV whose class attribute are:
    // <DIV ... class="body-content"> article ... [HTMLNodes] ... </DIV>
    
    // This uses java's lambda syntax to build the ArticleGet instance
    ArticleGet ag = (URL url, Vector<HTMLNode> page) ->
        InnerTagGetInclusive.first
            (page, "div", "class", TextComparitor.C, "body-content");
    
    // The behaviour of this ArticleGetter will be identical to the one, manually built, above.
    // Here, a pre-defined "factory builder" method is used instead:
    
    ArticleGet ag2 = ArticleGet.usual("div", "class", TextComparitor.C, "body-content");
    


    • Field Detail

      • serialVersionUID

        🡇     🗕  🗗  🗖
        static final long serialVersionUID
        This fulfils the SerialVersion UID requirement for all classes that implement Java's interface java.io.Serializable. Using the Serializable Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.

        Functional Interfaces are usually not thought of as Data Objects that need to be saved, stored and retrieved; however, having the ability to store intermediate results along with the lambda-functions that helped get those results can make debugging easier.
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         public static final long serialVersionUID = 1;
        
    • Method Detail

      • apply

        🡅  🡇     🗕  🗗  🗖
        java.util.Vector<HTMLNodeapply​(java.net.URL url,
                                         java.util.Vector<HTMLNode> page)
                                  throws ArticleGetException
        FunctionalInterface Target-Method:
        This method corresponds to the @FunctionalInterface Annotation's method requirement. It is the only non-default, non-static method in this interface, and may be the target of a Lambda-Expression or '::' (double-colon) Function-Pointer.

        This method's purpose is to take a "Scraped HTML Page" (stored as a Vectorized-HTML Web-Page), and return an HTML Vector that contains only the "Article Content" - which is usually just called the "Article Body." Perhaps it seems daunting, but the usual way to get the actual article-body of an HTML News-Website Page is to simply identify an HTML <DIV ID="..." CLASS="..."> surrounding element.

        This class has several different static-methods called "usual" which automatically create a page-getter. The example at the top of this class should hiLite how this works. Extracting news-content from a page that has already been downloaded - is usually trivial. The point really becomes identifying the <DIV>'s class=... or id=... attributes & page-structure to find the article-body. Generally, in your browser just click the View Source and look at manually to find the attributes used. Using the myriad Get methods from Torello.HTML.NodeSearch usually boils down to code that looks surreptitiously like Java-Script:


        JavaScript:
         var articleHTML = document.getElementById("article-body").innerHTML;
        
         // or...
         var articleHTML = document.getElementByClassName("article-body").innerHTML;
        

        Using the NodeSearch package, the above DOM-Tree Java-Script is easily written in Java as below:
         // For articles with HTML divider elements having an "ID" attribute to specify the article
         // body, get the article using the code below.  In this example, the particular newspaper
         // web-site has articles whose content ("Article Body") is simply wrapped in an HTML
         // HTML Divider Element: <DIV ID="article-body"> ... </DIV>
        
         // For extracting that content use the NodeSearch Package Class: InnerTagGetInclusive
        
         Vector<HTMLNode> articleBody = InnerTagGetInclusive
             (page, "div", "id", TextComparitor.EQ_CI, "article-body");
        
         // To use this NodeSearch Package Class with the NewsSite Package, simply use one of the
         // 'usual' methods in class ArticleGet, and the lambda Functional Interface "ArticleGet"
         // will be built automatically as such:
        
         ArticleGet getter = ArticleGet.usual("div", "id", TextComparitor.EQ_CI, "article-body");
        
         // For articles with HTML divider elements having an "CLASS" attribute to specify
         // the article body, get the article with the following code.  Note that in this example
         // the article body is wrapped in an HTML Divider Element that has the characteristics
         // <DIV CLASS="article-body"> ... </DIV>.  The content of a Newspaper Article can be easily
         // extracted with just one line of code using the methods in the NodeSearch Package as
         // follows: 
        
         Vector<HTMLNode> articleBody = InnerTagGetInclusive
             (page, "div", "class", TextComparitor.C, "article-body");
        
         // which should be written for use with the ScrapeArticles class as using the 'usual'
         // methods in ArticleGet as such:
        
         ArticleGet getter = ArticleGet.usual(TextComparitor.EQ_CI, "article-body");
        


        Note: For all examples above, the text-string "article-body" will be a tag-value that (was) decided/chosen by the HTML news-website, or content-website you want to scrape.

        Furthermore: One might have to be careful about modifying the input to this Predicate. Each and every one of the NodeSearch classes retrieves a copy (read: a clone) of the input Vector (other than the classes that actually use the term "remove.") However, if you were to write an Article Get lambda of your own (rather than using the "usual" methods), make sure you know whether you are going to intentionally, modify the input-page, and if so, remember you have.

        Additionally: There are many content-based web-sites that have some (even "a lot") of spurious HTML information inside the primary article body, even after the header & footer information has been eliminated. It may be necessary to do some vector-cleaning later on. For example: getting rid of "Post to Facebook", "Post to Twitter" or "E-Mail Link" buttons.
        Throws:
        ArticleGetException
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag)
        This is a static, factory method for building ArticleGet.

        This builds an "Article Getter" based on a parameter-specified HTML Tag. Two or three common HTML "semantic elements" used for wrapping newspaper article-content include these:

        • <ARTICLE ...> article-body </ARITCLE>
        • <MAIN ...> article-body </MAIN>
        • <SECTION ...> article-body </SECTION>

        Identifying which tag to use can be accomplished by going to the main-page of an internet news web-site, selecting a news-article, and then using the "View Source" or the "View Page Source" depending upon which browser your are using, and then scanning the HTML to find what elements are used to wrap the article-body.

        Call this method, and use the ArticleGet that it generates/returns with the class NewsSiteScrape. As long as the news or content website that you are scraping has it's page-body wrapped inside of an HTML <DIV> element whose CSS 'class' specifier is one you have uncovered by inspecting the page-manually then ArticleGet produced by this factory-method will retrieve your page content appropriately.
        Parameters:
        htmlTag - This should be the HTML element that is used to wrap the actual news-content article-body of an HTML news web-site page.
        Returns:
        This returns an "Article Getter" that just picks out the part of a news-website article that lies between the open and closed version of the specified htmlTag.
        Code:
        Exact Method Body:
         return Usual_htmlTag.generate(htmlTag);
        
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(TextComparitor tc,
                                java.lang.String... cssClassCompareStrings)
        This is a static, factory method for building ArticleGet.

        This builds an "Article Getter" for you, using the most common way to get an article - specifically via the HTML <DIV CLASS="..."> element and it's CSS 'class' selector.

        Call this method, and use the ArticleGet that it generates/returns with the class NewsSiteScrape. As long as the news or content website that you are scraping has it's page-body wrapped inside of an HTML <DIV> element whose CSS 'class' specifier is one you have uncovered by inspecting the page-manually then ArticleGet produced by this factory-method will retrieve your page content appropriately.
        Parameters:
        tc - This should be any of the pre-instantiated TextComparitor's. Again, a TextComparitor is just a String compare function like: equals, contains, StrCmpr.containsIgnoreCase(...), etc...
        cssClassCompareStrings - These are the values to be used by the TextComparitor when comparing with the value of the CSS-Selector "Class" from the list of DIV elements on the page.
        Returns:
        This returns an "Article Getter" that just picks out the part of a news-website article that lies between the HTML-DIV Element nodes whose class is identified by the "CSS (Cascading Style Sheets) 'class' identifier, and the TextComparitor parameter that you have chosen.
        Code:
        Exact Method Body:
         return Usual_tc.generate(tc, cssClassCompareStrings);
        
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag,
                                java.lang.String innerTag,
                                TextComparitor tc,
                                java.lang.String... attributeValueCompareStrings)
        This is a static, factory method for building ArticleGet.

        This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML DIV element, and the <DIV CLASS="..."> can be found by the CSS 'class' attribute. However, This factory method allows a programmer to select article content that handles other cases than the 95%, where you specify the HTML-token, attribute-name and use the usual TextComparitor to find the article.
        Parameters:
        htmlTag - This is almost always a "DIV" element, but if you wish to specify something else, possibly a paragraph element (<P>), or maybe an <IFRAME> or <FRAME>, then you may.
        innerTag - This is almost always a "CLASS" attribute, but if you need to use "ID" or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.
        tc - This should be any of the pre-instantiated TextComparitor's. Again, a TextComparitor is just a String compare function like: equals, contains, StrCmpr.containsIgnoreCase(...).
        attributeValueCompareStrings - These are the String's compared with using the innerTag value using the TextComparitor.
        Returns:
        This returns an "Article Getter" that picks out the part of a news-website article that lies between the HTML element which matches the 'htmlTag', 'innerTag' (id, class, or "other"), and whose attribute-value of the specified inner-tag can be matched by the TextComparitor and the compare-String's.
        Code:
        Exact Method Body:
         return Usual_htmlTag_tc.generate(htmlTag, innerTag, tc, attributeValueCompareStrings);
        
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag,
                                java.lang.String innerTag,
                                java.util.regex.Pattern innerTagValuePattern)
        This is a static, factory method for building ArticleGet.

        This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML DIV element, and the <DIV CLASS="..."> can be found by the CSS 'class' attribute. However, This factory method allows a programmer to select article content that handles other cases than the 95%. Here, you may specify the HTML-token, attribute-name and use a Java Regular-Expression handler to test the value of the attribute - no matter how complicated or bizarre.
        Parameters:
        htmlTag - This is almost always a "DIV" element, but if you wish to specify something else, possibly a paragraph element (<P>), or maybe an <IFRAME> or <FRAME>, then you may.
        innerTag - This is almost always a "CLASS" attribute, but if you need to use "ID" or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.
        innerTagValuePattern - Any regular-expression. It will be used to PASS or FAIL the attribute-value (a name that is used interchangeably in this scrape/search package for "inner-tag-value") when compared against this regular-expression parameter.

        HELP: This would be like saying:
        // Pick some random HTML TagNode
        TagNode aTagNode = (TagNode) page.elementAt(index_to_test);
        
        // Gets the attribute value of "innerTag"
        String  attributeValue  = aTagNode.AV(innerTag);
        
        // Make sure the HTML-token is as specified
        // calls to: java.util.regex.*;
        boolean passFail = aTagNode.tok.equals(htmlTag) &&
             innerTagValuePattern.matcher(attributeValue).find();
        
        Returns:
        This returns an "Article Getter" that picks out the part of a news-website article that lays between the HTML element which matches the htmlTag, innerTag and value-testing regex Pattern "innerTagValuePattern".
        Code:
        Exact Method Body:
         return Usual_innerTagValuePattern.generate(htmlTag, innerTag, innerTagValuePattern);
        
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag,
                                java.lang.String innerTag,
                                java.util.function.Predicate<java.lang.String> p)
        This is a static, factory method for building ArticleGet.

        This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML 'DIV' element, and the <DIV CLASS="..."> can be found by the CSS 'class' attribute. However, This factory method allows a programmer to select article content that handles other cases than the 95%, where you specify the HTML-token, attribute-name and a Predicate<String> for finding the page-body.
        Parameters:
        htmlTag - This is almost always a "DIV" element, but if you wish to specify something else, possibly a paragraph element (<P>), or maybe an <IFRAME> or <FRAME>, then you may.
        innerTag - This is almost always a "CLASS" attribute, but if you need to use "ID" or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.
        p - This java "lambda Predicate" will just receive the attribute-value from the "inner-tag" and provide a yes/no answer.
        Returns:
        This returns an "Article Getter" that matches an HTML element specified by 'htmlTag', 'innerTag' and the result of the String-Predicate parameter 'p' on the value of that inner-tag.
        Code:
        Exact Method Body:
         return Usual_p.generate(htmlTag, innerTag, p);
        
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String startTextTag,
                                java.lang.String endTextTag)
        This is a static, factory method for building ArticleGet.

        This factory method generates an "ArticleGet" that will retrieve news-article body-content based on a "start-tag" and an "end-tag." It is very to note, that the text can only match a single text-node, and not span multiple text-nodes, or be within TagNode's at all! This should be easy to find, print up the HTML page as a Vector, and inspect it!
        Parameters:
        startTextTag - This must be text from an HTML TextNode that is contained within one (single) TextNode of the vectorized-HTML page.
        endTextTag - This must be text from an HTML TextNode that is also contained in a single TextNode of the vectorized-HTML page.
        Returns:
        This will return an "Article Getter" that looks for non-HTML Text in the article, specified by the text-tag parameters, and gets it.
        Code:
        Exact Method Body:
         return Usual_textTag.generate(startTextTag, endTextTag);
        
      • usual

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet usual​(java.util.regex.Pattern startPattern,
                                java.util.regex.Pattern endPattern)
        This is a static, factory method for building ArticleGet. This factory method generates an "ArticleGet" that will retrieve news-article body-content based on starting and ending regular-expressions. The matches performed by the Regular Expression checker will be performed on TextNode's, not on the TagNode's, or the page itself. It is very to note, that the text can only match a single TextNode, and not span multiple TextNode's, or be within TagNode's at all! This should be easy to find, print up the HTML page as a Vector, and inspect it!
        Parameters:
        startPattern - This must be a regular expression Pattern that matches an HTML TextNode that is contained within one (single) TextNode of the vectorized-HTML page.
        endPattern - This must be a regular expression Pattern that matches an HTML TextNode that is also contained in a single TextNode of the vectorized-HTML page.
        Returns:
        This will return an "Article Getter" that looks for non-HTML Text in the article, specified by the regular-expression pattern-matching parameters, and gets it.
        Code:
        Exact Method Body:
         return Usual_pattern.generate(startPattern, endPattern);
        
      • branch

        🡅  🡇         External-Java:    🗕  🗗  🗖
        static ArticleGet branch​(URLFilter[] urlSelectors,
                                 ArticleGet[] getters)
        This is a static, factory method for building ArticleGet. This is just a way to put a list of article-parse objects into a single "branching" article-parse Object. The two parameters must be equal-length arrays, with non-null elements. Each 'urlSelector' will be tested, and when a selector passes, the ArticleGet that is created will use the "parallel getter" from the parallel array "getters."

        LAY-SPEAK: The best way to summarize this is if a programmer is going to use the NewsSiteScrape class, and planning to scrape a site that has different types of news-articles, he will need differing "ArticleGet" methods. This class will take two array's that match the URL from which the article was retrieved with the particular "getter" method you have provided. When I scrape the address: http://www.baidu.com/ - a Chinese News Web-Site, it links to at least three primary domains:

        1. http://...chinesenews.com/director.../article...
        2. http://...xinhuanet.com/director.../article...
        3. http://...cctv.com/director.../article...

        Results from each of these sites need to be "handled" just ever-so-slightly different.
        Parameters:
        urlSelectors - This is a list of Predicate<URL> elements. When one of these returns TRUE for a particular URL, then the index of that URL-selector in it's array will be used to call the appropriate getter from the parallel-array input-parameter 'getters'.
        getters - This is a list of getter elements. These should be tailored to the particular news-website source that are chosen/selected by the 'urlSelectors' parallel array.
        Returns:
        This will be a "master ArticleGet" or a "dispatch ArticleGet." All it does is simply traverse the first array looking for a Predicate-match from the 'urlSelectors', and then calls the getter in the parallel array.

        NOTE: If none of the 'urlSelectors' match when this "dispatch" or rather "branch" is called by class NewsSiteScrape, the function/getter that is returned will throw an ArticleGetException. It is important that the programmer only allow article URL's that he can capably handled to pass to class NewsSiteScrape.
        Throws:
        java.lang.IllegalArgumentException - Will throw this exception if:

        • Either of these parameters are null
        • If they are not parallel, with differing lengths.
        • If either contain a null value.
        Code:
        Exact Method Body:
         return Branch.generate(urlSelectors, getters);
        
      • andThen

        🡅  🡇     🗕  🗗  🗖
        default ArticleGet andThen​(ArticleGet after)
        This is the standard-java Function 'andThen' method.
        Parameters:
        after - This is the ArticleGet that will be (automatically) applied after 'this' function.
        Returns:
        A new, composite ArticleGet that performs both operations. It will:

        1. Run 'this' function's 'apply' method to a URL, Vector<HTMLNode>, and return a Vector<HTMLNode>.

        2. Then it will run the 'after' function's 'apply' method to the results of 'this.apply(...)' and return the result.
        Code:
        Exact Method Body:
         return (URL url, Vector<HTMLNode> page) -> after.apply(url, this.apply(url, page));
        
      • compose

        🡅  🡇     🗕  🗗  🗖
        default ArticleGet compose​(ArticleGet before)
        This is the standard-java Function 'compose' method.
        Parameters:
        before - This is the ArticleGet that is performed first, whose results are sent to 'this' function.
        Returns:
        A new composite ArticleGet that performs both operations. It will:

        1. Run the 'before' function's 'apply' method to a URL, Vector<HTMLNode>, and return a Vector<HTMLNode>.

        2. Then it will run 'this' function's 'apply' method to the results of the before.apply(...) and return the result.
        Code:
        Exact Method Body:
         return (URL url, Vector<HTMLNode> page) -> this.apply(url, before.apply(url, page));
        
      • identity

        🡅     🗕  🗗  🗖
        static ArticleGet identity()
        The identity function will always return the same Vector<HTMLNode> as output that it receives as input. This is one of the default Java's lambda-methods.
        Returns:
        a new ArticleGet which (it should be obvious) is of type: java.util.function.Function<Vector<HTMLNode>, Vector<HTMLNode>>

        ... where the returned Vector is always the same (identical) to the input Vector.
        Code:
        Exact Method Body:
         return (URL url, Vector<HTMLNode> page) ->
         {
             ArticleGetException.check(url, page);
             return page;
         };