Skip to content

Extract hidden text from NY Times #31

@jice-lavocat

Description

@jice-lavocat

Here is my test URL :
http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html

The code :

package main

import (
    "github.com/advancedlogic/GoOse"
)

func main() {
    g := goose.New()
    article := g.ExtractFromUrl("http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html")
    println("title : ", article.Title)
    println("content : ", article.CleanedText[0:150])
}

The output :

title :  2 Indicted in George Washington Bridge Case; Ally of Christie Pleads Guilty
content :  Continue reading the main story

Continue reading the main story

After a 16-month federal investigation into the George Washington Bridge lane closin

The source from NY Times contains the following :

<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
<span class="sharetools-label visually-hidden">Share This Page</span>
<div class="ad sharetools-inline-article-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
<div id="MiddleLeft" class="ad middle-left-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
</div>

Could we think of a regexp that would remove text when classes contain "hidden" ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions