-
Notifications
You must be signed in to change notification settings - Fork 109
Open
Description
Here is my test URL :
http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html
The code :
package main
import (
"github.com/advancedlogic/GoOse"
)
func main() {
g := goose.New()
article := g.ExtractFromUrl("http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html")
println("title : ", article.Title)
println("content : ", article.CleanedText[0:150])
}
The output :
title : 2 Indicted in George Washington Bridge Case; Ally of Christie Pleads Guilty
content : Continue reading the main story
Continue reading the main story
After a 16-month federal investigation into the George Washington Bridge lane closin
The source from NY Times contains the following :
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
<span class="sharetools-label visually-hidden">Share This Page</span>
<div class="ad sharetools-inline-article-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
<div id="MiddleLeft" class="ad middle-left-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
</div>
Could we think of a regexp that would remove text when classes contain "hidden" ?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels