Extract hidden text from NY Times

Here is my test URL : 
http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html

The code : 

```
package main

import (
    "github.com/advancedlogic/GoOse"
)

func main() {
    g := goose.New()
    article := g.ExtractFromUrl("http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html")
    println("title : ", article.Title)
    println("content : ", article.CleanedText[0:150])
}
```

The output : 

```
title :  2 Indicted in George Washington Bridge Case; Ally of Christie Pleads Guilty
content :  Continue reading the main story

Continue reading the main story

After a 16-month federal investigation into the George Washington Bridge lane closin
```

The source from NY Times contains the following : 

```
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
<span class="sharetools-label visually-hidden">Share This Page</span>
<div class="ad sharetools-inline-article-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
<div id="MiddleLeft" class="ad middle-left-ad hidden nocontent robots-nocontent">
<a class="visually-hidden skip-to-text-link" href="#story-continues-1">Continue reading the main story</a>
</div>
</div>
```

Could we think of a regexp that would remove text when classes contain "hidden" ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract hidden text from NY Times #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extract hidden text from NY Times #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions