Repository with tools to convert for text some content types
To get for run in cli you can get with:
go get github.com/zbioe/grapnelyou can pass -type of content in "type" for parse reader directly:
cat pdf/testdata/valid.pdf | grapnel -t pdfcat html/testdata/valid.html | grapnel -t htmlor you can not pass type for read all content and try detect the type:
cat pdf/testdata/valid.pdf | grapnelcat html/testdata/valid.html | grapnelReceive Pdf in []byte or io.Reader and transform him to text with pdftotext
create file main.go with content:
package main
import (
"os"
"fmt"
"github.com/zbioe/grapnel/pdf"
)
func main() {
text, err := pdf.ToTextFromReader(os.Stdin)
if err != nil {
fmt.Print(err)
os.Exit(1)
}
fmt.Print(text)
}run on command line:
go run main.go < pdf/test_files/valid.pdf
curl -Ls "http://www.orimi.com/pdf-test.pdf" | go run main.goReceive html in bytes or reader and transform him to text
create file main.go with content:
package main
import (
"os"
"fmt"
"github.com/zbioe/grapnel/html"
)
func main() {
text, err := html.ToTextFromReader(os.Stdin)
if err != nil {
fmt.Print(err)
os.Exit(1)
}
fmt.Print(text)
}run on command line:
go run main.go < pdf/testdata/valid.html
curl -Ls "https://reddit.com/" | go run main.go