Subtitle fuzzying using keyword extraction?

Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.

Currently I'm rewriting each file as filename.mkv.jp.srt and then running:
```
ls -1 *.{mkv,avi,mp4} | parallel -j4 'ffmpeg -i {} {}.en.ass'
ls -1 *.{mp4,avi,mkv} | parallel -j16 'bunkai extract cards -m {} {}.jp.* {}.en.*'
```
It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is [retext-keywords](https://github.com/retextjs/retext-keywords)  or for golang [RAKE](https://github.com/afjoseph/RAKE.Go). I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.

Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored. `/ *\[[^\]]*\] */g`

Regex to remove everything within brackets, parens, and quotes:
`/[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subtitle fuzzying using keyword extraction? #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Subtitle fuzzying using keyword extraction? #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions