Skip to content

Subtitle fuzzying using keyword extraction? #1

@setreadygo

Description

@setreadygo

Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.

Currently I'm rewriting each file as filename.mkv.jp.srt and then running:

ls -1 *.{mkv,avi,mp4} | parallel -j4 'ffmpeg -i {} {}.en.ass'
ls -1 *.{mp4,avi,mkv} | parallel -j16 'bunkai extract cards -m {} {}.jp.* {}.en.*'

It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is retext-keywords or for golang RAKE. I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.

Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored. / *\[[^\]]*\] */g

Regex to remove everything within brackets, parens, and quotes:
/[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions