-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Thanks a lot for the tool! substudy had some major pain points that ruined 80% of stuff I tried to extract (duplicate timestamps in .ass files extracted by ffmpeg). The thing I spend the most time on now is rewriting filenames.
Currently I'm rewriting each file as filename.mkv.jp.srt and then running:
ls -1 *.{mkv,avi,mp4} | parallel -j4 'ffmpeg -i {} {}.en.ass'
ls -1 *.{mp4,avi,mkv} | parallel -j16 'bunkai extract cards -m {} {}.jp.* {}.en.*'
It would be great to have some automation here. This could be done with NLP keyword extraction, I think. An example library that does this is retext-keywords or for golang RAKE. I also doubt these keyword extractors are good at picking out episode numbers though so they may need to be tweaked a bit. I've tried this kind of thing in the past using regex heuristics, but they get crazy and buggy due to edge cases. Maybe that's fine though.
Also, some pre-parsing is needed because brackets in filenames is breaking the sound field in anki. Escaping them in anki doesn't do anything and removing them causes anki to do a fuzzy search which can take 5+ seconds to find the file with large amounts of media. Almost always the stuff inside brackets is erroneous data that can be ignored. / *\[[^\]]*\] */g
Regex to remove everything within brackets, parens, and quotes:
/[ _\.\-]*(\[[^\]]*\]|\([^\)]*\)|'[^']*'|"[^"]*)[ _\-]*/g