all repos — videocr @ eb29dd4d909e976eefe3b34569500ae5976ab2bb

Extract hardcoded subtitles from videos using machine learning

README.md (view raw)

  1# videocr
  2
  3Extract hardcoded (burned-in) subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
  4
  5Input a video with hardcoded subtitles:
  6
  7<p float="left">
  8  <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png">
  9  <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873659-3b76dd00-6a34-11e9-97aa-2c3e96fe3a97.png">
 10</p>
 11
 12```python
 13# print_sub.py
 14
 15import videocr
 16
 17if __name__ == '__main__':
 18    print(videocr.get_subtitles('video.avi', lang='chi_sim+eng', sim_threshold=70))
 19```
 20
 21`$ python3 print_sub.py`
 22
 23Output:
 24
 25``` 
 260
 2700:00:01,042 --> 00:00:02,877
 28喝 点 什么 ? 
 29What can I get you?
 30
 311
 3200:00:03,044 --> 00:00:05,463
 33我 不 知道
 34Um, I'm not sure.
 35
 362
 3700:00:08,091 --> 00:00:10,635
 38休闲 时 光 …
 39For relaxing times, make it...
 40
 413
 4200:00:10,677 --> 00:00:12,595
 43三 得 利 时 光
 44Bartender, Bob Suntory time.
 45
 464
 4700:00:14,472 --> 00:00:17,142
 48我 要 一 杯 伏特 加
 49Un, I'll have a vodka tonic.
 50
 515
 5200:00:18,059 --> 00:00:19,019
 53谢谢
 54Laughs Thanks.
 55```
 56
 57## Performance
 58
 59The OCR process runs in parallel and is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. You may want more cores for longer videos.
 60
 61## Installation
 62
 631. Install [Tesseract](https://github.com/tesseract-ocr/tesseract/wiki) and make sure it is in your `$PATH`
 64
 652. `$ pip install videocr`
 66
 67## API
 68
 69```python
 70videocr.get_subtitles(
 71        video_path: str, lang='eng', time_start='0:00', time_end='',
 72        conf_threshold=65, sim_threshold=90, use_fullframe=False)
 73```
 74Return the subtitles string in SRT format.
 75
 76
 77```python
 78
 79videocr.save_subtitles_to_file(
 80        video_path: str, file_path='subtitle.srt', lang='eng', time_start='0:00',
 81        time_end='', conf_threshold=65, sim_threshold=90, use_fullframe=False)
 82```
 83Write subtitles to `file_path`. If the file does not exist, it will be created automatically.
 84
 85### Parameters
 86
 87- `lang`
 88
 89  The language of the subtitles. You can extract subtitles in almost any language. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
 90  
 91  Note that you can use more than one language. For example, `'hin+eng'` means using Hindi and English together for recognition. More details are available in the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#using-multiple-languages).
 92  
 93  Language data files will be automatically downloaded to your `$HOME/tessdata` directory when necessary. You can read more about Tesseract language data files on their [wiki page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).
 94
 95- `time_start` and `time_end`
 96
 97  Extract subtitles from only a part of the video. The subtitle timestamps are still calculated according to the full video length.
 98
 99- `conf_threshold`
100
101  Confidence threshold for word predictions. Words with lower confidence than this threshold are discarded. The default value is fine for most cases. 
102
103  Make it closer to 0 if you get too few words from the predictions, or make it closer to 100 if you get too many excess words.
104
105- `sim_threshold`
106
107  Similarity threshold for subtitle lines. Neighbouring subtitles with larger [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance) ratios than this threshold will be merged together. The default value is fine for most cases.
108
109  Make it closer to 0 if you get too many duplicated subtitle lines, or make it closer to 100  if you get too few subtitle lines.
110
111- `use_fullframe`
112
113  By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.