all repos — videocr @ 04561b96fbbb860f7358be325fd3c872ec66d83e

Extract hardcoded subtitles from videos using machine learning

README.md (view raw)

  1# videocr
  2
  3Extract hardcoded (burned-in) subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
  4
  5Input a video with hardcoded subtitles:
  6
  7<p float="left">
  8  <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png">
  9  <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873659-3b76dd00-6a34-11e9-97aa-2c3e96fe3a97.png">
 10</p>
 11
 12```python
 13# example.py
 14
 15from videocr import get_subtitles
 16
 17if __name__ == '__main__':  # This check is mandatory for Windows.
 18    print(get_subtitles('video.mp4', lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
 19```
 20
 21`$ python3 example.py`
 22
 23Output:
 24
 25``` 
 260
 2700:00:01,042 --> 00:00:02,877
 28喝 点 什么 ? 
 29What can I get you?
 30
 311
 3200:00:03,044 --> 00:00:05,463
 33我 不 知道
 34Um, I'm not sure.
 35
 362
 3700:00:08,091 --> 00:00:10,635
 38休闲 时 光 …
 39For relaxing times, make it...
 40
 413
 4200:00:10,677 --> 00:00:12,595
 43三 得 利 时 光
 44Bartender, Bob Suntory time.
 45
 464
 4700:00:14,472 --> 00:00:17,142
 48我 要 一 杯 伏特 加
 49Un, I'll have a vodka tonic.
 50
 515
 5200:00:18,059 --> 00:00:19,019
 53谢谢
 54Laughs Thanks.
 55```
 56
 57## Performance
 58
 59The OCR process is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. More CPU cores will make it faster.
 60
 61## Installation
 62
 631. Install [Tesseract](https://github.com/tesseract-ocr/tesseract/wiki) and make sure it is in your `$PATH`
 64
 652. `$ pip install videocr`
 66
 67## API
 68
 691. Return subtitle string in SRT format
 70    ```python
 71    get_subtitles(
 72        video_path: str, lang='eng', time_start='0:00', time_end='',
 73        conf_threshold=65, sim_threshold=90, use_fullframe=False)
 74    ```
 75
 762. Write subtitles to `file_path`
 77    ```python
 78    save_subtitles_to_file(
 79        video_path: str, file_path='subtitle.srt', lang='eng', time_start='0:00', time_end='',
 80        conf_threshold=65, sim_threshold=90, use_fullframe=False)
 81    ```
 82
 83### Parameters
 84
 85- `lang`
 86
 87  The language of the subtitles. You can extract subtitles in almost any language. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
 88  
 89  Note that you can use more than one language, e.g. `lang='hin+eng'` for Hindi and English together. 
 90  
 91  Language files will be automatically downloaded to your `~/tessdata`. You can read more about Tesseract language data files on their [wiki page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).
 92
 93- `conf_threshold`
 94
 95  Confidence threshold for word predictions. Words with lower confidence than this value will be discarded. The default value `65` is fine for most cases. 
 96
 97  Make it closer to 0 if you get too few words in each line, or make it closer to 100 if there are too many excess words in each line.
 98
 99- `sim_threshold`
100
101  Similarity threshold for subtitle lines. Subtitle lines with larger [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance) ratios than this threshold will be merged together. The default value `90` is fine for most cases.
102
103  Make it closer to 0 if you get too many duplicated subtitle lines, or make it closer to 100 if you get too few subtitle lines.
104
105- `time_start` and `time_end`
106
107  Extract subtitles from only a clip of the video. The subtitle timestamps are still calculated according to the full video length.
108
109- `use_fullframe`
110
111  By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.