README.md (view raw)
1# videocr
2
3Extract hardcoded (burned-in) subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
4
5Input a video with hardcoded subtitles:
6
7<p float="left">
8 <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png">
9 <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873659-3b76dd00-6a34-11e9-97aa-2c3e96fe3a97.png">
10</p>
11
12```python
13# example.py
14
15from videocr import get_subtitles
16
17if __name__ == '__main__': # This check is mandatory for Windows.
18 print(get_subtitles('video.mp4', lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
19```
20
21`$ python3 example.py`
22
23Output:
24
25```
260
2700:00:01,042 --> 00:00:02,877
28喝 点 什么 ?
29What can I get you?
30
311
3200:00:03,044 --> 00:00:05,463
33我 不 知道
34Um, I'm not sure.
35
362
3700:00:08,091 --> 00:00:10,635
38休闲 时 光 …
39For relaxing times, make it...
40
413
4200:00:10,677 --> 00:00:12,595
43三 得 利 时 光
44Bartender, Bob Suntory time.
45
464
4700:00:14,472 --> 00:00:17,142
48我 要 一 杯 伏特 加
49Un, I'll have a vodka tonic.
50
515
5200:00:18,059 --> 00:00:19,019
53谢谢
54Laughs Thanks.
55```
56
57## Performance
58
59The OCR process is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. More CPU cores will make it faster.
60
61## Installation
62
631. Install [Tesseract](https://github.com/tesseract-ocr/tesseract/wiki) and make sure it is in your `$PATH`
64
652. `$ pip install videocr`
66
67## API
68
691. Return subtitle string in SRT format
70 ```python
71 get_subtitles(
72 video_path: str, lang='eng', time_start='0:00', time_end='',
73 conf_threshold=65, sim_threshold=90, use_fullframe=False)
74 ```
75
762. Write subtitles to `file_path`
77 ```python
78 save_subtitles_to_file(
79 video_path: str, file_path='subtitle.srt', lang='eng', time_start='0:00', time_end='',
80 conf_threshold=65, sim_threshold=90, use_fullframe=False)
81 ```
82
83### Parameters
84
85- `lang`
86
87 The language of the subtitles. You can extract subtitles in almost any language. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
88
89 Note that you can use more than one language, e.g. `lang='hin+eng'` for Hindi and English together.
90
91 Language files will be automatically downloaded to your `~/tessdata`. You can read more about Tesseract language data files on their [wiki page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).
92
93- `conf_threshold`
94
95 Confidence threshold for word predictions. Words with lower confidence than this value will be discarded. The default value `65` is fine for most cases.
96
97 Make it closer to 0 if you get too few words in each line, or make it closer to 100 if there are too many excess words in each line.
98
99- `sim_threshold`
100
101 Similarity threshold for subtitle lines. Subtitle lines with larger [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance) ratios than this threshold will be merged together. The default value `90` is fine for most cases.
102
103 Make it closer to 0 if you get too many duplicated subtitle lines, or make it closer to 100 if you get too few subtitle lines.
104
105- `time_start` and `time_end`
106
107 Extract subtitles from only a clip of the video. The subtitle timestamps are still calculated according to the full video length.
108
109- `use_fullframe`
110
111 By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.