README.md (view raw)
1# videocr
2
3Extract hardcoded (burned-in) subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
4
5Input a video with hardcoded subtitles:
6
7<p float="left">
8 <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png">
9 <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873659-3b76dd00-6a34-11e9-97aa-2c3e96fe3a97.png">
10</p>
11
12```python
13# print_sub.py
14
15import videocr
16
17if __name__ == '__main__':
18 print(videocr.get_subtitles('video.avi', lang='chi_sim+eng', sim_threshold=70))
19```
20
21`$ python3 print_sub.py`
22
23Output:
24
25```
260
2700:00:01,042 --> 00:00:02,877
28喝 点 什么 ?
29What can I get you?
30
311
3200:00:03,044 --> 00:00:05,463
33我 不 知道
34Um, I'm not sure.
35
362
3700:00:08,091 --> 00:00:10,635
38休闲 时 光 …
39For relaxing times, make it...
40
413
4200:00:10,677 --> 00:00:12,595
43三 得 利 时 光
44Bartender, Bob Suntory time.
45
464
4700:00:14,472 --> 00:00:17,142
48我 要 一 杯 伏特 加
49Un, I'll have a vodka tonic.
50
515
5200:00:18,059 --> 00:00:19,019
53谢谢
54Laughs Thanks.
55```
56
57## Performance
58
59The OCR process runs in parallel and is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. You may want more cores for longer videos.
60
61## Installation
62
631. Install [Tesseract](https://github.com/tesseract-ocr/tesseract/wiki) and make sure it is in your `$PATH`
64
652. `$ pip install videocr`
66
67## API
68
69```python
70get_subtitles(
71 video_path: str, lang='eng', time_start='0:00', time_end='',
72 conf_threshold=65, sim_threshold=90, use_fullframe=False)
73```
74Return the subtitles string in SRT format.
75
76
77```python
78
79save_subtitles_to_file(
80 video_path: str, file_path='subtitle.srt', lang='eng', time_start='0:00', time_end='',
81 conf_threshold=65, sim_threshold=90, use_fullframe=False)
82```
83Write subtitles to `file_path`. If the file does not exist, it will be created automatically.
84
85### Parameters
86
87- `lang`
88
89 The language of the subtitles. You can extract subtitles in almost any language. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
90
91 Note that you can use more than one language. For example, `'hin+eng'` means using Hindi and English together for recognition. More details are available in the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#using-multiple-languages).
92
93 Language data files will be automatically downloaded to your `$HOME/tessdata` directory when necessary. You can read more about Tesseract language data files on their [wiki page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).
94
95- `time_start` and `time_end`
96
97 Extract subtitles from only a part of the video. The subtitle timestamps are still calculated according to the full video length.
98
99- `conf_threshold`
100
101 Confidence threshold for word predictions. Words with lower confidence than this threshold are discarded. The default value is fine for most cases.
102
103 Make it closer to 0 if you get too few words from the predictions, or make it closer to 100 if you get too many excess words.
104
105- `sim_threshold`
106
107 Similarity threshold for subtitle lines. Neighbouring subtitles with larger [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance) ratios than this threshold will be merged together. The default value is fine for most cases.
108
109 Make it closer to 0 if you get too many duplicated subtitle lines, or make it closer to 100 if you get too few subtitle lines.
110
111- `use_fullframe`
112
113 By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.