Skip to content
Toggle navigation
Toggle navigation
This project
Loading...
Sign in
周伟奇
/
pdf_to_img
Go to a project
Toggle navigation
Toggle navigation pinning
Projects
Groups
Snippets
Help
Project
Activity
Repository
Pipelines
Graphs
Issues
0
Merge Requests
0
Wiki
Network
Create a new issue
Builds
Commits
Issue Boards
Files
Commits
Network
Compare
Branches
Tags
ff70b617
authored
2020-08-06 14:27:27 +0800
by
周伟奇
Browse Files
Options
Browse Files
Tag
Download
Email Patches
Plain Diff
update extract model
1 parent
77026d8c
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
33 additions
and
23 deletions
README.md
pdf_to_img.py
README.md
View file @
ff70b61
# PDF转图片脚本
## 主要处理逻辑
## 2种转化方式
-
保存整个页面为png图片
-
提取PDF页面中的图片对象
-
图片对象数目为0(如电子账单),保存整个页面为png图片
-
图片对象数目为1
-
大图,保存图片对象
-
小图(如电子账单盖章),保存整个页面为png图片
-
图片对象数目大于1
-
多
大
图,保存图片对象
-
多
整
图,保存图片对象
-
多碎图,根据宽高突变位置分组,拼接合并后保存
-
其他特殊情况:保存整个页面为png图片
## 已知问题
-
提取图片对象方式下,整图与碎图通过宽高阈值区分,无法满足所有PDF。个别PDF中,整图很小时会被当做碎图合并,碎图很大时会被当做整图不合并
## 用法
-
python3.6+
-
`pip install -r requirements.txt`
-
`python pdf_to_img.py pdf_path [img_path]`
| 参数 | 是否必须 | 说明 | 缺省值 |
| ---- | ---- | ---- | ---- |
| pdf_path | 是 | PDF文件或目录路径 | - |
| img_path | 否 | 图片保存路径 | PDF文件路径 |
\ No newline at end of file
-
`python pdf_to_img.py [-h] -i INPUT [-o OUTPUT] [-e]`
```
可选参数:
-h, --help 查看帮助信息并退出
-i INPUT, --input INPUT PDF文件或目录路径,必要参数
-o OUTPUT, --output OUTPUT 输出图片保存路径,非必要参数,缺省值为PDF文件路径
-e, --extract 默认采用整个页面保存png图片的方式,增加该选项选择提取图片方式转化图片
```
\ No newline at end of file
...
...
pdf_to_img.py
View file @
ff70b61
import
os
import
sys
import
fitz
import
argparse
from
PIL
import
Image
from
io
import
BytesIO
if
sys
.
version_info
[
0
]
<
3
:
raise
Exception
(
"This program requires at least python3.6"
)
if
len
(
sys
.
argv
)
<
2
:
print
(
'用法:python pdf_to_img.py PDF文件或目录路径 [图片保存路径]
'
)
sys
.
exit
(
0
)
if
not
os
.
path
.
exists
(
sys
.
argv
[
1
]):
print
(
'PDF文件或目录不存在: {0}'
.
format
(
sys
.
argv
[
1
])
)
sys
.
exit
(
0
)
parser
=
argparse
.
ArgumentParser
(
description
=
'PDF转图片
'
)
parser
.
add_argument
(
'-i'
,
'--input'
,
help
=
'PDF文件或目录路径,必要参数'
,
required
=
True
)
parser
.
add_argument
(
'-o'
,
'--output'
,
help
=
'输出图片保存路径,非必要参数,缺省值为PDF文件路径'
)
parser
.
add_argument
(
'-e'
,
'--extract'
,
help
=
'默认采用整个页面保存png图片的方式,增加该选项选择提取图片方式转化图片'
,
action
=
"store_true"
)
args
=
parser
.
parse_args
(
)
LOG_BASE
=
'[pdf to img]'
...
...
@@ -190,13 +191,13 @@ class PDFHandler:
page
=
pdf
.
loadPage
(
pno
)
self
.
page_to_png
(
page
)
def
extract_image
(
self
):
def
extract_image
(
self
,
is_extract
):
os
.
makedirs
(
self
.
img_dir_path
,
exist_ok
=
True
)
with
fitz
.
Document
(
self
.
path
)
as
pdf
:
print
(
'++++++++++'
*
5
)
print
(
'{0} [start] [pdf_path={1}] [metadata={2}]'
.
format
(
LOG_BASE
,
self
.
path
,
pdf
.
metadata
))
for
pno
in
range
(
pdf
.
pageCount
):
il
=
pdf
.
getPageImageList
(
pno
)
# 获取页面图片对象
il
=
pdf
.
getPageImageList
(
pno
)
if
is_extract
else
[]
# 获取页面图片对象
# (xref, smask, width, height, bpc, colorspace, alt.colorspace, name, filter, invoker)
print
(
'---------- page: {0} ----------'
.
format
(
pno
))
print
(
'img_object_list: {0}'
.
format
(
il
))
...
...
@@ -230,26 +231,29 @@ class PDFHandler:
self
.
merge_il
(
pdf
,
pno
,
il
)
def
extract_image
(
pdf_path
,
target_path
):
def
extract_image
(
pdf_path
,
target_path
,
is_extract
):
pdf_handler
=
PDFHandler
(
pdf_path
,
target_path
)
pdf_handler
.
extract_image
()
pdf_handler
.
extract_image
(
is_extract
)
def
main
():
pdf_path
=
os
.
path
.
realpath
(
sys
.
argv
[
1
])
if
not
os
.
path
.
exists
(
args
.
input
):
print
(
'PDF文件或目录不存在: {0}'
.
format
(
args
.
input
))
return
pdf_path
=
os
.
path
.
realpath
(
args
.
input
)
# 目录:遍历处理所有pdf文件
if
os
.
path
.
isdir
(
pdf_path
):
completed_count
=
0
failed_list
=
[]
for
parent
,
dirnames
,
filenames
in
os
.
walk
(
pdf_path
):
# 图片保存目录
target_path
=
os
.
path
.
realpath
(
sys
.
argv
[
2
])
if
len
(
sys
.
argv
)
>
2
else
parent
target_path
=
os
.
path
.
realpath
(
args
.
output
)
if
args
.
output
else
parent
for
pdf_file
in
filenames
:
if
not
pdf_file
.
endswith
(
'pdf'
)
and
not
pdf_file
.
endswith
(
'PDF'
):
continue
pdf_file_path
=
os
.
path
.
join
(
parent
,
pdf_file
)
try
:
extract_image
(
pdf_file_path
,
target_path
)
extract_image
(
pdf_file_path
,
target_path
,
args
.
extract
)
except
Exception
as
e
:
print
(
'{0} [failed] [err={1}] [pdf_path={2}]'
.
format
(
LOG_BASE
,
e
,
pdf_file_path
))
failed_list
.
append
(
pdf_file_path
)
...
...
@@ -261,9 +265,9 @@ def main():
# 文件:处理pdf文件
else
:
# 图片保存目录
target_path
=
os
.
path
.
realpath
(
sys
.
argv
[
2
])
if
len
(
sys
.
argv
)
>
2
else
os
.
path
.
dirname
(
pdf_path
)
target_path
=
os
.
path
.
realpath
(
args
.
output
)
if
args
.
output
else
os
.
path
.
dirname
(
pdf_path
)
try
:
extract_image
(
pdf_path
,
target_path
)
extract_image
(
pdf_path
,
target_path
,
args
.
extract
)
except
Exception
as
e
:
print
(
'{0} [failed] [err={1}] [pdf_path={2}]'
.
format
(
LOG_BASE
,
e
,
pdf_path
))
else
:
...
...
Write
Preview
Styling with
Markdown
is supported
Attach a file
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to post a comment