Wednesday 3 April 2024

Convert PDF files using Python


from gtts import gTTS
from PyPDF2 import PdfReader

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, 'rb') as f:
        reader = PdfReader(f)
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
    return text

def text_to_audio(text, output_file):
    tts = gTTS(text)
    tts.save(output_file)

# Example usage:
pdf_file = "clcoding.pdf"
output_audio_file = "clcoding_audio.mp3"

text = pdf_to_text(pdf_file)
text_to_audio(text, output_audio_file)

#clcoding.com

Explanation: 

This code snippet is a Python script that converts text from a PDF file to an audio file using the Google Text-to-Speech (gTTS) library (gTTS) and the PyPDF2 library (PdfReader).

Here's a breakdown of what each part of the code does:

Importing Required Libraries:

from gtts import gTTS: This imports the gTTS class from the gtts module, which allows us to convert text to speech using Google Text-to-Speech.
from PyPDF2 import PdfReader: This imports the PdfReader class from the PyPDF2 library, which is used to read PDF files.
pdf_to_text(pdf_file) Function:

This function takes a PDF file path (pdf_file) as input.
It opens the PDF file in binary mode and creates a PdfReader object to read the content of the PDF.
It iterates through each page of the PDF (reader.pages) and extracts text from each page using the extract_text() method.
It concatenates all the extracted text from each page into a single string (text).
Finally, it returns the concatenated text.
text_to_audio(text, output_file) Function:

This function takes two arguments: the text to convert to audio (text) and the file path where the audio will be saved (output_file).
It creates a gTTS object (tts) by passing the input text.
It saves the generated audio file to the specified output file path.
Example Usage:

It defines the input PDF file (pdf_file) as "clcoding.pdf".
It defines the output audio file path (output_audio_file) as "clcoding_audio.mp3".
It calls the pdf_to_text() function to extract text from the PDF file.
It calls the text_to_audio() function to convert the extracted text to audio and save it to the specified output file.
The comment #clcoding.com is unrelated to the code and appears to be a note or reference to a website.

This script essentially converts the text content of a PDF file into audio, which could be useful for tasks such as creating audiobooks, generating voiceovers, or assisting users with reading disabilities.

import os
from PyPDF2 import PdfReader
from pdf2image import convert_from_path

def pdf_to_images(pdf_file, output_dir):
    images = []
    with open(pdf_file, 'rb') as f:
        reader = PdfReader(f)
        for page_num, _ in enumerate(reader.pages):
            # Convert each PDF page to image
            img_path = os.path.join(output_dir, f"page_{page_num}.png")
            images.append(img_path)
    return images

# Example usage:
pdf_file = "clcoding.pdf"
output_dir = "output_images"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

pdf_to_images(pdf_file, output_dir)

#clcoding.com

Explanation: 

This Python script converts each page of a PDF file into an image format (PNG) using the PyPDF2 library to read the PDF and the pdf2image library to perform the conversion.

Here's a breakdown of each part of the code:

Importing Required Libraries:

import os: This imports the os module, which provides functions for interacting with the operating system, such as creating directories.
from PyPDF2 import PdfReader: This imports the PdfReader class from the PyPDF2 library, which is used to read PDF files.
from pdf2image import convert_from_path: This imports the convert_from_path function from the pdf2image library, which is used to convert PDF pages to images.
pdf_to_images(pdf_file, output_dir) Function:

This function takes two arguments: the path to the input PDF file (pdf_file) and the directory where the output images will be saved (output_dir).
It initializes an empty list called images to store the file paths of the converted images.
It opens the PDF file in binary mode ('rb') using a context manager (with open(...) as f) and creates a PdfReader object (reader) to read the content of the PDF.
It iterates through each page of the PDF (reader.pages) using enumerate to get both the page number and the page object.
For each page, it generates a file path for the corresponding image in the output directory (output_dir) using os.path.join and appends it to the images list.
Finally, it returns the list of image file paths.
Example Usage:

It defines the input PDF file (pdf_file) as "clcoding.pdf".
It defines the output directory (output_dir) as "output_images".
It checks if the output directory does not exist (if not os.path.exists(output_dir)), and if not, it creates the directory using os.makedirs(output_dir).
It calls the pdf_to_images() function to convert the PDF pages to images and stores the list of image file paths.
The comment #clcoding.com is unrelated to the code and appears to be a note or reference to a website.

This script can be useful for tasks such as converting PDF pages to images for further processing or display, such as in document management systems, image processing pipelines, or for creating thumbnails of PDF documents.

import os
from PyPDF2 import PdfReader
import docx

def pdf_to_text():
    pdf_file = "clcoding.pdf"
    text = ""
    with open(pdf_file, 'rb') as f:
        reader = PdfReader(f)
        for page_num in range(len(reader.pages)):
            page_text = reader.pages[page_num].extract_text()
            text += page_text
    return text

def pdf_to_docx(output_file):
    text = pdf_to_text()
    doc = docx.Document()
    doc.add_paragraph(text)
    doc.save(output_file)

# Example usage:
output_docx_file = "output_docx.docx"

pdf_to_docx(output_docx_file)

#clcoding.com

Explanation:

This Python script converts the text content of a PDF file ("clcoding.pdf") into a Microsoft Word document (.docx) using the PyPDF2 library to extract text from the PDF and the python-docx library to create and save the Word document.

Here's a breakdown of each part of the code:

Importing Required Libraries:

import os: This imports the os module, which provides functions for interacting with the operating system, such as creating directories or checking file paths.
from PyPDF2 import PdfReader: This imports the PdfReader class from the PyPDF2 library, which is used to read PDF files.
import docx: This imports the docx module, which is part of the python-docx library, used for creating and manipulating Word documents.
pdf_to_text() Function:

This function reads the text content of the PDF file ("clcoding.pdf").
It initializes an empty string called text to store the extracted text.
It opens the PDF file in binary mode ('rb') using a context manager (with open(...) as f) and creates a PdfReader object (reader) to read the content of the PDF.
It iterates through each page of the PDF (reader.pages) using range(len(reader.pages)) to get the page number.
For each page, it extracts the text using the extract_text() method of the page object and appends it to the text string.
Finally, it returns the concatenated text.
pdf_to_docx(output_file) Function:

This function converts the extracted text from the PDF to a Word document.
It calls the pdf_to_text() function to get the text content of the PDF.
It creates a new docx.Document() object (doc) to represent the Word document.
It adds a paragraph containing the extracted text to the document using the add_paragraph() method.
It saves the document to the specified output file path using the save() method.
Example Usage:

It defines the output Word document file path (output_docx_file) as "output_docx.docx".
It calls the pdf_to_docx() function to convert the PDF text content to a Word document and save it to the specified output file.
The comment #clcoding.com is unrelated to the code and appears to be a note or reference to a website.

This script is useful for converting the text content of a PDF file to a Word document, which can be helpful for further editing or formatting. Make sure you have the necessary libraries installed (PyPDF2 and python-docx).

 import os
from PyPDF2 import PdfReader
import pandas as pd

def pdf_to_text():
    pdf_file = "clcoding.pdf"
    text = ""
    with open(pdf_file, 'rb') as f:
        reader = PdfReader(f)
        for page_num in range(len(reader.pages)):
            page_text = reader.pages[page_num].extract_text()
            text += page_text
    return text

def pdf_to_excel(output_file):
    text = pdf_to_text()
    lines = text.split('\n')
    df = pd.DataFrame(lines)
    df.to_excel(output_file, index=False, header=False)

# Example usage:
output_excel_file = "output_excel.xlsx"

pdf_to_excel(output_excel_file)

#clcoding.com

Explanation:

This Python script reads the text content of a PDF file ("clcoding.pdf") and then converts it into an Excel file (.xlsx) using the Pandas library. Here's a breakdown of each part of the code:

Importing Required Libraries:

import os: This imports the os module, which provides functions for interacting with the operating system, such as creating directories or checking file paths.
from PyPDF2 import PdfReader: This imports the PdfReader class from the PyPDF2 library, which is used to read PDF files.
import pandas as pd: This imports the Pandas library, often used for data manipulation and analysis.
pdf_to_text() Function:

This function reads the text content of the PDF file ("clcoding.pdf").
It initializes an empty string called text to store the extracted text.
It opens the PDF file in binary mode ('rb') using a context manager (with open(...) as f) and creates a PdfReader object (reader) to read the content of the PDF.
It iterates through each page of the PDF (reader.pages) using range(len(reader.pages)) to get the page number.
For each page, it extracts the text using the extract_text() method of the page object and appends it to the text string.
Finally, it returns the concatenated text.
pdf_to_excel(output_file) Function:

This function converts the extracted text from the PDF to an Excel file.
It calls the pdf_to_text() function to get the text content of the PDF.
It splits the text into lines using the newline character ('\n') and creates a list of lines.
It creates a Pandas DataFrame (df) from the list of lines.
It saves the DataFrame to an Excel file specified by the output_file parameter using the to_excel() method, without including the index or header.
Example Usage:

It defines the output Excel file path (output_excel_file) as "output_excel.xlsx".
It calls the pdf_to_excel() function to convert the PDF text content to an Excel file and save it to the specified output file.
The comment #clcoding.com is unrelated to the code and appears to be a note or reference to a website.

This script can be useful for extracting text data from PDF files and converting it into a structured format like an Excel spreadsheet for further analysis or manipulation. Make sure you have the necessary libraries installed (PyPDF2 and pandas).



import os
from PyPDF2 import PdfReader
from pptx import Presentation

def pdf_to_text():
    pdf_file = "clcoding.pdf"  # Using "clcoding.pdf"
    text = ""
    with open(pdf_file, 'rb') as f:
        reader = PdfReader(f)
        for page_num in range(len(reader.pages)):
            page_text = reader.pages[page_num].extract_text()
            text += page_text
    return text

def pdf_to_ppt(output_file):
    text = pdf_to_text()
    prs = Presentation()
    slides = text.split('\n\n')
    for slide_content in slides:
        slide = prs.slides.add_slide(prs.slide_layouts[1])
        slide.shapes.title.text = slide_content
    prs.save(output_file)

# Example usage:
output_ppt_file = "output_ppt.pptx"

pdf_to_ppt(output_ppt_file)

#clcoding.com

Explanation:

This Python script converts the text content of a PDF file ("clcoding.pdf") into a PowerPoint presentation (.pptx) using the PyPDF2 library to extract text from the PDF and the python-pptx library to create and save the PowerPoint presentation.

Here's a breakdown of each part of the code:

Importing Required Libraries:

import os: This imports the os module, which provides functions for interacting with the operating system, such as creating directories or checking file paths.
from PyPDF2 import PdfReader: This imports the PdfReader class from the PyPDF2 library, which is used to read PDF files.
from pptx import Presentation: This imports the Presentation class from the pptx module, which is part of the python-pptx library used for creating and manipulating PowerPoint presentations.
pdf_to_text() Function:

This function reads the text content of the PDF file ("clcoding.pdf").
It initializes an empty string called text to store the extracted text.
It opens the PDF file in binary mode ('rb') using a context manager (with open(...) as f) and creates a PdfReader object (reader) to read the content of the PDF.
It iterates through each page of the PDF (reader.pages) using range(len(reader.pages)) to get the page number.
For each page, it extracts the text using the extract_text() method of the page object and appends it to the text string.
Finally, it returns the concatenated text.
pdf_to_ppt(output_file) Function:

This function converts the extracted text from the PDF to a PowerPoint presentation.
It calls the pdf_to_text() function to get the text content of the PDF.
It creates a new Presentation() object (prs) to represent the PowerPoint presentation.
It splits the text into slides based on double newline characters ('\n\n').
For each slide content, it adds a new slide to the presentation using the add_slide() method, specifying the layout of the slide.
It sets the title of each slide to the slide content using the shapes.title.text property.
Finally, it saves the presentation to the specified output file path using the save() method.
Example Usage:

It defines the output PowerPoint file path (output_ppt_file) as "output_ppt.pptx".
It calls the pdf_to_ppt() function to convert the PDF text content to a PowerPoint presentation and save it to the specified output file.
The comment #clcoding.com is unrelated to the code and appears to be a note or reference to a website.

This script can be useful for converting the text content of a PDF file into a structured format like a PowerPoint presentation, which can be helpful for presentations or sharing information in a visually appealing format. Make sure you have the necessary libraries installed (PyPDF2 and python-pptx).

0 Comments:

Post a Comment

Popular Posts

Categories

AI (27) Android (24) AngularJS (1) Assembly Language (2) aws (17) Azure (7) BI (10) book (4) Books (115) C (77) C# (12) C++ (82) Course (62) Coursera (179) coursewra (1) Cybersecurity (22) data management (11) Data Science (91) Data Strucures (6) Deep Learning (9) Django (6) Downloads (3) edx (2) Engineering (14) Excel (13) Factorial (1) Finance (5) flutter (1) FPL (17) Google (19) Hadoop (3) HTML&CSS (46) IBM (25) IoT (1) IS (25) Java (92) Leet Code (4) Machine Learning (44) Meta (18) MICHIGAN (5) microsoft (3) Pandas (3) PHP (20) Projects (29) Python (747) Python Coding Challenge (208) Questions (2) R (70) React (6) Scripting (1) security (3) Selenium Webdriver (2) Software (17) SQL (40) UX Research (1) web application (8)

Followers

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses