Skip to content

The Complete PDF Guide

Cornerstone Guide10 min readJanuary 15, 2025
Table of Contents

PDF (Portable Document Format) was designed to solve one problem: a document should look exactly the same regardless of who opens it, on what device, with what software. In 1993, when Adobe released the format, this was revolutionary. Thirty years later, PDF is the global standard for contracts, invoices, reports, forms, and archival documents — precisely because it solved that problem.

But PDFs come with friction. They are often too large to email. They resist editing. Converting them to other formats loses formatting. Protecting them adds a layer that frustrates legitimate use. Combining multiple PDFs into a coherent document requires software most people do not have.

This guide covers every major PDF operation: what is happening inside the file during each one, which settings matter, and how to do it without desktop software. Everything described here works directly in your browser — no installation, no subscription required for the most common tasks.

What Is Inside a PDF File

A PDF file is a structured container. Inside it, you find several distinct components: page descriptions (instructions for drawing each page), embedded fonts (so text renders identically regardless of what fonts are installed on the reader's device), embedded images (raster graphics like photos and scanned content), and metadata (author, creation date, software used, thumbnail preview).

Understanding these components explains why PDFs behave the way they do and why certain operations work the way they do.

The page descriptions are written in a subset of the PostScript language. They describe where every character, line, and image appears on the page using coordinates. This is why PDFs have such consistent layout — the exact position of every element is specified numerically.

Fonts are embedded either in full (the complete font file is included, regardless of which characters appear in the document) or as subsets (only the characters actually used in the document are included). Full embedding is larger; subsetting is smaller. Most modern PDF creators subset automatically, but older tools and some enterprise software embed full fonts.

Images within PDFs are usually already compressed using JPEG for photographs or CCITT Group 4 (fax compression) for scanned black-and-white documents. The quality of embedded images is baked in at the time the PDF is created — this is why compressing a high-quality-image PDF achieves significant size reduction, while compressing an already-compressed one achieves little.

PDF Compression: What Actually Happens

Compressing a PDF reduces its file size by targeting the three main contributors: image data, font data, and structural overhead.

Image data is the largest contributor in most PDFs, especially anything created from scans or screenshots. PDF compression recompresses embedded images at a lower quality level — discarding fine detail that is imperceptible at normal viewing distances. At 150 DPI (the "ebook" compression level), an A4 page rendered at full size on a monitor looks identical to the 300 DPI version. The difference only appears when you zoom in beyond 200% or print at large format.

Font subsetting removes unused characters from embedded fonts. A document using only English text from a font that includes 3,000 characters (Latin, Greek, Cyrillic, symbols) can reduce its font data by 90% by keeping only the 95 or so characters that actually appear. This is a lossless operation — the document looks and reads identically.

Structural overhead includes metadata, object streams, cross-reference tables, and thumbnail images embedded in the PDF. Stripping metadata is also a privacy operation — it removes information like the author's name, the software that created the file, and the creation timestamp. Content stream compression re-encodes the drawing instructions more efficiently.

Realistic expectations by PDF type: scanned documents typically compress 60-80%; Word-exported PDFs compress 20-40%; professionally typeset PDFs from design software compress 5-15%.

Compress a PDFReduce file size without changing content

Merging and Splitting PDFs

Merging combines multiple PDF files into a single document. Splitting divides one PDF into multiple smaller ones. Both operations work at the page boundary level — they never touch the content of any page.

When merging, the page order is the only decision you control. The merged document concatenates pages in the order you arrange the source files. Fonts, images, and formatting from each source file are preserved exactly. Bookmarks from individual source files are typically preserved but may point to incorrect pages if page numbers change significantly.

Splitting has three common modes: by page range (pages 1-10 go to file A, 11-20 to file B), by interval (split every N pages automatically), and by extraction (select specific pages to keep, discard the rest). Extraction is the most powerful — it lets you cherry-pick individual pages from anywhere in the document without affecting the source.

A common workflow: receive a large combined PDF from a scanner or a reporting system, split it by extraction to pull out the relevant sections, process each section independently, then merge the results back into a final delivery document. All of these operations are structural — no content is ever degraded.

Merge PDFsCombine multiple PDF files into one

PDF Conversion: When It Works and When It Doesn't

PDF conversion covers three major categories: PDF to image (raster export), PDF to editable text formats (Word, plain text), and image to PDF (creating PDFs from image files).

PDF to image exports each page as a pixel-based image file. DPI is the critical setting — 72 DPI is screen quality and looks blurry when printed, 150 DPI is suitable for web publishing, 300 DPI is print quality. For JPG export, a secondary quality setting controls how aggressively the JPEG algorithm compresses the image data. This conversion is always high-fidelity in one direction: the PDF renders exactly as it would in a viewer, but the result is a static image with no text selectability.

PDF to Word is the conversion that generates the most confusion. Its success depends entirely on whether the source PDF contains real text or images of text. Digital PDFs — created in Word, InDesign, or any software that outputs real text — convert well: text is extracted, headings are identified, simple tables are reconstructed. Scanned PDFs require OCR (optical character recognition) before conversion is meaningful. Even with OCR, complex layouts (multi-column text, tables with merged cells, text overlaid on images) need manual cleanup after conversion.

Image to PDF wraps image files in a PDF container. The key decisions are page size (match the image dimensions or standardise to A4/Letter), margins, and image quality preservation. A good converter embeds the original image data without recompressing it.

Convert PDF to JPGExport PDF pages as images at any resolution

PDF Security: Passwords, Permissions, and Redaction

PDF security has two distinct layers that are frequently confused.

The open password (user password) prevents the file from being opened without entering the correct password. This is strong protection — AES-256 encryption, the current standard, makes brute-force attacks computationally infeasible. Anyone who receives the file without the password cannot view it.

The permissions password (owner password) allows the file to be opened freely but restricts operations: editing, copying text, printing, and form completion can each be allowed or denied. This is weaker protection in practice — some PDF readers and tools bypass permissions restrictions without the owner password, treating them as advisory rather than enforced. Rely on the open password for genuine access control.

Removing a password requires knowing it. This bears repeating: removing PDF encryption is not cracking — it is decryption using the correct key. Modern AES-256 PDF encryption cannot be bypassed by brute force; no legitimate online tool can crack it. Any service claiming to remove a password from a strongly encrypted PDF without the password is either exploiting a weakness in an older encryption standard (RC4-40, RC4-128) or is misleading users.

Redaction is the permanent removal of sensitive content from a PDF. True redaction replaces content with black rectangles at the PDF structure level — the underlying text is deleted, not just covered. Covering text with a black rectangle drawn on top of the page is not redaction — the original text remains in the PDF and can be extracted. Use a proper redaction tool, not an annotation or drawing tool, for legally sensitive content.

Protect a PDFAdd AES-256 password encryption

Common PDF Workflows

Some of the most common multi-step PDF workflows appear in business, legal, and academic contexts. Understanding the right sequence of operations saves time and avoids quality loss.

Invoice package assembly: create or export each component as a separate PDF (invoice, purchase order, delivery note), merge them in the correct order, add page numbers for consistent pagination, compress the combined file for email attachment, then optionally add a watermark for draft tracking.

Scanned document digitisation: scan physical pages to image files (JPEG or PNG), convert them to PDF, run OCR to make the text searchable, compress the OCR-processed PDF (OCR typically increases size), and archive or distribute the result.

Contract distribution: start with the final contract PDF, add your digital signature, encrypt with an open password communicated separately, and distribute. The recipient unlocks, signs, and returns the encrypted signed copy.

Large report splitting: receive a 200-page combined report PDF, split by page range into individual section files, distribute relevant sections to relevant stakeholders, merge received feedback documents back into a single review package.

Each workflow involves a sequence of structural operations on the PDF — none of them require modifying the content of any page directly. All can be performed using browser-based tools on documents of any size.

Frequently Asked Questions

What is the difference between a PDF and a DOCX file?
PDF uses fixed-position layout — every element is placed at exact coordinates, rendering identically on any device. DOCX uses flow layout — text reflows based on page size, fonts, and margins, which means it may look different on different devices. Use PDF for documents where layout must be preserved; use DOCX for documents that need to be edited.
Why is my PDF so large?
The most common causes are embedded high-resolution images (especially in scanned PDFs), unsubsetted fonts that include thousands of unused characters, and metadata including embedded thumbnails. Scan-based PDFs are typically the largest. Use the Compress PDF tool to address all three causes at once.
Can I edit a PDF directly?
Yes, but with significant limitations. The PDF Editor tool allows annotation, text insertion, and drawing. Full-text editing that reflows paragraphs requires conversion to Word format first. Direct PDF editing works best for adding signatures, stamps, comments, and filling form fields — not for rewriting body text.
What is the difference between PDF/A and regular PDF?
PDF/A is a subset of PDF designed for long-term archiving. It requires all fonts to be embedded (no external font dependencies), prohibits encryption, disallows audio and video content, and includes colour profile information. Regulatory and legal archiving requirements often specify PDF/A.
Are my PDFs private when I use browser-based PDF tools?
For tools that run in the browser using WebAssembly (compress, merge, split, rotate, protect, unlock, extract text), yes — your files never leave your device. For server-side tools (PDF to Word, OCR, advanced conversion), files are processed on secure servers and deleted after processing.
Why can't I copy text from my PDF?
Either the PDF is scanned (images of text rather than real text — use OCR), or the owner password restricts text copying. Try unlocking the PDF first. If it is scanned, use the OCR tool to create a searchable text layer, then extract text.
What happens to PDF forms after compression?
Interactive PDF form fields are preserved during compression. The form fields are part of the document structure, not the page content — compression operates on images and fonts, not form elements. You can fill form fields in a compressed PDF exactly as in the original.

Summary

PDF is the most durable and universal document format in existence. Understanding what is inside a PDF file — and which operations affect which components — makes every PDF task faster and more predictable.

Compress to target images and fonts; merge and split at page boundaries without touching content; convert with awareness of whether real text or scanned images are present; protect with an open password for genuine security and permissions for advisory restrictions. These are the operations that cover 95% of all PDF workflows.

All tools described in this guide are available in the PDF tools section. Every browser-based operation processes your files locally — your documents stay on your device.

Try these tools

Related guides

All Guides