About This Archive - MediaOCU Archive

Archive Overview

This archive preserves content from MediaOCU, Oklahoma City University's student newspaper, covering the period from 2010 to 2024. The original WordPress site was compromised in an attack, and this static archive was created using data recovered from the Internet Archive's Wayback Machine.

Years Archived

1,043

Full Articles

2,023

Article Headers

3,066

Total Article Pages

Purpose: This archive serves as a historical record of OCU student journalism and campus life. While the university has rebuilt their live site at mediaocu.com, this archive preserves the past for research, reference, and remembrance.

How This Archive Was Built

1. Data Recovery (Wayback Machine)

All content was recovered from the Internet Archive's Wayback Machine using the open-source tool wayback-machine-downloader by Hartator.

Recovery Process:

Identified snapshots of mediaocu.com in the Wayback Machine (2010-2024)
Downloaded HTML, CSS, JavaScript, and image files using wayback-machine-downloader
Retrieved approximately 1.4GB of data with 10,595 HTML files
Extracted WordPress database exports when available

2. Data Cleanup (Automated + AI)

The recovered WordPress site contained significant bloat and artifacts. Cleanup was performed using custom Python scripts and AI assistance (Claude by Anthropic).

Cleanup Tasks:

Removed WordPress artifacts: 4,175 feed directories, 2,443 query string directories, PHP files
Stripped tracking code: Google Analytics, SEO plugins, monitoring scripts (saved 72.6MB)
Fixed broken links: Converted 18,680 absolute URLs to relative paths
Cleaned filenames: Removed Wayback Machine query strings from 34 asset files
Rebuilt navigation: Created year archive pages with proper article listings

3. Redesign (AI-Assisted)

The original WordPress theme was stripped away and replaced with a clean, minimal design created with assistance from Claude (AI). The new design emphasizes:

Newspaper typography (serif fonts, traditional layout)
OCU blue branding (#0066cc)
Timeline-style article presentation
Maximum readability and content density
Minimal whitespace and distraction-free reading

Data Coverage by Year

The archive contains varying levels of completeness for each year, depending on what was captured by the Wayback Machine and how the content was stored in WordPress.

Year	Full Articles	Headers Only	Total Pages	Status
2010	64	139	203	✓ Good
2011	219	459	678	✓ Excellent
2012	244	218	462	✓ Excellent
2013	175	259	434	✓ Excellent
2014	29	97	126	◐ Partial
2015	55	136	191	✓ Good
2016	206	132	338	✓ Excellent
2017	51	104	155	✓ Good
2018	0	128	128	✗ Headers Only
2019	0	28	28	✗ Headers Only
2020	0	102	102	✗ Headers Only
2021	0	154	154	✗ Headers Only
2022	0	46	46	✗ Headers Only
2023	0	10	10	✗ Headers Only
2024	0	11	11	✗ Headers Only

Status Definitions

✓ Excellent: 100+ full articles with complete content
✓ Good: 20-100 full articles with complete content
◐ Partial: Some full articles, but limited coverage
✗ Headers Only: Only article titles and metadata, no body content

Known Limitations & Missing Data

What We Know Is Missing

Images: Zero images recovered. The Wayback Machine may not have captured them, or they were stored on a separate CDN/server
2018-2024 article content: Only headers/titles captured, likely due to WordPress changes or less frequent Wayback crawling
Comments: User comments on articles were not preserved
Dynamic content: Interactive elements, forms, and JavaScript-dependent features lost
Multimedia: Videos, audio files, and embedded content not preserved
Author profiles: Individual author pages and bios incomplete or missing

What Might Be Missing (Unknown Unknowns)

Wayback Machine gaps: Articles published between Wayback crawls may never have been captured
Deleted/unpublished content: Draft articles, scheduled posts, or content removed before archiving
Database-only content: Data stored in WordPress database but not rendered as HTML pages
Premium/paywalled content: If any content required login, it wouldn't have been archived
Category/tag relationships: Full taxonomy relationships may be incomplete
Custom post types: Special content types beyond standard articles (portfolios, galleries, etc.)

Note on Completeness: This archive represents the best possible recovery given the available data from the Wayback Machine. The Internet Archive's crawlers visit sites periodically, so content published between crawls or removed before archiving cannot be recovered.

Percentage Analysis

Based on article file analysis:

34% of article pages contain full content (1,043 / 3,066)
66% of article pages are headers only (2,023 / 3,066)
Years 2010-2017: 60% full content recovery rate
Years 2018-2024: 0% full content recovery rate

Tools & Technology Used

Data Recovery

wayback-machine-downloader by Hartator - Ruby gem for downloading websites from Wayback Machine
Internet Archive Wayback Machine - Source of all archived content

Processing & Cleanup (Python Scripts)

The project includes 16 Python scripts organized in the maintenance/ directory:

Analysis Scripts (maintenance/analysis/)
- analyze-archive-data.py - Generates statistics about articles per year
- analyze-homepage-links.py - Identifies broken links
Cleanup Scripts (maintenance/cleanup/)
- cleanup-html.py - Removed WordPress tracking/SEO metadata from 4,966 HTML files
- cleanup-homepage.py - Cleaned homepage HTML and removed duplicates
Fix Scripts (maintenance/fixes/)
- fix-article-assets.py - Converted 18,680 absolute URLs to relative paths
- fix-homepage-links.py - Repaired broken article links
- fix-css-filenames.sh - Cleaned 34 CSS/JS files with query string artifacts
Generation Scripts (maintenance/generation/)
- rebuild-year-archives-clean.py - Rebuilt all 15 year pages with clean HTML
- generate-archive-homepage-v2.py - Generated new homepage
- generate-year-archives.py - Created year archive index pages
- generate-plain-site.py - Generated plain-text version for analysis
- update-year-footers.py - Added navigation links to all pages
Styling Scripts (maintenance/styling/)
- apply-spacing.py - Switches between spacing options (minimal/balanced/tight)
- apply-design.py - Applies design templates to pages
- 4 design options + 3 spacing variations (CSS files)

All scripts include comprehensive documentation in maintenance/README.md and subfolder READMEs.

AI Assistance (Claude by Anthropic)

Claude, an AI assistant by Anthropic, was used extensively throughout this project:

Code Generation: Created all Python cleanup and generation scripts
Design: Developed the clean, newspaper-inspired CSS design
Problem Solving: Diagnosed and fixed broken links, layout issues, and WordPress artifacts
Data Analysis: Analyzed archive completeness and identified gaps
Documentation: Created this overview page and project documentation

Project Timeline: The entire recovery, cleanup, and redesign process was completed in a single collaborative session between the archive maintainer and Claude, demonstrating the power of AI-assisted archival work.

Documentation

Comprehensive documentation is organized in the docs/ directory:

Project Documentation (docs/project/)
- PROJECT-OVERVIEW.md - Complete project summary and goals
- CLEANUP-SUMMARY.md - Detailed log of all cleanup operations
- claude.md - Notes on AI collaboration methodology
Design Documentation (docs/design/)
- design-options.md - Comparison of 4 design templates
- SPACING-OPTIONS.md - Analysis of spacing variations
- spacing-analysis.md - Detailed spacing measurements
Analysis Reports (docs/reports/)
- broken-links-report.txt - List of broken links identified
- missing-homepage-links.txt - Missing articles analysis

Hosting & Deployment

Firebase Hosting - Static file hosting with CDN
Git - Version control for all changes
GitHub - Repository hosting and code management

Future Improvements

Potential enhancements for this archive:

Search functionality across all articles
Attempt to recover images from alternative sources
Check for additional Wayback Machine snapshots for 2018-2024 content
Create category/topic organization beyond chronological
Export archive data to structured formats (JSON, CSV) for research
OCR scanning of any available print editions to fill gaps

Contact & Contributing

This archive is maintained as a service to the OCU community. If you have:

Original MediaOCU articles or images not in this archive
Information about missing content or dates
Technical improvements or corrections

Please contact the archive maintainer or submit contributions via the project repository.

Acknowledgments: Thank you to the Internet Archive for preserving web history, to Hartator for the wayback-machine-downloader tool, and to Anthropic for Claude AI assistance. Most importantly, thank you to all MediaOCU student journalists whose work is preserved here.