Archive Overview
This archive preserves content from MediaOCU, Oklahoma City University's student newspaper, covering the period from 2010 to 2024. The original WordPress site was compromised in an attack, and this static archive was created using data recovered from the Internet Archive's Wayback Machine.
How This Archive Was Built
1. Data Recovery (Wayback Machine)
All content was recovered from the Internet Archive's Wayback Machine using the open-source tool wayback-machine-downloader by Hartator.
Recovery Process:
- Identified snapshots of mediaocu.com in the Wayback Machine (2010-2024)
- Downloaded HTML, CSS, JavaScript, and image files using wayback-machine-downloader
- Retrieved approximately 1.4GB of data with 10,595 HTML files
- Extracted WordPress database exports when available
2. Data Cleanup (Automated + AI)
The recovered WordPress site contained significant bloat and artifacts. Cleanup was performed using custom Python scripts and AI assistance (Claude by Anthropic).
Cleanup Tasks:
- Removed WordPress artifacts: 4,175 feed directories, 2,443 query string directories, PHP files
- Stripped tracking code: Google Analytics, SEO plugins, monitoring scripts (saved 72.6MB)
- Fixed broken links: Converted 18,680 absolute URLs to relative paths
- Cleaned filenames: Removed Wayback Machine query strings from 34 asset files
- Rebuilt navigation: Created year archive pages with proper article listings
3. Redesign (AI-Assisted)
The original WordPress theme was stripped away and replaced with a clean, minimal design created with assistance from Claude (AI). The new design emphasizes:
- Newspaper typography (serif fonts, traditional layout)
- OCU blue branding (#0066cc)
- Timeline-style article presentation
- Maximum readability and content density
- Minimal whitespace and distraction-free reading
Data Coverage by Year
The archive contains varying levels of completeness for each year, depending on what was captured by the Wayback Machine and how the content was stored in WordPress.
| Year | Full Articles | Headers Only | Total Pages | Status |
|---|---|---|---|---|
| 2010 | 64 | 139 | 203 | ✓ Good |
| 2011 | 219 | 459 | 678 | ✓ Excellent |
| 2012 | 244 | 218 | 462 | ✓ Excellent |
| 2013 | 175 | 259 | 434 | ✓ Excellent |
| 2014 | 29 | 97 | 126 | ◐ Partial |
| 2015 | 55 | 136 | 191 | ✓ Good |
| 2016 | 206 | 132 | 338 | ✓ Excellent |
| 2017 | 51 | 104 | 155 | ✓ Good |
| 2018 | 0 | 128 | 128 | ✗ Headers Only |
| 2019 | 0 | 28 | 28 | ✗ Headers Only |
| 2020 | 0 | 102 | 102 | ✗ Headers Only |
| 2021 | 0 | 154 | 154 | ✗ Headers Only |
| 2022 | 0 | 46 | 46 | ✗ Headers Only |
| 2023 | 0 | 10 | 10 | ✗ Headers Only |
| 2024 | 0 | 11 | 11 | ✗ Headers Only |
Status Definitions
- ✓ Excellent: 100+ full articles with complete content
- ✓ Good: 20-100 full articles with complete content
- ◐ Partial: Some full articles, but limited coverage
- ✗ Headers Only: Only article titles and metadata, no body content
Known Limitations & Missing Data
What We Know Is Missing
- Images: Zero images recovered. The Wayback Machine may not have captured them, or they were stored on a separate CDN/server
- 2018-2024 article content: Only headers/titles captured, likely due to WordPress changes or less frequent Wayback crawling
- Comments: User comments on articles were not preserved
- Dynamic content: Interactive elements, forms, and JavaScript-dependent features lost
- Multimedia: Videos, audio files, and embedded content not preserved
- Author profiles: Individual author pages and bios incomplete or missing
What Might Be Missing (Unknown Unknowns)
- Wayback Machine gaps: Articles published between Wayback crawls may never have been captured
- Deleted/unpublished content: Draft articles, scheduled posts, or content removed before archiving
- Database-only content: Data stored in WordPress database but not rendered as HTML pages
- Premium/paywalled content: If any content required login, it wouldn't have been archived
- Category/tag relationships: Full taxonomy relationships may be incomplete
- Custom post types: Special content types beyond standard articles (portfolios, galleries, etc.)
Percentage Analysis
Based on article file analysis:
- 34% of article pages contain full content (1,043 / 3,066)
- 66% of article pages are headers only (2,023 / 3,066)
- Years 2010-2017: 60% full content recovery rate
- Years 2018-2024: 0% full content recovery rate
Tools & Technology Used
Data Recovery
- wayback-machine-downloader by Hartator - Ruby gem for downloading websites from Wayback Machine
- Internet Archive Wayback Machine - Source of all archived content
Processing & Cleanup (Python Scripts)
The project includes 16 Python scripts organized in the maintenance/ directory:
- Analysis Scripts (
maintenance/analysis/)analyze-archive-data.py- Generates statistics about articles per yearanalyze-homepage-links.py- Identifies broken links
- Cleanup Scripts (
maintenance/cleanup/)cleanup-html.py- Removed WordPress tracking/SEO metadata from 4,966 HTML filescleanup-homepage.py- Cleaned homepage HTML and removed duplicates
- Fix Scripts (
maintenance/fixes/)fix-article-assets.py- Converted 18,680 absolute URLs to relative pathsfix-homepage-links.py- Repaired broken article linksfix-css-filenames.sh- Cleaned 34 CSS/JS files with query string artifacts
- Generation Scripts (
maintenance/generation/)rebuild-year-archives-clean.py- Rebuilt all 15 year pages with clean HTMLgenerate-archive-homepage-v2.py- Generated new homepagegenerate-year-archives.py- Created year archive index pagesgenerate-plain-site.py- Generated plain-text version for analysisupdate-year-footers.py- Added navigation links to all pages
- Styling Scripts (
maintenance/styling/)apply-spacing.py- Switches between spacing options (minimal/balanced/tight)apply-design.py- Applies design templates to pages- 4 design options + 3 spacing variations (CSS files)
All scripts include comprehensive documentation in maintenance/README.md and subfolder READMEs.
AI Assistance (Claude by Anthropic)
Claude, an AI assistant by Anthropic, was used extensively throughout this project:
- Code Generation: Created all Python cleanup and generation scripts
- Design: Developed the clean, newspaper-inspired CSS design
- Problem Solving: Diagnosed and fixed broken links, layout issues, and WordPress artifacts
- Data Analysis: Analyzed archive completeness and identified gaps
- Documentation: Created this overview page and project documentation
Documentation
Comprehensive documentation is organized in the docs/ directory:
- Project Documentation (
docs/project/)PROJECT-OVERVIEW.md- Complete project summary and goalsCLEANUP-SUMMARY.md- Detailed log of all cleanup operationsclaude.md- Notes on AI collaboration methodology
- Design Documentation (
docs/design/)design-options.md- Comparison of 4 design templatesSPACING-OPTIONS.md- Analysis of spacing variationsspacing-analysis.md- Detailed spacing measurements
- Analysis Reports (
docs/reports/)broken-links-report.txt- List of broken links identifiedmissing-homepage-links.txt- Missing articles analysis
Hosting & Deployment
- Firebase Hosting - Static file hosting with CDN
- Git - Version control for all changes
- GitHub - Repository hosting and code management
Future Improvements
Potential enhancements for this archive:
- Search functionality across all articles
- Attempt to recover images from alternative sources
- Check for additional Wayback Machine snapshots for 2018-2024 content
- Create category/topic organization beyond chronological
- Export archive data to structured formats (JSON, CSV) for research
- OCR scanning of any available print editions to fill gaps
Contact & Contributing
This archive is maintained as a service to the OCU community. If you have:
- Original MediaOCU articles or images not in this archive
- Information about missing content or dates
- Technical improvements or corrections
Please contact the archive maintainer or submit contributions via the project repository.