Project Summary: Seeking a skilled data specialist or programmer experienced in text parsing and data extraction to process a large text file containing compiled VS Battles Wiki character profiles. The goal is to accurately extract key data points (Character Name, Combat Tier, and Origin/Franchise) for each character and deliver the results in a clean, structured Excel spreadsheet. Project Description: I have compiled the raw text content from numerous individual character profiles on the VS Battles Wiki into a single, large text file. This file serves as the primary source for this data extraction project. The final deliverable I require is a single Microsoft Excel file (.xlsx) that contains one row for each reliably identifiable unique character from the text dump. The output file should include the following columns: * Character Name: The primary name for the character. Need logic to identify and select the most appropriate name, potentially handling common aliases or variations if present within a profile block. * Tier: The character's combat or power Tier. This information is present in the profile text in various formats (e.g., "High 6-C", "Low 2-C | 1-C", "Varies from...", "Unknown"). The extraction needs to be robust to capture these variations accurately and handle cases where the Tier might be missing or explicitly marked as "Unknown". * Origin: The franchise, series, or source material the character belongs to. This information is also present in the profile text in different formats (e.g., "Origin: [[Franchise Name]]", "Origin: Franchise Name", "Origin:" followed by text on the next line). The extraction should identify the specific franchise name and handle cases where the origin is missing, unclear, or listed generically (e.g., "Characters", "Video Game", "Female"). Prioritize specific franchise names over generic terms or "Unknown". * (Optional but preferred) URL: If the URL of the character's profile page can be reliably extracted or constructed from the data within the text dump, include it in a separate column. Input Files I Will Provide: * Primary Source: A single, large text file (.txt) containing the combined raw text content of all character profiles from the VS Battles Wiki. This file is comprehensive and contains the data that needs to be parsed. * Supplementary Files (For Reference): I also have two Excel files (.xlsx) that are results of previous partial extraction attempts focusing on different ways Tier and Origin information can be formatted in the profiles. These files can serve as helpful examples of the data variations you will encounter and demonstrate the kind of specific origin/tier values I am looking for. They are supplementary and not the primary source for extraction. Key Requirements & Expectations: * Develop and use a script (likely in Python with libraries like re for regex parsing, pandas for data handling) to read and parse the large text file. * Implement robust parsing logic to extract Character Name, Tier, and Origin based on the diverse formats within the text. * Apply logic to consolidate data for the same character if they appear multiple times or with slight name variations in the text dump (grouping similar names if necessary). * Handle missing data or generic origins/tiers appropriately (e.g., mark as "Unknown"). * Ensure all identifiable characters from the text dump are included in the output (aiming for a number potentially over 31,000 unique characters). * Output a clean, well-organized Excel (.xlsx) file with the specified columns. * (Optional but preferred) Provide the source code of the extraction script used. Skills Preferred: * Data Extraction * Text Parsing / Data Parsing * Python * Regular Expressions (Regex) * Pandas (or similar data handling library) * Excel I will share the input text file and the supplementary files privately with freelancers who send promising proposals or whom you invite to interview.
Keyword: Content Developer
Price: $100.0
Data Extraction Python Microsoft Excel
The Sanctuary Wellness Institute is seeking a Digital PR Specialist to join our in-house digital marketing team on a part-time, independent contractor basis. As a Digital PR Specialist, you’ll play a key role in expanding our brand’s reach, developing backlinks for site...
View JobAbout Us: Our podcast analyzes art and pop culture while providing educational and entertaining content. With 15 episodes and a small but engaged audience, we're ready to expand our reach. What We Need: We're seeking a creative and strategic marketing professional to he...
View JobWe’re seeking a contract Technical Writer to support a growing software development company specializing in tools and solutions for developers. You’ll be writing medium-length articles (typically 1,000+ words) that help developers better understand the platform, feature...
View Job