Python Automation Engineer Needed – Web Scraping + Data Extraction + Spreadsheet + Basecamp Calendar Integration (MVP 2–

Please login or register as jobseeker to apply for this job.

TYPE OF WORK

Gig

WAGE / SALARY

150-300

HOURS PER WEEK

TBD

DATE UPDATED

May 31, 2026

JOB OVERVIEW

Good—this changes the project from “scraping job” into a **full automation pipeline with scheduling + external system integration (Basecamp + spreadsheet sync)**. That needs to be explicitly stated or you’ll get someone who only builds scrapers and stops there.

Here’s your updated job description with that requirement integrated cleanly.

---

# ???? Job Title

**Python Automation Engineer Needed – Web Scraping + Data Extraction + Spreadsheet + Basecamp Calendar Integration (MVP 2–4 Days)**

---

# ???? Project Overview

Looking for a Python developer to build an automated system that:

1. Scrapes structured data from multiple websites (HTML, JS-rendered, and PDF sources)
2. Extracts key information from legal notices / listings
3. Stores the data in a structured spreadsheet
4. Automatically creates scheduled events in Basecamp calendar

This is a **fast MVP build (2–4 days)** focused on working automation, not enterprise-level architecture.

---

# ???? Core Objective

Build an end-to-end automation pipeline that:

* Collects data from multiple web sources
* Converts unstructured legal notice content into structured records
* Writes cleaned data into a spreadsheet (Google Sheets or CSV)
* Pushes relevant entries into Basecamp as calendar events or to-dos
* Runs automatically on a daily schedule

---

# ?? Required Features

## 1. Web Scraping Engine

* Python-based implementation
* Must use **Playwright** for JavaScript-heavy sites
* Support:

* Login/session handling where required
* Pagination and infinite scroll
* Modular per-source scraper design

---

## 2. Data Extraction Layer

From each notice/listing, extract where available:

* Title / notice type
* Property or listing name
* Address or location
* Auction or event date/time
* Case or reference number
* Parties involved (if present)
* Source URL
* Raw text backup

---

## 3. Data Normalization + Spreadsheet Output

* Convert all scraped data into a unified schema
* Write to:

* Google Sheets (preferred) OR
* CSV file

Spreadsheet must include structured rows per record.

---

## 4. Basecamp Integration (IMPORTANT)

For each valid record:

* Create a **Basecamp calendar event or to-do item**
* Include:

* Title (cleaned notice title or property name)
* Date/time (auction or event date)
* Description (key extracted details)
* Link back to source

Must use **Basecamp API** (authentication via token)

---

## 5. PDF Parsing Support

* Extract data from PDF notices
* Convert unstructured text into structured rows
* Tools: pdfplumber or PyMuPDF

---

## 6. Deduplication

* Prevent duplicate entries across all sources
* Use hash/composite key (address + date + case number or similar)

---

## 7. Automation / Scheduling

* System must run daily automatically
* Can use:

* cron job OR
* Python scheduler script

---

# ???? Optional Enhancements (if time allows)

* LLM-based extraction to clean legal text into structured fields
* Logging system for failures/retries
* Docker containerization
* Simple admin config file for adding new sources

---

# ???? Tech Stack

Required:

* Python
* Playwright
* BeautifulSoup
* pdfplumber / PyMuPDF
* Pandas

Integration:

* Google Sheets API (or CSV fallback)
* Basecamp API (mandatory)

Optional:

* Cron / scheduling tools

---

# ? Timeline

This is a **fast MVP build (2–4 days)**:

* Day 1: Core scraping framework + 1–2 sources
* Day 2: Expand sources + spreadsheet output
* Day 3: Basecamp integration + PDF parsing
* Day 4: Testing, cleanup, automation

---

# ???? Key Challenges

* Mixed data formats (HTML, JS, PDFs)
* Some sites may require authentication
* Extracting consistent structured data from legal text
* Reliable Basecamp API event creation
* Avoiding duplicates across multiple sources

---

# ???? Deliverables

* Working Python project
* Modular scraping system
* Spreadsheet integration (Google Sheets or CSV)
* Basecamp automation (calendar/tasks creation)
* PDF parsing module
* Setup instructions

---

# ???? Ideal Candidate

* Strong Python automation experience
* Expert in Playwright scraping
* Experience with APIs (especially Basecamp or similar project tools)
* Comfortable handling messy/unstructured data
* Able to deliver quickly with minimal supervision

---

# ???? Budget Guidance (global contractors)

* MVP range: **$ ---------- total**
* Higher end only if:

* Basecamp integration is fully working
* Multiple dynamic sources are stable
* PDF extraction is reliable

---

# ???? Application Requirements

Applicants must include:

* Relevant scraping + automation experience
* API integration examples (especially task/calendar systems)
* Confirmation of 2–4 day delivery capability
* Brief approach to handling scraping + scheduling pipeline

SKILL REQUIREMENT
VIEW OTHER JOB POSTS FROM:
SHARE THIS POST
facebook linkedin
  BENCHMARKS  
Loading Time: Base Classes  0.0008
Controller Execution Time ( Jobseekers / Job )  0.0496
Total Execution Time  0.0510
  GET DATA  
No GET data exists
  MEMORY USAGE  
1,520,336 bytes
  POST DATA  
No POST data exists
  URI STRING  
jobseekers/job/Python-Automation-Engineer-Needed-Web-Scraping-Data-Extraction-Spreadsheet-Basecamp-Calendar-Integration-MVP-2-1658584
  CLASS/METHOD  
jobseekers/job
  DATABASE:  onlinejobs (Jobseekers:$db)   QUERIES: 13 (0.0395 seconds)  (Hide)
0.0003   SELECT *
                                
FROM exrates
                                WHERE rate_name 
'USD-PHP' 
0.0004   SELECT *
FROM `employer_jobs`
WHERE `job_id` = 1658584
 LIMIT 1 
0.0009   SELECT *
FROM `employers`
WHERE `employer_id` = 661598
 LIMIT 1 
0.0243   SELECT COUNT(DISTINCT t.id) as cnt
FROM 
`t_thread` `t`
INNER JOIN `t_message` `mON `t`.`id` = `m`.`thread_id`
INNER JOIN `t_message_employer` `eON `m`.`id` = `e`.`message_id`
LEFT JOIN `t_thread_misc` `miscON `t`.`id` = `misc`.`thread_id`
WHERE `t`.`job_id` = 1658584
AND `misc`.`idIS NULL 
0.0006   SELECT e.business_namee.logoe.websitee.rebill_datee.date_added member_datehitsDATEDIFF('2026-06-21',ej.date_added) duration_daysDATEDIFF('2026-06-21',e.rebill_date) duration_rebillej.*, e.deactivate FROM employers eemployer_jobs ej WHERE e.employer_id ej.employer_id AND
                                   ((
e.user_level >= '500' AND ej.date_added <= e.rebill_date)
                                   OR 
e.employer_id '' OR (ej.date_approved <> '2000-01-01' and DATEDIFF('2026-06-21',ej.date_added) <= 14 ))
                                   AND 
e.deactivate != AND ej.deleted AND job_id '1658584' 
0.0008   SELECT *
FROM `employer_jobs_skills` `ejs`
LEFT JOIN `skills_categories` `scON `ejs`.`skill_id` = `sc`.`id`
WHERE `job_id` = 1658584 
0.0019   UPDATE employer_jobs SET hit_counts '***May-31-2026=985***Jun-01-2026=89***Jun-02-2026=42***Jun-03-2026=19***Jun-04-2026=15***Jun-05-2026=11***Jun-06-2026=8***Jun-07-2026=17***Jun-08-2026=9***Jun-09-2026=13***Jun-10-2026=13***Jun-11-2026=9***Jun-12-2026=10***Jun-13-2026=9***Jun-14-2026=6***Jun-15-2026=6***Jun-16-2026=7***Jun-17-2026=7***Jun-18-2026=4***Jun-19-2026=5***Jun-20-2026=5***Jun-21-2026=1' WHERE job_id'1658584'  
0.0007   UPDATE employer_jobs SET monthly_hits '***May-2026=967***Jun-2026=305' WHERE job_id'1658584'  
0.0009   SELECT date_sent FROM jobseeker_sent_emails WHERE jobseeker_id '' AND job_id '1658584' AND status LIKE 'sent%' ORDER BY id DESC  
0.0003   SELECT *
FROM `employer_jobs_skills` `ejs`
LEFT JOIN `skills_categories` `scON `ejs`.`skill_id` = `sc`.`id`
WHERE `job_id` = 1658584 
0.0079   SELECT COUNT(*) AS `numrows`
FROM `employer_jobs`
WHERE `employer_id` = '661598'
AND `date_added` >= '2022-06-08' 
0.0003   select from teasers 
0.0003   SELECT FROM skill_categories WHERE skill_cat_id='' 
  HTTP HEADERS  (Show)
  SESSION DATA  (Show)
  CONFIG VARIABLES  (Show)