Python Data Engineer – Build Data Validation Tool (CSV Comparison + AI Summary)

Please login or register as jobseeker to apply for this job.

TYPE OF WORK

Part Time

WAGE / SALARY

3.75/hour

HOURS PER WEEK

10

DATE UPDATED

Apr 28, 2026

JOB OVERVIEW

Python Data Engineer – Build Data Validation Tool (CSV Comparison + AI Summary)

Description:

I’m looking for a strong Python data engineer to build a simple MVP tool that compares two datasets and identifies data mismatches.

This is NOT a chatbot or marketing automation project.
This is a structured data validation and comparison tool.

What the tool should do:

* Accept 2 datasets (CSV/Excel): Source and Target
* Allow simple field mapping (e.g., source.emp_id ? target.worker_id)
* Compare data and identify:

* Missing records
* Extra records
* Duplicate records
* Field mismatches
* Null / blank values
* Output a clean error report
* Provide summary metrics (% match, total errors, etc.)
* Generate a short AI summary of findings (using OpenAI API)

Tech Requirements:

* Python (Pandas) – REQUIRED
* Experience with data comparison, ETL, or reconciliation
* Ability to build a simple UI (Streamlit preferred)
* Experience integrating APIs (OpenAI is a plus)

What I am NOT looking for:

* Chatbot developers
* Marketing automation specialists (Zapier-only)
* Prompt engineers without data experience

Deliverable:

A working MVP that:

* Runs end-to-end
* Produces accurate validation results
* Has a simple interface for demo purposes

Timeline:

7–10 days for MVP

Budget:

Open to fixed price based on experience

To Apply:

Please include:

1. How you would approach comparing two datasets with different field names and formats
2. Example of similar work (data validation, ETL, reconciliation)
3. The word “validation” in your response (to confirm you read this)

Bonus:

Experience working with HR, payroll, or enterprise datasets is a plus

---

This is a focused MVP build with potential for ongoing work if this goes well.

SKILL REQUIREMENT
VIEW OTHER JOB POSTS FROM:
SHARE THIS POST
facebook linkedin
  BENCHMARKS  
Loading Time: Base Classes  0.0008
Controller Execution Time ( Jobseekers / Job )  0.0595
Total Execution Time  0.0608
  GET DATA  
No GET data exists
  MEMORY USAGE  
1,498,680 bytes
  POST DATA  
No POST data exists
  URI STRING  
jobseekers/job/Python-Data-Engineer-Build-Data-Validation-Tool-CSV-Comparison-AI-Summary-1630960
  CLASS/METHOD  
jobseekers/job
  DATABASE:  onlinejobs (Jobseekers:$db)   QUERIES: 13 (0.0542 seconds)  (Hide)
0.0003   SELECT *
                                
FROM exrates
                                WHERE rate_name 
'USD-PHP' 
0.0003   SELECT *
FROM `employer_jobs`
WHERE `job_id` = 1630960
 LIMIT 1 
0.0009   SELECT *
FROM `employers`
WHERE `employer_id` = 771059
 LIMIT 1 
0.0457   SELECT COUNT(DISTINCT t.id) as cnt
FROM 
`t_thread` `t`
INNER JOIN `t_message` `mON `t`.`id` = `m`.`thread_id`
INNER JOIN `t_message_employer` `eON `m`.`id` = `e`.`message_id`
LEFT JOIN `t_thread_misc` `miscON `t`.`id` = `misc`.`thread_id`
WHERE `t`.`job_id` = 1630960
AND `misc`.`idIS NULL 
0.0006   SELECT e.business_namee.logoe.websitee.rebill_datee.date_added member_datehitsDATEDIFF('2026-06-20',ej.date_added) duration_daysDATEDIFF('2026-06-20',e.rebill_date) duration_rebillej.*, e.deactivate FROM employers eemployer_jobs ej WHERE e.employer_id ej.employer_id AND
                                   ((
e.user_level >= '500' AND ej.date_added <= e.rebill_date)
                                   OR 
e.employer_id '' OR (ej.date_approved <> '2000-01-01' and DATEDIFF('2026-06-20',ej.date_added) <= 14 ))
                                   AND 
e.deactivate != AND ej.deleted AND job_id '1630960' 
0.0011   SELECT *
FROM `employer_jobs_skills` `ejs`
LEFT JOIN `skills_categories` `scON `ejs`.`skill_id` = `sc`.`id`
WHERE `job_id` = 1630960 
0.0008   UPDATE employer_jobs SET hit_counts '***Apr-23-2026=6***Apr-25-2026=1***Apr-26-2026=309***Apr-27-2026=80***Apr-28-2026=401***Apr-29-2026=79***Apr-30-2026=43***May-01-2026=25***May-02-2026=22***May-03-2026=38***May-04-2026=34***May-05-2026=7***May-06-2026=19***May-07-2026=15***May-08-2026=24***May-09-2026=22***May-10-2026=13***May-11-2026=10***May-12-2026=17***May-13-2026=2***May-14-2026=9***May-15-2026=8***May-16-2026=8***May-17-2026=6***May-18-2026=7***May-19-2026=1***May-20-2026=3***May-21-2026=9***May-22-2026=8***May-23-2026=3***May-24-2026=2***May-25-2026=4***May-26-2026=2***May-27-2026=9***May-28-2026=6***May-29-2026=1***May-30-2026=1***May-31-2026=3***Jun-01-2026=11***Jun-02-2026=10***Jun-03-2026=2***Jun-04-2026=1***Jun-05-2026=1***Jun-06-2026=4***Jun-07-2026=10***Jun-08-2026=2***Jun-09-2026=6***Jun-10-2026=1***Jun-11-2026=3***Jun-12-2026=5***Jun-13-2026=3***Jun-14-2026=3***Jun-15-2026=3***Jun-16-2026=5***Jun-17-2026=3***Jun-18-2026=1***Jun-20-2026=1' WHERE job_id'1630960'  
0.0006   UPDATE employer_jobs SET monthly_hits '***Apr-2026=919***May-2026=338***Jun-2026=75' WHERE job_id'1630960'  
0.0016   SELECT date_sent FROM jobseeker_sent_emails WHERE jobseeker_id '' AND job_id '1630960' AND status LIKE 'sent%' ORDER BY id DESC  
0.0002   SELECT *
FROM `employer_jobs_skills` `ejs`
LEFT JOIN `skills_categories` `scON `ejs`.`skill_id` = `sc`.`id`
WHERE `job_id` = 1630960 
0.0016   SELECT COUNT(*) AS `numrows`
FROM `employer_jobs`
WHERE `employer_id` = '771059'
AND `date_added` >= '2022-06-08' 
0.0003   select from teasers 
0.0002   SELECT FROM skill_categories WHERE skill_cat_id='' 
  HTTP HEADERS  (Show)
  SESSION DATA  (Show)
  CONFIG VARIABLES  (Show)