Job Title | : Data Engineer with Cloud Data Integration & Transformation |
Location | : Remote (with some Travel to NC, Client will pay for travel) |
Type | : 12 Months |
Note:-
- Candidates need to use their own Laptop (Minimum 16 GB)
- Some travel to NC will be required (Client will pay for travel to NC)
About the Role:
We are seeking a hands-on Data Engineer to develop and maintain scalable data pipelines and transformation routines within a modern Azure + Databricks environment. This role is focused on executing ingestion, cleansing, standardization, matching, merging, and enrichment of complex legacy datasets into a governed data Lakehouse architecture.
The ideal candidate brings deep experience with Spark (PySpark), Delta Lake, Azure Data Factory, and data wrangling techniques — and is comfortable working in a structured, code-managed, team-based delivery environment.
Key Responsibilities:
- Pipeline Development & Maintenance
- Build and maintain reusable data pipelines using Databricks, PySpark, and SQL.
- Implement full and incremental loads from sources including VSAM, Db2 (LUW and z/OS), SQL Server, and flat files.
- Use Delta Lake on ADLS Gen2 to support ACID transactions, scalable upserts/merges, and time travel.
- Leverage Azure Data Factory for orchestration and triggering of Delta Live Tables and Databricks Jobs as part of nightly pipeline execution.
Data Cleansing & Transformation:
Apply cleansing logic for deduplication, parsing, standardization, and enrichment based on business rule definitions.
Use Spark-Cobol Library to parse EBCDIC/COBOL-formatted VSAM files into structured DataFrames.
Maintain 'bronze → silver → gold' structured layers and ensure quality during data transformations.
Support classification and mapping logic in collaboration with analysts and architects.
Observability, Testing & Validation:
- Integrate robust logging and exception handling to enable observability and pipeline traceability.
- Monitor job performance and cost with Azure Monitor and Log Analytics.
- Support validation and testing using frameworks like Great Expectations or dbt tests to enforce expectations on nulls, ranges, and referential integrity.
- Security, DevOps & Deployment
- Store and manage credentials securely using Azure Key Vault during pipeline execution.
- Maintain pipeline code using Azure DevOps Repos and participate in peer reviews and promotion workflows via Azure DevOps Pipelines.
- Deploy notebooks, configurations, and transformations using CI/CD best practices in repeatable environments.
Collaboration & Profiling:
- Collaborate with architects to ensure alignment with data platform standards and governance models.
- Work with analysts and SMEs to profile data, refine cleansing logic, and conduct variance analysis using Databricks Notebooks and Databricks SQL Warehouse.
- Support metric publication and lineage registration using Microsoft Purview and Unity Catalog and contribute to profiling datasets for Power BI consumption.
Required Skills & Experience:
- 5+ years of experience in data engineering or ETL development roles.
Proficiency in:
Databricks, PySpark, SQL
Delta Lake and Azure Data Lake Storage Gen2
Azure Data Factory for orchestration and event-driven workflows
Experience with:
Cleansing, deduplication, parsing, and merging of high-volume datasets
Parsing EBCDIC/COBOL-formatted VSAM files using Spark-Cobol Library
Connecting to Db2 databases using JDBC drivers for ingestion
Familiarity with:
Git, Azure DevOps Repos & Pipelines Great Expectations or dbt for validation Azure Monitor + Log Analytics for job tracking and alerting Azure Key Vault for secrets and credentials
Microsoft Purview and Unity Catalog for metadata and lineage registration
—