TRAINING STREAMSETS DATA COLLECTOR (SDC) ~ Purnama Academy

TRAINING STREAMSETS DATA COLLECTOR (SDC)

Purnama Academy 0838-0838-0001 , Training Syllabus :

STREAMSETS DATA COLLECTOR (SDC)

Level of Knowledge : StreamSets Fundamental (Basic to Intermediate)

Durations : 5 Days (09.00 – 16.00)

Class Method : Offline Class Only

Prerequisites :

- Ubuntu OS & LDAP Fundamental Knowledge

- Java / Python Basic Knowledge

- Vagrant / Docker Basic Knowledge

Descriptions

DESCRIPTIONS :

What is StreamSets?

StreamSets is a system for creating, executing and operating continuous dataflows that connect various parts of your data infrastructure. It comprises two complementary products - StreamSets Data Collector (aka SDC), and StreamSets Dataflow Performance Manager (aka DPM).

StreamSets Data Collector (SDC)

The SDC is the workhorse of the system which implements your data plane, i.e. the actual physical movement of data from one place to another. It provides a data pipeline authoring environment that helps you build any-to-any data movement pipelines using a drag-and-drop graphical interface or programmatically using Python or Java. The pipelines have the capability to work with minimal or no schema/structure specification and can filter, decorate or transform data as it flows through. Here is a screenshot of what a running pipeline may look like in SDC:

These pipelines can run in standalone mode, cluster streaming mode, or cluster batch mode. The SDC which runs these pipelines can be installed on free standing dedicated nodes or edge/gateway/cluster nodes alike. All that is needed is that SDC has direct access to the data sources and destinations it is operating on, and sufficient resources to run the dataflow.

OVERVIEW

INSTALLATION

Installation

Full Installation and Launch (Manual Start)

Full Installation and Launch (Service Start)

Core Installation

Install Additional Stage Libraries

Run Data Collector from Docker

Installation with Cloudera Manager

MapR Prerequisites

Creating Another Data Collector Instance

Uninstallation

CONFIGURATION

User Authentication

Roles and Permissions

Data Collector Configuration

Data Collector Environment Configuration

Install External Libraries

Custom Stage Libraries

Accessing Hashicorp Vault Secrets

Enabling External JMX Tools

PIPELINE CONCEPTS AND DESIGN

What is a Pipeline?

Data in Motion

Single and Multithreaded Pipelines

Delivery Guarantee

Designing the Data Flow

Branching Streams

Merging Streams

Dropping Unwanted Records

Required Fields

Preconditions

Error Record Handling

Pipeline Error Record Handling

Stage Error Record Handling

Example

Record Header Attributes

Working with Header Attributes

Viewing Attributes in Data Preview

Header Attribute-Generating Stages

Record Header Attributes for Record-Based Writes

Field Attributes

Field Attribute-Generating Stages

Processing Changed Data

CRUD Operation Header Attribute

CDC-Enabled Origins

CRUD-Enabled Stages

Processing the Record

Use Cases

Delimited Data Root Field Type

Protobuf Data Format Prerequisites

SDC Record Data Format

Text Data Format with Custom Delimiters

Processing XML Data with Custom Delimiters

Whole File Data Format

Basic Pipeline

Whole File Records

Additional Processors

Defining the Transfer Rate

Writing Whole Files

XML Data Format and Data Processing

Creating Multiple Records with an XML Element

Creating Multiple Records with an XPath Expression

Including Field XPaths and Namespaces

XML Attributes and Namespace Declarations

Parsed XML

Control Character Removal

Development Stages

PIPELINE CONFIGURATION

Data Collector Console - Edit Mode

Retrying the Pipeline

Pipeline Memory

Rate Limit

Runtime Values

Using Runtime Parameters

Using Runtime Properties

Using Runtime Resources

Webhooks

Request Method

Payload and Parameters

Examples

Notifications

SSL/TLS Configuration

Keystore and Truststore Configuration

Transport Protocols

Cipher Suites

Implicit and Explicit Validation

Expression Configuration

Basic Syntax

Using Field Names in Expressions

Referencing Field Names and Field Paths

Expression Completion in Properties

Data Type Coercion

Configuring a Pipeline

ORIGINS

Elasticsearch

Hadoop FS

HTTP Client

HTTP Server

HTTP to Kafka

JDBC Multitable Consumer

JDBC Query Consumer

Kafka Consumer

MySQL Binary Log

SFTP/FTP Client

UDP Source

UDP to Kafka

WebSocket Server

PROCESSORS

Processors

Base64 Field Decoder

Base64 Field Encoder

Expression Evaluator

Field Flattener

Field Hasher

Field Masker

Field Merger

Field Order

Field Pivoter

Field Remover

Field Renamer

Field Splitter

Field Type Converter

Field Zip

Geo IP

Groovy Evaluator

HBase Lookup

Hive Metadata

HTTP Client

JavaScript Evaluator

JDBC Lookup

JDBC Tee

JSON Parser

Jython Evaluator

Log Parser

Record Deduplicator

Spark Evaluator

Static Lookup

Stream Selector

Value Replacer

XML Flattener

XML Parser

DESTINATIONS

Elasticsearch

Hadoop FS

HBase

Hive Metastore

Hive Streaming

HTTP Client

Kafka Producer

MapR DB

WebSocket Client

EXECUTORS

Executors

HDFS File Metadata Executor

Hive Query Executor

JDBC Query Executor

MapReduce Executor

Pipeline Finisher Executor

Shell Executor

Spark Executor

DATAFLOW TRIGGERS (A.K.A. EVENT FRAMEWORK)

Dataflow Triggers Overview

Event Streams

Event Records

Case Study: Parquet Conversion

Case Study: Impala Metadata Updates for DDS for Hive

Case Study: Output File Management

Case Study: Stop the Pipeline

Event Records in Data Preview, Monitor, and Snapshot

Summary

MULTITHREADED PIPELINES

Multithreaded Pipeline Overview

How It Works

Monitoring

Tuning Threads and Runners

Resource Usage

Multithreaded Pipeline Summary

SDC RPC PIPELINES

SDC RPC Pipeline Overview

Deployment Architecture

Configuring the Delivery Guarantee

Defining the RPC ID

Enabling Encryption

Configuration Guidelines for SDC RPC Pipelines

CLUSTER PIPELINES

Cluster Pipeline Overview

Kafka Cluster Requirements

MapR Requirements

HDFS Requirements

Stage Limitations

DATA PREVIEW

Data Preview Overview

Data Collector Console - Preview Mode

Previewing a Single Stage

Previewing Multiple Stages

Editing Preview Data

Editing Properties

RULES AND ALERTS

Rules and Alerts Overview

Metric Rules and Alerts

Data Rules and Alerts

Data Drift Rules and Alerts

Alert Webhooks

Configuring Email for Alerts

PIPELINE MONITORING

Pipeline Monitoring Overview

Data Collector Console - Monitor Mode

Viewing Pipeline and Stage Statistics

Monitoring Errors

Snapshots

Viewing the Run History

PIPELINE MAINTENANCE

Data Collector Console - All Pipelines on the Home Page

Understanding Pipeline States

Starting Pipelines

Stopping Pipelines

Importing Pipelines

Sharing Pipelines

Adding Labels to Pipelines

Exporting Pipelines

Duplicating a Pipeline

Deleting Pipelines

Next Training Topic Recommendation : STREAMSETS DPM

TRAINING STREAMSETS DATA COLLECTOR (SDC)

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

SECCIONS

ABOUT

Report Abuse

Training TALEND ETL ESB di Bandung, Jakarta

Search This Blog

Archive

Hot Link

Training Venues

Pages - Menu

IT Management and Certifications

Hot Link

Overview

0838-0838-0001

Training Provider

PurnamaAcademy.com

Popular Topics

Popular

Archive

Latest courses

Categories

Comments

About

Top Links Menu

TRAINING STREAMSETS DATA COLLECTOR (SDC)

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

SECCIONS

ABOUT

Report Abuse

Training TALEND ETL ESB di Bandung, Jakarta

Search This Blog

Archive

Hot Link

Training Venues

Pages - Menu

IT Management and Certifications

Hot Link

Overview

0838-0838-0001

Training Provider

PurnamaAcademy.com

Popular Topics

Popular

Archive

Latest courses

Categories

Comments

About