Starting with Kafka – Distributed Messaging system

Kafka is distributed publish-subscribe messaging system. It provides solutions to handle all activities stream data. It also supports Hadoop platform. It have mechanism for parallel load into Hadoop.
Apache Kafka is an open source, distributed messaging system that enables you to build real-time applications using Streaming data.

Apache Kafka Overview

 

 

 

Kafka maintains feeds of messages in categories called topics. Can be call these process as publish messages to a Kafka topic producers. We’ll call processes that subscribe to topics and process the feed of published messages consumers. Kafka is run as a cluster comprised of one or more servers each of which is called a broker. We can also run Kafka deployment on Amazon EC2 that provides a high performance, scalable solution for ingesting streaming data. To deploy Kafka on Amazon EC2, you need to select and provision your EC2 instance types, install and configure the software components including Kafka and Apache Zookeeper, and then provision the block storage required to accommodate your streaming data throughput using Amazon Elastic Block Store (EBS).

 

Key Concepts –

 

  • Topic is divided in partitions.
  • The message order is only guarantee inside a partition
  • Consumer offsets are persisted by Kafka with a commit/auto-commit mechanism.
  • Consumers subscribes to topics
  • Consumers with different group-id receives all messages of the topics they subscribe. They consume the messages at their own speed.
  • Consumers sharing the same group-id will be assigned to one (or several) partition of the topics they subscribe. They only receive messages from their partitions. So a constraint appears here: the number of partitions in a topic gives the maximum number of parallel consumers.
  • The assignment of partitions to consumer can be automatic and performed by Kafka (through Zookeeper). If a consumer stops polling or is too slow, a process call “re-balancing” is performed and the partitions are re-assigned to other consumers.

 

 

Kafka normally divides topic in multiply partitions. Each partition is an ordered, immutable sequence of messages that is continually appended to.


Source: http://kafka.apache.org/documentation.html

 

  • A message in a partition is identified by a sequence number called offset.
  • The FIFO is only guarantee inside a partition.
  • When a topic is created, the number of partitions should be given, for instance:
  • The producer can choose which partition will get the message or let Kafka decides for him based on a hash of the message key (recommended). So the message key is important and will be the used to ensure the message order.
  • Moreover, as the consumer will be assigned to one or several partition, the key will also “group” messages to a same consumer.

Advantages-

  • With Kafka we can easily handle hundreds of thousands of messages in a second, which makes Kafka high throughput messaging system.
  • Kafka’s cluster can be expanded with no downtime, making Kafka highly scalable.
  • Messages are replicated, which provided reliability and durability.
  • It is fault tolerant.
  • Kafka also supports micro-batching.

Kafka components

  • Producer is process that can publish a message to a topic.
  • Consumer is a process that can subscribe to one or more topics and consume messages published to topics.
  • Topic category is the name of the feed to which messages are published.
  • Broker is a process running on single machine
  • Cluster is a group of brokers working together.
  • Broker management done by Zookeeper.


 

Kafka Cluster Management- Zookeeper

Entire cluster management of Kafka done by Zookeeper. Hey wait, why the name as Zookeeper? Because Hadoop is like elephant, then hive and pig is there. So it’s Zookeeper who manages all these. J (Manager of all animal as zookeeper). It is open source Apache project. It provides a centralized infrastructure and services that enable synchronization across cluster. Its common object used across large cluster environments maintained by in Zookeeper. It is not limited to Kafka. Strom and Spark also used this. Its services are used by large scale application to coordinate distributed processing across large cluster.

It selects leader node in Kafka and resource management. It also maintains node (Broker) down and up i.e. is configuration management. It also keep state of last message processed. ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly used to notify producer and consumer about the presence of any new broker in the Kafka system or failure of the broker in the Kafka system. As per the notification received by the Zookeeper regarding presence or failure of the broker then producer and consumer takes decision and starts coordinating their task with some other broker.

 

Cluster Management by Zookeeper

 

 

 

Above picture say it all. Producers connect to kafka broker and consumer as well connect to broker. New broker can be added, which make it more scalable. Zookeeper do configuration Management with broker.

Apache kafka also fit with Oracle Product. Oracle SOA, Oracle API management and other Fusion middleware products. Soon , will publish article with how we can connect with Kafka with Oracle FMW.Stay tuned.

 

Happy Distrubited messaging with Vinay in Kafka. J

 

 

Asset developments in WebCenter Portal 12c

Hi All,

 

In WebCenter Portal 12c, for asset development, we have separate project for creating assets and deploying. Create new WebCenter Portal Asset application

 

Click next

 

Select Default Values and click next. In portal Assets select Asset type. For example you want to create Template. Select Template and finish

 

 

Now you see Project is created and structure is as follow

 

 

We have PortalTemplate.jspx to make entry as template. Now the problem starts.If we want to create skin or any other assets.Then there is no option of adding skin here.  Next to a Page Template, we would also like to define a Skin for our Portal. However, the WebCenter Portal Asset is an Application, so we have to create a totally new Application, just as we did with the Page Template.

 

Now we really didn’t bother about application name. Just create portal assets applications and select skin in last step.

 

Please provide name of skin.

 

 

Choose default values and in Portal Asset step choose skin and in Directory add one more directory over to that. So that we can copy whole folder.

 

 

Click finish. Project Structure will be like below

 

Now, we have two application but we can merge them. We can copy of project structure of one application into another. So please close Jdeveloper and go to file system. I will go file system location of skin project and copy PortalSkin folder and PortalAssetBlog application folder.

 


 

Now open Jdeveloper and right click application and open project and go PortalSkin folder in PortalAsset application

 

 

Select PortalSkin.jpr and open.

 

 

Now project structure is as follow

 

 

You can deploy these asset separately. If you want to deploy template then select PortalAsset and deploy. If you want to deploy skin then select PortalSkin and deploy to AAR file or WebCenter Portal.

 

That all. Happy Portal Assets development with Vinay.

 

Learn More …

There are a lot more points like this. If you are eager to learn WebCenter Portal 12c,  the following book is for you –
Beginning Oracle WebCenter Portal 12c More details about this book can be found in this post-  https://www.techartifact.com/blogs/2016/11/pre-oreder-your-copy-of-beginning-oracle-webcenter-portal-12c-today.html

 

 

 

 

Shards and Replicas in Elasticsearch

Shards in Elastic Search- When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and inability to respond to client requests fast enough. In such a case, data can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus, your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query to each relevant shard and merges the result in such a way that your application doesn’t know about the shards. In addition to this, having multiple shards can speed up the indexing.

clustering allows us to store information volumes that exceed abilities of a single server. To achieve this requirement, ElasticSearch spread data to several physical Lucene indices. Those Lucene indices are called shards and the process of this spreading is called sharding. ElasticSearch can do this automatically and all parts of the index (shards) are visible to the user as one-big index. Note that besides this automation, it is crucial to tune this mechanism for particular use case because the number of shard index is built or is configured during index creation and cannot be changed later, at least currently.

So if you have an index with 100 documents and a cluster with 2 nodes, each node will hold 50 documents if the shard_number is 2. (Ignoring replicas of course)
That’s a little of the “infinite scaling magic ” because each machine in your cluster only have to deal with some pieces of your data.

 

Replica

In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.

Sharing allows us to push more data into ElasticSearch that is possible for a single node to handle. Replicas can help where load increases and a single node is not able to handle all the requests. The idea is simple: create additional copy of a shard, which can be used for queries just as original, primary shard. Note that we get safety for free. If the server with the shard is gone, ElasticSearch can use replica and no data is lost. Replicas can be added and removed at any time, so you can adjust their numbers when needed..

Replicas can be added or removed at runtime—primaries can’t You can change the number of replicas per shard at any time because replicas can always be created or removed. This doesn’t apply to the number of primary shards an index is divided into; you have to decide on the number of shards before creating the index. Keep in mind that too few shards limit how much you can scale, but too many shards impact performance. The default setting of five is typically a good start

 

A node is an instance of Elasticsearch. When you start Elasticsearch on your server, you have a node. If you start Elasticsearch on another server, it’s another node. You can even have more nodes on the same server by starting multiple Elasticsearch processes. Multiple nodes can join the same cluster. As we’ll discuss later in this chapter, starting nodes with the same cluster name and otherwise default settings is enough to make a cluster. With a cluster of multiple nodes, the same data can be spread across multiple servers. This helps performance because Elasticsearch has more resources to work with. It also helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch will still serve you all the data. For an application that’s using Elasticsearch, having one or more nodes in a cluster is transparent. By default, you can connect to any node from the cluster and work with the whole data just as if you had a single node. Although clustering is good for performance and availability, it has its disadvantages: you have to make sure nodes can communicate with each other quickly enough and that you won’t have a split brain (two parts of the cluster that can’t communicate and think the other part dropped out). To address such issues,

WHAT HAPPENS WHEN YOU SEARCH AN INDEX?

 

When you search an index, Elasticsearch has to look in a complete set of shards for that index Those shards can be either primary or replicas because primary and replica shards typically contain the same documents. Elasticsearch distributes the search load between the primary and replica shards of the index you’re searching, making replicas useful for both search performance and fault tolerance. Next we’ll look at the details of what primary and replica shards are and how they’re allocated in an Elasticsearch cluster.

shards

 

Happy Sharding in elastic Search with Vinay…..  🙂