Show cover of Streaming Audio: Apache Kafka® & Real-Time Data

Streaming Audio: Apache Kafka® & Real-Time Data

Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join our hosts as they stream through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Tracks

Apache Kafka® 3.5 is here with the capability of previewing migrations between ZooKeeper clusters to KRaft mode. Follow along as Danica Fine highlights key release updates.Kafka Core:KIP-833 provides an updated timeline for KRaft.KIP-866 now is preview and allows migration from an existing ZooKeeper cluster to KRaft mode.KIP-900 introduces a way to bootstrap the KRaft controllers with SCRAM credentials.KIP-903 prevents a data loss scenario by preventing replicas with stale broker epochs from joining the ISR list. KIP-915 streamlines the process of downgrading Kafka's transaction and group coordinators by introducing tagged fields.Kafka Connect:KIP-710 provides the option to use a REST API for internal server communication that can be enabled by setting `dedicated.mode.enable.internal.rest` equal to true. KIP-875 offers support for native offset management in Kafka Connect. Connect cluster administrators can now read offsets for both source and sink connectors. This KIP adds a new STOPPED state for connectors, enabling users to shut down connectors and maintain connector configurations without utilizing resources.KIP-894 makes `IncrementalAlterConfigs` API available for use in MirrorMaker 2 (MM2), adding a new use.incremental.alter.config configuration which takes values “requested,” “never,” and “required.”KIP-911 adds a new source tag for metrics generated by the `MirrorSourceConnector` to help monitor mirroring deployments.Kafka Streams:KIP-339 improves Kafka Streams' error-handling capabilities by addressing serialization errors that occur before message production and extending the interface for custom error handling. KIP-889 introduces versioned state stores in Kafka Streams for temporal join semantics in stream-to-table joins. KIP-904 simplifies table aggregation in Kafka by proposing a change in serialization format to enable one-step aggregation and reduce noise from events with old and new keys/values. KIP-914 modifies how versioned state stores are used in Kafka Streams. Versioned state stores may impact different DSL processors in varying ways, see the documentation for details.Kafka Client:KIP-881 is now complete and introduces new client-side assignor logic for rack-aware consumer balancing for Kafka Consumers. KIP-887 adds the `EnvVarConfigProvider` implementation to Kafka so custom configurations stored in environment variables can be injected into the system by providing the map returned by `System.getEnv()`.KIP 641 introduces the `RecordReader` interface to Kafka's clients module, replacing the deprecated MessageReader Scala trait. EPISODE LINKSSee release notes for Apache Kafka 3.5Read the blog to learn moreDownload and get started with Apache Kafka 3.5Watch the video version of this podcast

6/15/23 • 11:25

After recording 64 episodes and featuring 58 amazing guests, the Streaming Audio podcast series has amassed over 130,000 plays on YouTube in the last year. We're extremely proud of these achievements and feel that it's time to take a well-deserved break. Streaming Audio will be taking a vacation! We want to express our gratitude to you, our valued listeners, for spending 10,000 hours with us on this incredible journey.Rest assured, we will be back with more episodes! In the meantime, feel free to revisit some of our previous episodes. For instance, you can listen to Anna McDonald share her stories about the worst Apache Kafka® bugs she’s ever seen, or listen to Jun Rao offer his expert advice on running Kafka in production. And who could forget the charming backstory behind Mitch Seymour's Kafka storybook, Gently Down the Stream?These memorable episodes brought us joy, and we're thrilled to have shared them with you. As we reflect on our accomplishments with pride, we also look forward to an exciting future. Until we meet again, happy listening!EPISODE LINKSTop 6 Worst Apache Kafka JIRA BugsRunning Apache Kafka in ProductionLearn How Stream-Processing Works The Simplest Way PossibleWatch the video version of this podcastStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  

4/13/23 • 01:18

Have you ever struggled with managing data long term, especially as the schema changes over time? In order to manage and leverage data across an organization, it’s essential to have well-defined guidelines and standards in place around data quality, enforcement, and data transfer. To get started, Abraham Leal (Customer Success Technical Architect, Confluent) suggests that organizations associate their Apache Kafka® data with a data contract (schema). A data contract is an agreement between a service provider and data consumers. It defines the management and intended usage of data within an organization. In this episode, Abraham talks to Kris about how to use data contracts and schema enforcement to ensure long-term data management.When an organization sends and stores critical and valuable data in Kafka, more often than not it would like to leverage that data in various valuable ways for multiple business units. Kafka is particularly suited for this use case, but it can be problematic later on if the governance rules aren’t established up front.With schema registry, evolution is easy due to its robust security guarantees. When managing data pipelines, you can also use GitOps automation features for an extra control layer. It allows you to be creative with topic versioning, upcasting/downcasting the data collected, and adding quality assurance steps at the end of each run to ensure your project remains reliable.Abraham explains that Protobuf and Avro are the best formats to use rather than XML or JSON because they are built to handle schema evolution. In addition, they have a much lower overhead per-record, so you can save bandwidth and data storage costs by adopting them.There’s so much more to consider, but if you are thinking about implementing or integrating with your data quality team, Abraham suggests that you use schema registry heavily from the beginning.If you have more questions, Kris invites you to join the conversation. You can also watch the KOR Financial Current talk Abraham mentions or take Danica Fine’s free course on how to use schema registry on Confluent Developer.EPISODE LINKSOS projectKOR Financial Current TalkThe Key Concepts of Schema RegistrySchema Evolution and CompatibilitySchema Registry Made Simple by Confluent Cloud ft. Magesh NandakumarKris Jenkins’ TwitterWatch the video version of this podcastStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

3/21/23 • 57:28

Can you use Apache Kafka® and Python together? What’s the current state of Python support? And what are the best options to get started? In this episode, Dave Klein joins Kris to talk about all things Kafka and Python: the libraries, the tools, and the pros & cons. He also talks about the new course he just launched to support Python programmers entering the event-streaming world.Dave has been an active member of the Kafka community for many years and noticed that there were a lot of Kafka resources for Java but few for Python. So he decided to create a course to help people get started using Python and Kafka together.Historically, Java has had the most documentation, and people have often missed how good the Python support is for Kafka users. Python and Kafka are an ideal fit for machine learning applications and data engineering in general. Yet there are a lot of use cases for building, streaming, and machine learning pipelines. In fact, someone conducted a survey to find out what languages were most popular in the Kafka community and Python came in second after Java. That’s how Dave got the idea to create a course for newbies.In this course, Dave combines video lectures with code-heavy exercises to give developers a taste of what the code looks like, how to structure it, a preview of the shape of the code, and the structure of the classes and the functions so you can get hands-on practice using the library. He also covers building a producer and a consumer and using the admin client. And, of course, there is a module that covers working with the schemas supported by the Kafka library.Dave explains that Python opens up a world of opportunity and is ripe for expansion. So if you are ready to dive in, head over to developer.confluent.io to learn more about Dave’s course.EPISODE LINKSBlog: Getting Started with Python for Apache KafkaCourse: Introduction to Apache Kafka for Python DevelopersStep-by-step guide: Building a Python client application for KafkaCoding in MotionBuilding and Designing Events and Event Streams with Apache KafkaWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

3/14/23 • 31:57

In this episode, Kris interviews Doron Porat, Director of Infrastructure at Yotpo, and Liran Yogev, Director of Engineering at ZipRecruiter (formerly at Yotpo), about their experiences and strategies in dealing with data modeling at scale.Yotpo has a vast and active data lake, comprising thousands of datasets that are processed by different engines, primarily Apache Spark™. They wanted to provide users with self-service tools for generating and utilizing data with maximum flexibility, but encountered difficulties, including poor standardization, low data reusability, limited data lineage, and unreliable datasets.The team realized that Yotpo's modeling layer, which defines the structure and relationships of the data, needed to be separated from the execution layer, which defines and processes operations on the data.This separation would give programmers better visibility into data pipelines across all execution engines, storage methods, and formats, as well as more governance control for exploration and automation.To address these issues, they developed YODA, an internal tool that combines excellent developer experience, DBT, Databricks, Airflow, Looker and more, with a strong CI/CD and orchestration layer.Yotpo is a B2B, SaaS e-commerce marketing platform that provides businesses with the necessary tools for accurate customer analytics, remarketing, support messaging, and more.ZipRecruiter is a job site that utilizes AI matching to help businesses find the right candidates for their open roles.EPISODE LINKSCurrent 2022 Talk: Next Gen Data Modeling in the Open Data PlatformData Mesh 101Data Mesh Architecture: A Modern Distributed Data ModelWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

3/7/23 • 55:55

Migrating Apache Kafka® clusters can be challenging, especially when moving large amounts of data while minimizing downtime. Michael Dunn (Solutions Architect, Confluent) has worked in the data space for many years, designing and managing systems to support high-volume applications. He has helped many organizations strategize, design, and implement successful Kafka cluster migrations between different environments. In this episode, Michael shares some tips about Kafka cluster migration with Kris, including the pros and cons of the different tools he recommends.Michael explains that there are many reasons why companies migrate their Kafka clusters. For example, they may want to modernize their platforms, move to a self-hosted cloud server, or consolidate clusters. He tells Kris that creating a plan and selecting the right tool before getting started is critical for reducing downtime and minimizing migration risks.The good news is that a few tools can facilitate moving large amounts of data, topics, schemas, applications, connectors, and everything else from one Apache Kafka cluster to another.Kafka MirrorMaker/MirrorMaker2 (MM2) is a stand-alone tool for copying data between two Kafka clusters. It uses source and sink connectors to replicate topics from a source cluster into the destination cluster.Confluent Replicator allows you to replicate data from one Kafka cluster to another. Replicator is similar to MM2, but the difference is that it’s been battle-tested.Cluster Linking is a powerful tool offered by Confluent that allows you to mirror topics from an Apache Kafka 2.4/Confluent Platform 5.4 source cluster to a Confluent Platform 7+ cluster in a read-only state, and is available as a fully-managed service in Confluent Cloud.At the end of the day, Michael stresses that coupled with a well-thought-out strategy and the right tool, Kafka cluster migration can be relatively painless. Following his advice, you should be able to keep your system healthy and stable before and after the migration is complete.EPISODE LINKSMirrorMaker 2ReplicatorCluster LinkingSchema MigrationMulti-Cluster Apache Kafka with Cluster LinkingWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

3/1/23 • 61:30

dbt is known as being part of the Modern Data Stack for ELT processes. Being in the MDS, dbt Labs believes in having the best of breed for every part of the stack. Oftentimes folks are using an EL tool like Fivetran to pull data from the database into the warehouse, then using dbt to manage the transformations in the warehouse. Analysts can then build dashboards on top of that data, or execute tests.It’s possible for an analyst to adapt this process for use with a microservice application using Apache Kafka® and the same method to pull batch data out of each and every database; however, in this episode, Amy Chen (Partner Engineering Manager, dbt Labs) tells Kris about a better way forward for analysts willing to adopt the streaming mindset: Reusable pipelines using dbt models that immediately pull events into the warehouse and materialize as materialized views by default.dbt Labs is the company that makes and maintains dbt. dbt Core is the open-source data transformation framework that allows data teams to operate with software engineering’s best practices. dbt Cloud is the fastest and most reliable way to deploy dbt. Inside the world of event streaming, there is a push to expand data access beyond the programmers writing the code, and towards everyone involved in the business. Over at dbt Labs they’re attempting something of the reverse— to get data analysts to adopt the best practices of software engineers, and more recently, of streaming programmers. They’re improving the process of building data pipelines while empowering businesses to bring more contributors into the analytics process, with an easy to deploy, easy to maintain platform. It offers version control to analysts who traditionally don’t have access to git, along with the ability to easily automate testing, all in the same place.In this episode, Kris and Amy explore:How to revolutionize testing for analysts with two of dbt’s core functionalitiesWhat streaming in a batch-based analytics world should look likeWhat can be done to improve workflowsHow to democratize access to data for everyone in the businessEPISODE LINKSLearn more about dbt labsAn Analytics Engineer’s Guide to StreamingPanel discussion: If Streaming Is the Answer, Why Are We Still Doing Batch?All Current 2022 sessions and slidesWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

2/22/23 • 43:41

What’s the next big thing in the future of streaming data? In this episode, Greg DeMichillie (VP of Product and Solutions Marketing, Confluent) talks to Kris about the future of stream processing in environments where the value of data lies in their ability to intercept and interpret data.Greg explains that organizations typically focus on the infrastructure containers themselves, and not on the thousands of data connections that form within. When they finally realize that they don't have a way to manage the complexity of these connections, a new problem arises: how do they approach managing such complexity? That’s where Confluent and Apache Kafka® come into play - they offer a consistent way to organize this seemingly endless web of data so they don't have to face the daunting task of figuring out how to connect their shopping portals or jump through hoops trying different ETL tools on various systems.As more companies seek ways to manage this data, they are asking some basic questions:How to do it?Do best practices exist?How can we get help?The next question for companies who have already adopted Kafka is a bit more complex: "What about my partners?” For example, companies with inventory management systems use supply chain systems to track product creation and shipping. As a result, they need to decide which emails to update, if they need to write custom REST APIs to sit in front of Kafka topics, etc. Advanced use cases like this raise additional questions about data governance, security, data policy, and PII, forcing companies to think differently about data.Greg predicts this is the next big frontier as more companies adopt Kafka internally. And because they will have to think less about where the data is stored and more about how data moves, they will have to solve problems to make managing all that data easier. If you're an enthusiast of real-time data streaming, Greg invites you to attend the Kafka Summit (London) in May and Current (Austin, TX) for a deeper dive into the world of Apache Kafka-related topics now and beyond.EPISODE LINKSWhat’s Ahead of the Future of Data Streaming?If Streaming Is the Answer, Why Are We Still Doing Batch?All Current 2022 sessions and slidesKafka Summit London 2023Current 2023Watch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

2/15/23 • 41:29

What can online gaming teach us about making large-scale event management more collaborative in real-time? Ben Gamble (Developer Relations Manager, Aiven)  has come to the world of real-time event streaming from an usual source: the video games industry. And if you stop to think about it, modern online games are complex, distributed real-time data systems with decades of innovative techniques to teach us.In this episode, Ben talks with Kris about integrating gaming concepts with Apache Kafka®. Using Kafka’s state management stream processing, Ben has built systems that can handle real-time event processing at a massive scale, including interesting approaches to conflict resolution and collaboration.Building latency into a system is one way to mask data processing time. Ben says that you can efficiently hide latency issues and prioritize performance improvements by setting an initial target and then optimizing from there. If you measure before optimizing, you can add an extra layer to manage user expectations better. Tricks like adding a visual progress bar give the appearance of progress but actually hide latency and improve the overall user experience.To effectively handle challenging activities, like resolving conflicts and atomic edits, Ben suggests “slicing” (or nano batching) to break down tasks into small, related chunks. Slicing allows each task to be evaluated separately, thus producing timely outcomes that resolve potential background conflicts without the user knowing.Ben also explains how he uses pooling to make collaboration seamless. Pooling is a process that links open requests with potential matches. Similar to booking seats on an airplane, seats are assigned when requests are made. As these types of connections are handled through a Kafka event stream, the initial open requests are eventually fulfilled when seats become available.According to Ben, real-world tools that facilitate collaboration (such as Google Docs and Slack) work similarly. Just like multi-player gaming systems, multiple users can comment or chat in real-time and users perceive instant responses because of the techniques ported over from the gaming world.As Ben sees it, the proliferation of these types of concepts across disciplines will also benefit a more significant number of collaborative systems. Despite being long established for gamers, these patterns can be implemented in more business applications to improve the user experience significantly.EPISODE LINKSGoing Multiplayer With Kafka—Current 2022Building a Dependable Real-Time Betting App with Confluent Cloud and AblyEvent Streaming PatternsWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

2/8/23 • 55:32

Apache Kafka® 3.4 is released! In this special episode, Danica Fine (Senior Developer Advocate, Confluent), shares highlights of the Apache Kafka 3.4 release. This release introduces new KIPs in Kafka Core, Kafka Streams, and Kafka Connect.In Kafka Core:KIP-792 expands the metadata each group member passes to the group leader in its JoinGroup subscription to include the highest stable generation that consumer was a part of. KIP-830 includes a new configuration setting that allows you to disable the JMX reporter for environments where it’s not being used. KIP-854 introduces changes to clean up producer IDs more efficiently, to avoid excess memory usage. It introduces a new timeout parameter that affects the expiry of producer IDs and updates the old parameter to only affect the expiry of transaction IDs.KIP-866 (early access) provides a bridge to migrate between existing Zookeeper clusters to new KRaft mode clusters, enabling the migration of existing metadata from Zookeeper to KRaft. KIP-876 adds a new property that defines the maximum amount of time that the server will wait to generate a snapshot; the default is 1 hour.KIP-881, an extension of KIP-392, makes it so that consumers can now be rack-aware when it comes to partition assignments and consumer rebalancing. In Kafka Streams:KIP-770 updates some Kafka Streams configs and metrics related to the record cache size.KIP-837 allows users to multicast result records to every partition of downstream sink topics and adds functionality for users to choose to drop result records without sending.And finally, for Kafka Connect:KIP-787 allows users to run MirrorMaker2 with custom implementations for the Kafka resource manager and makes it easier to integrate with your ecosystem.Tune in to learn more about the Apache Kafka 3.4 release!EPISODE LINKSSee release notes for Apache Kafka 3.4Read the blog to learn moreDownload Apache Kafka 3.4 and get startedWatch the video version of this podcastJoin the Community 

2/7/23 • 05:13

How can you use OpenTelemetry to gain insight into your Apache Kafka® event systems? Roman Kolesnev, Staff Customer Innovation Engineer at Confluent, is a member of the Customer Solutions & Innovation Division Labs team working to build business-critical OpenTelemetry applications so companies can see what’s happening inside their data pipelines. In this episode, Roman joins Kris to discuss tracing and monitoring in distributed systems using OpenTelemetry. He talks about how monitoring each step of the process individually is critical to discovering potential delays or bottlenecks before they happen; including keeping track of timestamps, latency information, exceptions, and other data points that could help with troubleshooting.Tracing each request and its journey to completion in Kafka gives companies access to invaluable data that provides insight into system performance and reliability. Furthermore, using this data allows engineers to quickly identify errors or anticipate potential issues before they become significant problems. With greater visibility comes better control over application health - all made possible by OpenTelemetry's unified APIs and services.As described on the OpenTelemetry.io website, "OpenTelemetry is a Cloud Native Computing Foundation incubating project. Formed through a merger of the OpenTracing and OpenCensus projects." It provides a vendor-agnostic way for developers to instrument their applications across different platforms and programming languages while adhering to standard semantic conventions so the traces/information can be streamed to compatible systems following similar specs.By leveraging OpenTelemetry, organizations can ensure their applications and systems are secure and perform optimally. It will quickly become an essential tool for large-scale organizations that need to efficiently process massive amounts of real-time data. With its ability to scale independently, robust analytics capabilities, and powerful monitoring tools, OpenTelemetry is set to become the go-to platform for stream processing in the future.Roman explains that the OpenTelemetry APIs for Kafka are still in development and unavailable for open source. The code is complete and tested but has never run in production. But if you want to learn more about the nuts and bolts, he invites you to connect with him on the Confluent Community Slack channel. You can also check out Monitoring Kafka without instrumentation with eBPF - Antón Rodríguez to learn more about a similar approach for domain monitoring.EPISODE LINKSOpenTelemetry java instrumentationOpenTelemetry collectorDistributed Tracing for Kafka with OpenTelemetryMonitoring Kafka without instrumentation with eBPFKris Jenkins' TwitterWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get $100 of free Confluent Cloud usage (details)   

2/1/23 • 50:01

Data democratization allows everyone in an organization to have access to the data they need, and the necessary tools needed to use this data effectively. In short, data democratization enables better business decisions. In this episode, Rama Ryali, a Senior IT and Data Executive, chats with Kris Jenkins about the importance of data democratization in modern systems.Rama explains that tech has unprecedented control over data and ignores basic business needs. Tech’s influence has largely gone unchecked and has led to a disconnect that often forces businesses to hire outside vendors for help turning their data into information they can use. In his role at RightData, Rama worked closely with Marketing, Sales, Customers, and Leadership to develop a no-code unified data platform that is accessible to everyone and fosters data democratization.So what is data democracy anyway? Rama explains that data democratization is the process of making data more accessible and open to a wider audience in a unified, no-code UI. It involves making sure that data is available to people who need it, regardless of their technical expertise or background. This enables businesses to make data-driven decisions faster and reduces the costs associated with acquiring, processing, and storing information. In addition, by allowing more people access to data, organizations can better collaborate and access tools that allow them to gain valuable insights into their operations and gain a competitive edge in the marketplace.In a perfect world, complicated tools supported by SQL, Excel, etc., with static views of data, will be replaced by a UI that anyone can use to analyze real-time streaming data. Kris coined a phase, “data socialization,” which describes the way that these types of tools can enable human connections across all areas of the organization, not just tech.Rama acknowledges that Excel, SQL, and other dev-heavy platforms will never go away, but the future of data democracy will allow businesses to unlock the maximum value of data through an iterative, democratic process where people talk about what the data is, what matters to other people, and how to transmit it in a way that makes sense.EPISODE LINKSRightData LinkedInThe 5 W’s of Metadata by Rama RyaliReal-Time Machine Learning and Smarter AI with Data StreamingWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  

1/26/23 • 47:27

Is it possible to manage and test data like code? lakeFS is an open-source data version control tool that transforms object storage into Git-like repositories, offering teams a way to use the same workflows for code and data. In this episode, Kris sits down with guest Adi Polak, VP of DevX at Treeverse, to discuss how lakeFS can be used to facilitate better management and testing of data.At its core, lakeFS provides teams with better data management. A theoretical data engineer on a large team runs a script to delete some data, but a bug in the script accidentally deletes a lot more data than intended. Application engineers can checkout the main branch, effectively erasing their mistakes, but without a tool like lakeFS, this data engineer would be in a lot of trouble.Polak is quick to explain that lakeFS isn’t built on Git. The source code behind an application is usually a few dozen mega bytes, while lakeFS is designed to handle petabytes of data; however, it does use Git-like semantics to create and access versions so adoption is quick and simple.Another big challenge that lakeFS helps teams tackle is reproducibility. Troubleshooting when and where a corruption in the data first appeared can be a tricky task for a data engineer, when data is constantly updating. With lakeFS, engineers can refer to snapshots to see where the product was corrupted, and rollback to that exact state.lakeFS also assists teams with reprocessing of historical data. With lakeFS data can be reprocessed on an isolated branch, before merging, to ensure the reprocessed data is exposed atomically. It also makes it easier to access the different versions of reprocessed data using any tag or a historical commit ID.Tune in to hear more about the benefits of lakeFS.EPISODE LINKSAdi Polak's TwitterlakeFS Git-for-data GitHub repo What is a Merkle Tree?If Streaming Is the Answer, Why Are We Still Doing Batch?Current 2022 sessions and slidesSign up for updates on Current 2023Watch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

1/19/23 • 30:42

How does leader election work in Apache Kafka®? For the past 2 ½ years, Adithya Chandra, Staff Software Engineer at Confluent, has been working on Kafka scalability and performance, specifically partition leader election. In this episode, he gives Kris Jenkins a deep dive into the power of leader election in Kafka replication, why we need it, how it works, what can go wrong, and how it's being improved.Adithya explains that you can configure a certain number of replicas to be distributed across Kafka brokers and then set one of them as the elected leader - the others become followers. This leader-based model proves efficient because clients only have to write to the leader, who handles the replication process internally.But what happens when a broker goes offline, when a replica reassignment occurs, or when a broker shuts down? Adithya explains that when these triggers occur, one of the followers becomes the elected leader, and all the other replicas take their cue from the new leader. This failover reassignment ensures that messages are replicated effectively and efficiently with multiple copies across different brokers.Adithya explains how you can select a broker as the preferred election leader. The preferred leader then becomes the new leader in failure events. This reduces latency and ensures messages consistently write to the same broker for easier tracking and debugging.Leader failover cannot cover all failures, Adithya says. If a broker can’t be reached externally but can talk to other brokers in the cluster, leader failover won’t be triggered. If a broker experiences transient disk or network issues, the leader election process might fail, and the broker will not be elected as a leader. In both cases, manual intervention is required.Leadership priority is an important feature of Confluent Cloud that allows you to prioritize certain brokers over others and specify which broker is most likely to become the leader in case of a failover. This way, we can prioritize certain brokers to ensure that the most reliable broker handles more important and sensitive replication tasks. Additionally, this feature ensures that replication remains consistent and available even in an unexpected failure event.Improvements to this component of Kafka will enable it to be applied to a wide variety of scenarios. On-call engineers can use it to mitigate single-broker performance issues while debugging. Network and storage health solutions can use it to prioritize brokers. Adithya explains that preferred leader election and leadership failover ensure data is available and consistent during failure scenarios so that Kafka replication can run smoothly and efficiently.EPISODE LINKSData Plane: Replication ProtocolOptimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya ChandraWatch the videoKris Jenkins’ TwitterJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

1/12/23 • 51:06

Are bad customer experiences really just data integration problems? Can real-time data streaming and machine learning be democratized in order to deliver a better customer experience? Airy, an open-source data-streaming platform, uses Apache Kafka® to help business teams deliver better results to their customers. In this episode, Airy CEO and co-founder Steffen Hoellinger explains how his company is expanding the reach of stream-processing tools and ideas beyond the world of programmers.Airy originally built Conversational AI (chatbot) software and other customer support products for companies to engage with their customers in conversational interfaces. Asynchronous messaging created a large amount of traffic, so the company adopted Kafka to ingest and process all messages & events in real time.In 2020, the co-founders decided to open source the technology, positioning Airy as an open source app framework for conversational teams at large enterprises to ingest and process conversational and customer data in real time. The decision was rooted in their belief that all bad customer experiences are really data integration problems, especially at large enterprises where data often is siloed and not accessible to machine learning models and human agents in real time.(Who hasn’t had the experience of entering customer data into an automated system, only to have the same data requested eventually by a human agent?)Airy is making data streaming universally accessible by supplying its clients with real-time data and offering integrations with standard business software. For engineering teams, Airy can reduce development time and increase the robustness of solutions they build.Data is now the cornerstone of most successful businesses, and real-time use cases are becoming more and more important. Open-source app frameworks like Airy are poised to drive massive adoption of event streaming over the years to come, across companies of all sizes, and maybe, eventually, down to consumers.EPISODE LINKSLearn how to deploy Airy Open Source - or sign up for an Airy Cloud test instanceGoogle Case Study about Airy & TEDi, a 2,000 store retailerBecome an Expert in Conversational EngineeringSupercharging conversational AI with human agent feedback loopsIntegrating all Communication and Customer Data with Airy and ConfluentHow to Build and Deploy Scalable Machine Learning in Production with Apache KafkaReal-Time Threat Detection Using Machine Learning and Apache KafkaWatch the videoLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get $100 of free Confluent Cloud usage (details) 

1/5/23 • 38:56

The past year saw new trends emerge in the world of data streaming technologies, as well as some unexpected and novel use cases for Apache Kafka®. New reflections on the future of stream processing and when companies should adopt microservice architecture inspired several talks at this year’s industry conferences. In this episode, Kris is joined by his colleagues Danica Fine, Senior Developer Advocate, and Robin Moffatt, Principal Developer Advocate, for an end-of-year roundtable on this year’s developments and what they want to see in the year to come.Robin and Danica kick things off with a discussion of the year’s memorable conferences. Talk submissions for Kafka Summit London and Current 2022 featuring topics were noticeably more varied than previous years, with fewer talks focused on the basics of Kafka implementation. Many abstracts featured interesting and unusual use cases, in addition to detailed explanations on what went wrong and how others could avoid the same issues.The conferences also made clear that a lot of companies are adopting or considering stream-processing solutions. Are we close to a future where streaming is a part of everything we do? Is there anything helping streaming become more mainstream? Will stream processing replace batch?On the other hand, a lot of in-demand talks focused on the importance of understanding the best practices supporting data mesh and understanding the nuances of the system and configurations. Danica identifies this as her big hope for next year: No more Kafka developers pursuing quick fixes. “No more band aid fixes. I want as many people as possible to understand the nuances of the levers that they're pulling for Kafka, whatever project they're building.”Kris and Robin agree that what will make them happy in 2023 is seeing broader, more diverse client libraries for Kafka. “Getting away from this idea that Kafka is largely a Java shop, which is nonsense, but there is that perception.”Streaming Audio returns in January 2023.EPISODE LINKSPut Your Data To Work: Top 5 Data Technology Trends for 2023Write What You Know: Turning Your Apache Kafka Knowledge into a Technical TalkCommon Apache Kafka Mistakes to AvoidPractical Data Pipeline: Build a Plant Monitoring System with ksqlDBIf Streaming Is the Answer, Why Are We Still Doing Batch?View sessions and slides from Current 2022Watch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

12/28/22 • 31:19

Entomophiliac, Anna McDonald (Principal Customer Success Technical Architect, Confluent) has seen her fair share of Apache Kafka® bugs. For her annual holiday roundup of the most noteworthy Kafka bugs, Anna tells Kris Jenkins about some of the scariest, most surprising, and most enlightening corner cases that make you ask, “Ah, so that’s how it really works?”She shares a lot of interesting details about how batching works, the replication protocol, how Kafka’s networking stack dances with Linux’s one, and which is the most important Scala class to read, if you’re only going to read one.In particular, Anna gives Kris details about a bug that he’s been thinking about lately – sticky partitioner (KAFKA-10888). When a Kafka producer sends several records to the same partition at around the same time, the partition can get overloaded. As a result, if too many records get processed at once, they can get stuck causing an unbalanced workload. Anna goes on to explain that the fix required keeping track of the number of offsets/messages written to each partition, and then batching to force more balanced distributions.She found another bug that occurs when Kafka server triggers TCP Congestion Control in some conditions (KAFKA-9648). Anna explains that when Kafka server restarts and then executes the preferred replica leader, lots of replica leaders trigger cluster metadata updates. Then, all clients establish a server connection at the same time that lots TCP requests are waiting in the TCP sync queue.The third bug she talks about (KAFKA-9211), may cause TCP delays after upgrading…. Oh, that’s a nasty one. She goes on to tell Kris about a rare bug (KAFKA-12686) in Partition.scala where there’s a race condition between the handling of an AlterIsrResponse and a LeaderAndIsrRequest. This rare scenario involves the delay of AlterIsrResponse when lots of ISR and leadership changes occur due to broker restarts.Bugs five (KAFKA-12964) and six (KAFKA-14334) are no better, but you’ll have to plug in your headphones and listen in to explore the ghoulish adventures of Anna McDonald as she gives a nightmarish peek into her world of JIRA bugs. It’s just what you might need this holiday season!EPISODE LINKSKAFKA-10888: Sticky partition leads to uneven product msg, resulting in abnormal delays in some partitionsKAFKA-9648: Add configuration to adjust listen backlog size for AcceptorKAFKA-9211: Kafka upgrade 2.3.0 may cause tcp delay ack(Congestion Control)KAFKA-12686: Race condition in AlterIsr response handlingKAFKA-12964: Corrupt segment recovery can delete new producer state snapshotsKAFKA-14334: DelayedFetch purgatory not completed when appending as followerOptimizing for Low Latency and High ThroughputDiagnose and Debug Apache Kafka IssuesWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperUse PODCAST100 to get $100 of free Confluent Cloud usage (details

12/21/22 • 70:58

Could you explain Apache Kafka® in ways that a small child could understand? When Mitch Seymour, author of Mastering Kafka Streams and ksqlDB, wanted a way to communicate the basics of Kafka and event-based stream processing, he decided to author a children’s book on the subject, but it turned into something with a far broader appeal.Mitch conceived the idea while writing a traditional manuscript for engineers and technicians interested in building stream processing applications. He wished he could explain what he was writing about to his 2-year-old daughter, and contemplated the best way to introduce the concepts in a way anyone could grasp.Four months later, he had completed the illustration book: Gently Down the Stream: A Gentle Introduction to Apache Kafka. It tells the story of a family of forest-dwelling Otters, who discover that they can use a giant river to communicate with each other. When more Otter families move into the forest, they must learn to adapt their system to handle the increase in activity.This accessible metaphor for how streaming applications work is accompanied by Mitch’s warm, painterly illustrations.For his second book, Seymour collaborated with the researcher and software developer Martin Kleppmann, author of Designing Data-Intensive Applications. Kleppmann admired the illustration book and proposed that the next book tackle a gentle introduction to cryptography. Specifically, it would introduce the concepts behind symmetric-key encryption, key exchange protocols, and the Diffie-Hellman algorithm, a method for exchanging secret information over a public channel.Secret Colors tells the story of a pair of Bunnies preparing to attend a school dance, who eagerly exchange notes on potential dates. They realize they need a way of keeping their messages secret, so they develop a technique that allows them to communicate without any chance of other Bunnies intercepting their messages.Mitch’s latest illustration book is—A Walk to the Cloud: A Gentle Introduction to Fully Managed Environments.  In the episode, Seymour discusses his process of creating the books from concept to completion, the decision to create his own publishing company to distribute these books, and whether a fourth book is on the way. He also discusses the experience of illustrating the books side by side with his wife, shares his insights on how editing is similar to coding, and explains why a concise set of commands is equally desirable in SQL queries and children’s literature.EPISODE LINKSMinimizing Software Speciation with ksqlDB and Kafka StreamsGently Down the Stream: A Gentle Introduction to Apache KafkaSecret ColorsA Walk to the Cloud: A Gentle Introduction to Fully Managed EnvironmentsApache Kafka On the Go: Kafka Concepts for BeginnersApache Kafka 101 courseWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

12/20/22 • 31:29

What are the key factors to consider when developing event-driven architecture? When properly designed, events can connect existing systems with a common language and allow data exchange in near real time. They also help reduce complexity by providing a single source of truth that eliminates the need to synchronize data between different services or applications. They enable dynamic behavior, allowing each service or application to respond quickly to changes in its environment. Using events, developers can create systems that are more reliable, responsive, and easier to maintain.In this podcast, Adam Bellemare, staff technologist at Confluent, discusses the four dimensions of events and designing event streams along with best practices, and an overview of a new course he just authored. This course, called Introduction to Designing Events and Event Streams, walks you through the process of properly designing events and event streams in any event-driven architecture.Adam explains that the goal of the course is to provide you with a foundation for designing events and event streams. Along with hands-on exercises and best practices, the course explores the four dimensions of events and event stream design and applies them to real-world problems. Most importantly, he talks to Kris about the key factors to consider when deciding what events to write, what events to publish, and how to structure and design them to trigger actions like broadcasting messages to other services or storing results in a database.How you design and implement events and event streams significantly affect not only what you can do today, but how you scale in the future. Head over to Introduction to Designing Events and Event Streams to learn everything you need to know about building an event-driven architecture.EPISODE LINKSIntroduction to Designing Events and Event StreamsPractical Data Mesh: Building Decentralized Data Architecture with Event StreamsThe Data Dichotomy: Rethinking the Way We Treat Data and ServicesCoding in Motion: Sound & Vision—Build a Data Streaming App with JavaScript and Confluent CloudUsing Event-Driven Design with Apache Kafka Streaming Applications ft. Bobby CalderwoodWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

12/15/22 • 53:06

Is there a better way to manage access to resources without compromising security? New employees need access to a variety of resources within a company's tech stack. But manually granting access can be error-prone. And when employees leave, their access must be revoked, thus potentially introducing security risks if an admin misses one. In this podcast, Kris Jenkins talks to Anuj Sawani (Security Product Manager, Confluent) about the centralized identity management system he helped build to integrate with Apache Kafka® to prevent common identity management headaches and security risks.With 12+ years of experience building cybersecurity products for enterprise companies, Anuj Sawani explains how he helped build out KIP-768 (Secured OAuth support in Kafka) that supports a unified identity mechanism that spans across cloud and on-premises (hybrid scenarios).Confluent Cloud customers wanted a single identity to access all their services. The manual process required managing different sets of identity stores across the ecosystem. Anuj goes on to explain how Identity and Access Management (IAM) using cloud-native authentication protocols, such as OAuth or OpenID Connect, solves this problem by centralizing identity and minimizing security risks.Anuj emphasizes that sticking with industry standards is key because it makes integrating with other systems easy. With OAuth now supported in Kafka, this means performing client upgrades, configuring identity providers, etc. to ensure the applications can leverage new capabilities. Some examples of how to do this are to use centralized identities for client/broker connections.As Anuj continues to build and enhance features, he hopes to recommend this unified solution to other technology vendors because it makes integration much easier. The goal is to create a web of connectors that support the same standards. The future is bright, as other organizations are researching supporting OAuth and similar industry standards. Anuj is looking forward to the evolution and applying it to other use cases and scenarios.EPISODE LINKSIntroduction to Confluent Cloud SecurityKIP-768: Secured OAuth support in Apache KafkaConfluent Cloud Documentation: OAuth 2.0 SupportApache Kafka Security Best PracticesSecurity for Real-Time Data Stream Processing with Confluent CloudWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

12/8/22 • 41:23

Can we use machine learning to detect security threats in real-time? As organizations increasingly rely on distributed systems, it is becoming more important to analyze the traffic that passes through those systems quickly. Confluent Hackathon ’22 finalist, Géraud Dugé de Bernonville (Data Consultant, Zenika Bordeaux), shares how his team used TensorFlow (machine learning) and Neo4j (graph database) to analyze and detect network traffic data in real-time. What started as a research and development exercise turned into ZIEM, a full-blown internal project using ksqlDB to manipulate, export, and visualize data from Apache Kafka®.Géraud and his team noticed that large amounts of data passed through their network, and they were curious to see if they could detect threats as they happened. As a hackathon project, they built ZIEM, a network mapping and intrusion detection platform that quickly generates network diagrams. Using Kafka, the system captures network packets, processes the data in ksqlDB, and uses a Neo4j Sink Connector to send it to a Neo4j instance. Using the Neo4j browser, users can see instant network diagrams showing who's on the network, allowing them to detect anomalies quickly in real time.The Ziem project was initially conceived as an experiment to explore the potential of using Kafka for data processing and manipulation. However, it soon became apparent that there was great potential for broader applications (banking, security, etc.). As a result, the focus shifted to developing a tool for exporting data from Kafka, which is helpful in transforming data for deeper analysis, moving it from one database to another, or creating powerful visualizations.Géraud goes on to talk about how the success of this project has helped them better understand the potential of using Kafka for data processing. Zenika plans to continue working to build a pipeline that can handle more robust visualizations, expose more learning opportunities, and detect patterns.EPISODE LINKSZiem Project on GitHub ksqlDB 101 courseksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work together ft. Simon AuburyReal-Time Stream Processing, Monitoring, and Analytics with Apache KafkaApplication Data Streaming with Apache Kafka and SwimWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  

11/29/22 • 29:18

What happens when you need to store more than a few petabytes of data? Rittika Adhikari (Software Engineer, Confluent) discusses how her team implemented tiered storage, a method for improving the scalability and elasticity of data storage in Apache Kafka®. She also explores the motivating factors for building it in the first place: cost, performance, and manageability. Before Tiered Storage, there was no real way to retain Kafka data indefinitely. Because of the tight coupling between compute and storage, users were forced to use different tools to access cold and hot data. Additionally, the cost of re-replication was prohibitive because Kafka had to process large amounts of data rather than small hot sets.As a member of the Kafka Storage Foundations team, Rittika explains to Kris Jenkins how her team initially considered a Kafka data lake but settled on a more cost-effective method – tiered storage. With tiered storage, one tier handles elasticity and throughput for long-term storage, while the other tier is dedicated to high-cost, low-latency, short-term storage. Before, re-replication impacted all brokers, slowing down performance because it required more replication cycles. By decoupling compute and storage, they now only replicate the hot set rather than weeks of data. Ultimately, this tiered storage method broke down the barrier between compute and storage by separating data into multiple tiers across the cloud. This allowed for better scalability and elasticity that reduced operational toil. In preparation for a broader rollout to customers who heavily rely on compacted topics, Rittika’s team will be implementing tier compaction to support tiering of compacted topics. The goal is to have the partition leader perform compaction. This will substantially reduce compaction costs (CPU/disk) because the number of replicas compacting is significantly smaller. It also protects the broker resource consumption through a new compaction algorithm and throttling. EPISODE LINKSJun Rao explains: What is Tiered Storage?Enabling Tiered StorageInfinite Storage in Confluent PlatformKafka Storage and Processing FundamentalsKIP-405: Kafka Tiered StorageOptimizing Apache Kafka’s Internals with Its Co-Creator Jun RaoWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)  

11/22/22 • 29:32

In principle, data mesh architecture should liberate teams to build their systems and gather data in a distributed way, without having to explicitly coordinate. Data is the thing that can and should decouple teams, but proper implementation has its challenges.In this episode, Kris talks to Florian Albrecht (Solution Architect, Hermes Germany) about Galapagos, an open-source DevOps software tool for Apache Kafka® that Albrecht created with his team at Hermes, a German parcel delivery company. After Hermes chose Kafka to implement company-wide event-driven architecture, Albrecht’s team created rules and guidelines on how to use and really make the most out of Kafka. But the hands-off approach wasn’t leading to greater independence, so Albrecht’s team tried something different to documentation— they encoded the rules as software.This method pushed the teams to stop thinking in terms of data and to start thinking in terms of events. Previously, applications copied data from one point to another, with slight changes each time. In the end, teams with conflicting data were left asking when the data changed and why, with a real impact on customers who might be left wondering when their parcel was redirected and how. Every application would then have to be checked to find out when exactly the data was changed. Event architecture terminates this cycle. Events are immutable and changes are registered as new domain-specific events. Packaged together as event envelopes, they can be safely copied to other applications, and can provide significant insights. No need to check each application to find out when manually entered or imported data was changed—the complete history exists in the event envelope. More importantly, no more time-consuming collaborations where teams help each other to interpret the data. Using Galapagos helped the teams at Hermes to switch their thought process from raw data to event-driven. Galapagos also empowers business teams to take charge of their own data needs by providing a protective buffer. When specific teams,  providers of data or events, want to change something, Galapagos enforces a method which will not kill the production applications already reading the data. Teams can add new fields which existing applications can ignore, but a previously required field that an application could be relying on won’t be changeable. Business partners using Galapagos found they were better prepared to give answers to their developer colleagues, allowing different parts of the business to communicate in ways they hadn’t before. Through Galapagos, Hermes saw better success decoupling teams.EPISODE LINKSA Guide to Data MeshPractical Data Mesh ebookGalapagos GitHubFlorian Albrecht GitHubWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get $100 of free Confluent Cloud usage (details)   

11/15/22 • 38:38

Is real-time data streaming the future, or will batch processing always be with us? Interest in streaming data architecture is booming, but just as many teams are still happily batching away. Batch processing is still simpler to implement than stream processing, and successfully moving from batch to streaming requires a significant change to a team’s habits and processes, as well as a meaningful upfront investment. Some are even running dbt in micro batches to simulate an effect similar to streaming, without having to make the full transition. Will streaming ever fully take over?In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should). Is micro batching the natural stepping stone between batch and streaming? Will there ever be a unified understanding on how data should be processed over time? Is the lack of agreement on best practices for data streaming an insurmountable obstacle to widespread adoption? What exactly is holding teams back from fully adopting a streaming model?Recorded live at Current 2022: The Next Generation of Kafka Summit, the panel includes Adi Polak (Vice President of Developer Experience, Treeverse), Amy Chen (Partner Engineering Manager, dbt Labs), Eric Sammer (CEO, Decodable), and Tyler Akidau (Principal Software Engineer, Snowflake).EPISODE LINKSdbt LabsDecodablelakeFSSnowflakeView sessions and slides from Current 2022Stream Processing vs. Batch Processing: What to KnowFrom Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica FineWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

11/9/22 • 43:58

Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka® for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale. SecurityScorecard mines data from dozens of digital sources to discover security risks and flaws with the potential to expose their client’ data. This includes scanning and ingesting data from a large number of ports to identify suspicious IP addresses, exposed servers, out-of-date endpoints, malware-infected devices, and other potential cyber threats for more than 12 million companies worldwide.To allow real-time stream processing for the organization, the team moved away from using RabbitMQ to open-source Kafka for processing a massive amount of data in a matter of milliseconds, instead of weeks or months. This makes the detection of a website’s security posture risk happen quickly for constantly evolving security threats. The team relied on batch pipelines to push data to and from Amazon S3 as well as expensive REST API based communication carrying data between systems. They also spent significant time and resources on open-source Kafka upgrades on Amazon MSK.Self-maintaining the Kafka infrastructure increased operational overhead with escalating costs. In order to scale faster, govern data better, and ultimately lower the total cost of ownership (TOC), Brandon, lead of the organization’s Pipeline team, pivoted towards a fully-managed, cloud-native approach for more scalable streaming data pipelines, and for the development of a new Automatic Vendor Detection (AVD) product. Jared and Brandon continue to leverage the Cloud for use cases including using PostgreSQL and pushing data to downstream systems using CSC connectors, increasing data governance and security for streaming scalability, and more.EPISODE LINKSSecurityScorecard Case StudyBuilding Data Pipelines with Apache Kafka and ConfluentWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

11/3/22 • 48:33

What are some recommendations to consider when running Apache Kafka® in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases.Here are 6 recommendations for maximizing Kafka in production:1. Nail Down the Operational PartWhen setting up your cluster, in addition to dealing with the usual architectural issues, make sure to also invest time into alerting, monitoring, logging, and other operational concerns. Managing a distributed system can be tricky and you have to make sure that all of its parts are healthy together.  This will give you a chance at catching cluster problems early, rather than after they have become full-blown crises. 2. Reason Properly About Serialization and Schemas Up FrontAt the Kafka API level, events are just bytes, which gives your application the flexibility to use various serialization mechanisms. Avro has the benefit of decoupling schemas from data serialization, whereas Protobuf is often preferable to those practiced with remote procedure calls; JSON Schema is user friendly but verbose. When you are choosing your serialization, it's a good time to reason about schemas, which should be well-thought-out contracts between your publishers and subscribers. You should know who owns a schema as well as the path for evolving that schema over time.3. Use Kafka As a Central Nervous System Rather Than As a Single ClusterTeams typically start out with a single, independent Kafka cluster, but they could benefit, even from the outset, by thinking of Kafka more as a central nervous system that they can use to connect disparate data sources. This enables data to be shared among more applications. 4. Utilize Dead Letter Queues (DLQs)DLQs can keep service delays from blocking the processing of your messages. For example, instead of using a unique topic for each customer to which you need to send data (potentially millions of topics),  you may prefer to use a shared topic, or a series of shared topics that contain all of your customers. But if you are sending to multiple customers from a shared topic and one customer's REST API is down—instead of delaying the process entirely—you can have that customer's events divert into a dead letter queue. You can then process them later from that queue.5. Understand Compacted TopicsBy default in Kafka topics, data is kept by time. But there is also another type of topic, a compacted topic, which stores data by key and replaces old data with new data as it comes in. This is particularly useful for working with data that is updateable, for example, data that may be coming in through a change-data-capture log. A practical example of this would be a retailer that needs to update prices and product descriptions to send out to all of its locations. 6. Imagine New Use Cases Enabled by Kafka's Recent Evolution The biggest recent change in Kafka's history is its migration to the cloud. By using Kafka there, you can reserve your engineering talent for business logic. The unlimited storage enabled by the cloud also means that you can truly keep data forever at reasonable cost, and thus you don't have to build a separate system for your historical data needs.EPISODE LINKSKafka Internals 101 Watch in videoKris Jenkins' TwitterUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

10/27/22 • 58:44

Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless. Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task. With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database. Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team. EPISODE LINKSKafka Streams 101 courseThe Difference Engine for Unlocking the Kafka Black BoxGitHub repo: kash.pyWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

10/20/22 • 37:18

Java Virtual Machines (JVMs) impact Apache Kafka® performance in production. How can you optimize your event-streaming architectures so they process more Kafka messages using the same number of JVMs? Gil Tene (CTO and Co-Founder, Azul) delves into JVM internals and how developers and architects can use Java and optimized JVMs to make real-time data pipelines more performant and more cost effective, with use cases.Gil has deep roots in Java optimization, having started out building large data centers for parallel processing, where the goal was to get a finite set of hardware to run the largest possible number of JVMs. As the industry evolved, Gil switched his primary focus to software, and throughout the years, has gained particular expertise in garbage collection (the C4 collector) and JIT compilation. The OpenJDK distribution Gil's company Azul releases, Zulu, is widely used throughout the Java world, although Azul's Prime build version can run Kafka up to forty-percent faster than the open version—on identical hardware. Gil relates that improvements in JVMs aren't yielded with a single stroke or in one day, but are rather the result of many smaller incremental optimizations over time, i.e. "half-percent" improvements that accumulate. Improving a JVM starts with a good engineering team, one that has thought significantly about how to make JVMs better. The team must continuously monitor metrics, and Gil mentions that his team tests optimizations against 400-500 different workloads (one of his favorite things to get into the lab is a new customer's workload). The quality of a JVM can be measured on response times, the consistency of these response times including outliers, as well as the level and number of machines that are needed to run it. A balance between performance and cost efficiency is usually a sweet spot for customers.Throughout the podcast, Gil goes into depth on optimization in theory and practice, as well as Azul's use of JIT compilers, as they play a key role in improving JVMs. There are always tradeoffs when using them: You want a JIT compiler to strike a balance between the work expended optimizing and the benefits that come from that work. Gil also mentions a new innovation Azul has been working on that moves JIT compilation to the cloud, where it can be applied to numerous JVMs simultaneously.EPISODE LINKSA Guide on Increasing Kafka Event Streaming PerformanceBetter Kafka Performance Without Changing Any CodeWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

10/13/22 • 71:42

Apache Kafka® 3.3 is released! With over two years of development, KIP-833 marks KRaft as production ready for new AK 3.3 clusters only. On behalf of the Kafka community, Danica Fine (Senior Developer Advocate, Confluent) shares highlights of this release, with KIPs from Kafka Core, Kafka Streams, and Kafka Connect. To reduce request overhead and simplify client-side code, KIP-709 extends the OffsetFetch API requests to accept multiple consumer group IDs. This update has three changes, including extending the wire protocol, response handling changes, and enhancing the AdminClient to use the new protocol. Log recovery is an important process that is triggered whenever a broker starts up after an unclean shutdown. And since there is no way to know the log recovery progress other than checking if the broker log is busy, KIP-831 adds metrics for the log recovery progress with `RemainingLogsToRecover` and `RemainingSegmentsToRecover`for each recovery thread. These metrics allow the admin to monitor the progress of the log recovery.Additionally, updates on Kafka Core also include KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft. KIP-835: Monitor KRaft Controller Quorum Health. KIP-859: Add metadata log processing error-related metrics. KIP-834 for Kafka Streams added the ability to pause and resume topologies. This feature lets you reduce rescue usage when processing is not required or modifying the logic of Kafka Streams applications, or when responding to operational issues. While KIP-820 extends the KStream process with a new processor API. Previously, KIP-98 added support for exactly-once delivery guarantees with Kafka and its Java clients. In the AK 3.3 release, KIP-618 offers the Exactly-Once Semantics support to Confluent’s source connectors. To accomplish this, a number of new connectors and worker-based configurations have been introduced, including `exactly.once.source.support`, `transaction.boundary`, and more. Image attribution: Apache ZooKeeper™: https://zookeeper.apache.org/ and Raft logo:  https://raft.github.io/  EPISODE LINKSSee release notes for Apache Kafka 3.3.0 and Apache Kafka 3.3.1 for the full list of changesRead the blog to learn moreDownload Apache Kafka 3.3 and get startedWatch the video version of this podcast

10/3/22 • 06:42

How do you set data applications in motion by running stateful business logic on streaming data? Capturing key stream processing events and cumulative statistics that necessitate real-time data assessment, migration, and visualization remains as a gap—for event-driven systems and stream processing frameworks according to Fred Patton (Developer Evangelist, Swim Inc.) In this episode, Fred explains streaming applications and how it contrasts with stream processing applications. Fred and Kris also discuss how you can use Apache Kafka® and Swim for a real-time UI for streaming data.Swim's technology facilitates relationships between streaming data from distributed sources and complex UIs, managing backpressure cumulatively, so that front ends don't get overwhelmed. They are focused on real-time, actionable insights, as opposed to those derived from historical data. Fred compares Swim's functionality to the speed layer in the Lambda architecture model, which is specifically concerned with serving real-time views. For this reason, when sending your data to Swim, it is common to also send a copy to a data warehouse that you control. Web agent—a data entity in the Swim ecosystem, can be as small as a single cellphone or as large as a whole cellular network. Web agents communicate with one another as well as with their subscribers, and each one is a URI that can be called by a browser or the command line. Swim has been designed to instantaneously accommodate requests at widely varying levels of granularity, each of which demands a completely different volume of data. Thus, as you drill down, for example, from a city view on a map into a neighborhood view, the Swim system figures out which web agent is responsible for the view you are requesting, as well as the other web agents needed to show it.Fred also shares an example where they work with a telephony company that requires real-time statuses for a network infrastructure with thousands of cell towers servicing millions of devices. Along with a use case for a transportation company needing to transform raw edge data into actionable insights for its connected vehicle customers. Future plans for Swim include porting more functionality to the cloud, which will enable additional automation, so that, for example, a customer just has to provide database and Kafka cluster connections, and Swim can automatically build out infrastructure. EPISODE LINKSSwim Cellular Network SimulatorContinuous Intelligence - Streaming Apps That Are Always in SyncUsing Swim with Apache KafkaSwim DeveloperWatch the video version of this podcastKris Jenkins’ TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

10/3/22 • 39:10

Similar podcasts