Transcripts

FLOSS Weekly 743, Transcript

Aug 3rd 2023

Please be advised this transcript is AI-generated and may not be word for word. Time codes refer to the approximate times in the ad-supported version of the show.

Doc Searls (00:00:00):
This is Floss Weekly. I'm Doc Searls This week, Sean Powers and I talk with William Gu of Apache Sea Tunnel. Apache Sea Tunnel is a way to make all your databases, your multiple databases work together in a synchronized way and has many more implications than I ever thought possible Sean, ever thought possible. And it turned into a very exciting show. And that is coming up Next.

Leo Laporte (00:00:30):
Podcasts you love From people you trust. This is TWiT.

Doc Searls (00:00:38):
This is Floss Weekly, episode 743, recorded Wednesday, August 2nd, 2023. Data is surprisingly exciting.

Leo Laporte (00:00:51):
Listeners of this program get an ad free version if they're members of Club twit. $7 a month gives you ad free versions of all of our shows Plus membership in the Club. Twit Discord, a great clubhouse for twit listeners. And finally, the Twit plus feed with shows like Stacey's Book Club, the Untitled Linux Show, the GIZ Fizz and more. Go to twit tv slash club twit and thanks for your support.

Doc Searls (00:01:18):
Hello again, everyone everywhere. This is Floss Weekly. I am Doc Searles, and this week joined by Sean Powers himself. Hey, and there he is. We actually relatively close. We're one state apart. We're in adjacent states. I'm in Indiana, you're in Michigan, and you're in green and I'm in orange. For those of you not watching, is

Shawn Powers (00:01:41):
That orange or is that red? It looks

Doc Searls (00:01:43):
Red to me. It's it's orange. It's actually orange. Okay. It's a, it's Firefox shirt that I, we appeared with an older Firefox logo, one of several older Firefox logos.

Shawn Powers (00:01:52):
All right. Maybe it's because, maybe it's the contrast with the, with the fox, it makes it look red.

Doc Searls (00:01:56):
It it could be. Well, I, I think every screen is wrong also, you know, <laugh>, so that may be another thing. <Laugh>, there's, there are ways of calibrating them. You know, I paid extra for when to said it was pre calibrated, but I don't know. Doesn't look that arch and there either, to me, I look at the shirt and I look at the goes, ah, I dunno. So, so our, our, our guest today is William Guac from Apache's sea Tunnel project, which I have compiled a whole lot of stuff on and don't understand well enough yet. So, what, have you done your homework on this thing?

Shawn Powers (00:02:30):
I mean, so yes and no. It, I mean, big data is big, right? I mean, there's a lot to, to figure out there. I have, I, I have questions in my, in my youth, my youth, I don't know, in my yesterday years. You're still

Doc Searls (00:02:46):
In it, trust me.

Shawn Powers (00:02:47):
Okay. All right. Well, I was a, I was a database manager at a, at a university, and we had this incredibly archaic database that we had to tie in with another more modern SQL database. Basically all my questions are gonna be based around would this have made my job easier back in the day? And I'm pretty sure the answer is yes. But that's the extent of my knowledge of big data, is that it's a big pain in the butt. So hopefully this makes it less painful. Yeah,

Doc Searls (00:03:15):
I, I'm intrigued by the, some of the claims that I saw, or some of the stories I saw about it, saying that it's actually cheap to run and you can use it on smaller projects. I'm very interested in what using it personally, <laugh>. So that's an interesting thing. So we could go in lots of different directions with this. So let me introduce our guest. It's it's William Kwok. I'm hoping I get the pronunciation right. He's been an Apache Software Foundation member. A mentor was the Apache Sea Tunnel Project, ta, Apache Dolphin Scheduler P m c me mentor member. That's a Dolphin Scheduler is another topic of conversation today, the initiator of the Click House Chinese community. He grad, he is a graduate of the Ping University, used to work as a big data director for Lenovo Research Institute and general manager of the Wanda e-Commerce Data Department. And he's a visiting, visiting researcher at the big data business analysis research center of, of Renin university, and has been committed to promoting the democratization of data capabilities, the development of open source projects globally. And so we, with that as an inadequate introduction, you should

Shawn Powers (00:04:27):
Do, like, you should do like 12 truths and a lie <laugh> and like, and he has 27 toes on one foot. <Laugh>.

Doc Searls (00:04:37):
Hey. Could be so wel. We welcome. William, you're there.

William Kwok (00:04:42):
Hello. Hello, everyone. <Laugh>.

Doc Searls (00:04:44):
Yeah.

William Kwok (00:04:44):
Hello. Glad to see you. Yeah, you there? I'm William <laugh>.

Doc Searls (00:04:47):
So, so, so tell us a, a bit about a Apache Sea Tunnel and what, what led to it? 'cause I'm reading it, dolphin Dolphin's scheduler is part of it. So give us kind of the overall thing, and we'll dive down into parts of it.

William Kwok (00:05:03):
Okay. Okay. So, Apache Sea Tunnel it's a project that you can do a big data integration, and actually we can extract data from a different database such as MySQL, Oracle or DB two or or a w Ss Aurora, or even some Source or Cloud db. And then you can load the data into another database such as a Hive, Hadoop, or Redshift or Click House, or any database you want it. So I think it is a very, very good open source project that you, that can help you to extract data from your database to the other database. So that is Apache <inaudible> projects <laugh>.

Doc Searls (00:05:58):
So I, I am not a database expert at all, but I do know that companies have always had a hard time integrating these things because they're many different fields, many different variables, many different conventions involved in them, many different ways of, of querying them. And I, I wonder how you pull all of those together and don't have a monster of some sort that is too hard to get into. And I don't know. I mean, how, how does that look when it, when you're done with it and, and you're ready to query it? Yeah,

William Kwok (00:06:31):
Actually, yeah. Yeah. Actually I met a problem before because I, I just want to signify data from a w s Aurora to a Ws Redshift. And I used to use a tool called a w Ss d m s. That's a tool of that a w s offered to me, but I think it's not workable on that time. And also, I found that there's many, many database, not only a w s Aurora, but also we have MongoDB have a yield four J. There's another kind of database. And also we have many, many data warehouse such as snowflake as you know, and also like <inaudible> and Teradata, and also Oracle. So, so there's so many databases, and I, what, what I just want to do is just synchronize data from one database to the other, and I cannot find a very good, good tool to do that.

(00:07:42):
So we have to, to build a, a synchronized tool that we called a water job at the, that's former <inaudible> projects. And then we found that everyone needs a prop that tool that <inaudible> data from different data sources, perhaps as there's a Kafka, or perhaps there's some MongoDB and perhaps is MySQL show. Then we create a open source project called C Tunnel that led you to signalize data very easily. Even even you can use draw drag drop drag and drop I think a realization tools and to do the synchron, the synchronization. So that's why we create this sea tunnel. So I think it's easy for people that who are not technical background, but, but, but you want to circulate data from one database to the other. Even you can circulate data from for example, for from Notion to Google Doc <laugh>, if you want. So that's a different data sources. So Ano help will help you to do the ion from any data source to the other data source <laugh>.

Shawn Powers (00:09:11):
So, okay. And this is, yeah, again, I, I wish that I would've known you 10 years ago because we could have solved some major problems for my job, but when you talk about synchronizing, synchronizing two different database types, is it only a one-way synchronization? And if so, does that just mean you set up like two synchronizations, you know, one each way? Like, let's say I have a SQL database and I want to sync it to Fox Pro. I don't, I'm trying, again, databases aren't what I do right now, but but you want any changes on the other side to also be reflected, you know, you just want to keep them in sync with each other. Is that, is that like a one process thing or do you have to set up two jobs and like, whichever happens first and they just, you know what I mean? Is that, oh, is it two-way, or is it two one-way, I guess is the short? Yeah.

William Kwok (00:10:05):
Very, very good question. For now, it is one way synchronization, but we, for synchronization, we have two kinds of synchronization, one kind, we call it the batch job. That means you, that extract data and the load data for one, one for one time and the other, we call the real time synchronization. That means you just can read the data from from MySQL, for example, and we call it C D C, that's change data capture, and then you can load the data to a w s Redshift or Snowflake in real time. So this is another type of data synchronization. So this is the one way synchronization, but we can do it in real time now is not only for the, for the batch time, for, for the batch synchronization. So I think that's why many users of CTAN use Apache Ctan to do the synchronization. Yeah, so it's a very good question. And and now we didn't have a Fox Pro connector, but I think now we use, now people use Snowflake Snow snowflake, and then you can use use See Tunnel to extract data from a Notion or Google Doc or Excel <laugh> and load them into a Snowflake very easily. <Laugh>.

Shawn Powers (00:11:41):
Okay. So, so it's basically they're one way, but you can set up multiples. Is that, is that a fair answer to that question? Yes. Okay.

William Kwok (00:11:50):
Yeah, you can have, have multiple one.

Shawn Powers (00:11:52):
Yeah. Okay. And then in the, in the back, Jonathan and Bennett and I, were, were both thinking the same sort of thing, when there are conflicts with that sort of a, like, if, if there's synchronizations two separate ones, you know, going, passing in the night how do you, how do you deal with conflicts? Is it, is it timestamp based or, or how do those conflicts get handled?

William Kwok (00:12:14):
Yeah, yeah, yeah. It is a very good question that sometimes we just want to load the data to the Target database, but there is some database data already. So we have a mode called a save mode, and you can choose that. You can choose you replace the the record or you just update the record, or you just deleted the record. So you will have some we call the save mode to, to, to, to handle the, the issue that, that you met. So I, and I think it is very easy for you to, to choose that kind of mode. So, so I think that's your answer, <laugh>, that's my answer.

Shawn Powers (00:13:03):
Okay. I, I guess that, I guess that makes sense, and probably it depends on the, the use case, like how, you know, if changes are made on both databases, and the hope is to get the, the data current mm-hmm. <Affirmative> in both, I assume timestamps must be at play, like, okay. Which, which gets the, you know, preference for being stored. I, I have, doc do you have anything? I, I have so many questions, but I don't wanna dominate all the,

Doc Searls (00:13:26):
All the, no, those are all good. I'm, what I'm not clear on, again, I'm not a database person, is what is the user looking at? I, I mean, you're, I mean, if you're used your Oracle database, your Mongo, or your, my sql, what are you looking at? I are you, are you looking at the one your company normally uses for something, or are you looking at some other user interface that's unique to C Tunnel?

William Kwok (00:13:53):
Yeah, for, for users of Cun, I think is a data engineer so who wanna handle the data, and because in old days, they have to write a lot of codes <laugh> to handle the, the handle, the data synchronization. Because there's no very good, good tools like C Tunnel to do the segregation between different database. Now they just drag the job, or they can just write a SQL-like code to do the synchronization with Apache C Tunnel. So I think Apache C Tunnel is for data engineer, especially for the, a big data engineer, I think.

Doc Searls (00:14:44):
Okay. So you, you mentioned engineers are the ones kind of doing this integration, and also I noticed in, in your, in the background information that there's this field called data integration is, and mm-hmm. <Affirmative>, that's what this is in. And, and it's a new field. And who all is in that, and where does, where does Apache Sea Tunnel fit into that? Yeah,

William Kwok (00:15:09):
Actually we call that E T L that means extract transform and load in old days in, I think in, in, in, we call a, a data warehouse period. So we just extract data from Oracle, from DB two and then load them into Teradata or, or DB two data warehouse edition. But in nowadays we do, in different ways. We just do the data capture in real time, and we load them into Snowflake in real time. And we do data a analytics in real time. And now we found a, another very interesting story that so many people use many developers try to use internal to <inaudible> data from SaaS or from database and to the target we call the chat G B T. So chat, G B T is very hard, and, you know, chat, G B T is only, no only knows the knowledge from the internet, but chat G B T cannot cut with you with your data because charge G B T do not know your data or chat.

(00:16:32):
G B T cannot know your, your, your data in your database. But cita no can exact data from more than 100 data sources. And we have a connector people just doing just developing a connector for chat g bt when they finish, I think chat, G B T can just have a, a connected to your database and then chat, G B D can chat with you with your data. No matter the data is in, is on your Google Doc or notion, or in your Oracle or in your MySQL or MongoDB. So it is a very interesting idea, <laugh>. Yeah.

Shawn Powers (00:17:28):
So, okay. All these, the, the connector idea is, is really cool, but so part of the, the idea of a connector, and there's like the, the source and then the sync, like, right, the source is where the data's coming from. The sync is where the data ends up. How, how much transformation can take place in that interim step. And I, I realize this is mostly, you know, stream or real time kind of thing. And obviously there has to be some translation because these are different data structures and stuff, but can other transforms happen? I mean, can, can you do stuff to make the data, you know, not just the different format, but actually have some transforms take place in the interim? And I have a follow up question, 'cause I, I think I know the answer there. But is, can stuff be done in the interim or, or not really? Or is it just a structure change?

William Kwok (00:18:20):
So we can do some transform between the source and a think and first because different database will have a d different data type. And c tunnel engine will transform the source data type into the sync data type automatically. So you need to, to do the transformation about the data type. But you, if you have another requirement about you, I want to, for example, I want to change from zero to male and one to female. And you can do your you can use a trans transformer in Sitan to do such, such a things. Okay. Actually, you can use a SQL-like code to do the transform in Sitan engine.

Shawn Powers (00:19:15):
Oh, that was, that was the exact answer to my question then. So yeah, you can do transforms and you thought of a much better example. I couldn't think of an example off the top of my head. So that, that's perfect. <Laugh> my, my follow up question then leads right up to that. This is something that the Sea Tunnel engine does but I noticed that a sea tunnel as the, the entire project can use the, the newer Sea tunnel engine, but can also use Flink or could use Spark Uhhuh. What are the, I mean, Flink and Spark already have like, people who are like, why would I ever use this over that? You know, why would I ever wanna switch to, you know, Flink Spark is everything I want. What does Sea Tunnel add to that argument that makes it a better fit for this? And, and if it's so much better, why are there options still to use Spark and or Flink?

William Kwok (00:20:09):
Yeah, it is very, very good question. <Laugh>, the, because actually a add the first C tunnel support Flink and spark Engine, but we found that spark and Flink is designed for, we call the computation. Mm-Hmm. Is not designed for the Signalization, for example in, in, in our use case, our users will have more than 1000 tables, and they want to synchronize the 1000 table to the other database. But if you use a Spark or Flink, there will be a lot of, we call the G D P C connections be to, to, to handle with that. But actually that's a very heavy load for the source database. So C Tunnel can we, we created, we call the connection pool that you can reuse the, the, the G D Z connections.

(00:21:15):
So something like that. And also, there's another feature we call schema evolution that, that's a technical word. Well, what does that mean? That means if you change the data model from your source table, and we want the same thing exactly happened to the target table, we call a the think table. So Flink and spark is not designed for that because spark and Flink is designed for the computation for the group drawing, for the aggregation, and for, for, for that. So we have to design another synchronization engine. We call it Zeta C Tunnel, Zeta, to do the, the, this kind of things. And C Tunnel, this project is only designed for the synchronization. So the performance will be better because we do not care about computation or some complex functions. So the performance will be better than Flink and Spark on synchronization user scenario. So that's, we have why, why, why we have to, to use Ctan <laugh>.

Shawn Powers (00:22:38):
Okay, good. So my, my guess was the, the implementation or the, the ability to use Flink and or spark, was that just because there wasn't a, a highly developed sea Tunnel engine yet they seemed an, an odd fit for, you know, for what Sea Tunnel was designed for. I'm like, okay, cool. And then when I saw that you could use Flink or Spark, I was a little confused, like, but that's not really what those do. That's more like, you know, data extraction for like running, you know, real time, whatever. Mm-Hmm. <affirmative>, you know, I was a little surprised. So, so ideally the, the Sea Tunnel engine itself is, is the better use case, is that fair to say?

William Kwok (00:23:21):
Yeah. Yeah. Yeah. And and also Spark and Flink, they do not have a many source and a sync connectors, you know, so

Shawn Powers (00:23:29):
Yeah. I didn't realize that you they did a sync at all. I mean, I thought it was like an extraction for their own, I didn't. Yeah. So that makes sense, <laugh>. And so the the source and the sync, I don't know if they're plugins, I don't know the terminology but mm-hmm. <Affirmative> are those designed engine specific? So like there are C tunnel, engine source and sync like code, but then if you want to use Spark or Flink, you have to use something specifically designed for those engines. Is that, so the, the bulk of development is towards the C tunnel engine

William Kwok (00:24:05):
Be actually C tunnel engine we call a C tunnel connectors. Sometimes we adopt a or adopt a Spark and a Flink because somebody use Spark Flink. But you, if you, if use, if you use Slack and Flink, you will have, you will, you will not have the features such as Sigma Evolution or like a better performance or like some some features. So actually Sitan support Flink and Spark, but also if you want to use a better performance or, or, and a better functions, you can use a Sitan engine itself. So Sitan Engine will help you to do more, you will, will offer you more

Shawn Powers (00:25:00):
Functions. Okay. I, I'm sorry, this is just an add-on that I, for some reason popped into my head that I forgot to ask earlier. Are mm-hmm. <Affirmative>, can the, the the flow, whatever, the connection from one database to another, can it be one to many? Or is that another thing where you set up just another connection? You know, like say we have our, you know, SQL database, and I want it, you know, sent to, to use your examples. Like, I want, you know, tables put under Google Docs, and I also wanna set up to my notion, you know, personal database mm-hmm. <Affirmative>, is that, is it one to many or is it just, do I set up two different connections?

William Kwok (00:25:37):
Yeah, you, you can use one to many. Okay. We call it load ones and sync many times. Oh, nice. So you, you load from a, from a Google Doc or Kafka, then you can sync them into Snowflake and then into Redshift, and then into SS three a w s. So that's a, a feature for of ano.

Shawn Powers (00:26:04):
So you don't have to do three queries. You can, or whatever the terminology would be. You don't have to pull the data three times. Okay. That, that's, that's really nice. Yes,

William Kwok (00:26:11):
Yes. Yeah, <laugh>. So, so it's very interesting because I think at the first time, I think synchronization is very easy things when, but we, when we create a Apache Tunnel, we found that oh, there's a lot of user scenario, and that's a quite different from a Flink and a Spark, because synchronization is another kind of user, user story. I think <laugh>.

Shawn Powers (00:26:39):
Yeah. I, I never thought about Spark and Flink as like cramming data back into database. So that was, again, that's why I didn't understand how they were engines in the same way that the Sea Tunnel engine would've been. So yeah, I thank you for the clarification there. 'cause Yeah, I feel the Sea Tunnel engine is definitely the, I almost say sea turtle every time. I don't know why <laugh>, but the Sea Tunnel engine is definitely the, that's the next one. The most efficient way to go. So thank you, <laugh>.

Doc Searls (00:27:05):
I think that's the next project,

William Kwok (00:27:06):
Sea Turtle <laugh>.

Doc Searls (00:27:08):
So the Sea Turtle goes through the tunnel when he needs to. I I'm worried 'cause I'm, I'm quick reading about spark and Flink. 'cause Again, I'm new to this stuff, and mm-hmm. <Affirmative>, I see in a piece about Spark and Flick. Those are also Apache projects, earlier Apache projects. Yeah. I guess. And those are called third and fourth generation data processing frameworks. Mm-Hmm. <affirmative>, does that make Sea Tunnel fifth generation? I'm not sure. Or something, or a different species altogether?

William Kwok (00:27:42):
I think we, I I I, I don't think it's the fifth fifth generation <laugh> because I think a fifth generation will be another story. For example a quantum computation, perhaps that's a fifth com generation. But for, for Spark and Flink, I think because they are focusing on computation, just like you said, is a data processing. And the data process processing is quite different from a data synchronization. So actually I think our projects is in different ways for from spark and a Flink <laugh>, so we can compute something <laugh>.

Doc Searls (00:28:32):
So, so I'm wondering if well, this may be a self answered question given what you just said, that when is focused on processing the other synchronization, do any of the same people that worked on Spark and Flink also work on, on sea Tunnel? Or is it a different set of, of experts and forms of expertise? I guess a related question, is there, where are you getting your developers and what are the, how are the, what are the developers working on? Mm-Hmm. You know, what did it come out of? I know that the, the, the Dolphin thing was involved, this part of it. So that may be a way to transfer into that question as well.

William Kwok (00:29:10):
Yeah, actually, I think for the developers, we call the contributors of open source project. Some of Flink and Spark contributors are in Apache Ctan and the contributor code, because they need ctan to synchronize data from data from a data source tool to the target target database. And then they will use Flink or Spark to do the data processing. So actually in in, in our users yeah, actually our users are using both Ctan and Flink or Spark to do the the data synchronization and the data processing, because you, when you do the data processing, you have to store the data in one database. So you have to have to extract data from a different database and load them into that one database, such as, such as a Snowflake or a Delta Lake for, for the Databricks. So that's what the, that's what <inaudible> do doing. And when they load the, the data into the one database, they can use the Spark to do the canulation or to, to do the to use the Flink to do the the real time processing. So actually, we, we are not competitors. <Laugh>, we are, we we, we are something there after, after our projects <laugh>.

Shawn Powers (00:30:54):
So, speaking of developers again, I, I keep thinking about myself 'cause that's apparently how I, I think. But there, the database that I, I had to struggle with at my, when I was the manager for database department was I'll even say, I'll call 'em up. It was this software that was developed by Datatel and the, the program was called Colleague. It was bought by Ellucian. But nonetheless, this database, which was designed in the eighties, had a really, really weird non-compatible with anything S Q L, it had like multi value fields. It was just a weird database. Okay. mm-hmm. <Affirmative> mm-hmm. <Affirmative>. And so there was this ridiculously complicated spaghetti of a database interaction that was just custom designed because something like Sea Tunnel didn't exist. Mm-Hmm. <affirmative>, my question is, if I were to tackle this today are the, the connectors, I think that's the proper term, are the connectors modular?

(00:31:52):
Is that something that somebody could develop a, you know, their own connector to a database and, and use that with Sea Tunnel? Or is it, is it not modular where a person could develop something for their own ridiculous backend database that they want to, you know, synchronize with a more modern version? It, it, we've literally had to run it in this, like, old vm. It w it was like star Trek, you know, star Trek one with like, Voyager is like in the middle of this enormous alien, like monstrosity. That's basically what this database thing was, like, a tiny VM with this old database, and we had to figure out a way to connect to it. It was miserable. So is it modular to connect to things like that?

William Kwok (00:32:34):
Actually everyone can contribute connector to Ctan. And actually we, we have a very good example. There's a database called info Mix as a very old database. I think that database Ag is more, is older than I, I think <laugh>. Mm-Hmm. <affirmative>. So, so many people want to some people, actually, some people use that old database, and they want to synchronization, synchronize the data from these old database and to Oracle actually. And then they, they can use Oracle instead of Infomax. So ano can do this kind of things, and someone develop a, a connector for the Infomax and let people to, can do the real time synchronization from an INFORMIX to Oracle. Then then the application based on the the informix, they can change the application. They, they, they design another application based on Oracle, then they can transfer from Informix to Oracle. So I think that's a, an example for, for your question, everyone can contribute a connector to Ctan, and then because it's open source projects, and then everyone else who will have the same issue can use this connector to solve the problem we, we met. So everyone can solve that kind kind of a problem.

Shawn Powers (00:34:20):
<Laugh>. Okay. That, that's, yeah. That's awesome. And then the, in your example though, like, so there's this older database that some applications connect directly to, and then they want to synchronize that with an Oracle database, so some other newer frontend, whatever, you know, could connect to it. That kind of leads back to my original question about two-way synchronization. Would there just be then two setups? I mean, if somebody's working on the Oracle database and they make a change and they wanted that reflected back in that older database, would there be like two different connectors, or two different, not connectors, I'm sorry, but two different synchronizations taking place,

William Kwok (00:34:59):
Uhhuh <affirmative> we, or would it be

Shawn Powers (00:35:02):
Read only?

William Kwok (00:35:03):
We, yeah, we don't, we don't suggest to do that because a actually the things will be like that. People will do, do the new things in, in the new database, and they do not insert or feedback the data back to the old database. So if it is a one way, one way one way synchronization, they only want the old data to the new database. And if you have a two-way synchronization I think that, that will confuse that, that, that will confuse the, the, the system because I don't know, the system will not know this new record is from the old database or from the new database. I think it's, is not it is not I, I, I, I didn't match set that kind of scenario before <laugh>. Okay. I guess, I think one signal. Good.

Shawn Powers (00:36:04):
Okay. I'm glad we went. I'm glad we talked about it. 'cause That was my, my original question. You know, I mean, and if, if it did two way, how, how would it handle conflicts? Because, you know, I see a whole bunch of like problems that could, that could happen there. But I also, I mean, again, in my use case, the, the two-way synchronization would've been great because when we developed you know, programs or interfaces, we did not want to hook to that old archaic database directly because nobody in who was alive even knew how to program it. But also we needed to get data back and forth from it. So it was, it was tough.

William Kwok (00:36:40):
Yeah. It's a very good, good question. Well, it is very good. Your require we can consider in, in the data time. And now I think there maybe a solution that you will have a third data source, like a Kafka, and you, this two database can do the two-way synchronization between Kafka and the two different database. So I think perhaps that's a solution for, for your question. But I think there will be some, some solution for, for the requirement that you mentioned, <laugh>. Yeah.

Shawn Powers (00:37:17):
And for what

William Kwok (00:37:17):
It's worth, I figure it out.

Shawn Powers (00:37:18):
For what it's worth, I'm not going back to that job. So I don't, I don't really need a solution. <Laugh>, <laugh>, 10 years ago, Sean needed a solution and I left the job. So <laugh>

William Kwok (00:37:28):
<Laugh>, yeah. It is a very interesting customer.

Doc Searls (00:37:33):
So I'm, I'm wondering again, as a, an outsider of this a couple things. One is the, the, there's the connector, a p I, and that is somewhere mm-hmm. <Affirmative>. and somebody using this synchronized database, like a master database must call it something. Do they call it, I mean, does it, is it known as the Sea Tunnel database? They call it, I've got a sea tunnel here, we've got all these databases and I'm using my sea tunnel. Is that how mm-hmm. <Affirmative>? Is that how a, a company or a user or a customer, you don't really have customers, I suppose, but you know, somebody, you know, some company using Sea Tunnel, what do they call that? Mm-Hmm. And, and where does it live? I mean, it may live in a cloud off somewhere. I'm, I'm not sure

William Kwok (00:38:22):
Actually many, many company use ano mm-hmm. <Affirmative> I think in American, there's a bank called Goldman Sach, I think, or, or, or another investment bank. And they use ano to extract data from from their, their from a w s Aurora to a w Ss Redshift and many other companies such as Billy, that's a, a video company just like YouTube <laugh>. And also we have vip.com. That's a company just like Amazon. But it is a smaller one. So they, they just signaled data from a different databases. And now I think there's some user in yeah, yeah, you, you can see the there's many, many company, I think there are internet companies. That's a JP Morgan, sorry, it's not a Goldman Sach.

(00:39:32):
Mm-Hmm. <affirmative>, it's a jet, JP Morgan Good. And many internet company such like <inaudible>, that's a TikTok, you know. So, and also there's many other company both in Japan, in China, in Singapore, and also some users in, in America. So I think most of them are internet company because they use cloud and cloud, there's many cloud, new cloud database, and there's no very good open source project to, for the cloud database in synchronization. So they are using <inaudible> to solve that, that kind of things. And we, I know there's many, many old software such as Talend and Informatica, which is in North American and Europe. But this kind of tools will not support cloud database very well. So that's why there's many internet company you use Ctan. And also because Ctan is open source, it's free, I think <laugh>. Yeah,

Doc Searls (00:40:50):
Yeah. In some ways it makes it harder to track, I suppose, you know, somebody could be using it and, you know, they're not paying customers that are not necessarily in your base by mentioned they, they might have developers on the case. I know Sean has a question, but we need to take a break and we'll be back right after this.

Shawn Powers (00:41:08):
Okay. So again, I just have so many questions and I didn't think I was going to, but I, I've really enjoyed the conversation here. So let's say I'm using Sea Tunnel and, and I assume I could run like Sea Tunnel, like in a Docker container or something. I, you know, I assume it's just an engine that's running, and then it, it talks to the different, you know, data sources and syncs. Is, is it an all or nothing? Like if, if I have a, a database, do I have to then keep the entire database in sync across the connector? Can I, can I do piecemeal? Like, can I say I, you know, only this part of the database I want synced with, you know, this sync over here. Like, I want, I want my contact database kept in sync with my Notion address book or whatever, and just have that stay in sync. And if so, I assume yes. I mean, it, it would be silly if I could only do the entire database. But how, what, what triggers the sync? Is this, is this something that does Sea Tunnel stay connected to the data source watching for a change? Or is there something that has to trigger Sea Tunnel? You know, do you send data to it? Mm-Hmm. <affirmative>, how, how does that connection actually happen?

William Kwok (00:42:18):
Yeah, very good question. <Laugh> there's two ways. One way we call the batch job. That means you have to trigger the Sea Tunnel to extract data from database and the load into the other database, and you can use airflow or Dolphin scheduler to trigger that batch job. So we call the, that's job job orchestration tools. So that's another you can use another obstetrician tool such as Sitan airflow or Dolphin Scheduler to trigger that. And, or if you have a realtime job, we call it realtime Data Synchronization. And then the Sitan will watch the the original database. If you have some new records or you change some data, data records, the Ctan will know that and will synchronize the data into your database in, in, in no more than one second. So there's two W Walkways for Cun. So, so I think that's for your question.

Shawn Powers (00:43:34):
Yeah. Yeah. I, is there a downside to having C Tunnel, you know, to take out that extra tool, like, you know, the, the Dolphin scheduler or, or Airflow? Is there a, a downside to having sea Tunnel monitored directly itself for that real time changed? I mean, is that a performant hit on the, on the source database? Or is it just, if you're already using something like Dolphin Scheduler to do these, you know, these batch changes, you just wanna do that. Is there, I guess, is there a best practice or is it all just dependent on what kind of data you're working with?

William Kwok (00:44:06):
Yeah, yeah. If, if you don't want to do the real time, if you don't want to get the real time data, you need to use a real time realtime mode, because I think for the real time mode it will affect some Regi regional database performance. But it is it's very tiny because we, we did not read the database. We read, we call the, we read the database bin Log. Bin log means that's a, a file that is not the database. So we read the bin log and, or we read the redo log. So that will affect some <inaudible>, but I think it is not, will affect the database for one very much so.

Shawn Powers (00:44:57):
So it doesn't

William Kwok (00:44:58):
Actually have a Yeah,

Shawn Powers (00:44:59):
Yeah. It doesn't connect to the database unless it detects something that it would want from the, from the logs. Okay. Yeah. That, that makes sense. Yeah. And then, I mean, I, I guess with something like a, like Dolphin scheduler, there would be no connection at all. And then would dolphin actually pass the data or would it just trigger Sea Tunnel to then come in and, and grab the data?

William Kwok (00:45:21):
Yeah, dolphin Scheduler is just orchestration tools is just a trigger the cyan. Okay. And also it can trigger Spark, or have or other, or, or E m r on, on a W Ss it's orchestration tools. Yeah.

Shawn Powers (00:45:37):
Okay. Okay. All right. I guess I, I see why you could do it two different ways, and probably the performance isn't drastically better one way or the other. Yeah.

William Kwok (00:45:47):
Okay. Because you have a lot of data, you have to extract, extract the data in batch mode, because the, the data is too big. <Laugh>

Shawn Powers (00:45:56):
Yeah, that

William Kwok (00:45:57):
Makes sense too. So if your data is not so big, you can do it in real time mode. Yeah,

Shawn Powers (00:46:01):
Yeah. If you're watching one field, you wouldn't have to do like a batch for, you know, <laugh> Oh, one name changed, you know? Yeah. Gotcha.

William Kwok (00:46:08):
Yes. Okay. Yeah,

Doc Searls (00:46:11):
There's a couple questions on the, in the back channel. After somebody says they wanna do, we're only with cool sounding names and c tunnels on that list. Is there a chance for errors with the sync that fast? And did they already talk rollback on of transactions? Hmm. Especially if someone has some files open at the time the sync happens, so, Hmm. So can you address those That's

William Kwok (00:46:40):
Yeah. <Laugh>, yeah. It is also a very good question. Actually Sitan is, can deploy on one server or in, in the classroom mode. And we, we call, we have a, a global snapshot tech technique. That means if the, some arrow happened and the whole data synchronization process will roll back to the last global snapshot. We, we, we call it checkpoint. So the, if, if some error happened, so we will roll back to the last checkpoint. So you need to worry about, you will lose the, some, some data, because we will have many, many checkpoint when you do the synchronization, and you can define those checkpoint by yourself, for example, you can define one 32nd, 30 seconds, you make one checkpoint, or you, you can define 100 records, then you'll make a checkpoint. So the data will, will not be be missed between data database organization because we have we call it distributed checkpoint.

(00:48:08):
And we have a rollback, a rollback engine, a rollback function to, to assure that data will not be loose. Yeah. So <laugh>, so this is very good question. And also we have some some other functions to assure that the data will do, will read the data again, and to, to, to synchronize from the loss of checkpoint to the, to, to, to, to the other to, to the future data synchronization. So c tunnel will assure that the data will synchronize from the source to the target. Exactly. Once we cut exactly once, it will not, will not be doubled your data or will not, your data will be not lost

Shawn Powers (00:48:59):
Unless you're shown and you're trying to sync data two ways at the same time. In which case, <laugh>, all betts are off. <Laugh>.

William Kwok (00:49:05):
<Laugh>, yeah. <Laugh>.

Doc Searls (00:49:08):
So, I, I, I have a question that there was a piece written by somebody with a dolphin the, the Apache Dolphin Scheduler that has a very provocative headline train your own private chat G P T model for the cost of a Starbucks coffee. And I'll read the, yeah. The paragraph that opens that you can own your own trained open source, large scale model. It can be fine tuned according to different training data directions to enhance various skills such as medical programming, stock trading, love advice, making your large scale model more understanding of you. Let's try trading. And then it, it goes into how you could do that. And I have a particular question about that. 'cause I want my own mm-hmm. <Affirmative>, I want my own chat bot on my own data in my house, in my household, all my property, all my health records, all my financial records, all my contacts and calendars, my travels where I've been, you know, like where was I when I had that medical thing that happened?

(00:50:14):
You know, and what doctors did I see about that? I mean, I'm just making that up. But, but those are the kind of things I, I think when the likes of chat G B t become relevant to individuals is going to, I think we're sort of at a moment now where we will start having our own databases in our house, in, in our homes that are not relevant to the world. And that we, we think about these easily for companies. 'cause Companies have gigantic databases in most cases and wish to know a lot about themselves. And it would apply there. And that's probably where most of the uses are going to be early on. But I, I was thinking, we had a, a company on here a few weeks ago talking about control planes. It was called cross plane, and it was about doing, you know, multiple control planes within, within a company. And we have controlled planes in our own lives. And I'm thinking, Hmm, this seems relevant to me, especially when you're saying it's cheap. Mm-Hmm. So you have any thoughts about that?

William Kwok (00:51:15):
Yeah, yeah, yeah, yeah. Actually we call the private RM because you, what you need is only A G P U better than 3 0 9 0. Then you can use do scheduler to train your own chat, G B T with that G P U I think it is about two, two 24 hours. You, you can chat your train, your own chat, G B T. But what you need to do is to prepare the data and to let the open source large model for, for example, LAMA or LAMA tool and do scheduler will download the LAMA or LAMA tool automatically. And it will help you to train with your personal data in your own personal computer or personal laptop a desktop to, to train your own chat G P T. And you can just to, to use Dolphin scheduler to train the whole private model. And you can use the private model at your home. You didn't to, to worry about the, the data leak or, or your personal data will upload to some something somewhere else because it's training you, you can train the data, you can train your data in your own laptop <laugh>. So it is a very interest interesting one. <Laugh>,

Doc Searls (00:53:00):
This is something I very much want. We at at the distributed WebCamp a few weeks ago, and I've mentioned this on earlier shows, but it's worth bringing up again. One of the hackers among us took everything I'd written for Linux Journal where Sean and I both used to work for over 24 years. I wrote many, many articles. And had it query those, in other words, it trained on those, I don't know what model you use, whether it's Llama or Chet, G B T mm-hmm. <Affirmative>, but there was one, and it gave good answers and it put them in the form of a haiku as well. It gave you like the complete answer and then this haiku and was remarkably right and, and and, and helpful. And that was just one thing. I mean, one could actually look back through all of one's emails, you know, and, you know, how many times did I talk to Sean?

(00:53:50):
You know, when, when, when, when did this come up? You know, I mean, with, even with, I'd love to have it for, we've done this show for 16 years, it'd be really great to go back and say, Hey, <laugh>, you know, when, when did we last talk to, to William? You know, when did we last talk to so-and-so? Oh, who would we wanna have back? And, you know, what questions were left unanswered? I mean, there are lots of possibilities there. And, and until, you know, I I, I started learning about this. I wasn't thinking about how possible this was at a relatively low cost. So that's intriguing.

William Kwok (00:54:27):
Yeah. Yeah. Actually, I think be scheduler can can let almost everyone can have their private chat G P T, but I think the hard part is to prepare the data. Yeah. I think the data preparation is hard. So some people are doing are, are creating a, a connector in <inaudible> to, for the preparing data, for example, you can extract data from your PowerPoint or from your ward, and then to do the data preparation for the lama. And then you can training the LAMA with a dolphin scheduler, and then you can have your own private chat G B T. But I think I think it is, it is hard <laugh>, but if, if, if, if he success, I think everyone will be happy because everyone can do the data preparation very easily. But for now, they're doing that pro kind of project doing that connectors. I think <laugh>

Doc Searls (00:55:37):
We are, we are getting down toward the end of our hour. And and there are so many, I, I mean, I took so many notes in prep for this thing and, but probably the, there are two questions we tend to end with. One is, are there any questions we haven't asked so far that you'd really like us to have answered to ask you so we can cover that before we come out?

William Kwok (00:56:02):
Hmm. I, I, I think I, I don't know people, there's a question for, for, for the pmcs. I don't know. People like to use a code SQL-like code to do the data synchronization, or they want to use a ui, just like a drag, a drop <laugh> to, to, to, to, to the synchronization. So that's a question for, for, for the audience, actually, actually, I think mm-hmm. <Affirmative>. if, if they have the answers, they can tell me <laugh>,

Doc Searls (00:56:39):
Well, we have a dozen thousand of those, or so, so <laugh> maybe, maybe one of them will. One last thing, which actually Jonathan Bennett, another, another co-host who we mentioned earlier here and who has been on our chat often brings up, which is, what is the weirdest use you've seen so far, <laugh>? What, what's really unusual uhhuh or stands out as a, as an exception?

William Kwok (00:57:05):
Actually actually I think the most weird thing I met in this project is we, we, I I think the connectors will be growing slowly, actually, I think because when we enter the Apache incubator, we only have 20 connectors actually. And we just double that connectors in one year from a 20 to 40 in our company, but actually now it has more than 100 connectors. And I think the, the, the power of open source community is, is more powerful than I think, I think. And so I think it is is, is interesting because I never think that the connectors will grow so much. And I never thought there's will be so many users who want, who can contribute their connectors to these open source projects. I think so. So I think the little word for me, <laugh>. Yeah.

Doc Searls (00:58:30):
Well that's, well, that's great. And and given that it's growing that fast and and given that we will have sea tunnel running in a year or two, <laugh> or less, maybe if, if we follow that path, we'll be able to see how far it's gone and have you back on, on a future show. That'll be great.

William Kwok (00:58:51):
Yeah. Yeah. So, so our goal is connect to every data source in the world, <laugh>. So I think will be <crosstalk>. That'll

Doc Searls (00:58:59):
Be great.

William Kwok (00:59:00):
<Laugh>. Yeah.

Doc Searls (00:59:02):
Get 'em all. Th thank thanks so much, William, for having, for being on the show. Yeah, thank you. And we will have to have you back. Yeah.

William Kwok (00:59:09):
Thank you everyone. Thank you.

Doc Searls (00:59:10):
Thank you. So Sean

Shawn Powers (00:59:14):
I surprised myself with how many questions I had <laugh> and how much I wished that I had met William 10 years ago. Yeah,

Doc Searls (00:59:22):
Yeah. I know. Even, and I, I didn't know that Ant who was <laugh> was producing the show was also involved, and that was actually glad to get away from databases at some point.

Shawn Powers (00:59:31):
Yeah. You can tell he worked in databases. 'cause He's generally angry. No kidding. <Laugh>,

Doc Searls (00:59:38):
He's

Ant Pruitt (00:59:38):
Got a pretty stroke. That was a great conversation, but I gotta tell you, I was sweating a little bit over here, <laugh> little steam. Oh boy, me too. I,

Shawn Powers (00:59:47):
Yeah, I expect to have nightmares about databases tonight when I, when I drift off to sleep <laugh>. But I'm still fascinating though. And yeah, it, I mean big data's not going anywhere, right? I mean, our whole world is data. So the idea of being able to connect different data sources, especially crazy, wacky ones, and it sounds like the number of connectors is just, you know, off the charts. It's, it is exciting. It's, it's surprising that I'm this interested and excited about database connections, right? I mean, <laugh> on the tin, that doesn't sound like an exciting podcast topic, but it was, it's pretty cool when you think about what's actually happening behind the scenes.

Doc Searls (01:00:26):
Well, it's and I find it quite exciting, and and I, of course, I don't have any, any P T S D from dealing with databases <laugh>, other than, you know, my own. I mean, I just wait a

Ant Pruitt (01:00:37):
Minute, minute. You didn't spend a whole weekend worrying about O D B C and J D B C and, and, and just, ugh, no.

Doc Searls (01:00:44):
Every

Ant Pruitt (01:00:45):
Day Saturday that you're having a great time and all of a sudden something just crashes and you get that phone call and Yeah, you really, you didn't have that problem, dopp never.

Doc Searls (01:00:56):
No, I didn't. My problem has always been, I have too many drives laying around that have my photography on them. And so that's why I now have this eight terabyte laptop that's filled completely <laugh>, but I don't see it as a database. So it is, in a way, I mean, I just know it's a directory I have to navigate. It's a different thing.

Shawn Powers (01:01:18):
If you write a connector, you could put that into any sort of database you want. Could, because I know a tool I could, it's a real,

Doc Searls (01:01:23):
It's

Shawn Powers (01:01:23):
Sea turtle. No, it's not sea

Doc Searls (01:01:24):
Turtle. I mean, it's, it's an interesting thought. I mean, we, we never get, get rid of databases after all. We are digital, digital animals now and and entire, our corporations are digital. Our governments are digital. All of it's digital. So

Ant Pruitt (01:01:39):
Now, don't get me wrong, I'm totally for databases and the advantages that they bring to day-to-day life, especially you talking about being able to build a large language model and being able to query it just to get stuff that you want.

Doc Searls (01:01:53):
I want, I mean, that's the stuff I want. I mean, I, yeah. You know,

Ant Pruitt (01:01:56):
I,

Doc Searls (01:01:56):
I'm,

Ant Pruitt (01:01:57):
I'm all for that. Just, you know, I love data. I just don't miss the management side of it. That's all I'm saying.

Shawn Powers (01:02:06):
Yeah. This is, and, and same, and that was the thing. I wasn't the database. Like I wasn't the database person. I was the manager. So not only did like all the, when things went wrong, it was, I had to fix it, but I also wasn't person who could fix it. So, yeah, that's, it's stressful. Databases can be stressful.

Doc Searls (01:02:24):
I, on another podcast I was listening to, 'cause that's, oh, you listen to it now, podcast. They were the, the host there said, I forget who it was. They were talking about a chief digital officer. Like some companies have a Chief digital officer. And this guy said, well, isn't that like having a chief electricity officer now <laugh>, because it's all digital. There's no other kind of kind of thing. So so I, I wanna, I wanna plug next week we have Damien Real on r spelled r i e h l. He is this really smart lawyer, open source, friendly. He's big in the music business. He's on a, I've actually had, I saw what he wrote. It was really interesting stuff. And I dragged him onto another, A list I'm on where he is just killing it. He's really smart. And Catherine, who will be the co-host and I have had him on also on another, on our own little podcast. So he's a good guy. He's gonna be up next week. Catherine's gonna co-host. So I wanna plug that, I wanna ask legal questions, all kinds of topics. He's your guy. He's a good guy. So what do you got to plug there, Sean?

Shawn Powers (01:03:37):
Just yourself, not anything special on my YouTube channel. Youtube.Com. Mm-hmm. <Affirmative> slash Sean Powers with a zero. Eh, there it's,

Doc Searls (01:03:44):
Yeah. So are, are you still doing the cartoon and, and we got the book out.

Shawn Powers (01:03:47):
I haven't, I haven't had a chance to, and I, I, it, it still bums me out. I just, I still can't seem to get enough hours in a day since I had Covid. I don't think I have long covid, but for some reason I just cannot get the I, I need more sleep now. And that time that I spend sleeping is when I used to draw my comic every morning. So maybe someday it'll happen. I hope so. But I haven't drawn my comic in a while.

Doc Searls (01:04:12):
I, I had the lamest covid covid anybody had, which is, I had sniffles. I asked my daughter if she had a test. I tested positive. She banished me to the basement, but I had no other symptoms. That was the end of it. <Laugh>, there was nothing happened. So I'm jealous. I know there's so many bad stories, but mine wasn't one of 'em. Anyhow, so again, next week, see you next week with Damien Reel. I'm Doc s This is Floss Weekly. We'll see you then.

Rod Pyle (01:04:38):
Hey, I'm Rod Pyle, editor in Chief VA Astor magazine. And each week I joined with my co-host to bring you this week in space, the latest and greatest news from the Final Frontier. We talk to NASA chiefs, space scientists, engineers, educators and artists. And sometimes we just shoot the breeze over what's hot and what's not in space. Books and tv. And we do it all for you, our fellow true believers. So, whether you're an armchair adventurer or waiting for your turn to grab a slot in Elon's Mars Rocket, join us on this weekend's space and be part of the greatest adventure of all time.

FLOSS Weekly #743
Aug 2 2023 - Data Is Surprisingly Exciting
Apache SeaTunnel, William Kwok

All Transcripts posts