Remarks & reflections by Matt Turner, MarkLogic, CTO, Media & Publishing
Recorded at Copyright Clearance Center, Danvers, Mass.
For podcast release Monday, May 18, 2015
Matt Turner is the CTO, Media and Publishing at MarkLogic where he develops strategy and solutions for the Media, Publishing and Information Provider markets and works with customers and prospects to create leading edge information and digital content applications with MarkLogic’s Enterprise NoSQL database. Matt works closely with MarkLogic’s customers including McGraw-Hill, Warner Bros., Conde Nast and LexisNexis.
Previously, Matt worked with Sony Music creating community, identity and content delivery applications for artist sites and reviewing investment opportunities for Sony Music’s venture arm. Before that Matt was at PC World where he developed some of the industry’s first XML based publishing systems.
We were lucky enough and fortunate enough to have worked with this industry for the last 10 years, and it’s really a remarkable decade of innovation. A lot of people might look back on the last 10 years and say, all my print revenue’s gone. Gosh, everything’s so complicated. I have all these websites, and I have these mobile devices, and there’s so much happening in the information industries. I don’t know where it’s going to go.
I like to take it the other way around. Think of what we accomplished. Think of all the innovation that we’ve put into this industry to take it to where it is today. We have vibrant digital businesses delivering information in context. We have guys at Springer who do things like sell queries. We have dynamic access to information that’s spreading not just from the scientific world, but down into the commercial world. We do something with BBC that uses semantics and actually tailors information in exactly the same way that some of our most innovative STM publishers pioneered.
Think of all the things we’ve accomplished in the last 10 years in an industry under threat. Most industries, if their main line of business absolutely disappeared, it would fragment, and you may not even see the industry again. I think what I see is that, actually, the information industries are an incredibly strong place. We’re part of it, but I wanted to give that little bit of perspective, because the challenge – it’s a remarkable thing to work in an industry where the challenges just keep coming, and we’re all actually able to meet them.
So then I’m going to go into a little bit about that. The most obvious statement, I hope, in the world – change is the only constant. But this is obvious now. If you were at an executive meeting even in 2007, 2008, it may not have been about change management and agility. I think this is the other big sign of that kind of decade of innovation, is not only people have built the new products and changed with the market, but now everybody is realizing we’re not done. This industry is still changing faster than you can imagine and in ways you can’t imagine, and everybody’s gearing up to be part of the change.
So change management, adopting change – there’s a great talk by Freddie Quek of Wiley. He and his team did this project that everybody thought was impossible. It’s mission impossible. If you guys know Wiley, and I’m sure you do, it’s a kind of traditional space. It’s not necessarily known for being incredibly agile. But that organization moved very quickly to take on the business, knowing it had to be agile to deliver. In three or four months, they did the mission impossible. He talks a little bit about why and what the technology was, but what he really talks about is he says the thing you need is a team that is going to expect the unexpected, and you need to adopt that and manage to that, because that’s what’s going to keep happening. Change is the only constant.
What’s happening to adopt this change? Well, the shift – and I like to term it in terms of what a publisher is and what an information provider is. The shift is sort of technical, but it’s really strategic. It’s really about business. In a publishing mindset, you’re going to make a product, and you’re going to build all the stuff that you need to make that product. You’re going to build a whole pipeline and workflow of bringing the information together, putting it in the right packaging and delivering it. It’s all thinking about the form. This is what the product is. I need to put my pipeline, my production system up to make that form. And then when you make a new product, you’ve got to do the same thing again.
I put this up here because when I talk to publishers, there isn’t anybody that doesn’t have some bit of this going on. Silos of information. When you want to do something on like a mobile app, you struggle, because you’re not able to put together that – you have to put together that entire product production pipeline.
The shift to an information provider – and again, I want to say this – it’s not necessarily technology. It’s really strategic. The shift is your thinking. If your thinking is about curating the information that’s important for your domain – if your thinking is about collecting it and augmenting, and just for the information itself, and you’re not necessarily thinking on how it might be delivered, then I think you’re more of an information provider. And I think that’s what people are turning into.
I think in the industry, and maybe for what you folks work with, that mentality – it’s analogous to what I was talking about before about the data first. It’s not about the app. It’s about the data. It’s not about the publishing. It’s about the information. And that’s a big shift, because it means that you are putting your efforts into curation. You’re putting your efforts into what makes you a publisher versus the product.
There’s a little bit of a theme. I was actually a part of a panel talking about the role of the editor. So what’s an editor? Well, an editor is a jack of all trades. It turns out an editor is a curator of information for a specific domain. They have trust. And that’s maybe what you can boil down a role of a publisher to – trust and information on a specific domain. I don’t know. But certainly having the information-centric approach is very important.
So there’s a technology angle to this that we’re going to get into, but I wanted to get to the kind of business angle. What happens when you’re in this space? Right, you’re all done. Sorry. You’ve got a big butterfly diagram with all your information, and all your products are there, and everything’s fine, right? Not at all. (laughter) There’s still more challenges.
So this is the kind of thing – we’re working with a lot of our customers who’ve kind of made the shift. I mentioned a few of them. Certainly someone like Springer is in there. These are the things that they’re trying to struggle with is now what’s on top. It’s not just about getting the data in. It’s about insights. It’s about getting meaning out and getting it out faster. Nobody ever wanted their information slower, and nobody ever wanted 10 paragraphs when they can just get two.
Data is complex. People don’t want to see complexity. That’s a big problem when you begin to bring together disparate datasets, because they need to be very instinctually understood. Not everybody has permission – this should be resonating with everybody in the room – not everybody has permission, as it turns out. So you need to respect, you need to understand rights and permissioning and rules. That has to be in this mix.
Don’t forget to repurpose. By the way, repurposing. This is still the hardest thing – where were we? We were just doing the ROI for something, or the other way around, the total cost of ownership for the benefits – however people want to talk about it these days – where they were once again talking about – they have two problems with their CMS.
The first problem is that nobody can find anything. So everybody’s buying all of this stuff again. They’re literally going out and repurposing – this is more of a DAM problem, actually, than a CMS problem – but literally going out and spending money to bring in new assets that are already in the system but just not accessible to people. So there’s a barrier to repurposing.
The second problem is that people are using things in the system that they don’t have rights to, so they’re getting sued all the time. (laughter) Two ends of the same problem. Repurposing is the goal, but you can only do it if you have everything else. And then once you attract them, you must be reliable. It’s an important piece of – a constant piece for these kinds of new technologies to make sure that they are there and stable. It’s not the information provider’s job to do this kind of core operational, let’s make sure we’re down at the bottom of the level and making sure everything’s work. We want people to start working on the information itself. So that’s that last one.
So when you look at MarkLogic under the covers – I did promise some of the tech talk – it is an inverted index scheme. The reason that we do schema flexibility is that you give us something that describes your data in your schema – JSON, XML, or RDF, whatever it is – you’re giving us the instructions for us to create our inverted indexes that we can resolve all of our queries against.
That’s what MarkLogic is. It’s 100% indexing based on search technology, or I would say advancing search technology, because Chris came into that world right at the right time to do this architecture, knowing that memory was going to be really, really cheap, knowing that disk was going to get cheap, knowing that CPUs were going to get faster.
So MarkLogic, the way it uses indexes and the way that we optimize for today’s hardware that everybody takes for granted now, it looked a little extreme when we first started out. We would walk into places and say, yeah, we want 64 gigs of RAM in 2003, and they would look at us a little funny. Now this is what you get.
But our architecture is about search. It is not just one index. That’s the other big thing that Chris broke down. It’s tens of thousands of little indexes. And even when it’s fully optimized – MarkLogic’s optimization scheme is not to build a big, huge index that’s going to be really, really optimized, you take holes of it, and then you rewrite it. That’s kind of the older search approach. It’s actually tens of thousands of little indexes that are brought into memory, mixed, and actually queried up and matched in memory.
So 90%, or probably 100% for almost all systems, are very, very optimized indexes interacting with each other in memory. The reason I’m saying that is that we get to use some very cool indexes together. The core of MarkLogic is like a search engine. I know this word is in this document. The second core is that I know this structure is in this document. I know this word is in this structure. That’s all kind of where it builds.
Then you can get to things like we have scalar indexes, name value pairs that allow you to very quickly go through lists of things – dates, values. Those are in the mix. Then you can get to thinks like the geospatial. Geospatial’s a first class citizen for us. We have geospatial indexing that sits next to the name value pair for data that sits next to the content. And now we have a triple index.
So triples and semantics are a big area of focus for us. I have a couple slides on it. But I wanted to point out that it’s not like we added something on the side. We took the way the system works and enabled a new data type on the same indexing. What this means is that you can do amazing queries on the data – ad hoc queries, as long as the data’s there – that bridge together all of these different types of data and all these different search functionalities, and it will continue to scale out, and it’ll be performant.
You don’t have this moment with MarkLogic where it’s going to go think for a while. You’re either getting an answer from indexes very, very quickly or – we talked about this – if you want the whole result set, it might take a while to aggregate them, but you’re always getting index result set answers across every piece of functionality. It’s the same piece of technology for every feature, and it all comes back to the same thing.
Security works the same way for us. We have very, very high grade security. Security’s just a query. If you don’t have rights to something, you can’t see it, because the first thing that happens is that index lops off all those things you can’t see from your result set. Everything is a query. What am I missing? Collections, permissions – all of the things we do in the system are actually put into the system as indexes, and that’s how we work it. So it’s a one-stop shop that we’re very, very good at and we spend a lot of time on, but it’s really what’s behind the engine.
But the fact that you can bring up servers and move around data on somebody else’s infrastructure, it’s like the shift from – if you have power plant – I don’t know if you’ve been – I’m from New York. There are buildings in New York City where you can find where the electricity generation plant was. They used to make their own electricity in the building. The superintendent took care of the electricity. That happened. We’re at that moment where the power plant is beginning to be built down the street, and where people are starting to shut off their electricity generation plants inside their own buildings. I think it’s huge, and I think that I wanted to point it out here because it’s one of the other really major trends, so this flexibility from semantics and the elasticity in the cloud.
What we’ve built and why we’re going down this direction is a take on this notion of scalability that you get from NoSQL. The big problem with a relational database is that you really have nowhere to go but up. It’s very hard to take multiple servers and make them work off relational tables. They just weren’t thinking about that when they designed them.
In a NoSQL world, you take pieces of information and you distribute them on multiple machines, and all the machines work together. It’s groundbreaking. It works. We maintain transactions across that. You don’t have to give up your data integrity to actually make this work. That’s a big shift right now, where you’re seeing people move towards, oh, yeah, we’re going to have transactions, as well. So that’s important for your data.
But what’s really important is if you do that and you run, as MarkLogic does, with your configuration and all of your security and everything in the same kind of data model, now a transaction can be something like taking five servers and putting them online and taking data from your 10 servers and putting it onto those five servers, and all of a sudden going from five servers to 10 servers in a single transaction while the system’s running, and then, when you’re done, dropping those five and moving them back. And the whole time, because the whole system is transactional, you’re serving customers.
We have customers – the best one is the BBC iPlayer. Does anybody go to Europe and know what the BBC iPlayer is? It’s very, very cool. It’s their over-the-top. So if you’re in the UK, you can watch any show on any device for free out of the BBC iPlayer, and it’s part of their license agreement. We run that system. It’s about metadata, it’s about search, it’s about query. It’s great. You can watch Doctor Who. The screenshots I get from it are awesome.
Every day, it expands and contracts. Because if you’re renting your infrastructure from the power plant, you don’t want to buy anything more than what you’re using. And every day, they bulk up, and every night, they slim down. And then they bulk up, and it’s like a lung. This is very, very forward-thinking in terms of what architecture and infrastructure is going to look like, but we know it’s going to be really important, and we think – we know, actually – this generation of databases has to handle it. The database has to be able to go to the cloud, because the data and the operations on the data are really what’s driving your whole orientation if you’re doing information first.
So the impact of these approaches is about the innovation, kind of tying it all together, that you need to get by. We’re in this stage with the information industries where there’s a lot going on. I have my list of five. We probably know that it’s going to be a different list very soon. I like to put my solve thing on it, because this is a talk. So we’ve solved the five things with this technology, which is great, but that’s the main point, right? We know that even if we get by with these five, there are going to be five more things in the future. Thanks.