Following the intense experience of the startup weekend, we had a day of cloud education starting with a visit to the headquarters of Cloudera in Palo Alto. Cloudera primarily provides services and support for Apache Hadoop solutions but also contributes to the on going development of the Hadoop open source project.
Hadoop is an open source Apache project that is used for the processing and storage of very large data sets. It was created by Doug Cutting, the current Chief Architect at Cloudera and Mike Cafarella. The project was named Hadoop by Doug Cutting, who named it after a toy elephant that his son owned.
After being treated to some amazing bagels and cream cheese, Vala Dormiani, product manager, gave us a great introduction to Cloudera and humbled us with his background and experience – turns out he finished high school at 14, graduated with a masters from Stanford at 18 and now manages product strategy and acquisitions at Cloudera.
It was very interesting to hear about Cloudera’s business model and the fact that 70% of the solutions that they provide are open source. Cloudera’s value add is that they give support for Hadoop solutions and provide a layer of proprietary software for the management of those solutions. It was refreshing to hear that a large majority of the work that they do actually benefits the entire Hadoop community, not just Cloudera’s customers.
Vala also gave us insight into Cloudera’s overall acquisition strategy and pointed out that for Cloudera, it doesn’t so much matter if a startup or company that they want to acquire is profitable or even making any money at all. The important aspects that Cloudera examine when deciding to acquire a company is the technology stack that the company runs on and whether Cloudera can take that technology to make the companies they support more productive.
After having a tour of Cloudera premises (cool standing desks, guys!), Eli Collins gave us more insights into the big data phenomenon. He pointed out that quite often, meta-data dwarfs actual data and gave the example of ATM transaction. For an ATM transaction, the actual transaction data is relatively small compared to the vast amount of other data captured. As well as CCTV for security, ATMs are now capturing the input timing when a customer enters their pin and this can be used to assist in identifying fraud.
Eli also spoke about the difficulties in deleting data and how this is currently a big problem that companies are facing. With data protection laws such as the ‘right to be forgotten’ in the European Union, which affords individuals the right to have their data deleted, many companies are struggling to figure out how to delete all the data they store about individuals. Eli pointed out that it’s not just that there’s a large amount of data but also because of the large amount of replicas of data that typically exist. On average, there will be 42 replicas within a single company of a particular database.
It was great to visit Cloudera and gain an insight and appreciation into the work that Cloudera performs. The big data space is continually accelerating and it will be exciting to see how Cloudera continues to innovate going forward.