Codio Co-Founder, Freddy May, chats with Dr. Gregory S. DeLozier, Adjunct Professor of Computer Science at Kent State University about Big Data and teaching computer science.
Big Data
Twenty years ago, Big Data meant you had a huge amount of data in a MySQL, Oracle or SQLServer relational database. You would use SQL queries to perform massive joins that could take hours or even days.
It has evolved. Google launched MapReduce which was then taken to the next stage with Hadoop. Hadoop combines distributed file systems and distributed MapReduce processing. As a result, what use to take days with an SQL query now takes minutes. Real-time data analysis is a reality, enabling critical, real-time business and scientific decisions.
Over time, server memory space became cheap and systems like Spark emerged, which extended the distributed Hadoop approach and used memory rather than file systems for processing. As a result, processing times again increased by up to several orders of magnitude. However, businesses, governments, science and new technology areas like the Internet of Things now generate massive amounts of data. Trillions is the new billions, underscoring the need for these new technologies.
Big Data in Academia
Computer Science students are interested in Big Data with a view to the jobs market. They want to broaden their range of skills in order to increase employability. They are interested in the process of Big Data analysis itself and how it can be applied to a broad range of problems.
However, there are also geologist, epidemiologists, statisticians and other students who do not specialize in Computer Science but want to learn how to use Big Data tools in order to extract meaning from vast amounts of data.
In both cases, having an understanding of big data technology requires considerably more technical skills than the SQL query of old. With the correct big data solution architecture, a processing task may run an order or magnitude faster or more. This can be the difference between success and failure or making an informed decision or a guess.
It is for this reason that we are seeing Computer Science departments starting to embrace Big Data. Currently it is often taught at the post-graduate level, but Codio allows lecturers to start incorporating it at the undergraduate level as well.
How Codio helps
Teaching big data to a class of students presents several infrastructural challenges. Big data has distributed processing resources at its heart. This means that several servers are needed to provide the realistic environment that students need in order to play with a proper system.
One approach is for a distributed system to be set up in a computer science laboratory. This presents the lecturer and students with a real problem. Such an environment is a single entity serving many students. The administrator will not want students to play around with the configuration in case they destabilise the entire system for everyone else. This restriction means that students are not getting their hands dirty in order to build the necessary overall experience.
Dr. DeLozier had investigated many solutions to this problem, including CS laboratory environments, cloud based PaaS such as AWS, Azure, Linode etc. as well as students using their own laptops. However, these platforms ultimately proved to be both expensive and cumbersome. They also lacked the many academic features offered by Codio that are designed to keep costs low, provide lecturers instant access to student projects and constructively monitor their progress.
Dr. DeLozier quickly became impressed with Codio’s Box technology and the way it allowed his students to quickly spin up a distributed system or even variations of these systems without having to purchase dedicated cloud servers or administer systems in-house. Codio’s boxes automatically deactivate when not in use. So Codio effectively charges for a single user even if they have 200 different configurations or projects.
He found that using Codio boxes, he could install the base Hadoop software (and other supporting systems, like Java or web servers) and essentially create small, single node Hadoop machines. These machines were capable enough to support HDFS, Hive, Pig, and other major components of the base Hadoop ecosystem. Having students install and configure what amounted to a small microcluster for each student was essential to creating deeper understanding of the basics of Hadoop and its theory of operation. Compared to the usual Hadoop-as-a-black-box method of giving access to a shared cluster, this experience was very well received by students learning the mechanisms of big data.
Dr. DeLozier compares this situation to students taking an auto mechanics class -- and taking apart engines -- to the experience of learning to drive a car. Experience using a packaged environment is important, but it doesn’t compare to getting into the internals. Codio allows unrestricted internal access to every student, and if they break something, they can just start over.
Many students chose to stay with Codio even when a campus cluster was provided. While processing was not as fast, the convenience and reliability of the Codio box was such that many of the students actually chose that as the tool with which to deliver their class projects.
Via Codio’s classroom management dashboard, he can also instantly access a student’s projects and assess configurations, run jobs, diagnose issues and even grade each student’s project. There is no need for him to know urls and passwords for remote systems.
Codio will shortly be releasing two important new features based on Dr. DeLozier’s requests for extensions to suit the outer limits of teaching big data.
Bigger Boxes - Codio’s boxes have a default CPU and RAM allocation that is adequate for almost all coding projects. However, some applications require higher levels of RAM and so Codio will shortly be offering lecturers with the ability to configure larger allocations.
Time To Live Boxes - standard Codio boxes shut down automatically a few minutes after the student closes the project. It is this capability that ensures Codio can afford to offer users advanced server resources at a very low cost. However, for some projects such as big data and server side API development, this automatic termination is troublesome. Our TTL boxes allow boxes to remain online for a specific time period before they are killed. If the student wants to restart the box at any point they can, whether in the CS lab or at home and so interruptions are rare and perfectly acceptable in a teaching environment.
Always On Boxes - we already offer this feature but due to fact that these servers are permanently live and consuming resources, Codio needs to charge extra. In a teaching environment, the Time To Live boxes provide the availability that teachers need at a cost that they can afford.
About Dr. Gregory DeLozier
Dr. Gregory DeLozier is a long time user of the Codio platform. He has seen it grow from a leading web based development platform for professional developers into the world’s best environment for teaching Computer Science. Dr. DeLozier has degrees in Biology and Computer Science and is currently Adjunct Professor of Computer Science at Kent State University, a global top 200 university with an enrolment of over 40,000 students. Greg also has extensive industry experience as a research software engineer with companies such as Progressive Insurance, specializing in scientific computing and data analysis both in industry and academia.