Abstract
Working with large data sets is increasingly common in researchand industry. There are some distributed data analytics solutions likeHadoop, that offer high scalability and fault-tolerance, but they usuallylack a user interface and only developers can exploit their functionalities.In this paper, we present Radoop, an extension for the RapidMinerdata mining tool which provides easy-to-use operators for running distributedprocesses on Hadoop. We describe integration and developmentdetails and provide runtime measurements for several data transformationtasks. We conclude that Radoop is an excellent tool for big dataanalytics and scales well with increasing data set size and the numberof nodes in the cluster.