Modularizing Flink programs to enable stream analytics in IoT Mashup tools = Modularización de programas Flink para el análisis de datos en tiempo real en herramientas de Mashup para IOT

Texto completo

(1)UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN. MODULARIZING FLINK PROGRAMS TO ENABLE STREAM ANALYTICS IN IOT MASHUP TOOLS. FEDERICO ALONSO FERNÁNDEZ MORENO. 2018.

(2) DEPARTMENT OF INFORMATICS INFORMATICS 4 - CHAIR OF SOFTWARE AND SYSTEMS ENGINEERING TECHNICAL UNIVERSITY OF MUNICH. Master’s Thesis in Informatics. Modularizing Flink Programs to Enable Stream Analytics in IoT Mashup Tools Federico Alonso Fernández Moreno.

(3)

(4) DEPARTMENT OF INFORMATICS INFORMATICS 4 - CHAIR OF SOFTWARE AND SYSTEMS ENGINEERING TECHNICAL UNIVERSITY OF MUNICH. Master’s Thesis in Informatics. Modularizing Flink Programs to Enable Stream Analytics in IoT Mashup Tools Modularisierung von Flink Programmen zur Ermöglichung von Stream Analysen in IoT Mashup Tools Author: Supervisor: Advisor: Submission Date:. Federico Alonso Fernández Moreno Prof. (Chang’an Univ.) PD Dr. habil. Christian Prehofer M.Sc. Tanmaya Mahapatra, Dr. Ilias Gerostathopoulos 16th July 2018.

(5) Ich versichere, dass ich diese Masterarbeit selbstständig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe.. I confirm that this Master’s Thesis in Informatics is my own work and I have documented all sources and material used.. Munich, 16th July 2018. Federico Alonso Fernández Moreno.

(6) Acknowledgments. I would like to thank in the first place Prof. (Chang’an Univ.) PD Dr. habil. Christian Prehofer, who gave me the opportunity of undertaking this project. I am also grateful to Dr. Ilias Gerostathopoulos, for being always there to help me when I needed it. And, of course, I am thankful to Tanmaya Mahapatra, for his advice and for trusting me during all these months. I would also like to thank my home university, the Technical University of Madrid (UPM), especially the staff in the International Office of the ETSIT department, because exchanges do make for better engineers and better people. My family has always supported me, regardless of the circumstances. I can only say thank you, because I have always felt the greatest support from them. I would simply not be here without them. To Alba, for sharing this path with me. And, of course, to all my friends that have been like family these months, those who I already knew and those who I have met during this year. They know what it means to me to complete this project. Thank you..

(7)

(8) Abstract Among all the challenges that the Internet of Things (IoT) has posed, the analysis of large amounts of information in real time is probably one of the most considerable ones. To this purpose, novel Big Data approaches foster Stream Analytics as a solution to the strict latency and throughput requirements that IoT platforms impose. Thus, Stream Analytics should constitute an essential component in IoT applications. However, developing IoT applications is no easy task. To simplify the process of addressing heterogeneous devices at once, mashup tools are widely extended. They provide a lightweight, user-friendly way to prototyping, but currently they lack integration with platforms for Stream Analytics. The main goal of this Thesis is to integrate Apache Flink, an open source platform for distributed Stream Analytics, into aFlux, an IoT mashup tool developed at the Chair of Software and Systems Engineering of the Technical University of Munich. To this end, a conceptual approach to modularize Flink programs in a generic, flexible and expandable way has been designed, implemented and evaluated. With this approach, end users may not only program Flink graphically, but also get support on how to do it right during the creation of these programs.. Keywords: Stream Analytics; IoT Mashup Tools; Big Data Analytics; IoT. v.

(9)

(10) Zusammenfassung Das Internet der Dinge („Internet of Things“ (IoT) auf Englisch) wirft verschiedene Herausforderungen auf, in denen die Echzeitanalyse großer Datenmengen wahrscheinlich eine der wichtigsten ist. Zu diesem Zweck unterstützen neuartige Big-Data-Ansätze Stream Analytics als Lösung der strengen Latenzzeit- und Durchsatzanforderungen, die IoT Plattformen auferlegen. Daher sollte Stream Analytics eine wesentliche Komponente in IoT-Anwendungsprogrammen darstellen. Trotzdem ist die Entwicklung von IoT-Anwendungsprogrammen keine leichte Aufgabe. Um den Prozess zu vereinfachen, heterogene Geräte direkt zu kontaktieren, werden „Mashup Tools“ weitgehend erweitert. Sie bieten eine einfache, benutzerfreundliche Art zum Prototyping, aber derzeit mangelt es ihnen noch eine Integration der Stream-Analytics-Plattformen. Das Hauptziel dieser Masterarbeit ist die Integration von Apache Flink, einer OpenSource-Plattform für verteilte Stream Analytics, in aFlux, ein am Lehrstuhl für Software und Systems Engineering der Technischen Universität München entwickeltes IoT Mashup Tool. Dazu wurde ein konzeptioneller Ansatz zur generischen, flexiblen und erweiterbaren Modularisierung von Flink Programmen konzipiert, implementiert und evaluiert. Durch diesen Ansatz können Endbenutzer Flink nicht nur graphisch programmieren, sondern auch Unterstützung bei der Erstellung dieser Programme erhalten.. Keywords: Stream Analytics; IoT Mashup Tools; Big Data Analytics; IoT. vii.

(11)

(12) Resumen. Autor: Título en castellano: Tutor:. Federico Alonso Fernández Moreno Modularización de programas Flink para el análisis de datos en tiempo real en herramientas de mashup para IoT Tanmaya Mahapatra. Institución:. Chair of Software And Systems Engineering. Universidad Politécnica de Múnich (TUM). Lugar de lectura:. Universidad Politécnica de Múnich (TUM). Garching bei München, Baviera, Alemania. Fecha de lectura:. 19 de julio de 2018. Contexto, descripción del problema y objetivos La cantidad de datos que se procesa en Internet está en constante crecimiento. El término inglés Big Data engloba los numerosos nuevos procedimientos que son necesarios para gestionar y analizar esta gran cantidad de información que proviene de diferentes fuentes (probablemente con estructuras también muy variadas), haciéndolo suficientemente rápido, pero sin dejar de proporcionar resultados que sean verdaderamente útiles. Los enfoques tradicionales de Big Data, denominados comúnmente procesamiento en lotes o, en inglés, Batch Analytics, tratan los datos como si fueran conjuntos acotados o tandas (batches en inglés). Se basan en recopilar los datos antes de analizarlos, lo cual simplifica notablemente su procesamiento, pero tiene muchas limitaciones, como las grandes capacidades de almacenamiento que se necesitan al no procesar los datos según se reciben, sino tras guardarlos. Aun así, los paradigmas de Big Data más conocidos, como MapReduce, están basados en este enfoque, el cual, aunque se comporta bien cuando lo más crítico es el volumen de los datos, no es capaz de proporcionar buenos resultados con baja latencia, lo que lo convierte en inapropiado para sistemas de tiempo real. Y es que, en muchos casos de uso, como transacciones financieras, interacciones de un usuario con las redes sociales, mensajes intercambiados entre terminales de usuario y estaciones base en una red celular, o eventos en las infraestructuras de una. ix.

(13) Resumen ciudad inteligente (monitorización de la contaminación del aire, control del tráfico, del alumbrado público, etc.), los datos se generan como resultado de una serie de eventos que siguen una estructura de flujo temporal y, por tanto, sin fin. Procesar estos datos en lotes es contraintuitivo, pues no son conjuntos con un principio y un final, y en la mayoría de los casos una latencia elevada hace que el procesamiento tarde demasiado y su resultado sea inválido directamente. Por eso, en estos casos se empezó a aplicar una variante de Batch Analytics basada en procesar micro-lotes, que permiten obtener resultados más rápido. Aun así, gran parte de los esfuerzos de las plataformas de Big Data actuales se centran en buscar enfoques para llevar a cabo el procesamiento de datos en flujo (Stream Analytics en inglés) ex profeso que sean más eficientes. La mayoría de los proyectos de software para procesamiento de datos en flujo (como Apache Spark Streaming o Apache Storm) suelen proporcionar un equilibrio entre latencia, tolerancia ante fallos, consistencia y capacidad de proceso, ofreciendo una funcionalidad limitada. Por el contrario, Apache Flink (en adelante, Flink) ofrece una solución completa diseñada directamente para procesamiento de datos en tiempo real, al contrario que los proyectos anteriores, que han evolucionado desde enfoques basados en lotes. Es por ello por lo que este Trabajo Fin de Máster (TFM) gira entorno a Flink. Por otro lado, la consolidación del Internet de las Cosas (IoT, según sus siglas en inglés) ha desembocado en la que probablemente sea una de las mayores fuentes de flujos de datos en tiempo real. Son miles de millones los dispositivos que, continuamente y en todo el mundo, toman medidas de su entorno, produciendo una ingente cantidad de datos que deben ser procesados, pero no de cualquier modo. Por ejemplo, en una ciudad inteligente, las decisiones sobre los eventos de monitorización del tráfico o del medio ambiente no tienen sentido a menos que se tomen en tiempo real. De esto resulta que, en la mayoría de sus variantes, el IoT impone unos requisitos de latencia en el procesamiento de los datos muy estrictos, pues los actuadores tienen que responder lo antes posible. Claramente, el procesamiento de datos en flujo encaja a la perfección con el IoT. Sin embargo, a pesar de esta gran relación que existe entre el IoT y las técnicas de Big Data, la investigación industrial y académica ha comprobado que la integración de estos dos mundos no es fácil, siendo la heterogeneidad de dispositivos el mayor impedimento. A esta se une el problema de coordinar e identificar los múltiples dispositivos, de tal forma que, en la mayoría de los casos, los desarrolladores de aplicaciones para el IoT deben crear programas muy extensos y complejos para que funcionen con todos los dispositivos. Las herramientas de mashup para IoT, como Node-RED, de IBM, proporcionan a los desarrolladores una capa de abstracción que simplifica la creación de aplicaciones, especialmente para usuarios finales, que carecen de conocimientos de programación. Estas herramientas cuentan habitualmente con una interfaz gráfica (GUI, por sus siglas en inglés), en la que el usuario puede combinar una serie de componentes que se le ofrecen para generar aplicaciones nuevas de forma intuitiva y rápida. Pero no solo son útiles para usuarios sin conocimientos de programación, sino también para usuarios. x.

(14) avanzados que quieran beneficiarse de un nivel de abstracción superior que simplifica su trabajo de desarrollo. Sin embargo, entre las funcionalidades de las herramientas de mashup para IoT actuales no se encuentra la creación de programas para procesamiento de datos en flujo. Este TFM toma este punto de partida, y tiene como objetivo principal hacer posible la creación de programas para procesamiento de datos en flujo con Flink en aFlux, una herramienta de mashup para IoT desarrollada en el Departamento de Ingeniería de Software y Sistemas (Chair of Software and Systems Engineering) de la Universidad Politécnica de Múnich (Technical University of Munich, TUM). Para ello, se han diseñado, implementado y evaluado un conjunto de componentes que el usuario puede combinar utilizando la GUI de aFlux para crear programas en Flink gráficamente. Son dos las preguntas que se pretenden responder con este TFM. En primer lugar, ¿qué abstracciones son necesarias para modularizar los programas Flink, de tal forma que se puedan crear programas para procesamiento de datos en flujo gráficamente, a través de una herramienta de mashup para IoT? Y, en segundo lugar, ¿cómo se puede ayudar a los usuarios finales durante el proceso de creación gráfica de programas Flink, en concreto para que la forma en la que combinan los componentes sea la correcta? Ambas se enmarcan en el reto que supone la integración de Big Data e IoT en sí misma, tan poco desarrollada hasta el momento, y en las limitaciones de las herramientas de mashup, que se ven limitadas precisamente por su simplicidad de uso.. Desarrollo del proyecto En la primera fase del proyecto se han diseñado dos modelos conceptuales. El primero permite traducir automáticamente elementos especificados en una GUI a código ejecutable; el segundo permite evaluar continuamente la composición que hace un usuario al crear un mashup, para llevar a cabo comprobaciones semánticas y proporcionarle realimentación al respecto. El traductor incluye un sistema de actores, que ejecutan la lógica que se requiere para la generación de los programas Flink. Esta generación de código fuente debe ser lo más genérica posible, para garantizar que la flexibilidad y extensibilidad del enfoque implementado sean máximas. Al mismo tiempo, debe ser suficientemente específica como para que tenga sentido realizarla a través de mashups. Por su parte, el segundo modelo incluye la definición de una serie de reglas semánticas entre los componentes de mashup desarrollados, que restringen las posiciones en las que dichos componentes pueden encontrarse dentro del mashup. En la siguiente fase del proyecto se han implementado los modelos anteriores en el contexto de aFlux, lo que ha dado lugar a un nuevo plugin para la herramienta que contiene todos los componentes gráficos necesarios para que el usuario pueda crear programas Flink gráficamente. El usuario obtiene directamente un ejecutable que puede desplegar en cualquier instancia de Flink, y todo ello sin necesidad de escribir código fuente. Además, el usuario recibe realimentación por distintas vías gráficas sobre el cumplimiento o incumplimiento de las reglas semánticas relativas a los componentes que ha incluido en su mashup. De esta forma, se asegura que el mashup creado se. xi.

(15) Resumen traducirá en un programa Flink perfectamente ejecutable y sin errores. Finalmente, se ha llevado a cabo una fase de evaluación en la que se ha puesto a prueba el enfoque desarrollado. Para ello, se ha tomado el caso práctico de las ciudades inteligentes, que contienen miles de sensores IoT que crean datos constantemente. Más específicamente, se estudia el caso concreto de la ciudad de Santander, que participa en el proyecto europeo "SmartSantander", fundado por la Comisión Europea como parte del Séptimo Programa Marco (7PM). El objetivo de SmartSantander es proporcionar un entorno de pruebas para ciudades inteligentes completo, con un extenso despliegue de sensores y demás infraestructura inteligente que puede ser utilizado para el desarrollo de nuevas aplicaciones. En esta fase, se han estudiado varios escenarios en los que un usuario utiliza aFlux para crear programas Flink que faciliten el acceso a los datos en tiempo real de SmartSantander. El resultado principal ha sido que es posible crear programas de procesamiento de datos en flujo gráficamente de forma muy fácil e intuitiva con el enfoque desarrollado en este TFM. El principal inconveniente identificado es que no todas las funcionalidades de Flink pueden programarse desde aFlux, aunque sí son fácilmente integrables dada la extensibilidad del modelo desarrollado. Otras limitaciones vienen dadas por las herramientas de mashup en sí mismas, que proporcionan simplicidad de uso a cambio de limitar la flexibilidad de las opciones que el usuario final puede configurar.. Conclusión Las aplicaciones para IoT necesitan cada vez más estar integradas con las plataformas para Big Data, para llevar a cabo análisis que den sentido a las ingentes cantidades de información que se generan en millones de sensores cada minuto. De entre todos los enfoques de Big Data, el procesamiento en flujo es el más indicado para eventos en tiempo real, pero sin embargo no está soportado por las herramientas de mashup para IoT, a pesar de ser un caso de uso que encaja a la perfección con este enfoque. A nivel de implementación, las principales contribuciones de este TFM son: un plugin para aFlux que permite la creación de programas Flink gráficamente, una serie de mejoras en el núcleo de aFlux que permiten el soporte continuo del usuario final validando los flujos gráficos que crea y un conector para Flink que permite el acceso directo a los datos de SmartSantander al crear programas de procesamiento de datos en flujo. Puede concluirse que las preguntas planteadas al principio del proyecto se han respondido con el trabajo desarrollado: el diseño, implementación y evaluación de un modelo genérico, expandible, escalable y flexible que permite que Flink, una de las soluciones de procesamiento de datos en flujo más punteras, pueda programarse gráficamente a través de aFlux, una herramienta de mashup para IoT. Se han creado un conjunto de componentes gráficos que un usuario puede combinar de múltiples formas, siempre supervisado por un sistema que continuamente le proporciona realimentación sobre si lo está haciendo correctamente o no, para garantizar el éxito de los programas generados.. xii.

(16) Contents Acknowledgments. iii. Abstract. v. Zusammenfassung. vii. Resumen. ix. List of Tables. xvii. List of Figures. xix. Listings. xxi. List of Acronyms. xxiii. 1. Introduction 1.1. Motivation . . . . . . . . . . . . 1.2. Research Questions and Goals 1.3. Methodology . . . . . . . . . . 1.4. Outline . . . . . . . . . . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 1 2 4 5 6. 2. Background 2.1. Stream Processing . . . . . . . . . . . . . . . . . . . . . . 2.1.1. Requirements of Stream Processing Platforms . 2.1.2. Integration with the IoT . . . . . . . . . . . . . . 2.2. Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1. The Concept of Time . . . . . . . . . . . . . . . . 2.2.2. Windows . . . . . . . . . . . . . . . . . . . . . . . 2.2.3. Programming Flink . . . . . . . . . . . . . . . . . 2.2.4. Flink Against its Competitors . . . . . . . . . . . 2.3. Mashups . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1. Integration with the IoT . . . . . . . . . . . . . . 2.4. aFlux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1. The Web Application . . . . . . . . . . . . . . . . 2.4.2. The aFlux Engine: Leveraging the Actor Model 2.4.3. The aFlux Plug-in Framework . . . . . . . . . . 2.5. SmartSantander . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 7 7 8 9 10 11 12 12 14 15 16 17 17 19 20 20. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. xiii.

(17) Contents 3. Related Work 3.1. Nussknacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. IBM SPSS Modeler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Microsoft Azure Stream Analytics . . . . . . . . . . . . . . . . . . . . . . .. 23 23 24 24. 4. Conceptual Approach 4.1. Translation and Code Generation . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Validation of Graphical Flows . . . . . . . . . . . . . . . . . . . . . . . . . .. 25 25 27. 5. Implementation 5.1. SmartSantander Connector for Flink . . . . . . . . 5.1.1. The SmartSantander API . . . . . . . . . . 5.1.2. The Data Model . . . . . . . . . . . . . . . . 5.1.3. The Business Logic . . . . . . . . . . . . . . 5.1.4. The Flink Data Source . . . . . . . . . . . . 5.2. Mashup Components for aFlux . . . . . . . . . . . 5.2.1. Java Code Generation . . . . . . . . . . . . 5.2.2. Message Passing Among Actors . . . . . . 5.2.3. Structure of the Actors . . . . . . . . . . . . 5.2.4. Setting Up the Flink Environment . . . . . 5.2.5. SmartSantander Data Source . . . . . . . . 5.2.6. Transformation Mashup Components . . . 5.2.7. Outputting a Data Stream . . . . . . . . . . 5.2.8. Complex Event Processing . . . . . . . . . 5.2.9. Executing and Generating Job . . . . . . . 5.3. Automatic Mapper for Flink API . . . . . . . . . . 5.3.1. The JavaParser Library . . . . . . . . . . . . 5.3.2. Parsing the Flink API . . . . . . . . . . . . 5.3.3. Analyzing the Flink API . . . . . . . . . . . 5.3.4. The Flink API Mapper . . . . . . . . . . . . 5.4. End-User Continuous Support . . . . . . . . . . . 5.4.1. The ToolSemanticsCondition . . . . . . . . 5.4.2. Back-End Job Validation . . . . . . . . . . . 5.4.3. Front-End Feedback . . . . . . . . . . . . . 5.4.4. Using Conditions in Mashup Components. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. 31 31 32 34 34 37 38 39 39 40 41 41 42 49 51 58 59 61 62 63 64 65 65 66 69 70. 6. Evaluation 6.1. The Evaluation Scenario . . . . . . . . 6.2. Overall Considerations . . . . . . . . . 6.3. Use Case 1: Real Time Data Analysis 6.3.1. Experiment 1 . . . . . . . . . . 6.3.2. Experiment 2 . . . . . . . . . . 6.3.3. Experiment 3 . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 73 73 75 75 75 78 81. xiv. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . ..

(18) Contents 6.3.4. Experiment 4 . . . . . . . . . . . . . 6.4. Use Case 2: Pattern Detection . . . . . . . . 6.4.1. Experiment 1 . . . . . . . . . . . . . 6.4.2. Experiment 2 . . . . . . . . . . . . . 6.5. End-User Continuous Support Evaluation . 6.6. Critical Discussion . . . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 82 87 90 92 92 95. 7. Conclusions 99 7.1. Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 A. Appendix A.1. SmartSantander Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2. Mashup Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3. Flink API Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 101 101 107 111. Bibliography. 113. xv.

(19)

(20) List of Tables 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7. 5.8. 5.9. 5.10. 5.11. 5.12. 5.13. 5.14.. Main Endpoints of SmartSantander REST API . . . . . . . . . . . . . . . Properties of the "SmartSntndr Data" Mashup Component . . . . . . . . Properties of the "GPS Filter" Mashup Component . . . . . . . . . . . . . Properties of the "Select" Mashup Component . . . . . . . . . . . . . . . Supported Windows Application Programming Interfaces (APIs) . . . . Properties of the "Window" Mashup Component . . . . . . . . . . . . . . Properties of the "Window Operation" Mashup Component . . . . . . . Properties of the "Output Result" Mashup Component . . . . . . . . . . Defining Contiguity in FlinkCEP . . . . . . . . . . . . . . . . . . . . . . . Properties of the "CEP Begin" Mashup Component . . . . . . . . . . . . Properties of the "CEP New Patt." Mashup Component . . . . . . . . . . Properties of the "CEP Add Condition" Mashup Component . . . . . . . Properties of the "CEP End" Mashup Component . . . . . . . . . . . . . Semantics Conditions in the Mashup Components of the Flink Plug-in .. . . . . . . . . . . . . . .. 33 42 44 45 47 48 49 50 53 53 54 56 57 71. A.1. A.2. A.3. A.4.. Structure of the Traffic Dataset . . . . . . . . . . . . Structure of the Environment Dataset . . . . . . . . Structure of the Air Quality Dataset . . . . . . . . . Mashup Components in the Flink Plug-in for aFlux. . . . .. 102 103 104 108. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. xvii.

(21)

(22) List of Figures 1.1. Latest Gartner’s Hype Cycle, as of July 2017 . . . . . . . . . . . . . . . . . 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8.. Architecture of a Stream Processing Platform Different Concepts of Time in Flink . . . . . Windows in Flink . . . . . . . . . . . . . . . . Overall structure of a Flink Program . . . . . Analytics, IoT and Mashups . . . . . . . . . . High-Level Architecture of aFlux . . . . . . . Graphical User Interface of aFlux . . . . . . . Traffic Sensors in SmartSantander . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 9 11 13 14 17 18 19 21. 3.1. The Nussknacker Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 4.1. Conceptual Approach for Translation and Code Generation . . . . . . . .. 28. 5.1. 5.2. 5.3. 5.4. 5.5. 5.6.. Live Data Provided by the SmartSantander API . The Location Picker Property . . . . . . . . . . . . Programming Flink from aFlux . . . . . . . . . . . ToolSemanticsCondition in the aFlux Tool Core . Validation Errors Rendered in aFlux’s Front-End . End-User Continuous Support . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 32 44 60 67 70 72. 6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7. 6.8. 6.9. 6.10. 6.11. 6.12. 6.13. 6.14. 6.15.. The Evaluation Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . Flows in aFlux for Use Case 1 . . . . . . . . . . . . . . . . . . . . . . . Tumbling vs. Sliding Windows in aFlux . . . . . . . . . . . . . . . . . Use Case 1, Experiment 1 (Tumbling Windows) . . . . . . . . . . . . Use Case 1, Experiment 1 (Sliding Windows) . . . . . . . . . . . . . . Changing Filtering Location from aFlux . . . . . . . . . . . . . . . . . Use Case 1, Experiment 2 (Tumbling Windows) . . . . . . . . . . . . Use Case 1, Experiment 2 (Sliding Windows) . . . . . . . . . . . . . . Use Case 1, Experiment 3 (Tumbling Windows) . . . . . . . . . . . . Use Case 1, Experiment 3 (Sliding Windows) . . . . . . . . . . . . . . Different Window Operations in aFlux . . . . . . . . . . . . . . . . . . Use Case 1, Experiment 4 (Tumbling Windows) . . . . . . . . . . . . Use Case 1, Experiment 4 (Sliding Windows) . . . . . . . . . . . . . . Flow in aFlux for Use Case 2 . . . . . . . . . . . . . . . . . . . . . . . Sample Configuration of Components in aFlux for Pattern Detection. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 73 76 77 79 80 81 83 84 85 86 87 88 89 90 91. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 3. xix.

(23) List of Figures 6.16. 6.17. 6.18. 6.19.. Use Case 2, Experiment 1 . . . . . . . . . . . . . . . . . Use Case 2, Experiment 2 . . . . . . . . . . . . . . . . . Step-by-Step Flow Composition in aFlux . . . . . . . . Details About the Errors in the "Window" Component. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 93 93 94 95. A.1. Model used for the SmartSantander Connector . . . . . . . . . . . . . . . . 105 A.2. The SmartSantander Connector . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.3. The Flink API Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111. xx.

(24) Listings 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7. 5.8. 5.9. 5.10. 5.11. 5.12. 5.13. 5.14. 5.15. 5.16.. Deserialization of JSON Resources with Annotations . . . . . . . . Constructor of TrafficObservation.java . . . . . . . . . . . . . . . . . TrafficObservationDeserializer . . . . . . . . . . . . . . . . . . . . . Registering the Traffic Deserializer in Gson . . . . . . . . . . . . . . Instantiation of SmartSantanderObservationStream . . . . . . . . . SmartSantanderSource . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Code Generated by the "SmrtSntnder Data" Component . Sample Code Generated by the "GPS Filter" Component . . . . . . Sample Code Generated by the "Select" Component . . . . . . . . . Sample Code Generated by the "Window" Component . . . . . . . Sample Code Generated by the "Window Operation" Component . Sample Code Generated by the "Output Result" Component . . . . Sample Code Generated by the "CEP Begin" Component . . . . . . Sample Code Generated by the "CEP New Patt." Component . . . Sample Code Generated by the "CEP Add Condition" Component Sample Code Generated by the "CEP End" Component . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. 35 36 36 36 37 37 42 44 46 47 49 51 54 55 56 58. 6.1. 6.2. 6.3. 6.4. 6.5.. Required Code to Select Two Different Gas Levels . . . . . . Tumbling vs. Sliding Windows in Java . . . . . . . . . . . . . Changing Filtering Location in Java . . . . . . . . . . . . . . Different Window Operations in Java . . . . . . . . . . . . . Required Code to Create a Pattern that Detects Traffic Jams. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 76 78 78 82 90. A.1. A.2. A.3. A.4. A.5.. Sample SmartSantander API Response . . . . . Generating an Anonymous Class with JavaPoet Output of Listing A.2 . . . . . . . . . . . . . . . ExecuteAndGenerateJobActor.java . . . . . . . . Basic Example with FlinkCEP . . . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 101 107 107 109 110. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. xxi.

(25)

(26) List of Acronyms API Application Programming Interface. AST Abstract Syntax Tree. CEP Complex Event Processing. CLI Command-Line Interface. CSV Comma Separated Values. DAG Directed Acyclic Graph. EUD End-User Development. GUI Graphical User Interface. HTTP HyperText Transfer Protocol. IDE Integrated Development Environment. IoT Internet of Things. JSON JavaScript Object Notation. MDSD Model-Driven Software Development. POJO Plain Old Java Object. REST REpresentational State Transfer. SQL Structured Query Language. UML Unified Modeling Language.. xxiii.

(27)

(28) 1. Introduction The amount of data that needs to be processed in the Internet is constantly growing. Big Data is an umbrella term to refer to the new approaches that are needed to properly manage and analyze these ever-growing amounts of information, which differ from traditional data in the so-called 4 V’s: volume, variety, velocity and veracity [1]. Each of these properties entails a different challenge that Big Data platforms must address: they must be capable of handling great amounts of data that come from different sources (hence with different structures), and do it quickly enough while, at the same time, provide results that really matter. In many use cases, like stocks transactions, user interactions with social networks, exchange of messages between end-user terminals and mobile base stations, or events in smart cities infrastructures (e.g. pollution monitoring, traffic control, lamp posts, etc.), data occur as a series of events that follow a flow-like structure. However, rather than treating them as a stream, traditional Big Data approaches have processed data as if they were finite, bounded datasets called batches. This is known as Batch Analytics. Batch Analytics systems rely on collecting data before it is analyzed, so they simplify the whole process notably. However, they require great storage facilities, as they are not capable of handling data as they arrive. In any case, the most currently well-known Big Data paradigms, like MapReduce [2], are based on batch processing. Batch processing behaves reasonably well when assessing large volumes of data is the priority, especially regarding bounded datasets. However, they fail in providing good results with low latency, thus being inappropriate for real-time systems. They can nonetheless be applied to do some sort of stream processing, in what are known as micro-batches, but clearly there must be a more efficient way to do true Stream Analytics. The main strengths of Stream Analytics involve also great challenges, like ensuring low latency without abandoning high throughput, fail tolerance and consistency. For this reason, most stream processing software projects usually take a trade-off among these challenges and focus just on ensuring a subset of them. This is the case of the open-source projects Apache Spark Streaming1 and Apache Storm2 . Both evolved from a Batch Analytics system, so they mainly work with micro-batches[3, 4]. On the contrary, Apache Flink3 (in this report it will be written just "Flink") takes no trade-off; instead it provides a complete solution focused on stream processing. For this reason, Flink is the data processor that this thesis addresses. https://spark.apache.org/streaming/ https://storm.apache.org/ 3 [Online] https://flink.apache.org/ 1 [Online] 2 [Online]. 1.

(29) 1. Introduction The consolidation of the Internet of Things (IoT) has brought probably one of the most remarkable sources of stream-like Big Data. Indeed, with thousands of millions of sensors taking measurements of their environment all over the world, an incredibly extensive amount of data is constantly being produced [5]. Take, for instance, any scenario in a smart city, such as traffic monitoring, environment tracking, etc. Decisions in these scenarios need to be taken in real-time. Therefore, in most cases the IoT clearly enforces strict latency requirements, because actuators need to take action as soon as possible as a result of processing the measured data. Stream Analytics fits the IoT like a glove. In spite of this strong bond between IoT and Big Data analytics [6], enabling Stream Analytics in IoT platforms is no easy task. There are many challenges to it, being probably the diversity of IoT devices the most important one of them. If we consider also identity of devices and coordination among them, usually application developers have no alternative but to create complex boilerplate code to address all devices at once [7]. IoT mashup tools, like Node-RED4 , provide them with an abstraction layer to simplify the creation of applications, especially for non-programmer, end users. However, they do not support the creation of Stream Analytics flows so far. This thesis starts from this point and is aimed at enabling Stream Analytics with Flink in aFlux, an IoT Mashup Tool developed at the Chair of Software and Systems Engineering of the Technical University of Munich.. 1.1. Motivation There is a great number of challenges that encourage undertaking this project. For starters, the integration between the Big Data and IoT worlds is itself an issue, as very little research has been done on this topic. In fact, as it can be seen in Fig. 1.1 (taken from [8]), the IoT platform itself is currently considered as a key platform-enabling technology, to be adopted within the next 2 to 5 years. Similarly, connected homes —one of the main IoT use case scenarios— is also raising great expectations for the upcoming years. It is clear that it is the right moment to do research in IoT. The limitations of mashups also apply here. Mashup tools have many advantages, especially for end users with no programming skills, but they also entail some limitations which derive from their simplicity of use. The domain of application of the mashup is critical [9], as it determines the way its mashup elements are designed, so a complete generality cannot be reached —and should in fact be avoided because it would go against the main advantage of mashups: its ease of use. Thus, the level of abstraction that is accomplished should be generic enough to allow multiple variations of the created jobs but also specific enough to the domain of application so that end users find it easy (and even fun) to develop new applications. 4 [Online]. 2. https://nodered.org/.

(30) 1.1. Motivation. Figure 1.1.: Latest Gartner’s Hype Cycle, as of July 2017 [8]. The IoT platform just left the innovation trigger area and is supposed to be widely adopted within 2-5 years.. In this thesis, the domain of application is focused on Smart Cities, which contain thousands of IoT sensors that constantly create data. More specifically, the city of Santander, in the North of Spain, participates in a European project named "SmartSantander", which was in its start funded by the European Commission as part of the 7th Framework Program (FP7) [10]. The aim of SmartSantander is to provide a full smart-city testbed with a complete deployment that leverages Future Internet to become a reality [11]. SmartSantander serves as a practical use case scenario for this thesis. It has been taken into consideration both when designing the modularization of Flink API and when evaluating the outcome of the implementation. Some part of the code is also planned to be contributed to the community, to allow others to experiment with the data from SmartSantander as it has been done in this thesis. The integration of Flink and an IoT mashup tool like aFlux constitutes a great challenge as well. The conceptual approach of aFlux must be studied in detail to fully understand how it behaves, and hence be able to design a modularization of Flink programs that suits the architecture and behavioral model of aFlux. Another issue when it comes to modularizing the API of Flink is that the code generation should also be made as generic as possible. As many Flink APIs as possible should be supported, providing an abstraction to end-users of aFlux, while at the same time ensuring some specificity derived from the domain of application.. 3.

(31) 1. Introduction As it can be seen, there are many important challenges that motivate this thesis. In the following section the main goals to face them are stated.. 1.2. Research Questions and Goals The main task of this thesis is to enable the creation of Flink programs from aFlux. By designing and implementing a set of new mashup components inside aFlux (packaged in the form of a plugin), IoT application developers will be able to create Flink jobs graphically. But not only do they need the tools to create Flink programs, they also need support on how to use them, especially considering that the target user of mashup tools are profiles with no programming skills, or in general any end user that is not willing to take care of coding. Thus, two research questions are devised in the scope of this thesis: Research Questions. #1 #2. Which abstractions are necessary to modularize Flink programs so that they can be created from flow-based, graphical mashup tools? How can end users get support during the process of creating Flink programs graphically?. These questions remain in the context of this thesis, which is aimed at somehow giving a response to them. However, they should be narrowed down to have a properly defined problem that can be addressed completely in this thesis. For instance, not a whole modularization of Flink programs has been sought, but just a subset of the Stream Analytics functionalities. The same applies to the support stated in question #2: it will be limited to enforcing a certain order of the mashup components in the Graphical User Interface (GUI) of aFlux. Consequently, the revisited questions are as follows: Research Questions Revisited. #1 #2. Which abstractions are necessary to modularize Flink streaming programs so that they can be created from flow-based, graphical mashup tools? How can end users get support during the process of creating Flink programs graphically so that they place visual components in the right order?. After defining a set of the different modules that compose a Flink program, a mashup component has been created out of each of them. In the end, the user will be able to combine these building blocks in whichever way they want, to create a Flink job. And. 4.

(32) 1.3. Methodology they will get continuous support on how to fulfil this task successfully, so as to avoid errors that could result in the job not being created properly. As a result, the goals of this thesis are as follow:. • Perform a complete analysis of the Flink software project, to study its structure and come up with a way to modularize its programs. This approach should be flexible, generic and extensible. • Integrate this modularization approach into aFlux, by developing a set of mashup components that can be combined using the tool. • Provide the users with a support mechanism that helps them in the creation of Flink programs with the developed artifacts. • Contribute the implemented artifacts and approaches to the community of aFlux, Flink and SmartSantander.. 1.3. Methodology To accomplish the goals stated in Section 1.2, the following methodology has been followed: 1. Literature review. In this stage, the state-of-the-art with regard to stream analytics and IoT mashup tools were analyzed. 2. Design. This step dealt with seeking for an approach to frame Flink programs to enable its creation from aFlux. It included the following tasks: a) Analyze how mashup tools in general, and aFlux in particular, work, both from the front-end and back-end perspectives. b) Analyze how Flink works and its APIs, to fully understand what it provides and how it is achieved. c) Design a set of mashup components that allow the creation of Flink jobs. These components are mapped to Akka actors in the background, which execute the required logic to generate Flink programs. A right modularization has to be designed, so that the approach is extensible and supports error traceability. d) Design an approach to give users continuous support on the way they are using the mashup components, to prevent errors. 3. Implementation. In this step, the results from the design phase were used as input to a development stage of all the necessary components. Java has been used as the main programming language. The main outcome of this step is a set of mashup elements for aFlux that can be combined to create Flink jobs.. 5.

(33) 1. Introduction 4. Evaluation. This step was about assessing how easy it is to create Flink jobs with the designed approach. The extensibility and limitations of the implemented code-generation technique were assessed, analyzed and some conclusions were drawn. This whole project work has been undertaken in the Chair of Software and Systems Engineering of the Department of Informatics of the Technical University of Munich.. 1.4. Outline The structure of this document is as follows: Chapter 2 contains the background that is of relevance for this work. Concepts like Stream Analytics and mashup tools are presented, as well as Apache Flink and aFlux, the most relevant software projects that take part in this work. Finally, an overview of the SmartSantander project, the main use-case scenario on which this thesis is based, will be given. After that, the relevant works related to the matter of this thesis are discussed in Chapter 3. In Chapter 4 the conceptual approach to address the research questions is described, and the most important design choices are highlighted. Chapter 5 deals with the implementation tasks that have been accomplished as part of this thesis. Fine details are given to show how each component has been designed and works, and the main problems that have been faced are outlined. The implemented artifacts are evaluated in Chapter 6, by means of a set of use cases and examples that show how easy it is to experiment with the data from SmartSantander with the implemented approach. In light of the evaluation results, a critical discussion of the work accomplished is provided at the end of this chapter. Finally, the overall conclusions are drawn in Chapter 7, in which the main contributions of the thesis are also highlighted, and some future lines are identified.. 6.

(34) 2. Background In this chapter, the concepts that are required for a better understanding of this thesis are presented. On the one hand, a description of Stream Analytics is provided, then Flink is described as a software platform to perform those analytics. On the other hand, mashups are described from a theoretical perspective, and then aFlux is presented as a practical example of IoT mashup tools. Finally, the SmartSantander project, which is the main use case that has been considered for this thesis, is shortly introduced.. 2.1. Stream Processing The term "Big Data" was coined to refer to the great amounts of data that need to be processed, not only in terms of volume but also in terms of other subtler parameters like velocity. The MapReduce framework, initially presented by Google [2, 12], was then implemented as an open-source project by the Apache Software Foundation, called Hadoop1 . Both of them had been designed for processing data batches with high throughput, and actually could do it in a very efficient way, though these tasks could take several hours or even days [13]. These time restrictions are simply unacceptable for many real-world applications that require the shortest response times available (a couple of minutes or seconds), like most IoT applications. Even more important is the fact that the aforementioned batch processing enforces all the data being available before it can be fed to the processing system. But, again, sometimes it is necessary to make computations before all the data have arrived, in order to have real-time results, regardless of whether or not all data input is already in the system. The requirements and essence of IoT applications and scenarios encourage the use of Stream Analytics, because they meet the characteristics of streaming data: they produce continuous, unbounded and sometimes disordered streams of events [14]. In these contexts, data monitoring is also desirable, in order to search for patterns, anomalies, etc. Thus, Stream Processing is about a continuous processing of data, not only to provide instantaneous answers, but to endlessly monitor data in the wide sense. It is important not to mix Stream Processing with Real-Time Processing. The second one is an evolution of MapReduce that focuses on providing results quickly, but no stream of events is considered. For instance, the in-memory processing approach of Apache Spark makes for a real-time solution, though no streaming is supported in the first place [13]. 1 [Online]. http://hadoop.apache.org/. 7.

(35) 2. Background Of course, it is possible to somehow adapt MapReduce to data streams, by breaking the data into a set of finite batches. This is known as micro-batching and it is what Apache Spark Streaming does [15]. However, Spark is not a stream-first platform, unlike other approaches like Apache Storm and, especially, Flink.. 2.1.1. Requirements of Stream Processing Platforms An ideal stream-first platform should meet the following requirements [16]:. • Low latency. Streaming platforms usually make use of in-memory processing, in order to avoid the time required to read/write data in a storage facility. They should be capable of handling data as they arrive. For this reason, a message transport is vital in a streaming architecture, as it can be seen in Fig. 2.1. This transport layer should feature not only extraordinary performance, but also decoupling between data publishers and data consumers [17]. • Data querying. Streaming platforms should make it possible to find events in the whole data stream. Typically, an Structured Query Language (SQL)-like language is advised [16]. But, since data streams never come to an end, there has to be a mechanism to define the limits of a query (otherwise it would be impossible to query streaming data). This is where the window concept takes part. Windows define the data in which an operation may be applied, this is why they make key elements in stream processing. • Out-of-order data. Since the streaming platform does not wait for all the data to be available, it must feature a mechanism to handle data coming late or even not coming whatsoever. A concept of time needs to be introduced, as opposed to batch analytics systems, which process data in chunks regardless of which arrived before. • Replicable computations and stored state. In spite of the disorder that may exist in data, computations need to be repeatable. Besides, they may sometimes include information from the past, e.g. to make comparisons. This is to say that there has to be a system state, and it needs to be consistent [18]. • High availability and scalability. Stream processors will most probably handle ever-growing amounts of data, and in most cases other systems could rely on them, e.g. in IoT scenarios. For this reason, it has to be guaranteed that the stream processing platform is reliable, fault-tolerant and can handle just about any amount of data events. • High throughput. Scalability and parallelism are key to enable high performance in terms of the amount of data that can be processed. Stream Processing systems are often asked for a real-time performance, so they are useless if they cannot handle new data as they arrive: they should behave properly under stress [17].. 8.

(36) 2.1. Stream Processing. Real-Time Streams Transport. Stream Processor. Store, Transport, Serve. Events. Stored Past Data. Figure 2.1.: Architecture of a Stream Processing Platform The first approaches to stream processing, like Apache Storm and Apache Spark Streaming, used to focus on some of these requirements, such as low latency in the former example and high throughput in the latter one [3]. They were built on top of MapReduce, as stated above. An improvement was made with the so-called lambda architecture, a very well-known approach [17, 19, 20] that combines batch and stream-like approaches in order to reach short response times (in the order of seconds). This approach has some advantages, but one critical downside: the business logic needs to be duplicated into the stream-like and the batch processors. Experts agreed that there had to be a better way to face this challenge [21], and stream-first solutions like Flink appeared, promising to meet all the requirements [17]. They all have a similar architecture, which is depicted in Fig. 2.1. Apache Kafka2 and Flink can be used in the transport and stream processing components respectively.. 2.1.2. Integration with the IoT Stream Analytics makes sense for a wide variety of scenarios and practical use cases, ranging from the financial sector, telecommunications, and even retail and marketing [17]. In general, any area or field of study in which data are generated as a flow of events is suitable for Stream Analytics. The IoT is composed of millions of sensors that measure a quantity in the real physical world, and actuators that need this data to be processed with almost no latency, to ensure a proper response in time. These requirements encourage the use of Stream Analytics to the greatest extent [22]. For this reason, a lot of work and research has been conducted to bring the world of Big Data (especially Stream Analytics) and the IoT together [23]. In some cases, new semantics to specifically target IoT are designed [22]. Some proposals are based on the aforementioned lambda architecture [24] to combine both 2 [Online]. http://kafka.apache.org/. 9.

(37) 2. Background batch and stream processing, while others dive into new network paradigms like Fog Computing [25] to achieve near-to-zero network latency. In the world of Big Data management in a more general sense, many frameworks have been proposed. For instance, a complete data-centric design for a smart city is presented in [26], with a framework that spreads along all the levels of the communication stack: from the physical level to applications. They emphasize the necessity to make sense out of all the data generated in the IoT, just like in the cognitive IoT paradigm presented in [27]. Clearly, there is great interest in bringing these two worlds together. Still, very few research has been done in enabling fast prototyping of these analytics applications. To this purpose, mashups offer a powerful yet simple approach that even users with no programming skills can use.. 2.2. Apache Flink In its official website [18], Apache Flink is defined as "a framework and distributed processing engine for stateful computations over unbounded and bounded data streams". In more detail:. • It is a framework, because it allows the creation of programs through a set of public APIs, which currently support Java, Scala and the REpresentational State Transfer (REST) architecture. A web interface is also available to easily manage the jobs in execution. • It is a distributed engine, and it is built upon a distributed runtime [28] that can be executed in a cluster, to benefit from high availability and high-performance computing resources. A wide range of deployment options are supported [29]: YARN, Mesos, Docker, Kubernettes, Amazon Web Services, Google Compute Engine, and even Hadoop. • It is based on stateful computations. Indeed, Flink offers exactly-once state consistency [17], which means that it is able to ensure correctness even in case of failure. Flink is also scalable, because the state can also be distributed among several systems. • It supports both bounded and unbounded data streams. Flink cannot only process data as it arrives, but also process a historical stream of events. This is not the same as batch processing, because data are still processed as streams — it is just that the time reference is taken back to the past. In the latest releases, the support of bounded data sets has been extended to perform proper batch analytics, as this is indeed a special case of streaming [17]. Flink achieves all this by means of a distributed dataflow runtime that allows a real stream pipelined processing of data [28].. 10.

(38) 2.2. Apache Flink. 2.2.1. The Concept of Time As it was stated in Section 2.1, a streaming platform should be able to handle time, because this is the reference that is used for understanding how the data stream flows, that is to say, which events come before or after a given one. Time is used for creating windows, and to perform operations on streaming data on the wide sense. Time is the key to solve data disorder [17]. Flink supports several concepts of time. Fig. 2.2 depicts them using the streaming architecture presented in the previous section.. Real-Time Streams Transport. Stream Processor. Store, Transport, Serve. Events. Stored Past Data Processing Time Event Time. Ingestion Time. Figure 2.2.: Different Concepts of Time in Flink. Event Time Event time refers to the time in which an event took place. It is usually appended to the event metadata by the event producer as some timestamp attribute. By featuring event-time semantics, Flink allows processing the data considering where the events happened in the real world. No matter if they arrive out of order, or if the message transport makes a mess of them, or if they come straight from a recorded source. To achieve this, Flink support a mechanism of additional timestamps called watermarks, which allow making progress in event time [30]. Watermarks flow along with the data, and they indicate that by that point of the stream, all the events prior to a certain timestamp must have arrived. Now, the processor may advance its internal event-time clock. Watermarks permit the processing of out-of-order streams [30, 17]. And even if events arrive after a later watermark has already been processed, Flink can recalculate computations or send them to a side output.. 11.

(39) 2. Background Processing Time This time is related to the machine in which the streams are processed. Its internal system clock is used to process data. Of course, it provides the shortest latency [30], but it is not fault-tolerant and depends on the speed at which the messages arrive from the network and the message transport. Flink features processing-time semantics for those applications which do not need exact results. Ingestion Time This time concept is between event and processing time. In this case, there are watermarks (like in event time), but they are assigned automatically by Flink as the data enters it. They are therefore simpler in this sense, but they do not allow processing out-of-order streams. Internally, it is treated as event time [30].. 2.2.2. Windows Windows are a basic element in stream processors. Flink supports different types of windows [31], and all of them rely on the notion of time that has been defined in the previous section.. • Tumbling windows (Fig. 2.3a). With a specified window size, tumbling windows assign each event to one and only one window, without any overlap. Their size is fixed. • Sliding windows (Fig. 2.3b). Their size is also fixed, but an overlap called slide is allowed. • Session windows (Fig. 2.3c). They are really interesting for some applications, because sometimes it may be interesting to process events in sessions. This is something that cannot be done in micro-batching, as shown in [17]. • Global window (Fig. 2.3d). By assigning all the elements to one single window, this approach allows the definition of triggers, which tell Flink when exactly the computations should be performed. It is a completely customizable window.. 2.2.3. Programming Flink As stated above, Flink is a framework, and it can be used to develop stream processing jobs that can be applied on just about any data source. To this purpose, it offers a wide set of APIs, both in Java and in Scala, that expose the capabilities of Flink on three different abstraction levels:. • Stateful stream processing. It features full flexibility, by enabling low-level processing and control. In most applications, such a degree of possibilities is not required and can even be cumbersome.. 12.

(40) 2.2. Apache Flink. Window 2 Window 1 Window 1. Window 2. Window 4. Window 3. Window 3. A. A. B. B. C. C. D. D. Size. Slide Size. (a) Tumbling Windows. (b) Sliding Windows. Window 2 Window 1. Window 3. A. A. B. B. C. C. D. D Session gap. (c) Session Windows. (d) Global Window. Figure 2.3.: Windows in Flink. • Core level. This is the widest abstraction level. By means of both a DataStream API and a Dataset API, Flink enables not only stream processing but also batch analytics on "bounded data streams", i.e. data sets with fixed length. In this thesis, only the DataStream functionalities have been considered, because they are what make Flink disrupting —but it should be kept in mind that Flink can successfully handle batch jobs. • Declarative domain-specific language. Flink offers a Table API as well, which provides a high-level abstraction to data processing. With this tool, a data set or data stream can be converted to a table that follows a relational model. The Table API is more concise, because instead of the exact code of the operation, logical. 13.

(41) 2. Background operations are defined [32], but at the same time less expressive than the core APIs. In the latest Flink releases, an even-higher-level SQL abstraction has been created, as an evolution of this declarative domain-specific language.. • Libraries. On top of the user-facing APIs stated above, some libraries with special functionality are built. The added value ranges from Machine Learning algorithms (currently just available in Scala) to Complex Event Processing (CEP) and graph processing [17]. In any case, the structure of a Flink program (especially when using the core-level APIs), is the one shown in Fig. 2.4. Data from a source enters Flink, where a set of transformations is applied to them (window operations, data filtering, data mapping, etc.). The results are in turn yielded to a data sink.. Data Source. Transformations. Data Sink. Figure 2.4.: Overall Structure of a Flink Program [33]. 2.2.4. Flink Against its Competitors As a final note, it is interesting to see how Flink behaves when it is compared to other popular platforms like Apache Spark Streaming and Apache Storm. It has been demonstrated [34] that Flink and Storm, as true Stream Analytics processors, show much lower latency than Spark Streaming, yet Storm seems to include a lot of overhead that could be a problem with high throughputs On the other hand, Spark Streaming, being based in micro-batches, can handle higher throughputs with the right configuration [34]. But it does not perform automatic optimization of job execution, unlike Flink [35]. It seems like, because of its pipelined execution, automatic optimizations and ease of configuration, Flink should be chosen in most scenarios [35]. In any case, not a single framework is suitable for all data and jobs: as an example, Spark is nearly 2 times faster than Flink for large graph processing [35].. 14.

(42) 2.3. Mashups. 2.3. Mashups There is much controversy in research on how to define a mashup [36]. They are usually considered to be applications that have been built by combining different elements that were previously available and reuse data or business logic [37]. These building blocks are commonly referred to as mashup components, and the mashup logic is the way in which they are assembled altogether [36]. To draw mashups a mashup tool is used, which may or may not have a web interface. Mashups open a wide range of possibilities to software engineering and software development. For starters, there is Model-Driven Software Development (MDSD), which is about using a detailed model that a machine can understand to automatically create artifacts that are part of the final software deliverable [38]. MDSD offers higher speed at a lower cost [36], because unnecessary low-level details are hidden from the model. It is important not to confuse MDSD with Model-Based Software Development, which makes use of models in the software design stage, but these models do not need to be fully detailed —they are just an aid for developers. MDSD ultimately enables End-User Development (EUD), which refers to non-professionals able to develop software by means of (in this case) a mashup tool. Although mashups are typically simple applications, they provide a set of benefits that undoubtedly motivate the development of this modeling approach. They enable fast prototyping, so that even skilled developers may find mashups useful, because they are abstracted from a lot of boilerplate code that can be easily reused instead. Besides, their target user ranges from end users with no programming skills to experienced software engineers. Mashups provide quick responses to simple problems and can be even be fun to use [36]. Depending on what the mashup components represent, typically mashups and mashup components are classified according to three levels of abstraction:. • Data mashups, when their components allow composing low-level interfaces and raw data (e.g. integrating data from a REST interface). • Logic mashups, when their components provide access to business logic or functionalities (such as any given algorithm). • User-interface mashups, when they provide a GUI that allows combining fully operating components that provide an added value to end users (like a customizable news web dashboard). • Hybrid mashups, when they combine components of different types, to provide richer capabilities to the mashup developer. In this thesis, hybrid mashups remain as the most thoughtful alternative, as long as its limitations (e.g. conflicts between elements of different levels) are taken into consideration.. 15.

(43) 2. Background. 2.3.1. Integration with the IoT The benefits of mashups make it suitable for many fields of application, ranging from the traditional web to mobile environments [39, 40], both for end users and also large enterprises [41]. Among all of them, the IoT is indeed one of the most promising fields of application for mashups as well. Certainly, mashups suit the IoT use case like a glove. For instance, one of the greatest challenges in IoT is the high degree of devices heterogeneity, which requires an abstraction from the low-level layer to address all devices at once. For this reason, a lot of IoT mashup tools have been developed recently both in research and in production environments [42]. Here are the most remarkable ones:. • Node-RED is a mashup tool developed by IBM which offers a set of visual components that stand for JavaScript code. It is suitable for real-time applications that usually run on embedded hardware and handle significant amounts of data. • IoT-MAP [43] is a solution that focuses on the mobile world. The authors propose the creation of a platform that brings both mobile application developers, smart things manufacturers and end users together. • IoTLink [44] is said to be a development toolkit based on MDSD, that makes use of a high-level model to enable the distributed composition of devices and services into a mashup to visually define how they represent the concept of a "thing". The model can then be translated into Java, for further refinement by a human developer. • Other approaches treat IoT mashups as a middleware that can be leveraged to enable wider applications. For instance, the authors in [45] present a lightweight model that uses the REST principles as a basis to build an IoT-mashup-development platform. However, these approaches do not support data analytics, especially Stream Analytics. Instead, they aim to provide a way to create applications for the IoT (which of course is of great value), rather than applications for analyzing the data that is produced inside them. There is a lack of tools that combine the development of IoT applications and stream analytics [7, 46]. Recap Current approaches of IoT mashup tools are limited in terms of Stream Analytics, whereas Data Analytics for IoT do not feature any type of EUD. It seems necessary to bring this together to offer added value to IoT application developers and users.. 16.

(44) 2.4. aFlux. 2.4. aFlux To overcome the limitations specified above, the Chair of Software and Systems Engineering of the Technical University of Munich has created aFlux [47], an IoT mashup tool aiming to support Stream Analytics when graphically developing services and applications for the IoT [48]. As shown in Fig. 2.5, it brings together the IoT and analytics using mashups as the enabling technology.. IoT Data Frameworks. Big Data & Stream Analytics. Internet of Things aFlux. Analytics Mashup Tools. Mashups. IoT Mashup Tools. Figure 2.5.: Analytics, IoT and Mashups Among the most relevant features of aFlux are a set of new semantics to support asynchronous execution patterns and multi-threaded applications, which most IoT mashup tools, like Node-RED, still lack. Querying Big Data Analytics systems is supported in aFlux, as well as some "unified analytics" that enable addressing different systems at once [49]. Fig. 2.6 shows the overall architecture of aFlux. Each element will be explained in detail: first the web application will be covered, and then the rest of the subsystems of the backend.. 2.4.1. The Web Application The web application is composed of two main entities: the front-end and a REST API back-end.. 17.

(45) Engine. Common Analytics. Main Plug-ins. REST API. Components. State. Containers. Actions. 2. Background. Plugins. Front-End Back-End aFlux. Figure 2.6.: High-Level Architecture of aFlux The Front-End The front-end of aFlux provides a GUI to the creation of mashups. It is based on React3 and Redux4 , two frameworks for building user interfaces. The focus of these frameworks is on structuring the application as a set of components, be them either visual GUI components or a bundle of several of them. When the user interacts with these components, they generate actions that are dispatched and handled by containers, that contain the business logic. Apart from this, there is an overall stored state of the application, used to render the visual components and modified by the business logic. Fig. 2.7 shows the GUI of aFlux. As it can be seen, mashups are created by drag-anddropping mashup components from the left panel. Mashup components are loaded from plug-ins, which are explained below. The application shows a console-like output in the footer, and the details about a selected item are shown on the right panel. The following elements are considered in the GUI of aFlux:. • Mashup components are the building blocks of the tool. • A Flow is a synonym for mashup. It is composed of several mashup components that the user wires together. The tool also supports subflows, which are user-created mashups that are then available as a mashup component, to encourage reuse. • An Activity is a set of flows that are located in the same canvas. aFlux supports adding several flows to the same activity. • A Job is the most general data structure in aFlux. It can contain several activities, which can be selected from the tab bar in the bottom part of the application. 3 [Online] 4 [Online]. 18. https://reactjs.org/ https://redux.js.org/.

(46) 2.4. aFlux. Available Mashup Components. Application Header & Menu Bar. Side Panel. Mashups. Activity Tabs. Add-Plug-in Button. Canvas. Console-like Output. Figure 2.7.: Graphical User Interface of aFlux The Back-End The jobs that the user has created, as well as the main configuration parameters of the GUI and some internal metadata are stored in a MongoDB5 database for persistency. Besides, the logic to execute the flows in a job is not located in the front-end. The back-end of aFlux offers a REST API for the GUI to trigger these operations. The back-end of the web application is based on the Spring framework for Java6 .. 2.4.2. The aFlux Engine: Leveraging the Actor Model One of the most important features of aFlux is its execution model. Addressing it in detail is outside of the scope of this document, yet some important remarks are needed in order to understand the implementation approach adopted for this thesis. When a flow is sent to the back-end, it must be translated to an internal model, as it was explained in Section 2.3. In aFlux, this internal model is a graph called the "Flow Execution Model" [49]. This model is composed of actors, since aFlux makes use of the actor system approach. More specifically, it makes use of the Akka actor system7 , made for Java. https://www.mongodb.com/ https://spring.io/ 7 [Online] https://akka.io/ 5 [Online] 6 [Online]. 19.

(47) 2. Background In an actor system, actors encapsulate both state and behavior [50]. When an actor receives a message, it starts to perform its associated computations, and it may send a message to another actor when finished. In aFlux, messages can only be sent asynchronously [47]. Concurrency of actors is also supported. The actor system indeed fits the mashups use case, in which mashup components communicate also through messages (e.g. Node-RED offers the message.payload to send information from one node to another). The Java back-end and engine are based on Java and use Maven8 for project management and comprehension.. 2.4.3. The aFlux Plug-in Framework Not only is aFlux an IoT mashup tool offering novel features like asynchronous flows, its functionality is also expandable. The aFlux engine offers a java API that, despite being simple, enables the development of plug-ins. This API abstracts all the internal logic related to mashup components, such as how they are treated by the web application and the flow execution model, and leaves just two main issues to be handled by the plug-in: how the mashup component looks in the GUI, which includes its properties, and which is the business logic that the component stands for, i.e. the business logic of the actor associated to it. Plug-ins are Maven projects that are packaged (in Java’s .jar format) and directly loaded from aFlux in run-time. The dependencies that they required are also integrated inside the .jar, to make it standalone. With this approach, several plug-ins are already available, which range from generalpurpose utilities to a common analytics language, including input/output operations with Kafka, files, MQTT (a widely-used connectivity protocol for IoT9 ), etc.. 2.5. SmartSantander Smart Cities are considered to be one of the most relevant fields of study of Future Internet, and because of that it is a popular topic both in research and industry [51]. Cities make indeed a great testbed for IoT technology, because it has a direct impact on citizens. For this purpose, the European Commission, as part of the FP7 [10], invested in the creation of a complete testbed environment in the city of Santander, Spain, that can be used to develop meaningful IoT applications. In the context of the SmartSantander project [52], a whole experimental research facility for smart cities has been developed, so that it can be deployed in other cities apart from Santander. In fact, there are already nodes in Guilford, UK; Lübeck, Germany and Belgrade, Serbia [11]. 8 [Online] 9 [Online]. 20. https://maven.apache.org/ http://mqtt.org/.

(48) 2.5. SmartSantander The city-scale deployment was rolled out in three phases, starting in November 2011 [52], and it addresses many application domains: traffic, parking, environmental monitoring, parks and gardens irrigation, etc. Around 3,000 IEEE 802.15.4 devices [53], 200 GPRS10 modules and 2000 smart tags (RFID, QR) code labels compose the SmartSantander facilities. They are both located at static locations (streetlights, façades, bus stops) as well as embedded in public vehicles like buses and taxis.. Figure 2.8.: Traffic Sensors in SmartSantander Apart from IoT nodes, repeaters and gateways, the SmartSantander facility is structured in four subsystems: an Authentication, Authorization and Accounting subsystem, a Testbed Management subsystem, an Experimental Support subsystem and an Application Support subsystem. To encourage the development of new applications, the SmartSantander facility has an open API, managed both as part of the project and from the City Hall of the city. As an example, Fig. 2.8 contains a map of Santander with the locations of the traffic sensors, which have been retrieved from the API and represented using a cloud computing service11 . The SmartSantander facility is also accessible from other external APIs, like the global instance of the FIWARE Context Broker [54, 55], a publish-subscribe component for handling IoT data that is part of the European-funded FIWARE Platform12 .. https://www.etsi.org/technologies-clusters/technologies/mobile/gprs https://mygeodata.cloud/ 12 [Online] https://www.fiware.org/ 10 [Online] 11 [Online]. 21.

(49)