Dwarf
Dwarf is Ning's data pipeline and Analytics platform. It is a collection of open-source libraries, utilities and servers to build large-scale Analytics infrastructures.
Ning has been working with Hadoop and related technologies since 2007. Over the years, we've built internally a large scale data pipeline, that we open-sourced in 2010. It is composed of several building blocks, which can be independently used:
- Action core
- Exposes HDFS over HTTP. It provides both a REST API and a browsing UI.
- action-access library
- Java library to access the Action core. It provides an API to retrieve and store data in HDFS over HTTP.
- Collector core
- Data aggregator service, similar to Scribe or Kafka. It exposes both an HTTP and Thrift API.
- eventtracker library
- Java library to send data (events) to the Collector core.
- Goodwill core
- Metadata repository for events, it stores schema definitions. Used by the Collector core and the Action core for data validation.
- goodwill-access library
- Java library for Goodwill, it allows you to programmatically access Schemata from Goodwill.
- serialization library
- The serialization library contains the building blocks of the Dwarf framework. Each component depends on it.
- Meteo
- Realtime event processing engine. It leverages Esper for runtime analysis and can output data to different rendering engines for graphing purposes.
- HFind
- `find' implementation for Hadoop. See the POSIX specification.
- Sweeper
- Hadoop utility to quickly find large directories to clean up or small files to combine.
Contact
IRC: #dwarf on irc.freenode.net.
There is a general mailing-list that carries announcements as well as discussions for all Dwarf subprojects. If you have a problem, it is a very good idea to browse the archives for information.
Resources
- Realtime Analytics Tech Talk (slides)
- Low latency Analytics
- HFind: 1.0.1 release
- HFind: initial release announcement
- Sweeper, an hdfs browser
- Scribe case study
License
Released under the Apache License, Version 2.0.