Dwarf

Dwarf is Ning's data pipeline and Analytics platform. It is a collection of open-source libraries, utilities and servers to build large-scale Analytics infrastructures.

Ning has been working with Hadoop and related technologies since 2007. Over the years, we've built internally a large scale data pipeline, that we open-sourced in 2010. It is composed of several building blocks, which can be independently used:

Action core
Exposes HDFS over HTTP. It provides both a REST API and a browsing UI.
action-access library
Java library to access the Action core. It provides an API to retrieve and store data in HDFS over HTTP.
Collector core
Data aggregator service, similar to Scribe or Kafka. It exposes both an HTTP and Thrift API.
eventtracker library
Java library to send data (events) to the Collector core.
Goodwill core
Metadata repository for events, it stores schema definitions. Used by the Collector core and the Action core for data validation.
goodwill-access library
Java library for Goodwill, it allows you to programmatically access Schemata from Goodwill.
serialization library
The serialization library contains the building blocks of the Dwarf framework. Each component depends on it.
Meteo
Realtime event processing engine. It leverages Esper for runtime analysis and can output data to different rendering engines for graphing purposes.
HFind
`find' implementation for Hadoop. See the POSIX specification.
Sweeper
Hadoop utility to quickly find large directories to clean up or small files to combine.

Contact

IRC: #dwarf on irc.freenode.net.

There is a general mailing-list that carries announcements as well as discussions for all Dwarf subprojects. If you have a problem, it is a very good idea to browse the archives for information.

Resources

License

Released under the Apache License, Version 2.0.