Dwarf

Dwarf is Ning's data pipeline and Analytics platform. It is a collection of open-source libraries, utilities and servers to build large-scale Analytics infrastructures.

Ning has been working with Hadoop and related technologies since 2007. Over the years, we've built internally a large scale data pipeline, that we open-sourced in 2010. It is composed of several building blocks, which can be independently used:

Action core: Exposes HDFS over HTTP. It provides both a REST API and a browsing UI.
action-access library: Java library to access the Action core. It provides an API to retrieve and store data in HDFS over HTTP.
Collector core: Data aggregator service, similar to Scribe or Kafka. It exposes both an HTTP and Thrift API.
eventtracker library
: Java library to send data (events) to the Collector core.
Goodwill core: Metadata repository for events, it stores schema definitions. Used by the Collector core and the Action core for data validation.
goodwill-access library: Java library for Goodwill, it allows you to programmatically access Schemata from Goodwill.
serialization library: The serialization library contains the building blocks of the Dwarf framework. Each component depends on it.
Meteo: Realtime event processing engine. It leverages Esper for runtime analysis and can output data to different rendering engines for graphing purposes.
HFind: `find' implementation for Hadoop. See the POSIX specification.
Sweeper: Hadoop utility to quickly find large directories to clean up or small files to combine.

Contact

IRC: #dwarf on irc.freenode.net.

There is a general mailing-list that carries announcements as well as discussions for all Dwarf subprojects. If you have a problem, it is a very good idea to browse the archives for information.

Resources

License

Released under the Apache License, Version 2.0.