27-02-2013, 03:04 PM
Data Analysis infrastructure at Facebook
Data Analysis.docx (Size: 13.97 KB / Downloads: 25)
ABSTRACT
Scalable analysis on large data sets has been core to the functions ofa number of teams at Facebook - both engineering and non-engineering.Apart from ad hoc analysis of data and creation ofbusiness intelligence dashboards by analysts across the company, anumber of Facebook's site features are also based on analyzing largedata sets. These features range from simple reporting applicationslike Insights for the Facebook Advertisers, to more advanced kindssuch as friend recommendations. In order to support this diversity ofuse cases on the ever increasing amount of data, a flexibleinfrastructure that scales up in a cost effective manner, is critical. Facebook have leveraged, authored and contributed to a number of opensource technologies in order to address these requirements atFacebook. These include Scribe, Hadoop and Hive which togetherform the cornerstones of the log collection, storage and analyticsinfrastructure at Facebook. In this paper I will present how thesesystems have come together and enabled us to implement a datawarehouse that stores more than 15PB of data (2.5PB aftercompression) and loads more than 60TB of new data (10TB aftercompression) every day. We discuss the motivations behind ourdesign choices, the capabilities of this solution, the challenges thatwe face in day today operations and future capabilities andimprovements that we are working on.