학술논문

Random access in nondelimited variable-length record collections for parallel reading with Hadoop
Document Type
Conference
Source
2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM) Integrated Network and Service Management (IM), 2017 IFIP/IEEE Symposium on. :965-970 May, 2017
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Lenses
Heuristic algorithms
Scalability
Payloads
Indexing
Standards
Metadata
Language
Abstract
The industry standard Packet CAPture (PCAP) format for storing network packet traces is normally only readable in serial due to its lack of delimiters, indexing, or blocking. This presents a challenge for parallel analysis of large networks, where packet traces can be many gigabytes in size. In this work we present RAPCAP, a novel method for random access into variable-length record collections like PCAP by identifying a record boundary within a small number of bytes of the access point. Unlike related heuristic methods that can limit scalability with a nonzero probability of error, the new method offers a correctness guarantee with a well formed file and does not rely on prior knowledge of the contents. We include a practical implementation of the algorithm with an extension to the Hadoop framework, and a performance comparison to serial ingestion. Finally, we present a number of similar storage types that could utilize a modified version of RAPCAP for random access.