Contents

Data Duplication Removal Using File Checksum

Abstract:

Data duplication removal is a crucial aspect of managing the explosive growth of data and reducing storage consumption. This project aims to address the issue of redundant data in NoSQL key-value stores by implementing a file checksum technique. The project focuses on maximizing duplicate reduction, improving performance, and ensuring horizontal scalability on a Cloud Platform. This article provides an overview of the project’s objectives, details, and its potential advantages and disadvantages.

Introduction:

Data duplication poses challenges in terms of storage space, data management complexity, and performance. To overcome these challenges, the project proposes the utilization of file checksum technology. By comparing file checksums, redundant data can be identified and eliminated efficiently. However, false positives can occur, necessitating the comparison of new data chunks with stored data chunks. Existing research utilizes file data checksum extraction to minimize false positives. This project specifically addresses the duplication issue in NoSQL key-value stores, which store multiple attributes such as user ID, filename, size, extension, checksum, and date-time table.

Objective:

The main objectives of this project are as follows:

Maximize the reduction of duplicate data in NoSQL key-value stores.
Improve process performance while minimizing the impact on the backup window.
Design the system with horizontal scaling capabilities, enabling competitive operation on a Cloud Platform.

Project Details:

The project eliminates duplicate data using file checksums. When a user uploads a file, the system calculates its checksum and checks it against the database. The file is updated or generated if it already exists. This reduces unnecessary data, streamlines data administration, and improves file upload and download.

Conclusion:

Data duplication removal using file checksum is an effective approach to address the challenges posed by redundant data. This project focuses on implementing this technique in NoSQL key-value stores, aiming to reduce storage consumption, improve performance, and ensure horizontal scalability on a Cloud Platform. While the project offers several advantages such as efficient storage utilization and simplified data management, it also requires an active internet connection as a potential limitation.

Did you like this final year project?

To download this project Code with thesis report and project training... Click Here

Data Duplication Removal Using File Checksum