Detecting Duplicate Content

Detecting Duplicate Content

Detecting and classifying content as spam or duplicate poses a significant challenge for social media firms due to the vast amount of data published every second. However, limiting the number of posts, follows, and messages appears to be the simplest solution to this problem. You can also add the "report as spam" button to the posts, and unload the task to the community, but keep in mind that the community is cunning and might report posts that do not match their ideologies so you need to have a content moderation team in place. Too much work. :')

I was assigned the task of detecting duplicate content on the feed, which prompted me to delve into spam detection and research papers published by Facebook on the subject. After thorough research, I developed a pragmatic solution.

Do you know that the detection of malware in binaries relies on fuzzy hashing algorithms? Since malware has a fixed pattern of 0s and 1s, it can be detected in a binary file by identifying these patterns.

Similarly, content can be converted into md1 hash, but since even slightly different content can result in different hashes, we need to create fuzzy hashes and search for similar hashes.

Using fuzzy hashing algorithms to identify similar hashes, we can effectively detect duplicate content on social media feeds, which is essential in preventing spam and improving the user experience. Although spam is a subjective topic, context is also important to see if the post is spam or not.

I used the ctph.js library to hash the content and detect similarities.

Some good blogs around the same topic:

https://medium.com/@glaslos/locality-sensitive-fuzzy-hashing-66127178ebdc

https://blog.pythonicforensics.com/fuzzy-hashing-and-ctph-69f4a7bbfe48

Read more blogs from yours truly

Cheers.