In February, we saw signs of suspicious activity and we quickly moved to fix the problem. Here’s the story.
Please, Clean My Data
By their very nature, open source projects have little upfront control over contributors. Anyone can join at any time. This is an advantage for innovation, but for community managers, who need to know the players, it poses a challenge. They need to know: who is contributing and how much?
Sometimes, contributors use pseudonyms. Sometimes they even use different usernames and email addresses for each collaboration tool. In those cases, Bitergia Analytics Platform offers features to help users to view different identities as the same contributor. Managers can then keep track of the contributors and accurately gauge their impact. Some of our customers choose to entrust Bitergia with the task of ensuring a high level of data quality for the contributor identities. That way, they can rely on accurate information for their reporting and decision-making. During one routine data quality check, we discovered strange contributor activity.
Spam attacks against collaboration platforms add an extra layer of difficulty for trustworthy data. Cleaning up attacks can be cumbersome and expensive, and that’s why we apply our tools and expertise to help our customers do it quickly. And that’s just what we did this year when we saw strange contributor data and we suspected that something was going on.
What's That Smell?
In mid-February 2025, we detected a sudden, uncommonly high level of contributors that didn’t have a known affiliation.
We were conducting one of our regular identity reviews, and saw a spike in unaffiliated contributions. For this particular instance, the percentage of unaffiliated contributors would normally rise to about 22% each month, and we would bring it down to 18% or 19%. This month, though, it peaked at 27%.
We always review the ranking of recent contributors to gather their various identities and to affiliate them. We’re used to seeing many full names and a few pseudonyms, but this month, most of the top contributors were account aliases, and they had impressive numbers of contributions.
Something unusual was happening, so we drilled down to display their contributions and found that they had contributed thousands of comments within just three days. What a flood of activity!
Drilling down further, we found what we had expected to find: lots and lots of spam.
A spam attack had happened and it was impacting not only the affiliation level, but also the accuracy of the metrics we provide!
Neutralizing the Attack
We set about gathering data about the attack to alert the upstream forge, but soon discovered that they had already detected and fixed the attack on their side. Nice! So we focused on doing our own downstream clean-up.
Bitergia Analytics Platform’s drill-down capabilities helped. We were able to apply the same Paretto-based top-down prioritization to speed up the cleaning. We used the platform to review the top contributors and effectively confirmed the pattern: the attack happened (most likely contained by the forge admins) over a few days in early February.
When we focused on the three days of the attack, we saw that the percentage of unaffiliated contributors had peaked at over 50%. That’s 1.5 times the usual 20%. We checked the days before and after the attack and confirmed that both the affiliation data and the identities of the top contributors were confirmed. We thus had certainty of the extent of the spam attack.
Then, it was just a matter of gathering all of the 93 offending identities and marking them as the same single entity. That entity we marked as a bot.
All metrics and visualizations were now back to normal.
If you run your own instance of Bitergia Analytics Platform or GrimoireLab, you can check the ranking of contributors in the Affiliations panel for the first five days of February 2025. Here’s the list of offending accounts: Spammers_list.csv.
Our Takeaways
Our experience with the February 2025 spam attack highlights the very real, often messy, day-to-day work of data cleaning. It’s not just about setting up a platform; it’s about continuously monitoring, investigating anomalies, and meticulously correcting errors. This incident, with its sudden spike in unaffiliated contributions and the discovery of coordinated spam activity, underscores that integrity of auto-ingested data demands ongoing attention.
For organizations managing large-scale open source projects, this level of detailed, specialized data cleaning can be overwhelming. Engaging specialists like Bitergia, with the expertise and tools to detect and rectify data issues at scale, ultimately translates to significant savings. Trustworthy data doesn’t happen by accident; it’s the result of dedicated, consistent effort.

