User talk:Dušan Kreheľ/Wikipedia talk:New matrix format
Latest comment: 1 year ago by Dušan Kreheľ in topic Existing "ez" compressed format, as well as pageviews complete
Existing "ez" compressed format, as well as pageviews complete
editHi! The pageviews "complete" dump version does just this. It's a bit of a mess because the Analytics team that maintains these dumps has changed a lot in the middle of a big effort to create the new dump. But the details that are relevant are thus:
- pageviews_ez is the old dump that first implemented an idea similar to yours: https://dumps.wikimedia.org/other/pagecounts-ez/
- pageviews_complete is the new version that should meet your needs going forward: https://dumps.wikimedia.org/other/pageview_complete/readme.html
- lots of documentation updates are needed to make this clear
- we need to clean up old jobs that are still running and giving the impression that other datasets are supposed to be how people download data
Milimetric (WMF) (talk) 19:46, 5 September 2022 (UTC)
- @Milimetric (WMF): Thx, I looked. My way idea was to have the years export. pageview_complete have only the day statistics. Dušan Kreheľ (talk) 20:32, 18 September 2022 (UTC)
- @Dušan Kreheľ: Indeed, pageviews_complete has daily and monthly statistics. The monthly rollups are here, linked from the daily ones: https://dumps.wikimedia.org/other/pageview_complete/monthly/. Perhaps that should be clearer from the front page. If yearly rollups are useful as well, we should probably just add them to this dataset rather than creating a different dataset, in my opinion. What do you think? Milimetric (WMF) (talk) 13:36, 19 September 2022 (UTC)
- @Milimetric (WMF): Thx for the comment and the links. My actual answer on your question is in the section Epilogue of the article. You look. Dušan Kreheľ (talk) 20:21, 16 October 2022 (UTC)
- @Dušan Kreheľ: Indeed, pageviews_complete has daily and monthly statistics. The monthly rollups are here, linked from the daily ones: https://dumps.wikimedia.org/other/pageview_complete/monthly/. Perhaps that should be clearer from the front page. If yearly rollups are useful as well, we should probably just add them to this dataset rather than creating a different dataset, in my opinion. What do you think? Milimetric (WMF) (talk) 13:36, 19 September 2022 (UTC)
Comparison for other formats
editThanks for sharing it - this is interesting idea. I wonder how does it work compared to other known formats to store matrix with many zeros like Sparse_matrix#Storing_a_sparse_matrix. Eran (talk) 09:50, 15 October 2022 (UTC)
- @ערן: Excelent comment. I compared the examples for the section Compressed sparse row (CSR, CRS or Yale format) from enwiki page and my format is better. Dušan Kreheľ (talk) 07:08, 16 October 2022 (UTC), Dušan Kreheľ (talk) 07:09, 16 October 2022 (UTC)