skip to main content
10.1145/988672.988674acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

What's new on the web?: the evolution of the web from a search engine perspective

Authors:
Alexandros Ntoulas
University of California at Los Angeles, Los Angeles, CA
,
Junghoo Cho
University of California at Los Angeles, Los Angeles, CA
,
Christopher Olston
Carnegie Mellon University, Pittsburgh, PA
Authors Info & Claims
Published: 17 May 2004 Publication History

Abstract

We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines.

References

[1]
Google Directory http://dir.google.com.]]
[2]
Google Search. http://www.google.com.]]
[3]
The Internet Archive http://www.archive.org.]]
[4]
Nielsen NetRatings for Search Engines. avaiable from searchenginewatch.com at http://searchenginewatch.com/reports/article.php/2156451.]]
[5]
Online Computer Library Center http://wcp.oclc.org.]]
[6]
Open Directory Project http://www.dmoz.org.]]
[7]
The WebArchive Project, UCLA Computer Science, http://webarchive.cs.ucla.edu.]]
[8]
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proceedings of Twenty-Sixth VLDB Conference, Cairo, Egypt, 2000.]]
[9]
B. E. Brewington and G. Cybenko. How dynamic is the web? In Proceedings of the Ninth WWW Conference, Amsterdam, The Netherlands, 2000.]]
[10]
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Annual Conference, 1995.]]
[11]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh WWW Conference, Brisbane, Australia, 1998.]]
[12]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In Proceedings of the Nineth WWW Conference, Amsterdam, Netherlands, 2000.]]
[13]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of the Sixth WWW Conference, 1997.]]
[14]
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. Computer, 32(8):60--67, 1999.]]
[15]
J. Cho and H. García-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the Twenty-Sixth VLDB Conference, pages 200--209, Cairo, Egypt, 2000.]]
[16]
J. Cho and H. García-Molina. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 117--128, Dallas, Texas, 2000.]]
[17]
E. Coffman, Jr., Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, June 1998.]]
[18]
F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, 1997.]]
[19]
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth WWW Conference, Budapest, Hungary, 2003.]]
[20]
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In IEEE Symposium on Foundations of Computer Science (FOCS), 2000.]]
[21]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proceedings of the Twelfth WWW Conference, Budapest, Hungary, 2003.]]
[22]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Proceedings of the Second International Conference on Advances in Web-Age Information Management, pages 133--144. Springer-Verlag, 2001.]]
[23]
B. H. Murray and A. Moore. Sizing the internet. White paper, Cyveillance, Inc., 2000.]]
[24]
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems, Atlanta, Georgia, 1997.]]
[25]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, first edition, 1983.]]

Cited By

View all
  • (2024)Industry 4.0: a bibliometric analysis of social partners’ public messages in France and GermanyThe Economic and Labour Relations Review10.1017/elr.2023.52(1-24)Online publication date: 30-Jan-2024
  • (2023)AN ANALYSIS OF INFORMATION SEARCH AND RETRIEVAL TECHNIQUESRomanian Journal of Petroleum & Gas Technology10.51865/JPGT.2023.02.144 (75):2(137-148)Online publication date: 30-Dec-2023
  • (2023)Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News ArchiveApplied Sciences10.3390/app1315856613:15(8566)Online publication date: 25-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '04: Proceedings of the 13th international conference on World Wide Web
May 2004
754 pages
ISBN:158113844X
DOI:10.1145/988672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. change prediction
  2. degree of change
  3. link structure evolution
  4. rate of change
  5. search engines
  6. web characterization
  7. web evolution
  8. web pages

Qualifiers

  • Article

Conference

WWW04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)116
  • Downloads (Last 6 weeks)10
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Industry 4.0: a bibliometric analysis of social partners’ public messages in France and GermanyThe Economic and Labour Relations Review10.1017/elr.2023.52(1-24)Online publication date: 30-Jan-2024
  • (2023)AN ANALYSIS OF INFORMATION SEARCH AND RETRIEVAL TECHNIQUESRomanian Journal of Petroleum & Gas Technology10.51865/JPGT.2023.02.144 (75):2(137-148)Online publication date: 30-Dec-2023
  • (2023)Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News ArchiveApplied Sciences10.3390/app1315856613:15(8566)Online publication date: 25-Jul-2023
  • (2023)Prepandemic Antivaccination Websites' COVID-19 Vaccine Behavior: Content Analysis of Archived WebsitesJMIR Formative Research10.2196/402917(e40291)Online publication date: 11-Jan-2023
  • (2023)Look back, look around: A systematic analysis of effective predictors for new outlinks in focused Web crawlingKnowledge-Based Systems10.1016/j.knosys.2022.110126260(110126)Online publication date: Jan-2023
  • (2022)Preservação de sites oficiaisRevista Brasileira de Preservação Digital10.20396/rebpred.v3i00.165873(e022010)Online publication date: 12-Jul-2022
  • (2022)Noise-Reduction for Automatically Transferred Relevance JudgmentsExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-13643-6_4(48-61)Online publication date: 25-Aug-2022
  • (2021)The Problem of Reference Rot in Spatial Metadata CataloguesISPRS International Journal of Geo-Information10.3390/ijgi1101002711:1(27)Online publication date: 31-Dec-2021
  • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
  • (2021)RisGraph: A Real-Time Streaming System for Evolving Graphs to Support Sub-millisecond Per-update Analysis at Millions Ops/sProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457263(513-527)Online publication date: 9-Jun-2021
  • Show More Cited By

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media

Get Access

Get Access

Login options

References

References

[1]
Google Directory http://dir.google.com.]]
[2]
Google Search. http://www.google.com.]]
[3]
The Internet Archive http://www.archive.org.]]
[4]
Nielsen NetRatings for Search Engines. avaiable from searchenginewatch.com at http://searchenginewatch.com/reports/article.php/2156451.]]
[5]
Online Computer Library Center http://wcp.oclc.org.]]
[6]
Open Directory Project http://www.dmoz.org.]]
[7]
The WebArchive Project, UCLA Computer Science, http://webarchive.cs.ucla.edu.]]
[8]
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proceedings of Twenty-Sixth VLDB Conference, Cairo, Egypt, 2000.]]
[9]
B. E. Brewington and G. Cybenko. How dynamic is the web? In Proceedings of the Ninth WWW Conference, Amsterdam, The Netherlands, 2000.]]
[10]
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Annual Conference, 1995.]]
[11]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh WWW Conference, Brisbane, Australia, 1998.]]
[12]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In Proceedings of the Nineth WWW Conference, Amsterdam, Netherlands, 2000.]]
[13]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of the Sixth WWW Conference, 1997.]]
[14]
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. Computer, 32(8):60--67, 1999.]]
[15]
J. Cho and H. García-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the Twenty-Sixth VLDB Conference, pages 200--209, Cairo, Egypt, 2000.]]
[16]
J. Cho and H. García-Molina. Synchronizing a database to improve freshness. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 117--128, Dallas, Texas, 2000.]]
[17]
E. Coffman, Jr., Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, June 1998.]]
[18]
F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, 1997.]]
[19]
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth WWW Conference, Budapest, Hungary, 2003.]]
[20]
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In IEEE Symposium on Foundations of Computer Science (FOCS), 2000.]]
[21]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proceedings of the Twelfth WWW Conference, Budapest, Hungary, 2003.]]
[22]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Proceedings of the Second International Conference on Advances in Web-Age Information Management, pages 133--144. Springer-Verlag, 2001.]]
[23]
B. H. Murray and A. Moore. Sizing the internet. White paper, Cyveillance, Inc., 2000.]]
[24]
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proceedings of the ACM Conference on Human Factors in Computing Systems, Atlanta, Georgia, 1997.]]
[25]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, first edition, 1983.]]